A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

Size: px

Start display at page:

Download "A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:"

Ella Gilmore
5 years ago
Views:

1 1-5: Least-squares I A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution: f (x) =(Ax b) T (Ax b) =x T A T Ax 2b T Ax + b T b f (x) = 2A T Ax 2A T b = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 1 / 228

2 1-5: Least-squares II Regularization, weights: 1 2 λx T x + w 1 (Ax b) w k (Ax b) 2 k Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 2 / 228

3 2-4: Convex Combination and Convex Hull I Convex hull is convex Then x = θ 1 x θ k x k x = θ 1 x θ k x k αx + (1 α) x =αθ 1 x αθ k x k + (1 α) θ 1 x (1 α) θ k x k Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 3 / 228

4 2-4: Convex Combination and Convex Hull II Each coefficient is nonnegative and αθ αθ k + (1 α) θ (1 α) θ k = α + (1 α) = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 4 / 228

5 2-7: Euclidean Balls and Ellipsoid I We prove that any satisfies Let x = x c + Au with u 2 1 (x x c ) T P 1 (x x c ) 1 A = P 1/2 because P is symmetric positive definite. Then u T A T P 1 Au = u T P 1/2 P 1 P 1/2 u 1. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 5 / 228

6 2-10: Positive Semidefinite Cone I S n + is a convex cone. Let For any θ 1 0, θ 2 0, X 1, X 2 S n + z T (θ 1 X 1 + θ 2 X 2 )z = θ 1 z T X 1 z + θ 2 z T X 2 z 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 6 / 228

7 2-10: Positive Semidefinite Cone II Example: [ ] x y S y z 2 + is equivalent to x 0, z 0, xz y 2 0 If x > 0 or (z > 0) is fixed, we can see that has a parabolic shape z y 2 x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 7 / 228

8 2-12: Interaction I When t is fixed, {(x 1, x 2 ) 1 x 1 cos t + x 2 cos 2t 1} gives a region between two parallel lines This region is convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 8 / 228

9 2-13: Affine Function I f (S) is convex: Let f (x 1 ) f (S), f (x 2 ) f (S) αf (x 1 ) + (1 α)f (x 2 ) =α(ax 1 + b) + (1 α)(ax 2 + b) =A(αx 1 + (1 α)x 2 ) + b f (S) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 9 / 228

10 2-13: Affine Function II f 1 (C) convex: means that Because C is convex, x 1, x 2 f 1 (C) Ax 1 + b C, Ax 2 + b C α(ax 1 + b) + (1 α)(ax 2 + b) =A(αx 1 + (1 α)x 2 ) + b C Thus αx 1 + (1 α)x 2 f 1 (C) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 10 / 228

11 2-13: Affine Function III Scaling: Translation αs = {αx x S} S + a = {x + a x S} Projection T = {x 1 R m (x 1, x 2 ) S, x 2 R n }, S R m R n Scaling, translation, and projection are all affine functions Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 11 / 228

12 2-13: Affine Function IV For example, for projection f (x) = [ I 0 ] [ ] x x 2 I : identity matrix Solution set of linear matrix inequality C = {S S 0} is convex f (x) = x 1 A x m A m B = Ax + b f 1 (C) = {x f (x) 0} is convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 12 / 228

13 2-13: Affine Function V But this isn t rigourous because of some problems in arguing f (x) = Ax + b A more formal explanation: C = {s R p2 mat(s) S p and mat(s) 0} is convex f (x) = x 1 vec(a 1 ) + + x m vec(a m ) vec(b) = [ vec(a 1 ) vec(a m ) ] x + ( vec(b)) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 13 / 228

14 2-13: Affine Function VI f 1 (C) = {x mat(f (x)) S p and mat(f (x)) 0} is convex Hyperbolic cone: C = {(z, t) z T z t 2, t 0} is convex (by drawing a figure in 2 or 3 dimensional space) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 14 / 228

15 2-13: Affine Function VII We have that is affine. Then is convex f (x) = [ ] P 1/2 x c T = x f 1 (C) = {x f (x) C} [ P 1/2 c T ] x = {x x T Px (c T x) 2, c T x 0} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 15 / 228

16 Perspective and linear-fractional function I Image convex: if S is convex, check if {P(x, t) (x, t) S} convex or not Note that S is in the domain of P Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 16 / 228

17 Perspective and linear-fractional function II Assume We hope (x 1, t 1 ), (x 2, t 2 ) S α x 1 t 1 + (1 α) x 2 t 2 = P(A, B), where (A, B) S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 17 / 228

18 Perspective and linear-fractional function III We have α x 1 t 1 + (1 α) x 2 t 2 = αt 2x 1 + (1 α)t 1 x 2 t 1 t 2 = αt 2x 1 + (1 α)t 1 x 2 αt 1 t 2 + (1 α)t 1 t 2 αt 2 αt = 2 +(1 α)t 1 x 1 + (1 α)t 1 αt 2 +(1 α)t 1 x 2 αt 2 αt 2 +(1 α)t 1 t 1 + (1 α)t 1 αt 2 +(1 α)t 1 t 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 18 / 228

19 Perspective and linear-fractional function IV Let We have Further because θ = αt 2 αt 2 + (1 α)t 1 θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 = A B (A, B) S (x 1, t 1 ), (x 2, t 2 ) S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 19 / 228

20 Perspective and linear-fractional function V and Inverse image is convex Given C a convex set is convex S is convex P 1 (C) = {(x, t) P(x, t) = x/t C} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 20 / 228

21 Perspective and linear-fractional function VI Let Do we have (x 1, t 1 ) : x 1 /t 1 C (x 2, t 2 ) : x 2 /t 2 C θ(x 1, t 1 ) + (1 θ)(x 2, t 2 ) P 1 (C)? That is, θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 C? Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 21 / 228

22 Perspective and linear-fractional function VII Let Earlier we had θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 = α x 1 t 1 + (1 α) x 2 t 2, θ = αt 2 αt 2 + (1 α)t 1 Then (α(t 2 t 1 ) + t 1 )θ = αt 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 22 / 228

23 Perspective and linear-fractional function VIII t 1 θ = αt 2 αt 2 θ + αt 1 θ t 1 θ α = t 1 θ + (1 θ)t 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 23 / 228

24 2-16: Generalized inequalities I K contains no line: x with x K and x K x = 0 Nonnegative orthant Clearly all properties are satisfied Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 24 / 228

25 2-16: Generalized inequalities II Positive semidefinite cone: PD matrices are interior Nonnegative polynomial on [0, 1] When n = 2 t = 1 x 1 tx 2, t [0, 1] Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 25 / 228

26 2-16: Generalized inequalities III t = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 26 / 228

27 2-16: Generalized inequalities IV t [0, 1] It really becomes a proper cone Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 27 / 228

28 2-17: I Properties: implies that x K y, u K v y x K v u K From the definition of a convex cone, (y x) + (v u) K Then x + u K y + v Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 28 / 228

29 2-18: Minimum and minimal elements I The minimum element S x 1 + K A minimal element (x 2 K) S = {x 2 } Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 29 / 228

30 2-19: Separating hyperplane theorem I We consider a simplified situation and omit part of the proof Assume inf u v > 0 u C,v D and minimum attained at c, d We will show that a d c, b d 2 c 2 forms a separating hyperplane a T x = b Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 30 / 228 2

31 2-19: Separating hyperplane theorem II c d a T x = 2 b 1 1 = (d c) T d, b = a = d c 2 = (d c) T c = d T d c T c 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 31 / 228

32 2-19: Separating hyperplane theorem III Assume the result is wrong so there is u D such that a T u b < 0 We will derive a point u in D but it is closer to c than d. That is, u c < d c Then we have a contradiction The concept Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 32 / 228

33 2-19: Separating hyperplane theorem IV c u u d a T x = b a T u b = (d c) T u d T d c T c 2 implies that < 0 (d c) T (u d) d c 2 < 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 33 / 228

34 2-19: Separating hyperplane theorem V d d + t(u d) c 2 dt t=0 =2(d + t(u d) c) T (u d) =2(d c) T (u d) < 0 There exists a small t (0, 1) such that However, d + t(u d) c < d c d + t(u d) D, t=0 so there is a contradiction Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 34 / 228

35 2-19: Separating hyperplane theorem VI Strict separation C D They are disjoint convex sets. However, no a, b such that a T x < b, x C and a T x > b, x D Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 35 / 228

36 2-20: Supporting hyperplane theorem I Case 1: C has an interior region Consider 2 sets: interior of C versus {x 0 }, where x 0 is any boundary point If C is convex, then interior of C is also convex Then both sets are convex We can apply results in slide 2-19 so that there exists a such that a T x a T x 0, x interior of C Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 36 / 228

37 2-20: Supporting hyperplane theorem II Then for all boundary point x we also have a T x a T x 0 because any boundary point is the limit of interior points Case 2: C has no interior region In this situation, C is like a line in R 3 (so no interior). Then of course it has a supporting hyperplane We don t do a rigourous proof here Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 37 / 228

38 3-3: Examples on R I Example: x 3, x 0 Example: x 1, x > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 38 / 228

39 3-3: Examples on R II Example: x 1/2, x 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 39 / 228

40 3-3: Examples on R III Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 40 / 228

41 3-4: Examples on R n and R m n I A, X R m n tr(a T X ) = (A T X ) jj j = A T ji X ij = A ij X ij j i j i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 41 / 228

42 3-7: First-order Condition I An open set: for any x, there is a ball covering x such that this ball is in the set Global underestimator: z f (x) y x = f (x) z = f (x) + f (x)(y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 42 / 228

43 3-7: First-order Condition II : Because domain f is convex, for all 0 < t 1, x + t(y x) domain f when t 0, : f (x + t(y x)) (1 t)f (x) + tf (y) f (y) f (x) + f (x + t(y x)) f (x) t f (y) f (x) + f (x) T (y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 43 / 228

44 3-7: First-order Condition III For any 0 θ 1, z = θx + (1 θ)y f (x) f (z) + f (z) T (x z) =f (z) + f (z) T (1 θ)(x y) f (y) f (z) + f (z) T (y z) =f (z) + f (z) T θ(y x) θf (x) + (1 θ)f (y) f (z) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 44 / 228

45 3-7: First-order Condition IV First-order condition for strictly convex function: f is strictly convex if and only if f (y) > f (x) + f (x) T (y x) : it s easy by directly modifying to > f (x) >f (z) + f (z) T (x z) =f (z) + f (z) T (1 θ)(x y) f (y) > f (z)+ f (z) T (y z) = f (z)+ f (z) T θ(y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 45 / 228

46 3-7: First-order Condition V : Assume the result is wrong. From the 1st-order condition of a convex function, x, y such that x y and f (x) T (y x) = f (y) f (x) (1) For this (x, y), from the strict convexity f (x + t(y x)) f (x) <tf (y) tf (x) = f (x) T t(y x), t (0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 46 / 228

47 3-7: First-order Condition VI Therefore, f (x +t(y x)) < f (x)+ f (x) T t(y x), t (0, 1) However, this contradicts the first-order condition: f (x +t(y x)) f (x)+ f (x) T t(y x), t (0, 1) This proof was given by a student of this course before Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 47 / 228

48 3-8: Second-order condition I Proof of the 2nd-order condition: We consider only the simpler condition of n = 1 f (x + t) f (x) + f (x)t lim 2f (x + t) f (x) f (x)t t 0 t 2 2(f (x + t) f (x)) = lim = f (x) 0 t 0 2t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 48 / 228

49 3-8: Second-order condition II f (x + t) = f (x) + f (x)t f ( x)t 2 f (x) + f (x)t by 1st-order condition The extension to general n is straightforward If 2 f (x) 0, then f is strictly convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 49 / 228

50 3-8: Second-order condition III Using 1st-order condition for strictly convex function: f (y) =f (x) + f (x) T (y x) (y x)t 2 f ( x)(y x) >f (x) + f (x) T (y x) It s possible that f is strictly convex but 2 f (x) 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 50 / 228

51 3-8: Second-order condition IV Example: Details omitted f (x) = x 4 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 51 / 228

52 3-9: Examples I Quadratic-over-linear f x = 2x y, f y = x2 y 2 2 f x x = 2 y, 2 f x y = 2x y, 2 f 2 y y = 2x 2 y, 3 [ ] 2 y [y ] x y 3 x = 2 [ ] [ ] y 2 1 xy y 3 xy x 2 y x y = 2 2 x x 2 y 2 y 3 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 52 / 228

53 3-10 I f (x) = log k exp x k f (x) = ex 1 k ex k. exn k ex k 2 iif = ( k ex k )ex i e x i ex i ( k ex k ) 2, 2 ijf = Note that if z k = exp x k exi exj ( k ex k ) 2, i j Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 53 / 228

54 3-10 II then (zz T ) ij = z i (z T ) j = z i z j Cauchy-Schwarz inequality (a 1 b a n b n ) 2 (a an)(b bn) 2 Note that a k = v k zk, b k = z k z k > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 54 / 228

55 3-12: Jensen s inequality I General form f ( p(z)zdz) p(z)f (z)dz Discrete situation f ( p i z i ) p i f (z i ), p i = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 55 / 228

56 3-12: Jensen s inequality II Proof: f (p 1 z 1 + p 2 z 2 + p 3 z 3 ) ( ( )) p1 z 1 + p 2 z 2 (1 p 3 ) f + p 3 f (z 3 ) 1 p ( 3 p1 (1 p 3 ) f (z 1 ) + p ) 2 f (z 2 ) + p 3 f (z 3 ) 1 p 3 1 p 3 =p 1 f (z 1 ) + p 2 f (z 2 ) + p 3 f (z 3 ) Note that p 1 1 p 3 + p 2 1 p 3 = 1 p 3 1 p 3 = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 56 / 228

57 3-14: Positive weighted sum & composition with affine function I Composition with affine function: We know f (x) is convex Is g(x) = f (Ax + b) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 57 / 228

58 3-14: Positive weighted sum & composition with affine function II convex? g((1 α)x 1 + αx 2 ) =f (A((1 α)x 1 + αx 2 ) + b) =f ((1 α)(ax 1 + b) + α(ax 2 + b)) (1 α)f (Ax 1 + b) + αf (Ax 2 + b) =(1 α)g(x 1 ) + αg(x 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 58 / 228

59 3-15: Pointwise maximum I Proof of the convexity f ((1 α)x 1 + αx 2 ) = max(f 1 ((1 α)x 1 + αx 2 ),..., f m ((1 α)x 1 + αx 2 )) max((1 α)f 1 (x 1 ) + αf 1 (x 2 ),..., (1 α)f m (x 1 ) + αf m (x 2 )) (1 α) max(f 1 (x 1 ),..., f m (x 1 ))+ α max(f 1 (x 2 ),..., f m (x 2 )) (1 α)f (x 1 ) + αf (x 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 59 / 228

60 3-15: Pointwise maximum II For consider all combinations f (x) = x [1] + + x [r] ( ) n r Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 60 / 228

61 3-16: Pointwise supremum I The proof is similar to pointwise maximum Support function of a set C: When y is fixed, f (x, y) = y T x is linear (convex) in x Maximum eigenvales of symmetric matrix f (X, y) = y T Xy is a linear function of X when y is fixed Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 61 / 228

62 3-19: Minimization I Proof: Let ɛ > 0. y 1, y 2 C such that f (x 1, y 1 ) g(x 1 ) + ɛ f (x 2, y 2 ) g(x 2 ) + ɛ g(θx 1 + (1 θ)x 2 ) = inf y C f (θx 1 + (1 θ)x 2, y) f (θx 1 + (1 θ)x 2, θy 1 + (1 θ)y 2 ) θf (x 1, y 1 ) + (1 θ)f (x 2, y 2 ) θg(x 1 ) + (1 θ)g(x 2 ) + ɛ Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 62 / 228

63 3-19: Minimization II Note that the first inequality use the peroperty that C is convex to have θy 1 + (1 θ)y 2 C Because the above inequality holds for all ɛ > 0, g(θx 1 + (1 θ)x 2 ) θg(x 1 ) + (1 θ)g(x 2 ) First example: Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 63 / 228

64 3-19: Minimization III The goal is to prove A BC 1 B T 0 Instead of a direct proof, here we use the property in this slide. First we have that f (x, y) is convex in (x, y) because [ ] A B B T 0 C Consider min y f (x, y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 64 / 228

65 3-19: Minimization IV Because C 0, the minimum occurs at 2Cy + 2B T x = 0 Then y = C 1 B T x g(x) = x T Ax 2x T BC 1 Bx + x T BC 1 CC 1 B T x = x T (A BC 1 B T )x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 65 / 228

66 3-19: Minimization V is convex. The second-order condition implies that A BC 1 B T 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 66 / 228

67 3-21: the conjugate function I This function is useful later When y is fixed, maximum happens at y = f (x) (2) by taking the derivative on x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 67 / 228

68 3-21: the conjugate function II Explanation of the figure: when y is fixed z = xy is a straight line passing through the origin, where y is the slope of the line. Check under which x, yx and f (x) have the largest distance From the figure, the largest distance happens when (2) holds Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 68 / 228

69 z = x 0 f (x 0 ) + f (x 0 ) = x 0 y + f (x 0 ) = f (y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 69 / : the conjugate function III About the point The tangent line is (0, f (y)) z f (x 0 ) x x 0 = f (x 0 ) where x 0 is the point satisfying When x = 0, y = f (x 0 )

70 3-21: the conjugate function IV f is convex: Given x, y T x f (x) is linear (convex) in y. Then we apply the property of pointwise supremum Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 70 / 228

71 3-22: examples I negative logarithm f (x) = log x x (xy + log x) = y + 1 x = 0 If y < 0, pictures of xy and log x are Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 71 / 228

72 3-22: examples II Thus xy + log x has maximum. Then xy + log x = 1 log( y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 72 / 228

73 3-22: examples III strictly convex quadratic Qx = y, x = Q 1 y y T x 1 2 x T Qx =y T Q 1 y 1 2 y T Q 1 QQ 1 y = 1 2 y T Q 1 y Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 73 / 228

74 3-23: quasiconvex functions I Figure on slide: S α = [a, b], S β = (, c] Both are convex The figure is an example showing that quasi convex may not be convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 74 / 228

75 3-26: properties of quasiconvex functions I Modified Jensen inequality: f quasiconvex if and only if f (θx +(1 θ)y) max{f (x), f (y)}, x, y, θ [0, 1]. Let S is convex = max{f (x), f (y)} x S, y S θx + (1 θ)y S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 75 / 228

76 3-26: properties of quasiconvex functions II f (θx + (1 θ)y) and the result is obtained If results are wrong, there exists α such that S α is not convex. x, y, θ with x, y S α, θ [0, 1] such that θx + (1 θ)y / S α Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 76 / 228

77 3-26: properties of quasiconvex functions III Then f (θx + (1 θ)y) > α max{f (x), f (y)} This violates the assumption First-order condition (this is exercise 3.43): Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 77 / 228

78 3-26: properties of quasiconvex functions IV f ((1 t)x + ty) max(f (x), f (y)) = f (x) f (x + t(y x)) f (x) 0 t f (x + t(y x)) f (x) lim = f (x) T (y x) 0 t 0 t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 78 / 228

79 3-26: properties of quasiconvex functions V : If results are wrong, there exists α such that S α is not convex. x, y, θ with x, y S α, θ [0, 1] such that Then θx + (1 θ)y / S α f (θx + (1 θ)y) > α max{f (x), f (y)} (3) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 79 / 228

80 3-26: properties of quasiconvex functions VI Because f is differentiable, it is continuous. Without loss of generality, we have f (z) f (x), f (y), z between x and y (4) Let s give a 1-D interpretation. From (3), we can find a ball surrounding θx + (1 θ)y Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 80 / 228

81 3-26: properties of quasiconvex functions VII and two points x, y such that f (z) f (x ) = f (y ), z between x and y With (4), z = x + θ(y x), θ (0, 1) f (z) T ( θ(y x)) 0 f (z) T (y x θ(y x)) 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 81 / 228

82 3-26: properties of quasiconvex functions VIII Then f (z) T (y x) = 0, θ (0, 1) This contradicts (3). f (x + θ(y x)) =f (x) + f (t) T θ(y x) =f (x), θ [0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 82 / 228

83 3-27: Log-concave and log-convex functions I Powers: log(x a ) = a log x log x is concave Probability densities: log f (x) = 1 2 (x x)t Σ 1 (x x) + constant Σ 1 is positive definite. Thus log f (x) is concave Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 83 / 228

84 3-27: Log-concave and log-convex functions II Cumulative Gaussian distribution d 2 d 2 x log Φ(x) = log x e u2 /2 du d dx log Φ(x) = e x /2 du e u2 log Φ(x) x 2/2 = ( x e u2 /2 du)e x 2 /2 ( x) e x 2 /2 e x 2 /2 ( x e u2 /2 du) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 84 / 228

85 3-27: Log-concave and log-convex functions III Need to prove that ( x Because e u2 /2 du ) x + e x 2 /2 > 0 x u for all u (, x], Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 85 / 228

86 3-27: Log-concave and log-convex functions IV we have = ( x x x ) e u2 /2 du x + e x 2 /2 xe u2 /2 du + e x 2 /2 ue u2 /2 du + e x 2 /2 = e u2 /2 x + e x 2 /2 = e x 2 /2 + e x 2 /2 = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 86 / 228

87 3-27: Log-concave and log-convex functions V This proof was given by a student (and polished by another student) of this course before Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 87 / 228

88 4-3: Optimal and locally optimal points I f 0 (x) = 1/x f 0 (x) = x log x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 88 / 228

89 4-3: Optimal and locally optimal points II f 0 (x) = x 3 3x f 0(x) = 1 + log x = 0 x = e 1 = 1/e Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 89 / 228

90 4-3: Optimal and locally optimal points III f 0(x) = 3x 2 3 = 0 x = ±1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 90 / 228

91 4-7: example I x 1 /(1 + x 2 2 ) 0 x 1 0 (x 1 + x 2 ) 2 = 0 x 1 + x 2 = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 91 / 228

92 f 0 (y) f 0 (x), for all feasible y Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 92 / : Optimality criterion for differentiable f 0 I : easy From first-order condition Together with we have f 0 (y) f 0 (x) + f 0 (x) T (y x) f 0 (x) T (y x) 0

93 4-9: Optimality criterion for differentiable f 0 II Assume the result is wrong. Then f 0 (x) T (y x) < 0 Let z(t) = ty + (1 t)x d dt f 0(z(t)) = f 0 (z(t)) T (y x) d dt f 0(z(t)) = f 0 (x) T (y x) < 0 t=0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 93 / 228

94 A(tx+(1 t)y) = tax+(1 t)ay = tb+(1 b)b = b Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 94 / : Optimality criterion for differentiable f 0 III There exists t such that f 0 (z(t)) < f 0 (x) Note that is feasible because z(t) f i (z(t)) tf i (x) + (1 t)f i (y) 0 and

95 4-9: Optimality criterion for differentiable f 0 IV Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 95 / 228

96 4-10 I Unconstrained problem: Let y = x t f 0 (x) It is feasible (unconstrained problem). Optimality condition implies Thus f 0 (x) T (y x) = t f 0 (x) 2 0 f 0 (x) = 0 Equality constrained problem Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 96 / 228

97 4-10 II Easy. For any feasible y, Ay = b f 0 (x) T (y x) = ν T A(y x) = ν T (b b) = 0 0 So x is optimal : more complicated. We only do a rough explanation Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 97 / 228

98 4-10 III From optimality condition f 0 (x) T ν = f 0 (x) T ((x + ν) x) 0, ν N(A) N(A) is a subspace in 2-D. Thus ν N(A) ν N(A) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 98 / 228

99 4-10 IV f 0 N(A) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 99 / 228

100 4-10 V We have f 0 (x) T ν = 0, ν N(A) f 0 (x) N(A), f 0 (x) R(A T ) ν such that f 0 (x) + A T ν = 0 Minimization over nonnegative orthant Easy Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 100 / 228

101 4-10 VI For any y 0, i f 0 (x)(y i x i ) = { i f 0 (x)y i 0 if x i = 0 0 if x i > 0. Therefore, and f 0 (x) T (y x) 0 x is optimal Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 101 / 228

102 4-10 VII If x i = 0, we claim i f 0 (x) 0 Otherwise, i f 0 (x) < 0 Let y = x except y i f 0 (x) T (y x) = i f 0 (x)(y i x i ) This violates the optimality condition Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 102 / 228

103 4-10 VIII If x i > 0, we claim i f 0 (x) = 0 Otherwise, assume i f 0 (x) > 0 Consider y = x except y i = x i /2 > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 103 / 228

104 4-10 IX It is feasible. Then f 0 (x) T (y x) = i f 0 (x)(y i x i ) = i f 0 (x)x i /2 < 0 violates the optimality condition. The situation for i f 0 (x) < 0 is similar Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 104 / 228

105 4-23: examples I least-squares min x T (A T A)x 2b T Ax + b T b A T A may not be invertibe pseudo inverse linear program with random cost c E(C) Σ E C ((C c)(c c) T ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 105 / 228

106 4-23: examples II Var(C T x) = E C ((C T x c T x)(c T x c T x)) = E C (x T (C c)(c c) T x) = x T Σx Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 106 / 228

107 4-25: second-order cone programming I Cone was defined on slide 2-8 {(x, t) x t} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 107 / 228

108 4-35: generalized inequality constraint I f i R n R k i K i -convex: f i (θx + (1 θ)y) Ki θf i (x) + (1 θ)f i (y) See page 3-31 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 108 / 228

109 4-37: LP and SOCP as SDP I LP and equivalent SDP a 11 a 1n x 1 Ax =.. a m1 a mn x n 11 1n a x n a... x 1 1 b... a m1 0 a mn b m Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 109 / 228

110 4-37: LP and SOCP as SDP II For SOCP and SDP we will use results in 4-39: [ tip p A p q ] A T 0 A ti T A t 2 I q q, t 0 q q Now Thus p = m, q = 1 A = A i x + b i, t = c T i x + d i A i x + b i 2 (c T i x + d i ) 2, c T i x + d i 0 from t 0 A i x + b i c T i x + d i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 110 / 228

111 4-39: matrix norm minimization I Following 4-38, we have the following equivalent problem min subject to t A 2 t We then use A 2 t A T A t 2 I, t 0 [ ] ti A 0 ti A T Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 111 / 228

112 4-39: matrix norm minimization II to have the SDP min subject to t[ ] ti A(x) A(x) T 0 ti Next we prove [ ] tip p A p q A T 0 A ti T A t 2 I q q, t 0 q q Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 112 / 228

113 4-39: matrix norm minimization III we immediately have t 0 If t > 0, [ v T A T tv ] [ ] [ ] T ti p p A p q Av A T ti q q tv = [ v T A T tv ] [ ] T tav + tav A T Av + t 2 v =t(t 2 v T v v T A T Av) 0 v T (t 2 I A T A)v 0, v Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 113 / 228

114 4-39: matrix norm minimization IV and hence If t = 0 t 2 I A T A 0 [ v T A T v ] [ ] [ ] T 0 A Av A T 0 v = [ v T A T v ] [ ] T Av A T Av = 2v T A T Av 0, v Therefore A T A 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 114 / 228

115 4-39: matrix norm minimization V Consider [ u T v ] [ ] [ ] T ti A u A T ti v = [ u T v ] [ ] T tu + Av A T u + tv =tu T u + 2v T A T u + tv T v We hope to have tu T u + 2v T A T u + tv T v 0, (u, v) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 115 / 228

116 4-39: matrix norm minimization VI If t > 0 has optimum at We have min tu T u + 2v T A T u + tv T v u u = Av t tu T u + 2v T A T u + tv T v =tv T v v T A T Av t = 1 t v T (t 2 I A T A)v 0. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 116 / 228

117 4-39: matrix norm minimization VII Hence [ ] ti A A T 0 ti If t = 0 A T A 0 Thus v T A T Av 0, v T A T Av = Av 2 = 0 Av = 0, v [ u T v ] [ [ ] T 0 A u A 0] T v = [ u T v ] [ ] T 0 A T = 0 0 u Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 117 / 228

118 4-39: matrix norm minimization VIII Thus [ ] 0 A A T 0 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 118 / 228

119 4-40: Vector optimization I Though f 0 (x) is a vector note that f i (x) is still R n R 1 K-convex See 3-31 though we didn t discuss it earlier f 0 (θx + (1 θ)y) K θf 0 (x) + (1 θ)f 0 (y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 119 / 228

120 4-41: optimal and pareto optimal points I See definition in slide 2-38 Optimal O {x} + K Pareto optimal (x K) O = {x} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 120 / 228

121 5-3: Lagrange dual function I Note that g is concave no matter if the original problem is convex or not f 0 (x) + λ i f i (x) + ν i h i (x) is convex (linear) in λ, ν for each x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 121 / 228

122 5-3: Lagrange dual function II Use pointwise supremum on 3-16 sup( f 0 (x) λ i f i (x) ν i h i (x)) x D is convex. Hence inf(f 0 (x) + λ i f i (x) + ν i h i (x)) is concave. Note that sup( ) = convex = inf( ) = concave Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 122 / 228

123 5-8: Lagrange dual and conjugate function I f 0 ( A T λ c T ν) = sup(( A T λ c T ν) T x f 0 (x)) x = inf (f 0(x) + (A T λ + c T ν) T x) x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 123 / 228

124 5-9: The dual problem I From 5-5, the dual problem is It can be simplified to max g(λ, ν) subject to λ 0 max b T ν subject to A T ν λ + c = 0 λ 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 124 / 228

125 5-9: The dual problem II Further, max b T ν subject to A T ν + c 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 125 / 228

126 5-10: weak and strong duality I We don t discuss the SDP problem on this slide because we omitted 5-7 on the two-way partitioning problem Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 126 / 228

127 5-11: Slater s constraint qualification I We omit the proof because of no time linear inequality do not need to hold with strict inequality : for linear inequalities we DO NOT need constraint qualification We will see some explanation later Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 127 / 228

128 5-12: inequality from LP I If we have only linear constraints, then constraint qualification holds Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 128 / 228

129 5-15: geometric interpretation I Explanation of g(λ): when λ is fixed λu + t = is a line. We lower until it touches the boundary of G The value then becomes g(λ) When u = 0 t = so we see the point marked as g(λ) on t-axis Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 129 / 228

130 5-15: geometric interpretation II We have λ 0, so λu + t = must be like rather than Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 130 / 228

131 5-15: geometric interpretation III Explanation of p : In G, only points satisfying u 0 are feasible Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 131 / 228

132 5-16 I We do not discuss a formal proof of Slater condition strong duality Instead, we explain this result by figures Reason of using A: G may not be convex Example: min x 2 subject to x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 132 / 228

133 5-16 II This is a convex optimization problem is only a quadratic curve G = {(x + 2, x 2 ) x R} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 133 / 228

134 5-16 III The curve is not convex However, A is convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 134 / 228

135 5-16 IV A Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 135 / 228

136 5-16 V Primal problem: x = 2 optimal objective value = 4 Dual problem: g(λ) = min x x 2 + λ(x + 2) x = λ/2 max λ 0 λ λ optimal λ = 4 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 136 / 228

137 5-16 VI optimal objective value = = 4 Proving that A is convex (u 1, t 1 ) A, (u 2, t 2 ) A x 1, x 2 such that f 1 (x 1 ) u 1, f 0 (x 1 ) t 1 Consider f 1 (x 2 ) u 2, f 0 (x 2 ) t 2 x = θx 1 + (1 θ)x 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 137 / 228

138 5-16 VII We have f 1 (x) θu 1 + (1 θ)u 2 So f 0 (x) θt 1 + (1 θ)t 2 [ ] [ ] [ ] u u1 u2 = θ + (1 θ) A t t 1 t 2 Why non-vertical supporting hyperplane? Then g(λ) is well defined. Note that we have Slater condition strong duality Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 138 / 228

139 5-16 VIII However, it s possible that Slater condition doesn t hold but strong duality holds Example from exercise 5.22: min x subject to x 2 0 Slater condition doesn t hold because no x satisfies x 2 < 0 G = {(x 2, x) x R} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 139 / 228

140 5-16 IX t u There is only one feasible point (0, 0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 140 / 228

141 5-16 X t A u Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 141 / 228

142 5-16 XI Dual problem Strong duality holds g(λ) = min x + x 2 λ x { 1/(2λ) if λ > 0 x = if λ = 0 max λ 0 1/(4λ) λ, objective value 0 d = 0, p = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 142 / 228

143 5-17: complementary slackness I In deriving the inequality we use h i (x ) = 0 and f i (x ) 0 Complementary slackness compare the earlier results in 4-10 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 143 / 228

144 5-17: complementary slackness II 4-10: x is optimal of min subject to f 0 (x) x i 0, i if and only if x i 0, { i f 0 (x) 0 x i = 0 i f 0 (x) = 0 x i > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 144 / 228

145 5-17: complementary slackness III From KKT condition i f 0 (x) = λ i λ i x i = 0 λ i 0, x i 0 If then x i > 0, λ i = 0 = i f 0 (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 145 / 228

146 5-19: KKT conditions for convex problem I For the problem on p5-16, neither slater condition nor KKT condition holds 1 λ0 Therefore, for convex problems, KKT optimality but not vice versa. Next we explain why for linear constraints we don t need constraint qualification Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 146 / 228

147 5-19: KKT conditions for convex problem II Consider the situation of inequality constraints only: min subject to f 0 (x) f i (x) 0, i = 1,..., m Consider an optimial solution x. We would like to see if x satisfies KKT condition We claim that f 0 (x) = λ i f i (x) (5) i:f i (x)=0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 147 / 228

148 5-19: KKT conditions for convex problem III Assume the result is wrong. Then, f 0 (x) =linear combination of { f i (x) f i (x) = 0} +, where 0 and T f i (x) = 0, i : f i (x) = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 148 / 228

149 5-19: KKT conditions for convex problem IV Then there exists α < 0 such that δx α satisfies and f i (x) T δx = 0 if f i (x) = 0 f i (x + δx) 0 if f i (x) < 0 We claim that δx is feasible. That is, f i (x + δx) 0, i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 149 / 228

150 5-19: KKT conditions for convex problem V We have f i (x + δx) f i (x) + f i (x) T δx = 0 if f i (x) = 0 However, f 0 (x) T δx = α T < 0 This contradicts the optimality condition that from slide 4-9, for any feasible direction δx, f 0 (x) T δx 0. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 150 / 228

151 5-19: KKT conditions for convex problem VI We do not continue to prove f 0 (x) = λ i 0,f i (x)=0 λ i f i (x) (6) because the proof is not trivial However, what we want to say is that in proving (5), the proof is not rigourous because of For linear the proof becomes rigourous This roughly give you a feeling that linear is different from non-linear Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 151 / 228

152 5-25 I Explanation of f 0 (ν) inf y (f 0(y) ν T y) = sup(ν T y f 0 (y)) = f0 (ν) y where f0 (ν) is the conjugate function Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 152 / 228

153 5-26 I The original problem Dual norm: If ν > 1, g(λ, ν) = inf x Ax b = constant ν sup{ν T y y 1} ν T y > 1, y 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 153 / 228

154 5-26 II inf y + ν T y y ν T y < 0 Hence ty ν T (ty ) as t inf y If ν 1, we claim that y + νt y = inf y y + νt y = 0 y = 0 y + ν T y = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 154 / 228

155 5-26 III If y such that y + ν T y < 0 then y < ν T y We can scale y so that sup{ν T y y 1} > 1 but this causes a contradiction Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 155 / 228

156 5-27: implicit constraint I The dual function c T x + ν T (Ax b) = b T ν + x T (A T ν + c) inf x i(a T ν + c) i = (A T ν + c) i 1 x i 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 156 / 228

157 5-30: semidefinite program I From 5-29 we need that Z is non-negative in the dual cone of S k + Dual cone of S k + is S k + (we didn t discuss dual cone so we assume this result) Why tr(z( ))? We are supposed to do component-wise produt between Z and x 1 F x n F n G Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 157 / 228

158 5-30: semidefinite program II Trace is the component-wise product tr(ab) = i = i (AB) ii A ij B ji = j i A ij B ij j Note that we take the property that B is symmetric Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 158 / 228

159 7-4 I Uniform noise p(z) = { 1 2a if z a 0 otherwise Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 159 / 228

160 8-10: Dual of maximum margin problem I Largangian: a 2 λ i (a T x i + b 1) + i i = a 2 + at ( λ i x i + µ i y i ) i i µ i (a T y i + b + 1) + b( i λ i + i µ i ) + i λ i + i µ i Because of b( i λ i + i µ i ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 160 / 228

161 8-10: Dual of maximum margin problem II we have inf L = λ i + µ i + a,b i i { a inf a 2 + at ( i λ ix i + i µ iy i ) if i λ i = i µ i if i λ i i µ i For inf a a 2 + at ( i λ i x i + i µ i y i ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 161 / 228

162 8-10: Dual of maximum margin problem III we can denote it as inf a a 2 + v T a where v is a vector. We cannot do derivative because a is not differentiable. Formal solution: Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 162 / 228

163 8-10: Dual of maximum margin problem IV Case 1: If v 1/2: so However, a T v a v a 2 inf a a 2 + v T a 0. a = 0 a 2 + v T a = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 163 / 228

164 8-10: Dual of maximum margin problem V Therefore If v > 1/2, let inf a a 2 + v T a = t 2 t v a 2 + v T a = 0. a = tv v =t( 1 2 v ) if t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 164 / 228

165 8-10: Dual of maximum margin problem VI Thus Finally, inf a a 2 + v T a = inf L = λ i + µ i + a,b i i 0 if i λ i = i µ i and i λ ix i i µ iy i 1/2 otherwise Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 165 / 228

166 8-14. θ = vec(p) z i z j q, F (z) =. r z i. 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 166 / 228

167 10-3: initial point and sublevel set I The condition that S is closed if domain of f = R n Proof: Be definition S is closed if for every convergent sequence {x i } with x i S and lim i x i = x, then x S. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 167 / 228

168 10-3: initial point and sublevel set II Because we have domain of f = R n x domain of f Thus by the continuity of f, lim f (x i) = f (x ) f (x 0 ) i and x S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 168 / 228

169 10-3: initial point and sublevel set III The condition that S is closed if f (x) as x boundary of domain f Proof: if not, from the definition of the closeness of S, there exists {x i } S such that Thus x i x / domain f x is on the boundary Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 169 / 228

170 10-3: initial point and sublevel set IV Then violates f (x i ) > f (x 0 ) f (x i ) f (x 0 ), i Thus the assumption is wrong and S is closed Example f (x) = log( m exp(ai T x + b i )) i=1 domain = R n Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 170 / 228

171 10-3: initial point and sublevel set V Example f (x) = i log(b i a T i x) We use the condition that domain R n f (x) as x boundary of domain f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 171 / 228

172 f (y) f (x 0 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 172 / : strong convexity and implications I S is bounded. Otherwise, there exists a set {y i y i = x + i } S satisfying Then lim i = i f (y i ) f (x) + f (x) T i + m 2 i 2 This contradicts

173 10-4: strong convexity and implications II Proof of and From p > f (x) p 1 f (x) 2 2m f (y) f (x) + f (x) T (y x) + m x y 2 2 Minimize the right-hand side with respect to y f (x) + m(y x) = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 173 / 228

174 10-4: strong convexity and implications III ỹ = x f (x) m f (y) f (x) + f (x) T (ỹ x) + m ỹ x 2 2 = f (x) 1 2m f (x) 2, y Then and p f (x) 1 2m f (x) 2 > f (x) p 1 f (x) 2 2m Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 174 / 228

175 10-5: descent methods I If then f (x + t x) < f (x) f (x) T x < 0 Proof: From the first-order condition of a convex function Then f (x + t x) f (x) + t f (x) T x t f (x) T x f (x + t x) f (x) < 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 175 / 228

176 10-6: line search types I Why α (0, 1 2 )? The use of 1/2 is for convergence though we won t discuss details Finite termination of backtracking line search. We argue that t > 0 such that f (x + t x) < f (x) + αt f (x) T x, t (0, t ) Otherwise, {t k } 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 176 / 228

177 10-6: line search types II such that f (x + t k x) f (x) + αt k f (x) T x, k However, f (x + t k x) f (x) lim t k 0 t k = f (x) T x α f (x) T x cause a contradiction f (x) T x < 0 and α (0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 177 / 228

178 10-6: line search types III Graphical interpretation: the tangent line passes through (0, f (x)), so the equation is y f (x) t 0 = f (x) T x Because f (x) T x < 0, we see that the line of f (x) + αt f (x) T x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 178 / 228

179 10-6: line search types IV is above that of f (x) + t f (x) T x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 179 / 228

180 10-7 I Linear convergence. We consider exact line search; proof for backtracking line search is more complicated S closed and bounded 2 f (x) MI, x S f (y) f (x) + f (x) T (y x) + M 2 y x 2 Solve min t f (x) t f (x) T f (x) + t2 M 2 f (x)t f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 180 / 228

181 10-7 II t = 1 M f (x next ) f (x 1 1 f (x)) f (x) M 2M f (x)t f (x) The first inequality is from the fact that we use exact line search f (x next ) p f (x) p 1 2M f (x)t f (x) From slide 10-4, f (x) 2 2m(f (x) p ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 181 / 228

182 10-7 III Hence f (x next ) p (1 m M )(f (x) p ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 182 / 228

183 10-8 I Assume x k 1 = γ( γ 1 γ + 1 )k, x k 2 = ( γ 1 γ + 1 )k, f (x 1, x 2 ) = [ x1 γx 2 ] 1 min t 2 ((x 1 tx 1 ) 2 + γ(x 2 tγx 2 ) 2 ) 1 min t 2 (x 1 2 (1 t) 2 + γx2 2 (1 tγ) 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 183 / 228

184 10-8 II x 2 1 (1 t) + γx 2 2 (1 tγ)( γ) = 0 x tx 2 1 γ 2 x γ 3 tx 2 2 = 0 t = x γ2 x 2 2 x γ3 x 2 2 t(x γ 3 x 2 2 ) = x γ 2 x 2 2 = γ2 ( γ 1 γ+1 )2k + γ 2 ( γ 1 γ+1 )2k γ 2 ( γ 1 γ+1 )2k + γ 3 ( γ 1 γ+1 )2k = 2γ2 γ 2 + γ = γ [ ] x k+1 = x k t f (x k x k ) = 1 (1 t) (1 γt) x k 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 184 / 228

185 10-8 III x k+1 1 = γ( γ 1 γ + 1 )k ( γ γ ) = γ(γ 1 γ + 1 )k+1 x k+1 2 = ( γ 1 γ + 1 )k (1 2γ 1 + γ ) = ( γ 1 γ + 1 )k ( 1 γ 1 + γ ) = ( γ 1 γ + 1 )k+1 Why gradient is orghogonal to the tangent line of the contour curve? Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 185 / 228

186 10-8 IV Assume f (g(t)) is the countour with g(0) = x Then 0 = f (g(t)) f (g(0)) f (g(t)) f (g(0)) 0 = lim t 0 t = lim f (g(t)) T g(t) t 0 = f (x) T g(0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 186 / 228

187 10-8 V where is the tangent line x + t g(0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 187 / 228

188 10-10 I linear convergence: from slide 10-7 f (x k ) p c k (f (x 0 ) p ) log(c k (f (x 0 ) p )) = k log c + log(f (x 0 ) p ) is a straight line. Note that now k is the x-axis Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 188 / 228

189 10-11: steepest descent method I (unnormalized) steepest descent direction: x sd = f (x) x nsd Here is the dual norm We didn t discuss much about dual norm, but we can still explain some examples on Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 189 / 228

190 10-12 I Euclidean: x nsd is by solving min f T v subject to v = 1 f T v = f v cos θ = f when cos θ = π x nsd = f (x) f (x) f (x) = f (x) f (x) f (x) x nsd = f (x) = f (x) f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 190 / 228

191 10-12 II Quadratic norm: x nsd is by solving min f T v subject to v T Pv = 1 Now v P = v T Pv, where P is symmetric positive definite Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 191 / 228

192 10-12 III Let w = P 1/2 v The optimization problem becomes min w f T P 1/2 w subject to w = 1 optimal w = P 1/2 f P 1/2 f = P 1/2 f f T P 1 f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 192 / 228

193 10-12 IV optimal v = P 1 f f T P 1 f = x nsd Dual norm Therefore x sd z = P 1/2 z = f T P 1 f P 1 f f T P 1 f = P 1 f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 193 / 228

194 10-12 V Explanation of the figure: f (x) T x nsd = f (x) x nsd cos θ f (x) is a constant. From a point x nsd on the boundary, the projected point on f (x) indicates x nsd cos θ In the figure, we see that the chosen x nsd has the largest x nsd cos θ We omit the discusssion of l 1 -norm Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 194 / 228

195 10-13 I The two figures are by using two P matrices The left one has faster convergence Gradient descent after change of variables min x x = P 1/2 x, x = P 1/2 x f (x) min f (P 1/2 x) x x x αp 1/2 x f (P 1/2 x) P 1/2 x P 1/2 x αp 1/2 x f (x) x x αp 1 x f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 195 / 228

196 10-14 I f (x) x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 196 / 228

197 10-14 II Solve f (x) = 0 Fining the tangent line at x k : x k : the current iterate Let y = 0 y f (x k ) x x k = f (x k ) x k+1 = x k f (x k )/f (x k ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 197 / 228

198 10-16 I ˆf (y) = f (x) + f (x) T (y x) (y x)t 2 f (x)(y x) inf y ˆf (y) = 0 = f (x) + 2 f (x)(y x) y x = 2 f (x) 1 f (x) ˆf (y) = f (x) 1 2 f (x)t 2 f (x) 1 f (x) f (x) inf y ˆf (y) = 1 2 λ(x)2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 198 / 228

199 10-16 II Norm of the Newton step in the quadratic Hessian norm x nt = 2 f (x) 1 f (x) x T nt 2 f (x) x nt = f (x) T 2 f (x) 1 f (x) = λ(x) 2 Directional derivative in the Newton direction f (x + t x nt ) f (x) lim t 0 t = f (x) T x nt = f (x) T 2 f (x) 1 f (x) = λ(x) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 199 / 228

200 10-16 III Affine invariant f (y) f (Ty) = f (x) Assume T is an invertable square matrix. Then λ(y) = λ(ty) Proof: f (y) = T T f (Ty) 2 f (y) = T T 2 f (Ty)T Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 200 / 228

201 10-16 IV λ(y) 2 = f (y) T 2 f (y) 1 f (y) = f (Ty) T TT 1 2 f (Ty) 1 T T T T f (Ty) = f (Ty) T 2 f (Ty) 1 f (Ty) = λ(ty) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 201 / 228

202 10-17 I Affine invariant y nt = 2 f (y) 1 f (y) = T 1 2 f (Ty) 1 f (Ty) = T 1 x nt Note that y k = T 1 x k so y k+1 = T 1 x k+1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 202 / 228

203 10-17 II But how about line search f (y) T y nt = f (Ty) T TT 1 x nt = f (x) T x nt Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 203 / 228

A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

1-5: Least-squares I A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution: f (x) =(Ax b) T (Ax b) =x T A T Ax 2b T Ax + b T b f (x) = 2A T Ax 2A T b = 0 Chih-Jen Lin (National