Convex optimization COMS 4771
1. Recap: learning via optimization
Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15
Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + Compare with Empirical Risk Minimization for zero-one loss: 1 w R d n n 1{y ix T i w 0}. 1 / 15
Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + Compare with Empirical Risk Minimization for zero-one loss: 1 w R d n n 1{y ix T i w 0}. In both cases, i-th term in summation is function of y ix T i w. 1 / 15
Zero-one loss vs. hinge loss 1 1 {yx T w 0} Zero-one loss: count 1 if f w(x) y. [ ] 1{yx T w 0} 1 yx T w. 1 + yx T w 2 / 15
Zero-one loss vs. hinge loss 1 [1 yx T w] + 1 {yx T w 0} 1 yx T w Hinge loss: an upper-bound on zero-one loss. [ ] 1{yx T w 0} 1 yx T w. + 2 / 15
Zero-one loss vs. hinge loss 1 [1 yx T w] + 1 {yx T w 0} 1 yx T w Hinge loss: an upper-bound on zero-one loss. [ ] 1{yx T w 0} 1 yx T w. Soft-margin SVM imizes an upper-bound on the training error rate, plus a term that encourages large margins. + 2 / 15
Learning via optimization Zero-one loss ERM Squared loss ERM Logistic regression MLE Soft-margin SVM 1 w R d n 1 w R d n 1 w R d n n 1{y ix T i w 0} n (x T i w y i) 2 n ln (1 + exp( y ix T i w)) λ w R d 2 w 2 2 + 1 n n [ ] 1 y ix T i w + 3 / 15
Learning via optimization Zero-one loss ERM Squared loss ERM Logistic regression MLE Soft-margin SVM Generic learning objective: w R d 1 w R d n 1 w R d n 1 w R d n n 1{y ix T i w 0} n (x T i w y i) 2 n ln (1 + exp( y ix T i w)) λ w R d 2 w 2 2 + 1 n r(w) }{{} regularization + 1 n n [ ] 1 y ix T i w + n φ i(w). } {{ } data fitting Regularization: encodes learning bias (e.g., prefer large margins). Data fitting: how poor is the fit to the training data. 3 / 15
Regularization Other kinds of regularization encode other inductive biases : e.g., r(w) = λ w 1 : encourages w to be small and sparse r(w) = λ d 1 wi+1 wi : encourage piecewise constant weights r(w) = λ w w old 2 2 : encourage closeness to old solution w old 4 / 15
Regularization Other kinds of regularization encode other inductive biases : e.g., r(w) = λ w 1 : encourages w to be small and sparse r(w) = λ d 1 wi+1 wi : encourage piecewise constant weights r(w) = λ w w old 2 2 : encourage closeness to old solution w old Can combine regularization with other data fitting objectives: e.g., λ Ridge regression w R d 2 w 2 2 + 1 n (x T i w y i) 2 2n Lasso λ w w R d 1 + 1 n (x T i w y i) 2 2n Sparse SVM λ w w R d 1 + 1 n [ ] 1 y ix T i w n + Sparse logistic regression w R d λ w 1 + 1 n n ( ln 1 + exp( y ix T i w) ) 4 / 15
2. Introduction to convexity
Convex sets A set A is convex if, for every pair of points {x, x } in A, the line segment between x and x is also contained in A. convex not convex convex convex 5 / 15
Convex sets A set A is convex if, for every pair of points {x, x } in A, the line segment between x and x is also contained in A. Examples: All of R d. Empty set. convex not convex convex convex Half-spaces: {x R d : b T x c}. Intersections of convex sets. Convex hulls: conv(s) := { k αixi : k N, x i S, α i 0, k αi = 1}. 5 / 15
Convex functions A function f : R d R is convex if for any x, x R d and α [0, 1], f ((1 α)x + αx ) (1 α) f(x) + α f(x ). x x x x f is not convex f is convex 6 / 15
Convex functions A function f : R d R is convex if for any x, x R d and α [0, 1], f ((1 α)x + αx ) (1 α) f(x) + α f(x ). x x x x f is not convex f is convex Examples: f(x) = c x for any c > 0 (on R) f(x) = x c for any c 1 (on R) f(x) = b T x for any b R d. f(x) = x for any norm. f(x) = x T Ax for symmetric positive semidefinite A. af + g for convex functions f and g, and a 0. max{f, g} for convex functions f and g. 6 / 15
Verifying convexity Is f(x) = x convex? 7 / 15
Verifying convexity Is f(x) = x convex? Pick any α [0, 1] and any x, x R d. 7 / 15
Verifying convexity Is f(x) = x convex? Pick any α [0, 1] and any x, x R d. f ((1 α)x + αx ) = (1 α)x + αx 7 / 15
Verifying convexity Is f(x) = x convex? Pick any α [0, 1] and any x, x R d. f ((1 α)x + αx ) = (1 α)x + αx (1 α)x + αx (triangle inequality) 7 / 15
Verifying convexity Is f(x) = x convex? Pick any α [0, 1] and any x, x R d. f ((1 α)x + αx ) = (1 α)x + αx (1 α)x + αx (triangle inequality) = (1 α) x + α x (homogeneity) 7 / 15
Verifying convexity Is f(x) = x convex? Pick any α [0, 1] and any x, x R d. f ((1 α)x + αx ) = (1 α)x + αx (1 α)x + αx (triangle inequality) = (1 α) x + α x (homogeneity) = (1 α)f(x) + αf(x ). Yes, f is convex. 7 / 15
Convexity of differentiable functions Differentiable functions If f : R d R is differentiable, then f is convex if and only if f(x) f(x 0) + f(x 0) T (x x 0) f(x) a(x) for all x, x 0 R d. x 0 a(x) = f(x 0) + f (x 0)(x x 0) 8 / 15
Convexity of differentiable functions Differentiable functions If f : R d R is differentiable, then f is convex if and only if f(x) f(x 0) + f(x 0) T (x x 0) f(x) a(x) for all x, x 0 R d. x 0 a(x) = f(x 0) + f (x 0)(x x 0) Twice-differentiable functions If f : R d R is twice-differentiable, then f is convex if and only if 2 f(x) 0 for all x R d (i.e., the Hessian, or matrix of second-derivatives, is positive semi-definite for all x). 8 / 15
Verifying convexity of differentiable functions Is f(x) = x 4 convex? 9 / 15
Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. 9 / 15
Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. 9 / 15
Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. Is f(x) = e a,x convex? f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. 9 / 15
Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. Is f(x) = e a,x convex? Use first-order condition for convexity. f(x) = e a,x { a, x } = e a,x a (chain rule). 9 / 15
Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. Is f(x) = e a,x convex? Use first-order condition for convexity. f(x) = e a,x { a, x } = e a,x a (chain rule). Difference between f and its affine approximation: f(x) ( f(x 0) + f(x 0), x x 0 ) = e a,x ( ) e a,x0 + e a,x0 a, x x 0 9 / 15
Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. Is f(x) = e a,x convex? Use first-order condition for convexity. f(x) = e a,x { a, x } = e a,x a (chain rule). Difference between f and its affine approximation: f(x) ( f(x 0) + f(x 0), x x ) ( ) 0 = e a,x e a,x0 + e a,x0 a, x x 0 ( = e a,x 0 e a,x x0 ( 1 + a, x x )) 0 9 / 15
Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. Is f(x) = e a,x convex? Use first-order condition for convexity. f(x) = e a,x { a, x } = e a,x a (chain rule). Difference between f and its affine approximation: f(x) ( f(x 0) + f(x 0), x x ) ( ) 0 = e a,x e a,x0 + e a,x0 a, x x 0 ( = e a,x 0 e a,x x0 ( 1 + a, x x )) 0 0 (because 1 + z e z for all z R). 9 / 15
Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. Is f(x) = e a,x convex? Use first-order condition for convexity. f(x) = e a,x { a, x } = e a,x a (chain rule). Difference between f and its affine approximation: f(x) ( f(x 0) + f(x 0), x x ) ( ) 0 = e a,x e a,x0 + e a,x0 a, x x 0 ( = e a,x 0 e a,x x0 ( 1 + a, x x )) 0 Yes, f is convex. 0 (because 1 + z e z for all z R). 9 / 15
3. Convex optimization problems
Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. 10 / 15
Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; 10 / 15
Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; 10 / 15
Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; inequalities f i(x) 0 are constraints; 10 / 15
Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; inequalities f i(x) 0 are constraints; { } A := x R d : f i(x) 0 for all i = 1, 2,..., n is the feasible region. 10 / 15
Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; inequalities f i(x) 0 are constraints; { } A := x R d : f i(x) 0 for all i = 1, 2,..., n is the feasible region. Goal: Find x A so that f 0(x) is as small as possible. 10 / 15
Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; inequalities f i(x) 0 are constraints; { } A := x R d : f i(x) 0 for all i = 1, 2,..., n is the feasible region. Goal: Find x A so that f 0(x) is as small as possible. (Optimal) value of the optimization problem is the smallest value f 0(x) achieved by a feasible point x A. 10 / 15
Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; inequalities f i(x) 0 are constraints; { } A := x R d : f i(x) 0 for all i = 1, 2,..., n is the feasible region. Goal: Find x A so that f 0(x) is as small as possible. (Optimal) value of the optimization problem is the smallest value f 0(x) achieved by a feasible point x A. Point x A achieving the optimal value is a (global) imizer. 10 / 15
Convex optimization problems Standard form of a convex optimization problem: x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n where f 0, f 1,..., f n : R d R are convex functions. 11 / 15
Convex optimization problems Standard form of a convex optimization problem: x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n where f 0, f 1,..., f n : R d R are convex functions. Fact: the feasible set { } A := x R d : f i(x) 0 for all i = 1, 2,..., n is a convex set. 11 / 15
Example: Soft-margin SVM w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, ξ i ξ i 0 for all i = 1,..., n. 12 / 15
Example: Soft-margin SVM w R d,ξ 1,...,ξ n R Bringing to standard form: w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, λ 2 w 2 2 + 1 n ξ i ξ i 0 for all i = 1,..., n. n s.t. 1 ξ i y ix T i w 0 for all i = 1,..., n ξ i 0 ξ i for all i = 1,..., n 12 / 15
Example: Soft-margin SVM w R d,ξ 1,...,ξ n R Bringing to standard form: w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, λ 2 w 2 2 + 1 n ξ i ξ i 0 for all i = 1,..., n. n s.t. 1 ξ i y ix T i w 0 for all i = 1,..., n ξ i 0 ξ i for all i = 1,..., n (sum of two convex functions, also convex) 12 / 15
Example: Soft-margin SVM w R d,ξ 1,...,ξ n R Bringing to standard form: w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, λ 2 w 2 2 + 1 n ξ i ξ i 0 for all i = 1,..., n. n ξ i (sum of two convex functions, also convex) s.t. 1 ξ i y ix T i w 0 for all i = 1,..., n (linear constraints) ξ i 0 for all i = 1,..., n 12 / 15
Example: Soft-margin SVM w R d,ξ 1,...,ξ n R Bringing to standard form: w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, λ 2 w 2 2 + 1 n ξ i ξ i 0 for all i = 1,..., n. n ξ i (sum of two convex functions, also convex) s.t. 1 ξ i y ix T i w 0 for all i = 1,..., n (linear constraints) ξ i 0 for all i = 1,..., n (linear constraints) 12 / 15
Example: Soft-margin SVM w R d,ξ 1,...,ξ n R Bringing to standard form: w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, λ 2 w 2 2 + 1 n ξ i ξ i 0 for all i = 1,..., n. n ξ i (sum of two convex functions, also convex) s.t. 1 ξ i y ix T i w 0 for all i = 1,..., n (linear constraints) ξ i 0 for all i = 1,..., n (linear constraints) Which of Zero-one loss ERM, Squared loss ERM, Logistic regression MLE, Ridge regression, Lasso, Sparse SVM are convex? 12 / 15
Local imizers Consider an optimization problem (not necessarily convex): f 0(x) x R d s.t. x A. 13 / 15
Local imizers Consider an optimization problem (not necessarily convex): f 0(x) x R d s.t. x A. We say x A is a local imizer if there is an open ball { } U := x R d : x x 2 < r of positive radius r > 0 such that x is a global imizer for f 0(x) x R d s.t. x A U. 13 / 15
Local imizers Consider an optimization problem (not necessarily convex): f 0(x) x R d s.t. x A. We say x A is a local imizer if there is an open ball { } U := x R d : x x 2 < r of positive radius r > 0 such that x is a global imizer for f 0(x) x R d s.t. x A U. Nothing looks better than x in the immediate vicinity of x. locally optimal 13 / 15
Local imizers of convex problems If the optimization problem is convex, and x A is a local imizer, then it is also a global imizer. local global 14 / 15
Key takeaways 1. Formulation of learning methods as optimization problems. 2. Convex sets, convex functions, ways to check if a function is convex. 3. Standard form of convex optimization problems, concept of local and global imizers. 15 / 15