Convexity II: Optimization Basics

Size: px

Start display at page:

Download "Convexity II: Optimization Basics"

Ethelbert Burns
5 years ago
Views:

1 Conveity II: Optimization Basics Lecturer: Ryan Tibshirani Conve Optimization / See supplements for reviews of basic multivariate calculus basic linear algebra

2 Last time: conve sets and functions Conve calculus makes it easy to check conveity. Tools: Definitions of conve sets and functions, classic eamples (, f()) (y, f(y)) Key properties (e.g., first- and second-order characterizations for functions) Operations that preserve conveity (e.g., affine composition) { ( E.g., is ma log 1 (a T + b) 7 ) }, A + b 5 1 conve? 2

3 Outline Today: Optimization terology Properties and first-order optimality Equivalent transformations 3

4 Optimization terology Reder: a conve optimization problem (or program) is D subject to f() g i () 0, i = 1,... m A = b where f and g i, i = 1,... m are all conve, and the optimization domain is D = dom(f) m i=1 dom(g i) (often we do not write D) f is called criterion or objective function g i is called inequality constraint function If D, g i () 0, i = 1,... m, and A = b then is called a feasible point The imum of f() over all feasible points is called the optimal value, written f 4

5 If is feasible and f() = f, then is called optimal; also called a solution, or a imizer 1 If is feasible and f() f + ɛ, then is called ɛ-suboptimal If is feasible and g i () = 0, then we say g i is active at Conve imization can be reposed as concave maimization f() subject to g i () 0, i = 1,... m A = b ma f() subject to g i () 0, i = 1,... m A = b Both are called conve optimization problems 1 Note: a conve optimization problem need not have solutions, i.e., need not attain its imum, but we will not be careful about this 5

6 Conve solution sets Let X opt be the set of all solutions of conve problem, written X opt = arg f() subject to Key property: X opt is a conve set g i () 0, i = 1,... m A = b Proof: use definitions. If, y are solutions, then for 0 t 1, t + (1 t)y D g i (t + (1 t)y) tg i () + (1 t)g i (y) 0 A(t + (1 t)y) = ta + (1 t)ay = b f(t + (1 t)y) tf() + (1 t)f(y) = f Therefore t + (1 t)y is also a solution Another key property: if f is strictly conve, then the solution is unique, i.e., X opt contains one element 6

7 Eample: lasso Given y R n, X R n p, consider the lasso problem: β y Xβ 2 2 subject to β 1 s Is this conve? What is the criterion function? The inequality and equality constraints? Feasible set? Is the solution unique, when: n p and X has full column rank? p > n ( high-dimensional case)? How do our answers change if we changed criterion to Huber loss: { n 1 ρ(y i T i β), ρ(z) = 2 z2 z δ δ z 1 2 δ2 else i=1? 7

8 Eample: support vector machines Given y { 1, 1} n, X R n p with rows 1,... n, consider the support vector machine or SVM problem: β,β 0,ξ subject to 1 2 β C n i=1 ξ i ξ i 0, i = 1,... n y i ( T i β + β 0 ) 1 ξ i, i = 1,... n Is this conve? What is the criterion, constraints, feasible set? Is the solution (β, β 0, ξ) unique? What if changed the criterion to 1 2 β β2 0 + C n i=1 ξ 1.01 i? For original criterion, what about β component, at the solution? 8

9 Local ima are global ima For a conve problem, a feasible point is called locally optimal is there is some R > 0 such that f() f(y) for all feasible y such that y 2 R Reder: for conve optimization problems, local optima are global optima Proof simply follows from definitions Conve Nonconve 9

10 10 The optimization problem Rewriting constraints subject to f() g i () 0, i = 1,... m A = b can be rewritten as f() subject to C where C = { : g i () 0, i = 1,... m, A = b}, the feasible set. Hence the above formulation is completely general With I C the indicator of C, we can write this in unconstrained form f() + I C ()

11 For a conve problem First-order optimality condition f() subject to C and differentiable f, a feasible point is optimal if and only if f() T (y ) 0 for all y C This is called the first-order

11 11 For a conve problem First-order optimality condition f() subject to C and differentiable f, a feasible point is optimal if and only if f() T (y ) 0 for all y C This is called the first-order condition for optimality In words: all feasible directions from are aligned with gradient f() Important special case: if C = R n (unconstrained optimization), then optimality condition reduces to familiar f() = 0

12 12 Eample: quadratic imization Consider imizing the quadratic function f() = 1 2 T Q + b T + c where Q 0. The first-order condition says that solution satisfies Cases: f() = Q + b = 0 if Q 0, then there is a unique solution = Q 1 b if Q is singular and b / col(q), then there is no solution (i.e., f() = ) if Q is singular and b col(q), then there are infinitely many solutions = Q + b + z, z null(q) where Q + is the pseudoinverse of Q

13 13 Eample: equality-constrained imization Consider the equality-constrained conve problem: f() subject to A = b with f differentiable. Let s prove Lagrange multiplier optimality condition f() + A T u = 0 for some u According to first-order optimality, solution satisfies A = b and f() T (y ) 0 for all y such that Ay = b This is equivalent to f() T v = 0 for all v null(a) Result follows because null(a) = row(a)

14 14 Eample: projection onto a conve set Consider projection onto conve set C: a 2 2 subject to C First-order optimality condition says that the solution satisfies f() T (y ) = ( a) T (y ) 0 for all y C Equivalently, this says that a N C () where recall N C () is the normal cone to C at

15 15 Partial optimization Reder: g() = y C f(, y) is conve in, provided that f is conve in (, y) and C is a conve set Therefore we can always partially optimize a conve problem and retain conveity E.g., if we decompose = ( 1, 2 ) R n 1+n 2, then f( 1, 2 ) 1, 2 subject to g 1 ( 1 ) 0 g 2 ( 2 ) 0 1 f(1 ) subject to g 1 ( 1 ) 0 where f( 1 ) = {f( 1, 2 ) : g 2 ( 2 ) 0}. The right problem is conve if the left problem is

16 16 Recall the SVM problem Eample: hinge form of SVMs β,β 0,ξ subject to 1 2 β C n i=1 ξ i ξ i 0, y i ( T i β + β 0 ) 1 ξ i, i = 1,... n Rewrite the constraints as ξ i ma{0, 1 y i ( T i β + β 0)}. Indeed we can argue that we have = at solution Therefore plugging in for optimal ξ gives the hinge form of SVMs: 1 β,β 0 2 β C n [ 1 yi ( T i β + β 0 ) ] + i=1 where a + = ma{0, a} is called the hinge function

17 17 Transformations and change of variables If h : R R is a monotone increasing transformation, then f() subject to C h(f()) subject to C Similarly, inequality or equality constraints can be transformed and yield equivalent optimization problems. Can use this to reveal the hidden conveity of a problem If φ : R n R m is one-to-one, and its image covers feasible set C, then we can change variables in an optimization problem: y f() subject to C f(φ(y)) subject to φ(y) C

18 Eample: geometric programg A monomial is a function f : R n ++ R of the form f() = γ a 1 1 a 2 2 an n for γ > 0, a 1,... a n R. A posynomial is a sum of monomials, f() = p k=1 A geometric program is of the form γ k a k1 1 a k2 2 a kn n subject to f() g i () 1, i = 1,... m h j () = 1, j = 1,... r where f, g i, i = 1,... m are posynomials and h j, j = 1,... r are monomials. This is nonconve 18

19 19 Let s prove that a geometric program is equivalent to a conve one. Given f() = γ a 1 1 a 2 2 an n, let y i = log i and rewrite this as γ(e y 1 ) a 1 (e y 2 ) a2 (e yn ) an = e at y+b for b = log γ. Also, a posynomial can be written as p k=1 eat k y+b k. With this variable substitution, and after taking logs, a geometric program is equivalent to ) subject to log log ( p0 e at 0k y+b 0k k=1 ( pi e at ik y+b ik k=1 ) c T j y + d j = 0, j = 1,... r 0, i = 1,... m This is conve, recalling the conveity of soft ma functions

20 20 Many interesting problems are geometric programs, e.g., floor planning: w i ( i,y i ) C i h i H W See Boyd et al. (2007), A tutorial on geometric programg, and also Chapter 8.8 of B & V book

21 21 Eample floor planning program: W,H,,y,w,h subject to W H 0 i W, i = 1,... n 0 y i H, i = 1,... n i + w i j, (i, j) L y i + h i y j, (i, j) B w i h i = C i, i = 1,... n. Check: why is this a geometric program?

22 22 Eliating equality constraints Important special case of change of variables: eliating equality constraints. Given the problem subject to f() g i () 0, i = 1,... m A = b we can always epress any feasible point as = My + 0, where A 0 = b and col(m) = null(a). Hence the above is equivalent to y f(my + 0 ) subject to g i (My + 0 ) 0, i = 1,... m Note: this is fully general but not always a good idea (practically)

23 23 Introducing slack variables Essentially opposite to eliating equality contraints: introducing slack variables. Given the problem subject to f() g i () 0, i = 1,... m A = b we can transform the inequality constraints via,s subject to f() s i 0, i = 1,... m g i () + s i = 0, i = 1,... m A = b Note: this is no longer conve unless g i, i = 1,..., n are affine

24 24 Relaing nonaffine equality constraints Given an optimization problem f() subject to C we can always take an enlarged constraint set C C and consider f() subject to C This is called a relaation and its optimal value is always smaller or equal to that of the original problem Important special case: relaing nonaffine equality constraints, i.e., h j () = 0, j = 1,... r where h j, j = 1,... r are conve but nonaffine, are replaced with h j () 0, j = 1,... r

25 25 Eample: maimum utility problem The maimum utility problem models investment/consumption: ma,b subject to T α t u( t ) t=1 b t+1 = b t + f(b t ) t, t = 1,... T 0 t b t, t = 1,... T Here b t is the budget and t is the amount consumed at time t; f is an investment return function, u utility function, both concave and increasing Is this a conve problem? What if we replace equality constraints with inequalities: b t+1 b t + f(b t ) t, t = 1,... T?

26 26 Eample: principal components analysis Given X R n p, consider the low rank approimation problem: R X R 2 F subject to rank(r) = k Here A 2 F = n i=1 p j=1 A2 ij, the entrywise squared l 2 norm, and rank(a) denotes the rank of A Also called principal components analysis or PCA problem. Given X = UDV T, singular value decomposition or SVD, the solution is R = U k D k V T k where U k, V k are the first k columns of U, V and D k is the first k diagonal elements of D. I.e., R is reconstruction of X from its first k principal components This problem is not conve. Why?

27 27 We can recast the PCA problem in a conve form. First rewrite as X Z S XZ 2 p F subject to rank(z) = k, Z is a projection ma Z S p tr(sz) subject to rank(z) = k, Z is a projection where S = X T X. Hence constraint set is the nonconve set { C = Z S p : λ i (Z) {0, 1}, i = 1,... p, tr(z) = k} where λ i (Z), i = 1,... n are the eigenvalues of Z. Solution in this formulation is Z = V k V T k where V k gives first k columns of V

28 28 Now consider relaing constraint set to F k = conv(c), its conve hull. Note F k = {Z S p : λ i (Z) [0, 1], i = 1,... p, tr(z) = k} = {Z S p : 0 Z I, tr(z) = k} Recall this is called the Fantope of order k Hence, the linear maimization over the Fantope, namely ma Z F k tr(sz) is conve. Remarkably, this is equivalent to the nonconve PCA problem (admits the same solution)! (Famous result: Fan (1949), On a theorem of Weyl conerning eigenvalues of linear transformations )

29 29 References and further reading S. Boyd and L. Vandenberghe (2004), Conve optimization, Chapter 4 O. Guler (2010), Foundations of optimization, Chapter 4

Lecture 4: September 12

10-725/36-725: Conve Optimization Fall 2016 Lecture 4: September 12 Lecturer: Ryan Tibshirani Scribes: Jay Hennig, Yifeng Tao, Sriram Vasudevan Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: