Lecture 26: April 22nd

Size: px

Start display at page:

Download "Lecture 26: April 22nd"

Theresa Barber
6 years ago
Views:

1 10-725/36-725: Conve Optimization Spring 2015 Lecture 26: April 22nd Lecturer: Ryan Tibshirani Scribes: Eric Wong, Jerzy Wieczorek, Pengcheng Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. The set of conve optimization is only a small subset of all conve optimizations problems. Most problems are non-conve! Non-conve problems are typically harder to solve and analyze and have higher variance, but some can be solved eactly (to global optimality) Non-conve problems Non-conve problems typically have higher variance. A heuristic argument for this is as follows: Suppose we have a conve problem and a non-conve problem. If we perturb the input data by some smooth function, a non-conve surface could result in substantially different local ima depending on the realization of the input data. On the other hand, a conve surface is more likely to result in the same or similar (perturbed) ima. Today s lecture is largely about a few special cases of non-conve problems that happen to be tractable. It does not cover general strategies for arbitrary non-conve problems Definition A nonconve problem is defined as subject to f() h i () 0, i = 1,..., m l j () = 0, j = 1,..., r (26.1) where at least one of the conditions for a conve problem is not met (f and h i are conve, l j are affine). This definition includes non-conve formulations of conve problems. For eample f() + g() subject to g() = 0 (26.2) where f is conve, and g is non-conve can be reduced to a conve problem, but is a non-conve formulation. From this point on, we consider non-conve problems that are not trivially equivalent to conve ones. Nonconve problems can have local ima. By solving a nonconve problem, we mean finding the global imizer. We also implicitly mean doing it efficiently, i.e., in polynomial time. 26-1

2 26-2 Lecture 26: April 22nd Linear fractional programs A linear fraction program is c T + d e T + f subject to G h, e T + f > 0 A = b (26.3) This is not conve because the ratio of linear functions is not conve, but quasiconve. Using a transformation, this is actually equivalent to a linear program. Interestingly, a place where this shows up are the knots in the LASSO solution path. Every knot is the solution of some linear-fractional program. (A quasiconve function is one where all the sublevel sets are conve. For eample, f() = is not conve, but is quasiconve: for any t > 0, the set { : f() t} is a conve set.) Takeaway: try adding a variable to transform the non-conve problem into a conve one Geometric programs A geometric program is subject to f() h i () 1, i = 1,..., m l j () = 1, j = 1,..., r (26.4) where f, h i are posynomials and l j are monomials. We can transform this to a conve problem with y i = log i. Takeaway: try transforg your variables to convert the non-conve problem into a conve one Conve equality constraints If equality constraints are conve but not affine, we can rela them from l() = 0 to l() 0. This is very useful when l() is conve but not affine. Then, if l( ) = 0 at any solution, then the relaation is equivalent. This idea is used in the maimum utility problem, which we also saw in HW1 Q1. Takeaway: try a conve relaation of your constraints to turn the non-conve problem into a conve one Two quadratic functions Consider the following problem T A 0 + 2b T 0 + c 0 subject to T A 1 + 2b T 1 + c 1 0 (26.5) Note this is not conve since A 0, A 1 are not necessarily positive definite. However, its dual problem can be cast as a semidefinite program and is conve, and it can be shown that strong duality holds. Takeaway: try deriving the dual, which will be conve; if strong duality holds, solving the (conve) dual may solve your (non-conve) primal problem.

3 Lecture 26: April 22nd Eigen Problems Now, we move on from classical non-conve problems Principal Component Analysis PCA is the problem of given a matri Y : Y X X 2 F subject to rank(x) = k This has the well-known solution of taking a singular value decomposition, and truncating all but the top k singular values. This is not a conve problem due to the constraint (the objective is conve). The rank constraint is nonconve: for eample, let J 11, J 22 be 2 2 zero matrices with a single 1 in the upper left corner and lower right corner respectively. Then, each of these have rank 1, but their average has rank 2. This rank constraint is a matri analogue to the l 0 norm on vectors (which is also non-conve). The l 0 norm counts the number of nonzero entries; the rank constraint counts the number of nonzero singular values. Here is another (equivalent) characterization of the SVD. Given a symmetric matri S (which you can think of as Y T Y or Cov(Y ) for the Y in the problem above): S Z Z 2 F subject to rank(z) = k, Z is a projection This is also non-conve, since both rank and projection constraints are non-conve. The solution is Ẑ = V kv T k where the columns of V k are the first k eigenvectors of S. This is equivalent to a conve problem: first reepress the constraint set C as C = {Z R p p : Z = Z T, λ i (Z) {0, 1} for i = 1,..., p, tr(z) = k} The only offending (non-conve) constraint here is λ i (Z) {0, 1}. We can rela this to λ i (Z) (0, 1) (this is equivalent to relaing the problem to its conve hull). Now we have C = {Z R p p : Z = Z T, 0 Z 1, tr(z) = k} which is a conve set called the Fantope of order k. Relaing this Fantope to its conve hull actually admits the same solution to the original, but it is now a conve formulation. Vu et al. (2013) use this conve relaation via the Fantope to derive a sparse version of PCA. Adding a l 1 penalty term to the original non-conve problem is intractable. Adding a l 1 penalty term to the Fantopebased conve relaation is still conve, and hence tractable. However, the solution to this penalized-andrelaed problem is no longer necessarily the same as the solution to the original penalized problem... but maybe that is not a big concern since the original wasn t tractable anyway Other problems MDS (multidimensional scaling) and generalized eigenvalue problems are other eigen problems, but we will skip them for the sake of time.

4 26-4 Lecture 26: April 22nd 26.3 Graph problems Min cut The -cut problem is given a graph G = (V, E), two nodes s, t V, and costs c ij 0 on the edges (i, j) E, imizes the cost of a partition between s and t. This is not conve; in fact is is an integer program. If we throw out some of the hard constraints, we end up with a linear program which is the dual of the ma flow linear program (see lecture 12 for more detail) Shortest paths Finding the shortest path between nodes is also a non-conve problem. Dijkstra s algorithm solves this and more, in a reasonable time O( E log V ). We skip details for sake of time Non-conve proimal operators Consider a weighted l 0 norm: (y i i ) 2 + λ i 1{ i 0} (26.6) By inspection, the solution is to hard-threshold y: i = y i if y 2 i > λ i, and 0 otherwise. If we switch the loss to y X 2 2, this is called best subset selection. This is generally NP hard! l 0 segmentation Given a sequence y i, find a sequence i that imizes the number of jumps. specifically, n 1 (y i i ) 2 + λ 1{ i i+1 } (26.7) Recall that we get fused lasso if we replace 1{ i i+1 } with i i+1. Both of these can be solved with dynamic programg. Note that this is different from typical dynamic programg since this is over a continuous range of variables. An algorithm by Johnson is more efficient, and a version by Bellman is more general Tree-leaves projection Given a target u R n, tree g on R n and label y {0, 1}, consider z u z λ 1{g(z) y} In other words, we want to find a z such that the the label under the tree is the same as y. We can argue that the solution is either ẑ = u or ẑ = P S (u), the projection onto the set S = g 1 (1) = {z : g(z) = y}. This is the projection onto a set of tree leaves, which is very non-conve.

5 Lecture 26: April 22nd 26-5 S is a union of boes (tree leaves), but projection onto a set of boes is non-conve. We could project onto each bo individually, but this is epensive. Instead, label every node with the list of its leaves and the bounding bo of its leaves. Now perform DFS and prune nodes that do not contain a leaf labeled y, or if the bounding bo is further away from the current closest Discrete problems Skipped Infinite-dimensional problems Skipped Statistical problems Sparse underdetered linear systems Given X such that X i 2 = 1 for i = 1,..., n and y, consider 0 subject to X = y This is non-conve and also NP hard without additional restrictions on the form of X. The natural approach is to rela to the l 1 norm: 1 subject to X = y These two problems are interestingly connected. If p > n very large, there is a threshold ρ(p/n) such that for most matrices X if we solve the l 1 problem and find a solution with less than ρn non-zero components, then this is the unique solution of the l 0 problem. If we have greater than ρn non-zero components, then there is no solution with less than ρn non-zero components. In this case, maybe we shouldn t care about it. In this case, most refers to with high probability when drawing X uniformly at random.

Lecture 4: September 12

10-725/36-725: Conve Optimization Fall 2016 Lecture 4: September 12 Lecturer: Ryan Tibshirani Scribes: Jay Hennig, Yifeng Tao, Sriram Vasudevan Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: