Convex relaxation. In example below, we have N = 6, and the cut we are considering

Convex relaxation The art and science of convex relaxation revolves around taking a non-convex problem that you want to solve, and replacing it with a convex problem which you can actually solve the solution to the convex program gives information about (usually a lower bound) the solution to the original program. Usually this is done by either convexifying the constraints or convexifying the functional we will see examples of both below. MINCUT Previously, we have looked at the problem of finding the minimum cut of a directed graph. We have vertices indexed by,..., N; vertex is the source and vertex N is the sink. Between each pair of vertices (i, j) there is a capacity C i,j 0 if there is no edge from i to j, we take C i,j = 0. A cut partitions the vertices into two sets: a S which contains the source, and a set S c which contains the sink. The capacity of the cut is the sum of the capacities of all the edges that originate in S and terminate in S C. In example below, we have N = 6, and the cut we are considering has S = {,,, 5}: 3 7 6 3 6 6 5 5 S Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

The edges in this cut are 3, 6, and 5 6. The capacity of this cut is + 3 + = 7. In general, the capacity associated with a cut S is C i,j. i S,j S If we take the vector ν R N as {, i S, ν i = 0, n S then we can write the problem of finding the minimum cut as minimize ν i= C i,j max(ν i ν j, 0) subject to ν i {0, }, i =,..., N j= ν =, ν N = 0. To make the functional linear, we introduce λ i,j, and the minimum cut program can be rewritten as (MINCUT) minimize Λ,ν i= λ i,j C i,j subject to λ i,j = max(ν i ν j, 0) j= ν i {0, } ν =, ν N = 0, where it is understood that the constraints should hold for all i, j =,..., N. As it is stated, there are two things making this program nonconvex we have non-affine equality constraints relating λ i,j to ν i and ν j, and we have binary constraints on ν i. If we simply drop the integer constraint, and relax λ i,j = max(ν i ν i, 0) to λ i,j ν i ν i and λ i,j 0, Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

we are left with the linear program (LP-relax) minimize Λ,ν Λ, C subject to λ i,j ν i ν j λ i,j 0 ν =, ν N = 0. Note that the domain we are optimizing over in the LP relaxation is larger than the domain in the original formulation this means that every valid cut (feasible Λ, ν for the original program) is feasible in the LP relaxation. So at the very least we know that LP-relax MINCUT. But the semi-amazing thing is that the solutions to the two programs turn out to agree. We show this by establishing that for every solution of the relaxation, there is at least one cut with value less than or equal to LP-relax. We do this by generating a random cut (with the associated probabilities carefully chosen) and show that in expectation, it is less than LP-relax. Let Z be a uniform random variable on [0, ]. Let Λ, ν be solutions to (LP-relax). Create a cut S with the rule: if ν n > Z, then take n S. 3 Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

The probability that a particular edge i j is in this cut is ( ) P (i S, j S) = P ν j Z ν i max(νi νj, 0), i, j N, νj, i = ; j =,..., N, νi, i =,..., N ; j = N, i = ; j = N. λ i,j, where the last three inequality follows simply from the constraints in (LP-relax). This cut is random, so its capacity is a random variable, and its expectation is E[capacity(S)] = i,j C i,j P (i S, j S) i,j C i,j λ i,j = LP-relax. Thus there must be a cut whose capacity is at most LP-relax. This establishes that MINCUT LP-relax. Of course, combining this with the result above means than MINCUT = LP-relax. This is an example of a wonderful situation where convex relaxation costs us nothing, but makes solving the program computationally tractable. Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

MAXCUT A good resource for this section and the next is Ben-Tal and Nemirovski [BTN0]. This problem has a very similar setup as the MINCUT problem, but it is different in subtle ways. We are given a graph; this time the edges are undirected, and have positive weights A i,j associated with them. Since the graph is undirected, A i,j = A j,i and so A is symmetric. We will also assume that A i,i = 0 for all i. As before, a cut partitions the vertices into two sets, S and S c these sets can be arbitrary; there is no notion of source and sink here. For example, the cut in this example: A, A,3 A 3, 3 A, A,5 5 A,6 A 5,6 6 S has value cut(s) = A, + A 3,. The problem is to find the cut that maximizes the weights of the edges going between the two partitions. We can specify a cut of the graph with a binary valued vector x of length N, where each x n {, }. If x n = if vertex n is in S and x n = if vertex n is in S c, then the value of the cut is cut(s) = A i,j ( x i x j ). i= j= 5 Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

Note that if x i x j, then ( x i x j ) =, while if x i = x j, then ( x i x j ) = 0. The factor of / in front comes from the fact that ( x i x j ) = for edges in the cut, and that we are counting every edge twice (from i to j and again from j to i). Notice that we can write this value as a quadratic function of x: cut(s) = ( T A x T Ax ) The MAXCUT problem is find the cut with the largest value: (MAXCUT) maximize x R N T A xt Ax subject to x i {, }. Right now, this looks pretty gnarly, as A has no guarantee of being PSD, and we have integer constraints on x. We can address the first concern by re-writing this as search for a matrix X = xx T. As now x T Ax = X, A, we have maximize X R N N T A X, A subject to X 0 X i,i =, rank(x) =. i =,..., N You should be able to convince yourself that X is feasible above if and only if it can be written as X = xx T, where the entries of x are ±. The recast program looks like a SDP, except for the rank constraint. The relaxation, then, is to simply drop it and solve (MAXCUT-relax) maximize X R N N T A X, A subject to X 0 X i,i =, i =,..., N. 6 Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

As we are optimizing over a larger set, the optimal value of MAXCUTrelax will in general be larger than MAXCUT: MAXCUT-relax MAXCUT. But there is a classic result [GW95] that shows it will not be too much larger: MAXCUT (0.87856) MAXCUT-relax. The argument again relies on looking at the expected value of a random cut. Let X be a solution to MAXCUT-relax. Since X is PSD, it can be factored as X = V T V, with v j as the jth column of V, this means X i,j = v i, v j. Since along the diagonal we have X i,i =, this means that v i = as well. We can associate one column v i with each vertex in the original problem. To create the cut, we draw a vector z from the unit-sphere (so z = ) uniformly at random, and set S = {i : v i, z 0}. It should be clear that the probability that any fixed vertex is in S is /. But what is the probability that vertex i and vertex j are on different sides? The probability of this is simply the ratio of the angle between v i and v j to π: P (i S, j S) + P (i S, j S) = arccos v i, v k π = arccos X i,j. π In practice, you could do this by drawing each entry Normal(0, ) independently, then normalizing. 7 Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

Thus the expectation of the cut value is E[cut(S)] = A i,j (P (i S, j S) + P (i S, j S)) = i= i= j= j= arccos Xi,j A i,j. π There must be at least one cut that has a value greater than or equal to the mean, so we know that MAXCUT E[cut(S)]. Let s compare the terms in this sum to those in the objection function for MAXCUT-relax. We know that the entries in X have at most unit magnitude X i,j, and it is a fact that: arccos t (0.878856) ( t), for t [, ]. π Here is a little proof by MATLAB of this fact: 0.5 0.5 0. 0.35 0.3 0.5 0. 0.5 0. 0.05 0 - -0.5 0 0.5 t blue = arccos t π, red = (0.878856) ( t). This follows from X i,j = v i, v k, v i =, and Cauchy-Swartz. 8 Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

Thus MAXCUT E[cut(S)] arccos Xi,j = A i,j π i= j= (0.87856) i= j= A i,j( X i,j) = (0.87856) MAXCUT-relax 9 Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

Quadratic equality constraints The integer constraint x i {, } in the example above might also be interpreted as a quadratic equality constraint: x i {, } x i =. As we are well aware, quadratic (or any other nonlinear) equality constraints make the feasibility region nonconvex. We consider general nonconvex quadratic programs of the form minimize x T A 0 x + x, b 0 + c 0 subject to x R N x T A m x + x, b m + c m = 0 m =,..., M, where the A m are symmetric, but not necessarily 0. We will show how to recast these problems as optimization over the SDP cone with an additional (nonconvex) rank constraint. Then we will have a natural convex relaxation by dropping the rank constraint. This general methodology works for equality or (possibly nonconvex) inequality constraints, but for the sake of simplicity, we will just look at equality constraints. We can turn a quadratic form into a trace inner product with a rank matrix as follows. It is clear that x T Ax + b T x + c = [ x T ] [ [ ] A b x b c] T ([ ] ) [ ] A b x [xt = trace b T X c x, X x = ]. 0 Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

This means we can write the nonconvex quadratic program as ([ ] ) ([ ] ) A0 b minimize trace 0 Am b x R N b T X 0 c x subject to trace m 0 b T X m c x m m =,..., M. = 0 With F m = [ ] Am b m, b T m c m we see that this program is equivalent to minimize X, F 0 subject to X, F m = 0, m =,..., M X R N N X 0 rank(x) =. Again, we can get a convex relaxation simply by dropping the rank constraint. How well this works depends on the particulars of the problem. There are certain situations where it is exact; one of these is when there is a single non-convex inequality constraint 3. There are other situations where it is provably good one example is MAXCUT above. There are other situations where it is arbitrarily bad. 3 This is very closely related to the dual derivations we did when analyzing robust least-squares. Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

Example: Phase retrieval In coherent imaging applications, a not uncommon problem is to reconstruct an unknown vector x from measurements of the magnitude of a series of linear functionals. We observe y m = x, a m (+ noise), m =,..., M. For instance, if a m are Fourier vectors, we are observing samples of the magnitude of the Fourier transform of x. If we also measured the phase, then recovering x is a standard linear inverse problem (and if we have a complete set of samples in the Fourier domain, you can just take an inverse Fourier transform). But since we do not get to see the phase, we have to estimate it along with the underlying x this problem is often referred to as phase retrieval. We can rewrite the measurements as y m = a m, x x, a m = x H a m a H mx = trace(a m a H mxx H ) X, A m F, where A m = a m a H m and X = xx H. So solving the phase retrieval problem is the same as finding an N N matrix with the following properties: X, A m F = y m, m =,..., M, X 0, rank(x) =. The first condition is just that X obeys a certain set of linear equality constraints; the second is that X is in the SDP cone; the third is a nonconvex constraint. One convex relaxations for this problem simply drops the rank constraint and finds a feasible point that obeys the first two conditions. Under certain conditions on the a m, there will only be one point in this intersection once M is mildly larger than N. Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

Example: Maximum Likelihood MIMO decoding In MIMO (multiple input multiple output) digital communications, we have N transmit antennas and M receive antennas. Each transmit antenna sends a bit (±), which are collected together into an N-vector x. The receive antennas observe y = Hx + v, where H is a known M N fading matrix (or channel matrix), and v Normal(0, σ I).. The maximum likelihood decoder finds the binary valued vector x that makes the observations y most likely: ˆx ML = arg max p(y x). x {,} N Show how this can be written as a least-squares problem with integer constraints.. Show how the optimization program above can be relaxed into a semidefinite program. 3 Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07

References [BTN0] A. Ben-Tal and A. Nemirovski. Lectures on Modern Covex Optimization. SIAM, 00. [GW95] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. Assoc. Comp. Mach., (6):5 5, November 995. Georgia Tech ECE 883a Notes by J. Romberg. Last updated :0, April 3, 07