Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5 Instructor: Farid Alizadeh Scribe: Anton Riabov 10/08/2001 1 Overview We continue studying the maximum eigenvalue SDP, and generalize it to an affine set of matrices. Next, we define Lovász Theta function and prove Lovász s sandwich theorem. Finally, we introduce further extensions to the eigenvalue SDP for finding minimum sum of k largest eigenvalues over an affine set of matrices. 2 Finding Largest Eigenvalue (Continued) In the previous lecture we have introduced sets S := {X : tr X = 1, X 0}, E := {qq T : q 2 = 1}. We have shown that E S and members of E are extreme points of S. For a fixed symmetric matrix A we have defined the largest eigenvalue SDP as: The dual to this problem is: λ 1 (A) = min z s.t. zi A λ 1 (A) = max A Y s.t. tr Y = 1 Y 0 1

Note that S is exactly the feasible set of the dual. As we know, optimal value is attained at an extreme point of these set, i.e. Y E if and only if Y is an optimal solution of the dual. Thus there exists q such that Y = qq T. So, we can write: Thus, λ 1 (A) = maxy A = λ 1 (A) = max q 2 =1 qqt A = max q 2 =1 qt Aq. max q 2 =1 qt Aq, (1) which is a well-known result from linear algebra, that we have proven using semi-definite programming duality. Note 1 Similarly we can show that the smallest eigenvalue λ n (A) can be expressed as: λ n (A) = min q 2 =1 qt Aq. 3 Lovász Theta Function In this section we describe extensions of the largest eigenvalue problem. Interested students are referred to the classic book Geometric Algorithms and Combinatorial Optimization by M. Grötschel, L. Lovász and A. Schrijver, Chapter 9. 3.1 Optimizing Largest Eigenvalue Over an Affine Set Suppose now that we are solving the same largest eigenvalue problem, but the matrix A is not fixed, and we are looking for a way to minimize the largest eigenvalue over a set of matrices. For example, consider the linear combination of symmetric matrices, A := A 0 + x 1 A 1 + + x m A m. The problem in this case translates into the following unconstrained optimization problem: min λ 1(A 0 + x 1 A 1 + + x m A m ). x R m It can be shown that this function is not linear. Furthermore, it is not smooth, and this makes optimization a very difficult task. However, we will prove that this function is convex, and therefore the task of finding it s minimum should be tractable. Proposition 1 Function f(x) := λ 1 (A 0 + x 1 A 1 + + x m A m ) is convex. Proof: We will prove that λ 1 (A+B) λ 1 (A)+λ 1 (B), and the result will follow. Using (1), we can state that there exists q such that: λ 1 (A + B) = q T (A + B)q = q T Aq + q T Bq 2

The value of q is feasible, but not optimal, if we apply (1) to A and B separately. Therefore, and the result follows. q T Aq + q T Bq λ 1 (A) + λ 1 (B), We can rewrite the problem as an equivalent semidefinite programming problem: min z The dual for this problem is: s.t. zi x 1 A 1 x m A m A 0 max A 0 Y s.t. tr Y = 1 A i Y = 0 Y 0 People have studied this problem from the point of view of unconstrained optimization. It turns out to be a very difficult problem, when formulated this way. The graph of the function is not smooth. Newton method can not be applied directly, since taking derivative is not always possible. Semidefinite programs, in turn, can be solved efficiently with any given precision. We will describe algorithms for solving SDPs later in the course. 3.2 Maximum Clique, Stable Set and Graph Coloring Problems Assume G = (V, E) is an undirected graph, no loops attached to vertices, i.e. edges (i,i) are not allowed in G. Definition 1 Maximum clique problem: find a maximum fully connected subgraph of G (a clique). I.e. find a subgraph G 0 = (V 0, E 0 ), V 0 V, E 0 E such that i, j V 0, i j (i, j) E 0 and V 0 is maximal. The maximum clique problem is NP-complete. In fact, it was one of the original problems shown to be NP-complete. Example 1 (Maximum Clique) Consider the graph on Figure 1. In this graph the cliques of size more than 2 are {2,3,4,5} and {1,2,5}. The former clique is maximum. Definition 2 Maximum independent (or stable) set problem: find a maximum set of vertices none of which are connected with an edge. 3

1 5 2 4 3 Figure 1: An example of an undirected graph. The maximum independent set problem is in some sense a dual of the maximum clique problem. The maximum independent set of a graph is the maximum clique of its complement. Definition 3 Complement graph Ḡ = (V, Ē) of a graph G = (V, E) is a graph, that has the same set of vertices, and satisfying the following property: every pair of vertices, that is connected in the original graph is not connected in its complement, and vice versa. Example 2 (Stable Set) Using the graph in Figure 1, the optimal solutions are {1,4} and {1,3}. 1 5 2 4 3 Figure 2: Complement of a graph in Figure 1. Example 3 (Complement Graph) Figure 2 shows the complement graph for the graph shown on Figure 1. Note 2 Graph G is a complement of Ḡ. Note 3 Every solution to the maximum stable set problem on a graph G is a solution for the maximum clique problem on the graph Ḡ, and vice versa. Definition 4 Minimum graph coloring problem: assign colors to vertices such that every pair of adjacent vertices has different colors, using the smallest possible total number of colors. The number of colors used in the optimal solution to the problem is referred to as the chromatic number of the graph. 4

Finding the chromatic number of a graph is also NP-complete. In fact, it can be proven that there does not exist an ε-approximation algorithm for this problem. In other words, for any given ε 1 it is impossible to give a polynomial algorithm that for every graph G finds a number that is guaranteed to be at most εχ(g), unless P = NP. We will use the following notation to denote optimal values of these problems: ω(g) - size of maximum clique in graph G; α(g) - size of largest stable set in graph G; χ(g) - chromatic number of graph G. Proposition 2 ω(g) χ(g). Proof: The vertices of each clique must have different colors. Example 4 (Strict Inequality) However, the equality does not hold all the time. The example of a cycle of 5 vertices shows that in some cases the inequality above is strict. 1 5 2 4 3 Figure 3: Odd cycle example. It is easy to see that the graph shown in Figure 3 can not be painted with less than 3 colors, therefore χ(g) = 3. However, the largest clique in this graph has only 2 vertices, and ω(g) = 2 < 3 = χ(g). This is actually true for any odd cycle. Proposition 3 If a symmetric matrix A R n n and its submatrix B R k k are such that ( k then A = k B C C T D λ 1 (B) λ 1 (A). Proof: From (1) it follows that there exists q R k, satisfying q 2 = 1, such that: λ 1 (B) = q T Bq. ), 5

Recall that the value of vector q, at which maximum in (1) is achieved, is the eigenvector corresponding to the eigenvalue λ 1 (B). Further, λ 1 (B) = q T Bq = ( q T 0 ) ( ) ( ) B C q C T λ D 0 1 (A), where the inequality follows from the fact that vector ( q T optimal for the entire matrix A in terms of (1). 0 ) is not necessarily Corollary 1 If A R n n is a symmetric matrix and its submatrix B R k k is composed of the rows and columns corresponding the same index subsets, i.e. for each column i of A that is included in B, the row i of A is included in B, and vice versa, then λ 1 (B) λ 1 (A). Proof: This corollary states that the submatrix B does not have to be in the corner of A, it can be distributed over A, as long as corresponding rows and columns are participating. We will make use of permutation matrices to prove this. The matrix A can be transformed to have the structure required in Proposition 3 via multiplication to permutation matrices. A permutation matrix is a square matrix composed of 0 and 1 elements, and having only one non-zero in each row and in each column. If the permutation matrix P rearranges columns, when multiplied on the right side, then P T on the left side results in the rearrangement of corresponding columns. Therefore there exists a permutation matrix P such that P T AP has the required structure. It is known from linear algebra that permutation matrices are orthogonal, and P T = P 1. Therefore permutation does not change eigenvalues, and the matrix P T AP has the same eigenvalues as A. Thus, λ 1 (B) λ(p T AP) = λ 1 (A), where the inequality follows from Proposition 3. Let us now apply this result to adjacency matrix A of an arbitrary graph G. An adjacency matrix is defined as A = (a ij ), where a ij = 1 if there exists an edge (i, j), and a ij = 0 otherwise. We will illustrate the procedure on the example graph in Figure 1. For this graph, the adjacency matrix A is 0 1 0 0 1 1 1 0 0 1 1 0 1 1 1 A = 0 1 0 1 1 0 1 1 0 1 and A + I = 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1. 1 1 1 1 0 1 1 1 1 1 It is easy to see that cliques in the graph correspond to submatrices of ones in the matrix A + I, composed of the corresponding rows and columns. Now we can apply Corollary 1. Thus, for each such clique submatrix J k = 11 T of size k k, λ 1 (J k ) λ 1 (A + I). (2) 6

The eigenvalues of J k are easy to compute. It is a rank 1 matrix, therefore it has only one non-zero eigenvalue, and it is easy to see that the vector 1 is the eigenvector corresponding to this non-zero value, and the value itself is equal to k. Thus, λ 1 (J k ) = k. Substituting in (2), k λ 1 (A + I). Since this equation holds for any clique size k, it also holds for the maximum clique size: ω(g) λ 1 (A + I). (3) 3.3 Lovász Theta Function Let us see if the inequality (3) can be made tighter, so we can obtain a better estimate of ω(g). Note that Corollary 1 can still be applied in the same way, and (3) holds, even if the zeros in A + I are replaced with arbitrary values, as long as the matrix A + I remains symmetric. The reason for this is that there are no zeros inside any clique submatrix J k, so that part (corresponding to B in Corollary (1) remains unchanged. Formally, let { 1, if (i, j) E or i = j; [A(x)] ij = x ij = x ji, otherwise. Now we can write a stronger version of inequality (3): ω(g) min x λ 1 (A(x)). (4) Definition 5 The right hand side of (4) is referred to as Lovácz θ-function of a graph. It is denoted θ(g) = min x λ 1 (A (x)). Note that A(x) defined above can be written as A(x) = A 0 + x 1 A 1 + x 2 A 2 + + x m A m. The example below illustrates how this can be done. Example 5 (Rewriting A(x) as sum) 1 x y x 1 1 = x(e 12 + E 21 ) + y(e 13 + E 31 ) + (A + I). y 1 1 We know that ω(g) χ(g). Now we can formulate the following theorem, that makes a stronger statement. This theorem is also known as Lovász s sandwich theorem. Theorem 1 ω(g) θ(g) χ(g). 7

Before we prove this theorem, let us make several observations regarding graph sums. Definition 6 Given graphs G 1 = (V 1, E 1 ), G 2 = (V 2, E 2 ) the sum of graphs G 1 G 2 is a graph formed by putting together graphs G 1 and G 2. In other words, G 1 G 2 = (V 1 V 2, E 1 E 2 ). The following propositions are straightforward to prove, so the proofs are not given. Proposition 4 ω(g 1 G 2 ) = max{ω(g 1 ), ω(g 2 )}. Proposition 5 χ(g 1 G 2 ) = max{χ(g 1 ), χ(g 2 )}. Although given the two propositions above, the following proposition may be anticipated, the proof is not as obvious. Proposition 6 θ(g 1 G 2 ) = max{θ(g 1 ), θ(g 2 )}. Proof: Write an SDP for finding the value of θ(g) : θ(g) = min z s.t. zi X 0 x ij = 1 if i = j or (i, j) E Here is the dual of this problem: max y ij + y ii = (A + I) Y s.t. ij E tr Y = 1 y ij = 0 Y 0 (i, j) E Now let us write the dual for graph composition G 1 G 2. The dual variable will have the following structure (since there are no edges between the graphs in the sum): ( ) Y1 0. 0 Y 2 Thus, the dual program is max (A + I) Y 1 + (A + I) Y 2 s.t. tr Y 1 + tr Y 2 = 1 (5) y ij = 0 Y 1, Y 2 0 (i, j) E 8

The formulation above resembles the original dual formulation. It could be separated into two problems (for graphs G 1 and G 2 ) and the proposition will be proven immediately, if not for the constraint (5), that makes Y 1 dependent on Y 2. Note that we can replace Y with αy, in the dual problem, for some α > 0, and the objective still remains the same, except that everything is now scaled down by α. Now let us choose an α such that 0 α 1, and replace (5) with the one of following two constraints, and solve the two dual problems: tr Y 1 = α tr Y 2 = 1 α The solutions to these independent problems will be, as we know, αθ(g 1 ) and (1 α)θ(g 2 ) correspondingly. The solution to the dual problem for G 1 G 2 is then: [ max αθ(g1 ) + (1 α)θ(g 2 ) ] = max{θ(g 1 ), θ(g 2 )}, 0 α 1 where the equality follows from the linearity of the optimal solution in α. Now we are ready to prove Theorem 1. Proof: Denote k = χ(g). The nodes of the graph G can be partitioned into k classes according to their color. We can make all classes contain the same number of nodes (r) by introducing additional singleton nodes, that are not connected to any nodes in the graph, and therefore can have any color. This procedure does not change θ(g) according to Proposition 6, since θ of a single vertex is 1. Denote this new graph G 1. Then, θ(g) = θ(g 1 ), V(G 1 ) = kr. Consider matrix A(α) R kr kr, A(α) := (αj r + (1 α)i r ) J r J r J r (αj r + (1 α)i r ) J r J r J r (αj r + (1 α)i r ) Note that (αj r + (1 α)i r ) is a symmetric r r matrix having the following structure: 1 α α α α 1 α α (αj r + (1 α)i r ) = α α 1 α α α α 1 J r, as before, is a matrix of all ones of size r r. It is easy to see that matrix A(α) is a special case of (A(x)) matrix for graph G 1. Vertices within a color class are not connected, and if we rename them so that classes follow one another, we will have zero matrices of size r r 9

on the diagonal in adjacency matrix for G 1. Now we can replace all zeros with variables, and we choose to replace some of them with α, which preserves matrix symmetry. Since it is a special case, and by definition of θ(g 1 ), Note that we can rewrite A(α) as: θ(g 1 ) λ 1 (A(α)). A(α) = J kr (1 α)(j r J r J r ) + (1 α)i. (6) Now we would like to find the value of λ 1 (A(α)) as a function of α in closed form, and choose α such that λ 1 (A(α)) k = χ(g 1 ) = χ(g). Note that α does not have to be positive. Eigenvalue of matrix sum is equal to sum of eigenvalues, if the matrices commute. It is known from linear algebra that for matrices A and B, AB = BA if and only if A and B share a system of common eigenvectors. This is almost obvious for symmetric matrices, we can make use of eigenvalue decomposition: A = Q T DQ, B = Q T CQ, then A + B = Q T (D + C)Q if the matrices commute. We claim that J kr, (J r J r J r ) and I commute. I is easy, since it commutes with every matrix. It is also straightforward to verify that the vector of all ones, and everything orthogonal to it, can be taken as eigenvectors for J kr. Vector of all ones is also an eigenvector of (J r J r J r ); thus take any other set of eigenvectors for it, its members are orthogonal to 1, and so are also eigenvectors of J kr, and this proves the claim. The table below summarizes the eigenvalues of the 3 terms in the sum (6): J kr (1 α)(j r J r J r ) (1 α)i kr (1 α)r 1 α 0 0 1 α......... 0 0 1 α 0 (1 α)r 1 α 0 0 1 α......... 0 0 1 α 0 (1 α)r 1 α 0 0 1 α......... 0 0 1 α Thus in order to complete the proof, we need to show that there exists an α that satisfies: kr (1 α)r + (1 α) k 1 α (1 α)r k 1 α k 10

Solving the first inequality for α we obtain α 1 k. Solving the last one we get α 1 k. Setting α = 1 k will satisfy all three inequalities. As we have just shown, this theorem proves a bound on a very hard integer program (IP) using results obtained with semidefinite programming (SDP) analysis. Later we will show that any IP can be relaxed to an SDP in a similar way. SDP relaxations are typically tighter than relaxations to linear programs, and we will give a general framework for creating such relaxations. 4 Computing Sums of Eigenvalues Consider further generalization of the optimization problem we have been working with. As before, define A(x) = A 0 + x 1 A 1 + x 2 A 2 + + x m A m. But now, we want to find the sum of a given number of largest eigenvalues, instead of just finding one largest eigenvalue: f(x 1, x 2,, x n ) = (λ 1 + λ 2 + + λ k )(A 0 + x 1 A 1 + x 2 A 2 + + x m A m ). And, as in previous sections, we would like to find the unconstrained minimum: min x f(x 1, x 2,, x n ). We can write an SDP for this problem, but the straightforward approach (writing constraints for all possible sums of eigenvalues) results in an exponential number of constraints. A more sophisticated approach should be used. We will first illustrate the ideas we are going to employ on a small linear program. Example 6 (Linear Programming) Consider the following linear program (LP): max w 1 x 1 + + w n x n s.t. x1 + + x n = k (7) 0 x i 1 Suppose k is an integer. This program finds the k largest values of w, and sets corresponding values of x to 1. All extreme points of feasible set of LP are integer. This can be proven, for example, using complementary slackness. Let us write the dual of this problem: min s.t. kz + y i z + y i w i y i 0 Now, using complementary slackness, we can show that a fractional solution is either non-optimal, or is not an extreme point. 11

Another way to show that all the extreme points of LP are integral is to notice that the matrix of this problem is totally unimodular (all determinants of its sub-matrices are either 0, 1, or 1). Either way, it can be shown that the extreme points of this polytope are 0-1 vectors having exactly k ones in them. Introducing just one constraint on the sum (equation (7) in LP) greatly restricts the degrees of freedom for x. 12