Chapter 3. Some Applications. 3.1 The Cone of Positive Semidefinite Matrices

Chapter 3 Some Applications Having developed the basic theory of cone programming, it is time to apply it to our actual subject, namely that of semidefinite programming. Indeed, any semidefinite program assumes the form of a cone program Maximize Tr(C T X) subject to A(X) = b X 0 Maximize c, x subject to b Ax L x K, where K = {X S n : X 0} V = S n, the space of symmetric n n matrices, equipped with the scalar product X, Y = Tr(X T Y ). We have shown in Lemma 2..2 that K is a closed convex cone. Furthermore, L = {0} W = R m in our application. 3. The Cone of Positive Semidefinite Matrices Let us from now on use the notation S + n := {X S n : X 0} for the cone of positive semidefinite n n matrices. This notation is deliberately chosen in analogy with the notation R n + for the cone {x R n : x 0}, the nonnegative orthant in R n. In fact, there are many similarities between the two cones R n + and S+ n, to some of which we will get next. At this point, a sceptical reader might want to convince him- or herself that this is indeed a scalar product. 39

3.. Self-Duality 3. The Cone of Positive Semidefinite Matrices 40 We start by showing that the cone S n + is self-dual, just like Rn +. This statement is also known as Fejér s Trace Theorem. 3.. Lemma. (S + n ) = S + n. Proof. We first prove an auxiliary Claim. Let M, M S + n. Then Tr(MT M ) 0. It is instructive to consider the following non-proof first: If M, M S + n, then also M T M S + n. Therefore all diagonal elements of M T M and thus also the trace are nonnegative. What makes this a non-proof is the observation that the product of positive semidefinite matrices need not even be symmetric, let alone an element of S + n. Here is the real proof: We diagonalize M and write it in the form M = SDS T, where S is an orthogonal matrix (i.e. S = S T ) and D is a diagonal matrix with the nonnegative eigenvalues λ,...,λ n of M on the diagonal. Let s i denote the ith column of S. Using symmetry of Tr, we now compute Tr(M T M ) = Tr(S DS T M ) = Tr(DS T M S) = n λ i s T i M s i 0, i= since s T i M s i 0 by M S n +, see Fact.4.(ii). Using this claim, S + n (S+ n ) follows. The other direction is even simpler. Choose M (S + n ). For all x R n, the matrix xx T is positive semidefinite, so M (S + n ) implies 0 Tr(M T xx T ) = x T Mx x R n, meaning that M S + n. 3..2 Generating Matrices Every vector in R n is a linear combination of the n unit vectors e i, i =,...,n, and R n + can be characterized as the set of nonnegative linear combinations. In that sense, the n unit vectors generate both R n and R n +. Similarly, and as we show next, every matrix in S n is a linear combination of n matrices of the form ss T, s =, and S n + can be characterized as the set of nonnegative linear combinations. In that sense, the matrices ss T generate both S n and S n +. The essential difference to the case of Rn and R n + is that there is no finite set of generators.

4 3. Some Applications 3..2 Lemma. Let M be an n n matrix. We have M S n (M S + n, respectively) if and only if there are unit-length vectors s,...,s n S n and real numbers (nonnegative real numbers, respectively) λ,...,λ n such that M = n λ i s i s T i. i= Proof. The if directions are clear: all the s i s T i are in S n + S n, so every linear combination is in S n. By convexity of S n +, every nonnegative linear combination is again in S n +. For the only if directions, we again diagonalize M as M = SDS T, where S is orthogonal and D is the diagonal matrix of eigenvalues λ,...,λ n (which are nonnegative if M S n + ). With D(i) being the matrix Q such that q ii = λ i (and zero otherwise), we get M = S( n D (i) )S T = i= n SD (i) S T = i= n λ i s i s T i, where s i is the i-th column of S. By orthogonality of S, s i = for all i. i= 3.2 Largest Eigenvalue The following cannot really be called an application, but it nicely illustrates the use of duality in a very simple case. We will reprove the following well-known fact from linear algebra, using cone programming duality. There exists in fact a much shorter proof of this theorem, but the point here is to see duality in action (and prepare grounds for a more general statement, see Exercise 3.4.4). 3.2. Theorem. Let C S n. Then the largest eigenvalue of C is equal to λ = max{x T Cx : x R n, x = }. We note that the maximum exists, since we are optimizing a continuous function x T Cx over a compact subset of R n. Proof. We first rewrite x T Cx as Tr(C T xx T ) and x = as Tr(xx T ) =. This means that λ is the value of the constrained optimization problem Maximize Tr(C T xx T ) subject to Tr(xx T ) =. (3.) This program is obtained from the semidefinite program Maximize Tr(C T X) subject to Tr(X) = X 0 (3.2)

3.2 Largest Eigenvalue 42 by adding the constraint that X has rank. Indeed, the positive semidefinite matrices of rank are exactly the ones of the form xx T (Exercise 3.4.2). Equivalently, (3.2) can be considered as a relaxation of (3.). Note that Tr is a linear operator from S n to R. The crucial fact is that (3.) and (3.2) have the same value λ. Indeed, if X is any feasible solution of (3.2), Lemma 3..2 lets us write X = n λ i x i x T i, i= with nonnegative λ i s. Since the x i s are unit vectors, all matrices x i x T i trace, so linearity of Tr implies n i= λ i =. But then we have have Tr(C T X) = Tr(C T n n λ i x i x T i ) = λ i Tr(C T x i x T i ) max n i= Tr(CT x i x T i ) λ, i= i= since the x i are feasible solutions of (3.). Tr(C T X ) λ for some X follows from the fact that (3.2) is a relaxation of (3.). Now we are prepared to apply the strong Duality Theorem 2.5.4 of cone programming, in the version for equational form. Written as a cone program in equational form, (3.2) has K = S + n S n, c = C S n, b = R, and A = Tr. In order to be able to write down the dual, we just need to determine the adjoint operator A T : R S n. From the requirement y, A(X) = y Tr(X)! = Tr(A T (y)x) = A T (y), X for all X S n, y R, we infer that A T (y) := yi n is an adjoint, so that the dual of (3.2) is Minimize y subject to yi n C 0. (3.3) Since the primal program (3.2) has an interior point (choose X = I n /n, for example), the duality theorem applies and shows that the optimal value of (3.3) is also λ. But what is λ? If C has eigenvalues λ,...,λ n, then yi n C has eigenvalues y λ,..., y λ n, and the constraint yi n C 0 requires all of them to be nonnegative. Therefore, the value λ, the smallest y for which yi n C 0 holds, equals the largest eigenvalue of C. This proves the theorem. Exercise 3.4.4 discusses semidefinite programming formulations for the sum of the k largest eigenvalues.

43 3. Some Applications E F I J L E F I J L E F I J L E F I J L Figure 3.: Left: The recognition graph connects an input letter v to an output letter w if v may be recognized as w. Right: the similarity graph connects two input letters if they may be recognized as the same output letter 3.3 Shannon Capacity and Theta Function of a Graph Suppose that you have just bought some optical character recognition system. Your goal as a citizen of the internet age may be to digitize all the books that you have, so that you can safely throw them away afterwards. The system is not perfect, though, and it sometimes gets letters wrong. For example, the letter E might mistakenly be recognized as an F. In general, there are input letters (the ones in the book) and output letters (the ones being recognized). Input and output letters may come from the same, but also from different alphabets. We can encode the possible behaviors of the system as a directed bipartite recognition graph that connects an input letter v to an output letter w if the system may recognize v as w, see Figure 3. (left) for an example with 5 input letters and the same 5 output letters. Two input letters v, v are called similar if there is an ouput letter w such that both v and v may be recognized as w. In the example, v = E and v = F are similar, with w = F witnessing their similarity. The letters J and L are also similar, since both of them could be recognized as I. Finally, each letter is similar to itself by definition. We can record this information in an (undirected) similarity graph that connects two distinct input letters if they are similar, see Figure 3. (right). The information that every letter is similar to itself is implicit. If the similarity graph is empty, the system can correctly scan all your books: for every recognized output letter w, there is exactly one matching input letter v, and assuming the the system knows its recognition graph, the correct input letter v can be reconstructed. But already with a relatively sparse but nonempty similarity graph, the system may get a lot of words wrong. For example, a word with many E s is pretty likely to get corrupted, since it suffices if just one of the E s is mistakenly recognized as an F. For this reason, the system has an error correction: It comes with a built-in similarity-free dictionary of allowed input words. This means that no

3.3 Shannon Capacity and Theta Function of a Graph 44 two distinct allowed words are similar in the sense that they may be recognized as the same output word. Formally, two k-letter words v...v k and v...v k are similar if and only if v i is similar to v i for all i. Indeed, if the dictionary is similarity-free, then error corrections works, since for every recognized word w...w k, there is exactly one word v...v k in the dictionary such that v i may be recognized as w i for all i, and this word must be the correct input word. 2 While you are waiting for your next book to be scanned, your mind is drifting off and you start asking a theoretical question. What is the largest similarity-free dictionary of k-letter words? For k = (the words are just letters), this is easy to answer: The dictionary must be an independent set in the similarity graph. The largest similarity-free dictionary of -letter words is therefore a maximum independent set in the similarity graph. For k >, we can easily form a graph G k whose edges characterize similarity between k-letter words. The vertices of G k are the words of length k, and there is an edge between two words v...v k and v...v k if they are similar, meaning that v i is similar to v i for all i. This leads to the following 3.3. Observation. Let α(g) denote the independence number of a graph G, i.e. the size of a maximum independent set in G. Then the largest similarity-free dictionary of k-letter words has size α(g k ). It is known that the independence number of a graph is NP-hard to compute [4, Section 3..3], so finding the size of the largest similarity-free dictionary is hard even for -letter words. However, this is not our main concern here, since we want to study the sequence (α(g k )) k N in its entirety. We start by showing that the sequence is super-multiplicative. 3.3.2 Lemma. For all k, l N, α(g k+l ) α(g k )α(g l ). Proof. If I is an independent set in G k, and J is an independent set in G l, then the set of I J words {v...v k w...w l : v...v k I, w...w l J} is independent in G k+l. Indeed, no two distinct words in this set can be similar, as this would imply that at least one of I and J contains two distinct similar words. If I = α(g k ) and J = α(g l ), the statement follows. 2 If there is no such word v...v k in the dictionary, then the correct input word was not an allowed word.

45 3. Some Applications The inequality in Lemma 3.3.2 is in general strict. For the 5-cycle C 5 we have α(c 5 ) = 2. But α(c5) 2 5 > α(c 5 ) 2. To see this, we use the interpretation of α(c5 2 ) as the size of a largest similarity-free dictionary of 2-letter words. Suppose that the letters around the cycle C 5 are A, B, C, D, E. Then it is easy to check that the following five 2-letter words are pairwise non-similar: AA, BC, CE, DB, and ED. This example is actually best possible and α(c5 2) = 5. 3.3. The Shannon Capacity We may view a dictionary as a set of messages, encoded by k-letter words. The goal is to safely transmit any given message over a noisy channel whose input/output behavior induces a similarity graph G as in Figure 3. (right). If the dictionary is similarity-free w.r.t. G, we can indeed correct all errors being made during transmission. Using -letter words, we can thus safely transmit α(g) different messages. This means that every letter carries (roughly) log(α(g)) bits of information (the logarithm is binary). Using k-letter words, we can transmit α(g k ) different messages, meaning that each of the k letters carries k log α(gk ) bits of information. We are interested in the best k, the one that leads to the highest information-per-letter ratio. It easily follows from our above considerations on C 5 that k = is not always the best choice. Indeed, we have log(α(c 5 ) = < 2 log(α(c2 5) = log 5.6. 2 Consequently, let us define the Shannon capacity of a graph G as σ(g) = sup{ k log α(gk ) : k N}, (3.4) the (asymptotically) highest information-per-letter ratio that can be achieved. This definition is due to Claude Shannon [5]. 3.3.3 Lemma. For every graph G = (V, E), σ(g) is bounded and satisfies ( ) σ(g) = lim k k log α(gk ). Proof. Since G k has V k vertices, we obviously have α(g k ) V k which implies that σ(g) log V. Taking logarithms in Lemma 3.3.2, we see that the sequence (x k ) k N = (log α(g k )) k N is super-additive, meaning that x k+l x k +x l

3.3 Shannon Capacity and Theta Function of a Graph 46 for all k, l. Now we use Fekete s Lemma which states that for every super-additive sequence (x k ) k N, the sequence ( xk ) k k N converges to its supremum (Exercise 3.4.5 asks you to prove this). Shannon already remarked in his original paper [5] in 956 that it can be quite difficult to compute σ(g) even for small graphs G, and he in particular failed to determine σ(c 5 ). We know that σ(c 5 ) log 5, 2 but it is absolutely not clear whether k = 2 yields the best possible informationper-letter ratio. Only in 979, Lovász was able to determine σ(c 5 ) = /2 log5, showing that the lower bound obtained from 2-letter encodings is tight []. Lovász did this by deriving the theta function, a new upper bound on σ(g) (computable with semidefinite programming, as we will see), and by showing that this upper bound matches the known lower bound for σ(c 5 ). Instead of σ(g), Lovász uses the following equivalent quantity. 3.3.4 Definition. Let G be a graph. The Lovász-Shannon capacity of G is Θ(G) = 2 σ(g) = lim k k α(g k ). (3.5) We remark that the Lovász-Shannon capacity is lower-bounded by α(g) (by super-multiplicativity) and upper-bounded by V. After this transformation, the statement σ(c 5 ) = /2 log 5 reads as Θ(C 5 ) = 5. 3.3.2 The Theta Function We first pinpoint our earlier notation of similarity. 3.3.5 Definition. Let G = (V, E) be a graph. Vertices v and v are called similar in G if either v = v or {v, v } E. 3.3.6 Definition. An orthonormal representation of a graph G = (V, E) with V = n is a set U = {u v : v V } of unit vectors in S n such that u T v u v = 0 if v and v are not similar in G. (3.6) It is clear that every graph has such a representation, since we may take the n pairwise orthogonal unit vectors e,...,e n

47 3. Some Applications But we are looking for a better representation, if possible. Intuitively, a representation is good if there exists a unit vector that is far from being orthogonal to any of the u v. Formally, and at first glance somewhat arbitrarily, we define the value of an orthonormal representation U = {u v : v V } as ϑ(u) := min max c = v V (c T u v ) 2. (3.7) The minimum exists, since we can cast the problem as the minimization of a continuous function over a compact set (the unit sphere S n minus suitable open balls around the u v to avoid singularities). A vector c that attains the minimum is called a handle of U. 3.3.7 Definition. The theta function ϑ(g) of G is the smallest value ϑ(u) over all orthonormal representations U of G. Again, the minimum exists, as (i) ϑ(u) is continuous, and (ii) the set of orthonormal representations is the compact set (S n ) n, intersected with closed sets of the form {u T v u v = 0} (which again yields a compact set). 3.3.3 The Lovász Bound In this section we show that ϑ(g) is an upper bound for the Lovász-Shannon capacity Θ(G). This requires two elementary lemmas. With the definition of the graph G k on Page 44, we want to prove that ϑ(g k ) ϑ(g) k (recall that the inverse inequality holds for the independence number α, by Lemma 3.3.2). For this, we handle the case k = 2 first, in the following more general form. 3.3.8 Definition. Let G = (V, E) and H = (W, F) be graphs. The strong product of G and H is the graph G H with vertex set V W, and an edge between (v, w) and (v, w ) if and only if v is similar to v in G and w is similar to w in H. 3.3.9 Lemma. For all graphs G and H, ϑ(g H) ϑ(g)ϑ(h). Since G k is isomorphic to the k-fold strong product ( ((G G) G) G), we obtain 3.3.0 Corollary. ϑ(g k ) ϑ(g) k.

3.3 Shannon Capacity and Theta Function of a Graph 48 Proof of Lemma 3.3.9. Let U = {u v : v V } and V = {v w : w W } be optimal orthonormal representations of G = (V, E) and H = (W, F), with handles c and d. We will from this construct an orthonormal representation of G H with value at most ϑ(g)ϑ(h). The construction is simple: the orthonormal representation is obtained by taking all tensor products of vectors u v with vectors v w, and an upper bound for its value is computed using the tensor product of the handles c and d. The tensor product of two vectors x R m and y R n is the (column) vector x y R mn defined by x y = (x y,...,x y n, x 2 y,...,x 2 y n,...,x m y,...,x m y n ) R mn. Equivalently, the tensor product is the matrix xy T, written as one long vector (row by row). We have that (x y) T (x y ) = m n x i y j x iy j = i= j= m n x i x i y j y j = (x T x )(y T y ). (3.8) i= j= Now we can prove that the vectors u v v w indeed form an orthonormal representation U V of G H. As a direct consequence of (3.8), all of them are unit vectors. Moreover, if (v, w) and (v, w ) are not similar in G H, then v is not similar to v in G, or w is not similar to w in H. In both cases, (3.8) implies that (u v v w ) T (u v v w ) = (u T v u v )(ut w u w ) = 0, since U and V are orthonormal representations of G and H. Thus, we have an orthonormal representation of G H. By definition, ϑ(g H) is bounded by ϑ(u V) max v V,w W ((c d) T (u v v w )) 2 = max v V,w W (c T u v ) 2 (d T v w ) 2 = ϑ(g)ϑ(h). Here is the second lemma that we need: The theta function ϑ(g) is like the Lovász-Shannon capacity Θ(G) an upper bound for the independence number of G. 3.3. Lemma. For any graph G, α(g) ϑ(g). Proof. Let I V (G) be a maximum independent set in G, and let U = {u v : v V } be an optimal orthonormal representation with handle c. We know that the vectors u v, v I are pairwise orthogonal which implies (Exercise 3.4.8) that c T c v I (c T u v ) 2.

49 3. Some Applications We thus have = c T c v I (c T u v ) 2 I min v I (ct u v ) 2 = α(g) min v I (ct u v ) 2. This in turn means that α(g) min v I (c T u v ) = max 2 v I (c T u v ) max 2 v V (c T u v ) = ϑ(g). 2 The main result of this section now easily follows: The theta function ϑ(g) is an upper bound for the Lovász-Shannon capacity Θ(G), meaning that we have α(g) Θ(G) ϑ(g). 3.3.2 Theorem. For any graph G, Θ(G) ϑ(g). Proof. By Lemma 3.3. and Corollary 3.3.0, we have α(g k ) ϑ(g k ) ϑ(g) k. It follows that hence k α(gk ) ϑ(g) k, Θ(G) = lim k k α(g k ) ϑ(g). 3.3.4 The 5-Cycle Using the bound of Theorem 3.3.2, we can now determine the Lovász-Shannon capacity of the 5-cycle C 5. We already know that Θ(C 5 ) 5, by using 2- letter encodings. The fact that this is best possible follows from the next lemma, together with Theorem 3.3.2. 3.3.3 Lemma. ϑ(c 5 ) 5. Proof. We need to find an orthonormal representation of C 5 with value at most 5. Let the vertices of C5 be 0,, 2, 3, 4 in cyclic order. Here is Lovász s umbrella construction that yields vectors u 0,...,u 4 in S 2 (we can add two zero coordinates to lift them into S 4 ). Imagine an umbrella with unit handle c = (0, 0, ) and five unit ribs of the form u i = 2πi 2πi (cos, sin 5 (cos 2πi 5, sin 2πi 5, z) 5 i = 0,...,4., z),

3.3 Shannon Capacity and Theta Function of a Graph 50 u u 2 2π 5 c u 0 u 3 u 4 Figure 3.2: A flat five-rib umbrella, top view If z = 0, the umbrella is completely flat (see Figure 3.2 for a top view in which c collapses to the origin), and letting z grow to corresponds to the process of folding up the umbrella. Keep folding the umbrella until the angle between u 0 and u 2 becomes π/2, meaning that the vectors become orthogonal. This will eventually happen since we start with angle 4π/5 > π/2 in flat position and converge to angle 0 as z. We can compute the value of z for which we get orthogonality: We must have 0 = u T 0 u 2 (, 0, z) cos 4π 5 sin 4π 5 z = cos 4π 5 + z2 = 0. Hence, z = cos 4π 5, u 0 = (, 0, ) cos 4π 5 cos 4π 5. For this value of z, symmetry implies that we do have an orthonormal representation U: every u i is orthogonal to the two opposite vectors u i+2 and u i 2 of its two non-neighbors in C 5 (indices are modulo 5). Recalling that c = (0, 0, ), we have that ϑ(c 5 ) ϑ(u) 4 max i=0 (c T u i ) 2 = (c T u 0 ) 4π cos 5 = 2 cos 4π 5 by symmetry. Exercise (3.4.9) asks you to prove that this number is 5.

5 3. Some Applications 3.3.5 Two Semidefinite Programs for the Theta Function The value of Θ(C 5 ) was unknown for more than 20 years after Shannon had given the lower bound Θ(C 5 ) 5. Together with the Lovász bounds Θ(C 5 ) ϑ(c 5 ) 5, we get Θ(C 5 ) = ϑ(c 5 ) = 5. Here we want to discuss how ϑ(g) can be computed for an arbitrary graph G. The above method for C 5 was somewhat ad-hoc, and only in hindsight it turned out that the umbrella construction yields an optimal orthonormal representation. In general, the definition of ϑ(g) does not give rise to an efficient algorithm. However, Lovász proved that ϑ(g) can alternatively be expressed as the value of a semidefinite program. This implies that ϑ(g) is efficiently computable, up to any desired precision. In fact, there are various semidefinite programs that can be used to compute ϑ(g). The first one is obtained by more or less just rewriting the definition. Recall that ϑ(g) is the smallest value of ϑ(u) = min max c = v V (c T u v ) 2, over all orthonormal representations U. By replacing u v with u v if necessary, we may assume c T u v 0 for all v. But then, ϑ(g) = max U = max ϑ(u) U max c = min v V ct u v. With t R n + being an additional variable for the minimum, we see that / ϑ(g) is the value of the program Maximize t subject to u T v u v = 0 if v and v are not similar in G c T u v t, v V u v =, v V c = (3.9) This has not yet the form of a semidefinite program in equational form, but it can be brought into this form, see Exercise 3.4.7 and observe the remark after Theorem 3.3.4. 3.3.4 Theorem. For any graph G = (V, E) with V = {,..., n}, the theta function ϑ(g) is the value of the following semidefinite program in the matrix variable Y S n and the real variable t. Minimize t subject to y ij = y ii = t Y 0 if i is not similar to j in G for all i =,..., n (3.0)

3.3 Shannon Capacity and Theta Function of a Graph 52 One remark is in order. We can easily write this as a cone program in equational form, with cone K = S + n R + (t follows from y ii = t and Y 0). But this can also be simulated by a semidefinite program: append to Y one row and column, add the constraints y n+,i = y i,n+ = 0 for i n and replace t with x n+,n+ throughout. Then the larger matrix is positive semidefinite if and only if Y 0 and y n+,n+ 0, so (3.0) is a proper semidefinite program. In the same way, we can integrate an arbitrary number of nonnegative real variables into any semidefinite program. Proof. We first show that the value of (3.0) is at most ϑ(g). Let U = {u,...,u n } be an optimal orthonormal representation of G with handle c. Now we define a matrix Ỹ S n by and ỹ ij = u T i u j (c T u i )(c T u j ), i j ỹ ii = ϑ(g), i =,...,n. Since U is an orthonormal representation, we have ỹ ij = for i not similar to j. If we can show that Ỹ 0, we know that the pair (Ỹ, ϑ(g)) is a feasible solution of (3.0), meaning that the program s value is at most ϑ(g). To see Ỹ 0, we first observe (a simple calculation) that ỹ ij = and (by definition of ϑ(g)) ỹ ii = ϑ(g) ( c u i c T u i ) T ( c u j c T u j ( (c T u i ) = c 2 ), i j, u i c T u i ) T ( c u i c T u i This means that Ỹ is of the form Ỹ = D + UT U, where D is a diagonal matrix with nonnegative entries, and U is the matrix whose i-th column is the vector c u i /c T u i. Thus, Ỹ 0. To show that the value of (3.0) is at least ϑ(g), we let (Ỹ, t) be any feasible solution of (3.0) with the property that t is minimal subject to ỹ ij being fixed for i j. This implies that Ỹ has one eigenvalue equal to 0 (otherwise we could decrease t) and is therefore singular. Note that t. We now perform a Cholesky decomposition Ỹ = ST S, see Fact.4.(iii). Let s,...,s n be the columns of S. Since Ỹ is singular, S is singular as well, and the s i span a proper subspace of R n. Consequently, there exists a unit vector c that is orthogonal to all the s i. Next we define u i := t (c + s i), i =,...,n, ).

53 3. Some Applications and we intend to show that U = {u,...,u n } is an orthonormal representation of G. For this, we compute u T i u j = t (c + s i) T (c + s j ) = t (ct }{{} c +c T s j + s T i }{{}}{{} c 0 0 It follows that +s T i s j) = }{{} ỹ ij ỹ ii = t u i =, for all i =,...,n, ỹ ij = u T i u j = 0, if i not similar to j, t ( + ỹ ij). so we have indeed found an orthonormal representation of G. Since we further have (c T u i ) 2 = ( c T t ) 2 (c + s i) = t ( c T (c + s i ) ) 2 = t, i =,...,n, we get which completes the proof. ϑ(g) ϑ(u) n max i= (c T u i ) 2 = t, 3.3.6 The Sandwich Theorem and Perfect Graphs We know that ϑ(g) is lower-bounded by α(g), the independence number of the graph G. But we can also upper-bound ϑ(g) in terms of another graph parameter. This bound will also shed some more light on the geometric interpretation of the semidefinite program (3.0) for ϑ(g). 3.3.5 Definition. Let G = (V, E) be a graph. (i) A clique in G is a subset K V of vertices such that {v, w} E for all distinct v, w K. The clique number ω(g) of G is the size of a largest clique in G. (ii) A k-coloring of G is a mapping c : V {,..., k} such that c(v) c(w) if {v, w} E. The chromatic number χ(g) of G is the smallest k such that G has a k-coloring. (iii) The complementary graph Ḡ = (V, Ē) of E is defined via Ē = ( V 2) \ E. Here, ( V 2) is the set of two-element subsets of V.

3.3 Shannon Capacity and Theta Function of a Graph 54 According to this definition, an independent set in G is a clique in Ḡ, and vice versa. Consequently, α(g) = ω(ḡ). (3.) Here is the promised upper bound on ϑ(g). Together with the already known lower bound, we obtain the Sandwich Theorem that bounds ϑ(g) in terms of clique number and chromatic number of the complementary graph. 3.3.6 Theorem. For every graph G = (V, E), ω(ḡ) ϑ(g) χ(ḡ). Proof. For the lower bound, we use Lemma 3.3. together with (3.). For the upper bound, let us suppose that ϑ(g) >, as the bound is trivial for ϑ(g) = ). But then, χ(ḡ) 2, since a -coloring is possible only for Ē = in which case ϑ(g) = ). Now, let us rescale (3.0) into the following equivalent form (we assume that V = {,..., n}): Minimize t subject to y ij = /(t ) y ii = Y 0 if {i, j} Ē for all i =,...,n (3.2) At the same time, we have replaced the condition i is not similar to j by the equivalent condition that {i, j} is an edge in the complementary graph Ḡ = (V, Ē). If we rewrite Y 0 as Y = ST S for S a matrix with columns s,...,s n, the equality constraints of (3.2) translate as follows. y ij = t st i s j = t y ii = s i =. Lemma 3.3.8 below shows that if Ḡ has a k-coloring, then we actually find vectors s i that satisfy the latter equations, for t = k. This implies that (Y, t) = (S T S, k) is a feasible solution of (3.2), and hence k ϑ(g), the value of (3.2). The upper bound follows if we choose k = χ(ḡ). The vectors s i constructed in the latter proof can be regarded as a vector-kcoloring of Ḡ. 3.3.7 Definition. For k R, a vector-k-coloring of a graph G = (V, E) is a mapping γ : V S n such that γ(v) T γ(w) =, {v, w} E. k

55 3. Some Applications For a k-coloring, we require that adjacent vertices have different colors. For a vector k-coloring, we require the colors of adjacent vertices to have a large angle. The proof of Theorem 3.3.6 shows that ϑ(g) is the smallest k such that Ḡ has a vector k-coloring. The upper bound ϑ(g) χ(ḡ) then follows from the fact that the notion of vector-k-colorings is a relaxation of the notion of k-colorings: 3.3.8 Lemma. If a graph G has a k-coloring, then it also has a vector-kcoloring. Proof: We construct k unit-length vectors u,...,u k such that u T i u j = k, i j. Given a k-coloring c of G, a vector-k-coloring can then be obtained via γ(v) = u c(v), v V. The k vectors form the vertices of a regular simplex centered at the origin, see Figure 3.3 for the case k = 3. In general, we define u i = e i k e i k k l= e l k l= e, i =,...,k. l u 3 20 u u 2 Figure 3.3: k unit-length vectors with pairwise scalar products /(k ). Perfect graphs. We know that the clique number ω(g) is NP-hard to compute for general graphs. The same can be said about the chromatic number χ(g). But there is a class of graphs for which Theorem 3.3.6 makes both values computable in polynomial time. 3.3.9 Definition. A graph G is called perfect if ω(g ) = χ(g ) for every induced subgraph G of G.

3.4 Exercises 56 There are many known families of perfect graphs, including for example bipartite graphs as an easy example. Indeed, every induced subgraph of a bipartite graph is again bipartite, and every bipartite graph has clique number and chromatic number equal to 2. Other examples are interval graphs (intersection graphs of closed intervals on the real line), and more generally, chordal graphs (every cycle of length at least four has an edge connecting two vertices that are not neighbors along the cycle). For perfect graphs, Theorem 3.3.6 implies ω(g) = ϑ(ḡ) = χ(g), meaning that maximum cliques and minimum colorings can be computed for perfect graphs in polynomial time through semidefinite programming. Indeed, since we are looking for an integer, it suffices to solve (3.2) (for the complementary graph) up to accuracy ε < /2. Moreover, due to y ii =, all entries of a feasible Y are scalar products of unit vectors and hence in [, ]. This means that our requirements for polynomial-time solvability (see the beginning of Section.3) are satisfied. One can also compute the independence number α(g) of a perfect graph G in polynomial time: Recall that α(g) = ω(ḡ), and since Ḡ is perfect as well (the weak perfect graph conjecture, proved by Lovász in 972), the statement follows. 3.4 Exercises 3.4. Exercise. Let M S n be a symmetric n n matrix with eigenvalues λ,...,λ n. Prove that n Tr(M) = λ j. j= 3.4.2 Exercise. Prove that a matrix M S n has rank if and only if M = ±ss T for some nonzero vector s R n. In particular, M S n + has rank if and only if M = ss T. 3.4.3 Exercise. Given a symmetric matrix C S n, we are looking for a matrix Y S n + such that Y C S+ n. Prove that the trace of any such matrix is at least the sum of the positive eigenvalues of C. Moreover, there exists a matrix Ỹ S n + with Ỹ C S+ n and Tr(Ỹ ) equal to the sum of the positive eigenvalues of C. 3.4.4 Exercise. Let C S n.

57 3. Some Applications (a) Prove that the value of the following cone program is the sum of the k largest eigenvalues of C. Minimize ky + Tr(Y ) subject to yi n + Y C S + n (Y, y) S + n R. Hint: You may use the statement of Exercise 3.4.3 (b) Derive the dual program and show that its value is also the sum of the k largest eigenvalues of C. In doing this, you have (almost) proved Fan s Theorem. 3.4.5 Exercise. Let (x k ) k N be a sequence of real numbers such that x k+l x k + x l k, l. We say that the sequence is superadditive. Prove that x k lim k k = sup{x k k : k N}, where both the limit and the supremum may be. 3.4.6 Exercise. What is the value of the orthonormal representation U = {e,...,e n }? 3.4.7 Exercise. Prove that the program (3.9) can be rewritten into a semidefinite program in equational form, and with the same value. 3.4.8 Exercise. Let u,...,u k be pairwise orthogonal unit vectors in R n. Prove that c T c k i= (ct u i ) 2 for all c R n. 3.4.9 Exercise. Prove that cos 4π 5 cos 4π 5 = 5. 3.4.0 Exercise. Prove that for all graphs G and H, ϑ(g H) = ϑ(g)ϑ(h). Hint: First write down the program dual to (3.0) and show that it also has value ϑ(g). Then look at the dual programs for G and H with optimal solutions; from them, construct a feasible solution of the dual program for G H with value ϑ(g)ϑ(h). This shows that ϑ(g H) ϑ(g)ϑ(h), and the other inequality is Lemma 3.3.9.