Communication Complexity 6:98:67 2/5/200 Lecture 6 Lecturer: Nikos Leonardos Scribe: Troy Lee Information theory lower bounds. Entropy basics Let Ω be a finite set and P a probability distribution on Ω. The entropy of a random variable X distributed according to P is defined as H(X) = x Ω P (x) log P (x). Intuitively, entropy is a means to quantify the amount of uncertainty in a distribution If the distribution is focused on a single element the entropy is zero; for the uniform distribution on Ω the entropy is maximal at log Ω. These examples set the bounds for the range of entropy, 0 H(X) log Ω. Entropy can also be interpreted in terms of compression Shannon s source coding theorem states that the optimal expected code word length of elements of a random variable X is the entropy of X. We will also make use of the conditional entropy. First consider conditioning on a single outcome. H(X Y = y) = P (x y) log P (x y). x This quantity can actually be larger than H(X). Imagine the case where random bit if Y = 0 X = 0 if Y = Then H(X) < yet H(X Y = 0) =. The conditional entropy is the expectation of the last quantity over Y H(X Y ) = E y [H(X Y = y)] H(X). This quantity is at most the entropy of X. Finally, let the joint entropy of X, Y be H(X, Y ) = H(X) + H(Y X).
.2 Mutual Information For lower bounds, we will use mutual information. For two random variables Z, Π this is defined as I(Z; Π) = H(Z) H(Z Π) = H(Π) H(Π Z) = H(Π) + H(Z) H(Π, Z). In applications to communication complexity, typically Z will be a distribution over inputs X Y and Π will be a distribution over protocol transcripts. To see how this can work to show lower bounds, let us look at a warmup example, the index function. The index function is defined as Index : 0, } n [n] 0, } where Index(x, i) = x i. Consider the one-way complexity of the index function from Alice to Bob. Let Π(X, R) be a random variable over the messages of Alice which depends on the distribution X over Alice s inputs and R random coins of Alice. Notice that H(Π) will be a lower bound on the maximum length of a message of Alice. We will actually lower bound the potentially smaller quantity I(X; Π). Use the uniform distribution over Alice s inputs. Then we have I(X; Π) = H(X) H(X Π) = n H(X Π). Now it remains to upper bound H(X Π). Using the fact that H(Y, Z) = H(Y ) + H(Z Y ) we have H(X Π) H(X j Π). j As we are dealing with one-way communication, conditioning on a particular input to Bob does not change Alice s action. Thus we have H(X j Π) = H(X j Π, B = j) that is, given that Bob s input is actually j. Finally we have by correctness of the protocol that this entropy must be small, at most H(ɛ) if the error probability is ɛ. Putting everything together we have Rɛ A B (Index) I(X; Π) ( H(ɛ))n. This example comes from Ablayev, Lower bounds for one-way probabilistic communication complexity and their application to space complexity, Theoretical Computer Science, Vol. 57(2), pg. 39 59, 996. 2 Set Intersection Lower bound We will now show a Ω(n) lower bound for the randomized two-party complexity of the set intersection problem. We will follow the proof of Bar-Yossef et al. An information statistics 2
approach to data stream and communication complexity, JCSS 68, pg 702 732, 200. We refer the reader there for full details. Consider again a random variable Π((X, Y ), R A, R B ) over transcripts. This variable depends on the distribution of inputs (X, Y ) and Alice s random bits R A and Bob s random bits R B. It is actually important for this framework that we work with private coin complexity. Define the information cost of the protocol Π as IC((X, Y ); Π). The information cost of a function is then IC(f) = min IC(Π). Π Πcorrect 2. Direct sum The idea for the proof will be to show a direct sum theorem for the information cost measure that the information cost for set intersection must be n times the information cost of the one-bit and function. To do this we will use the following fact. Fact. If Z = (Z,..., Z n ) are mutually independent then I(Z; Π) I(Z ; Π) +... + I(Z n ; Π). We will define a distribution on inputs where (X, Y ),..., (X n, Y n ) are mutually independent. In this case, I((X, Y ); Π) I((X, Y ); Π) +... + I((X n, Y n ); Π). The goal will now be to relate I((X i, Y i ); Π) to the information cost of the one-bit AND function. The distribution we will use on (X i, Y i ) is P (0, 0) = /2, P (, 0) = /, P (0, ) = /. Notice that this distribution has the property that it only gives weight to inputs which evaluate to zero on AND. This is still OK, as in the definition of information cost we just minimize over correct protocols thus the trivial protocol which always outputs zero, although it works for this distribution, is excluded. It will be useful to view this distribution another way, as a mixture of product distributions. Introduce another variable D i uniformly distributed over 0, }. Then we define 0 if D i = 0 X i = random bit D i = 0 if D i = Y i = random bit D i = 0 Now for a fixed value of D i we have a product distribution on (X i, Y i ) and the mixture of these two product distributions is the distribution defined above. The key to the direct sum property is the following claim. 3
Claim 2. I((X i, Y i ); Π D) IC(AND). Proof. We design a protocol for AND by simulating the protocol for SET INTERSECTION with variables outside of (X i, Y i ) fixed. For notational convenience, let i =. I((X, Y ); Π D) = E d2,...,d n [I((X, Y ); Π D, D 2 = d 2,..., D n = d n )] We will show that each term of this expectation is at least IC(AND) which will then give the claim. Consider a fixing D 2 = d 2,..., D n = d n. We design a protocol for AND(A, B) using the protocol Π. As D j is fixed for j =,..., n, the distribution over (X j, Y j ) is a product distribution and so can be simulated by Alice and Bob without communication using their private random bits. Alice and Bob then run the protocol Π on the input (A, B), (x 2, y 2 ),..., (x n, y n ). This claim gives that IC(SI) nic(and). Now we just have to show a lower bound on the information complexity of the AND function on one bit. 2.2 One-bit AND function We want to lower bound I((A, B); Π D) = I((A, B); Π D = 0) + I((A, B); Π D = ). As 2 2 these are symmetrical we can just focus on I((A, B); Π D = ) = I(A; Π(A, 0) D = ). If the distribution on transcripts is very different on Π(0, 0) and Π(, 0) then we will be able to determine A from looking at the transcripts, implying the information complexity is large. To do this, we will transform the problem from one about mutual information to one about a metric, the Hellinger distance. Instead of viewing Π(a, b) as a probability distribution, consider instead Ψ(a, b) the unit vector which is the entrywise square root of Π(a, b). With this transformation, Hellinger distance simply becomes a scaled version of Euclidean distance: h(ψ, Ψ 2 ) = 2 Ψ Ψ 2. We will need three key properties of Hellinger distance and its relation to mutual information. See the above paper for proof of these statements.. Mutual information and Hellinger distance: Let u, v 0, } 2 be two inputs to AND, and U R u, v}. As before let Ψ(u) be the unit vector formed by the entrywise square root of Π(u). I(U; Π) 2 Ψ(u) Ψ(v) 2. 2. Soundness: If AND(u) AND(v) and Π is a protocol with error at most ɛ, then 2 Ψ(u) Ψ(v) 2 2 ɛ. 3. Cut and paste: Let u = (x, y), v = (x, y ) and u = (x, y ), v = (x, y). Then Ψ(u) Ψ(v) = Ψ(u ) Ψ(v ).
Given these three properties we can quickly finish the proof. 2 Ψ(0, 0) Ψ(, 0)) 2 + Ψ(0, 0) Ψ(0, )) 2 ( Ψ(0, 0) Ψ(, 0) + Ψ(0, 0) + Ψ(0, ) )2 Ψ(, 0) Ψ(0, ) 2 = Ψ(0, 0) Ψ(, ) ( 2 ɛ). Where the first inequality follows by Cauchy-Schwarz, the second by triangle inequality, the third by cut and paste, and the last by soundness. 5