st Semester 00/ Solutions to Homework Set # Sanov s Theorem, Rate distortion. Sanov s theorem: Prove the simple version of Sanov s theorem for the binary random variables, i.e., let X,X,...,X n be a sequence of binary random variables, drawn i.i.d. according to the distribution: Pr(X = ) = q, Pr(X = 0) = q. () Let the proportion of s in the sequence X,X,...,X n be p X, i.e., p X n = n X i. () i= By the law of large numbers, we would expect p X to be close to q for large n. Sanov s theorem deals with the probability that p X n is far away from q. In particular, for concreteness, if we take p > q >, Sanov s theorem states that n logpr{(x,x,...,x n ) : p X n p} plog p p +( p)log q q = D((p, p) (q, q) (3) Justify the following steps: Pr{(X,X,...,X n ) : p X p} i= np q i ( q) n i (4) i Argue that the term corresponding to i = np is the largest term in the sum on the right hand side of the last equation. Show that this term is approximately nd. Prove an upper bound on the probability in Sanov s theorem using the above steps. Use similar arguments to prove a lower bound and complete the proof of Sanov s theorem.
Solution Sanov s theorem Since nx n has a binomial distribution, we have Pr(nX n = i) = q i ( q) n i (5) i and therefore Pr{(X,X,...,X n ) : p X p} i= np q i ( q) n i (6) i Pr(nX n = i+) Pr(nX n = i) = ( n i+( n i ) q i+ ( q) n i ) = n i qi ( q) n i i+ q q (7) This ratio is less than if n i,i.e., if i > nq ( q). Thus q the maximum of the terms occurs when i = np. From Example..3, i+ < q.= nh(p) np and hence the largest term in the sum is q np ( q) n np = n( plogp ( p)log( p))+nplogq+n( p)log( q) = nd(p q) np (9) From the above results, it follows that Pr{(X,X,...,X n ) : p X p} q i ( q) n i (0) i i= np (n np ) q i ( q) () n i np (8) (n( p)+) nd(p q) () where the second inequality follows from the fact that the sum is less than the largest term times the number of terms. Taking the
logarithm and dividing by n and taking the limit as n, we obtain lim n n logpr{(x,x,...,x n ) : p X p} D(p q) (3) Similarly, using the fact the sum of the terms is larger than the largest term, we obtain Pr{(X,X,...,X n ) : p X p} q i ( q) n i (4) i i= np q i ( q) n i (5) np and nd(p q) (6) lim n n logpr{(x,x,...,x n ) : p X p} D(p q) (7) Combining these two results, we obtain the special case of Sanov s theorem lim n n logpr{(x,x,...,x n ) : p X p} = D(p q) (8). Strong Typicality Let X n be drawn i.i.d. P(x). Prove that for each x n T δ (X), n(h(x)+δ ) P n (x n ) n(h(x) δ ) for some δ = δ (δ) that vanishes as δ 0. Solution: Strong Typicality From our familiar trick, we have P n (x n ) = n( a X P x n(a)log P(a)) But, since x n T δ (X), we have P x n(a) P(a) < δ for all a. Therefore, H(X) δ log P(a) < P x n(a)log a a X 3 P(a) < H(X)+ δ log a P(a)
Hence, where δ = δ a log P(a) n(h(x)+δ ) P n (x n ) n(h(x) δ ) 3. Weak Typicality vs. Strong Typicality In this problem, we compare the weakly typical set A ǫ (P) and the strongly typical set T δ (P). To recall, the definition of two sets are following. { } A ǫ (P) = x n X n : n logpn (x n ) H(P) ǫ { T δ (P) = x n X n : P x n P δ } (a) Suppose P is such that P(a) > 0 for all a X. Then, there is an inclusion relationship between the weakly typical set A ǫ (P) and the strongly typical set T δ (P) for an appropriate choice of ǫ. Which of the statement is true: A ǫ (P) T δ (P) or A ǫ (P) T δ (P)? What is the appropriate relation between δ and ǫ? (b) Give a description of the sequences that belongs to A ǫ (P), vs. the sequences that belongs to T δ (P), when the source is uniformly distributed, i.e. P(a) =, a X. (Assume <.) (c) Can you explain why T δ (P) is called strongly typical set and A ǫ (P) is called weakly typical set? Solution: Weak Typicality vs. Strong Typicality (a) From Problem, we can see that if x n T δ (P), then n logpn (x n ) H(P) = δ log δ P(a) Therefore, we can see that if ǫ > δ a log, then A P(a) ǫ(p) T δ (P). To show that the other way does not hold, see the next part. a 4
(b) When P is a uniform distribution, we can see that A ǫ (P) includes every possible sequences x n. To see this, we can easily see that P n (x n ) = ( )n, and thus, /nlogp n (x n ) = log = H(P). Thus, x n A ǫ (P) for all x n, for all ǫ > 0. However, obviously, among those sequences in A ǫ (P), only those who have the type δ P x n(a) + δ for each letter a X, are in T δ (P). Therefore, for sufficiently small value of δ, there always exist some sequences that are in A ǫ (P), but not in T δ (P). (c) From above questions, we can see that the strongly typical set is contained in the weakly typical set for appropriate choice of δ and ǫ. Thus, we can see that the definition of strong typical set is stronger than that of the weakly typical set. 4. Rate distortion for uniform source with Hamming distortion. Consider a source X uniformly distributed on the set {,,...,m}. Find the rate distortion function for this source with Hamming distortion, i.e., { 0 if x = ˆx, d(x,ˆx) = if x ˆx. Solution: Rate distortion for uniform source with Hamming distortion. X is uniformly distributed on the set {,,...,m}. The distortion measure is { 0 if x = ˆx d(x,ˆx) = if x ˆx Consider any joint distribution that satisfies the distortion constraint D. Since D = Pr(X ˆX), we have by Fano s inequality and hence H(X ˆX) H(D)+Dlog(m ), (9) I(X; ˆX) = H(X) H(X ˆX) (0) logm H(D) Dlog(m ). () 5
We can achieve this lower bound by choosing p(ˆx) to be the uniform distribution, and the conditional distribution of p(x ˆx) to be p(ˆx x) { = D if ˆx = x = D/(m ) if ˆx x. () It is easy to verify that this gives the right distribution on X and satisfies the bound with equality for D < m. Hence { = logm H(D) Dlog(m ) if 0 D R(D) m 0 if D >. (3) m 5. Erasure distortion Consider X Bernoulli( ), and let the distortion measure be given by the matrix [ ] 0 d(x,ˆx) =. (4) 0 Calculate the rate distortion function for this source. Can you suggest a simple scheme to achieve any value of the rate distortion function for this source? Solution: Erasure distortion The infinite distortion constrains p(0,) = p(,0) = 0. Hence by symmetry the joint distribution of (X, ˆX) is of the form shown in Figure. 0 a 0 a e a a Figure : Joint distribution for erasure rate distortion of a binary source. 6
For this joint distribution, it is easy to calculate the distortion D = a and that I(X; ˆX) = H(X) H(X ˆX) = a. Hence we have R(D) = D for 0 D. For D >, R(D) = 0. It is very see how we could achieve this rate distortion function. If D is rational, say k/n, then we send only the first n k of any block of n bits. We reproduce these bits exactly and reproduce the remaining bits as erasures. Hence we can send information at rate D and achieve a distortion D. If D is irrational, we can get arbitrarily close to D by using longer and longer block lengths. 6. Rate distortion. A memoryless source U is uniformly distributed on {0,...,r }. The following distortion function is given by d(u,v) = 0, u = v,, u = v ± mod r,, otherwise. Show that the rate distortion function is { logr D h (D), D 3 R(D) =, logr log3, D >. 3 Solution: Rate distortion From the symmetry of the problem, we can assume the conditional distribution of p(v u) as p u = v, p(v u) = p/ u = v ± mod r, 0 otherwise Then, E(d(U,V)) = p. Therefore, the rate distortion function is Now, we know that R(D) = min p D I(U;V). I(U;V) =H(V) H(V U) =logr H( p,p/,p/), 7
since U is uniform, and due to symmetry, V is also uniform. We know that H( p,p,p) log3, and this is achieved when p = /3. Therefore, R(D) = logr log3, if D > /3. Now, let s consider the case when D /3. Denote f(p) = H( p,p/,p/) = ( p)log( p) p/logp/ = ( p)log( p) plogp+p, We know that f(p) is a concave function. By differentiating with respect to p, df(p) dp = log( p)+ logp + = log p p + and setting f(p) = 0, f(p) becomes maximum when p = /3. Therefore, if D /3, f(p) is an increasing function of p. Thus,. R(D) = log3 D h (D), if D /3. 7. Adding a column to the distortion matrix. Let R(D) be the rate distortion function for an i.i.d. process with probability mass function p(x) and distortion function d(x,ˆx), x X, ˆx ˆX. Now suppose that we add a new reproduction symbol ˆx 0 to ˆX with associated distortion d(x,ˆx 0 ), x X. Can this increase R(D)? Explain. Solution: Adding a column Let the new rate distortion function be denoted as R(D), and note that we can still achieve R(D) by restricting the support of p(x,ˆx), i.e., by simply ignoring the new symbol. Thus, R(D) R(D). Finally note the duality to the problem in which we added a row to the channel transition matrix to have no smaller capacity (Problem 7.). 8. Simplification. Suppose X = {,,3,4}, ˆX = {,,3,4}, p(i) = 4, i =,,3,4, and X,X,... are i.i.d. p(x). The distortion matrix 8
d(x,ˆx) is given by 3 4 0 0 0 0 3 0 0 4 0 0 (a) Find R(0), the rate necessary to describe the process with zero distortion. (b) Find the rate distortion function R(D). (Hint: The distortion measure allows to simplify the problem into one you have already seen.) (c) Suppose we have a nonuniform distribution p(i) = p i, i =,,3,4. What is R(D)? Solution: Simplification (a) We can achieve 0 distortion if we output ˆX = if X = or, and ˆX = 3 if X = 3 or 4. Thus if we set Y = if X = or, and Y = if X = 3 or 4, we can recover Y exactly if the rate is greater than H(Y) = bit. It is also not hard to see that any 0 distortion code would be able to recover Y exactly, and thus R(0) =. (b) If we define Y as in the previous part, and Ŷ similarly from ˆX, we can see that the distortion between X and ˆX is equal to the Hamming distortion between Y and Ŷ. Therefore if the rate is greater than the Hamming rate distortion function R(D) for Y, we can recover X to distortion D. Thus R(D) = H(D). (c) If the distribution of X is not uniform, the same arguments hold and Y has a distribution (p +p,p 3 +p 4 ), and the rate distortion function is R(D) = H(p +p ) H(D), 9. Rate distortion for two independent sources. Can one simultaneously compress two independent sources better than compressing the sources individually? The following problem addresses this question. Let the pair {(X i,y i )} be iid p(x,y). The distortion measure for X is d(x,ˆx) and its rate distortion function is R X (D). Similarly, the 9
distortion measure for Y is d(y,ŷ) and its rate distortion function is R Y (D). Suppose we now wish to describe the process {(X i,y i )} subject to distortionconstraintslim n Ed(X n, ˆX n ) D andlim n Ed(Y n,ŷ n ) D. Our rate distortion theorem can be shown to naturally extend to this setting and imply that the minimum rate required to achieve these distortion is given by R X,Y (D,D ) = min p(ˆx,ŷ x,y):ed(x, ˆX) D,Ed(Y,Ŷ) D I(X,Y; ˆX,Ŷ) Now, suppose the {X i } process and the {Y i } process are independent of each other. (a) Show (b) Does equality hold? Now answer the question. R X,Y (D,D ) R X (D )+R Y (D ). Solution: Rate distortion for two independent sources. (a) Given that X and Y are independent, we have Then I(X,Y; p(x,y,ˆx,ŷ) = p(x)p(y)p(ˆx,ŷ x,y) (5) ˆX,Ŷ) = H(X,Y) H(X,Y ˆX,Ŷ) (6) = H(X)+H(Y) H(X ˆX,Ŷ) H(Y X, ˆX,Ŷ) H(X)+H(Y) H(X ˆX) H(Y Ŷ) (8) = I(X; ˆX)+I(Y;Ŷ) (9) where the inequality follows from the fact that conditioning reduces entropy. Therefore R X,Y (D,D ) = min I(X,Y; ˆX,Ŷ) (30) p(ˆx,ŷ x,y):ed(x, ˆX) D,Ed(Y,Ŷ) D min p(ˆx,ŷ x,y):ed(x, ˆX) D,Ed(Y,Ŷ) D (I(X; ˆX)+I(Y;Ŷ) ) (3) = min I(X; ˆX)+ min I(Y;Ŷ) (3) p(ˆx x):ed(x, ˆX) D p(ŷ y):ed(y,ŷ) D = R X (D )+R Y (D ) (33) 0
(b) If then I(X,Y; p(x, y, ˆx, ŷ) = p(x)p(y)p(ˆx x)p(ŷ y), (34) ˆX,Ŷ) = H(X,Y) H(X,Y ˆX,Ŷ) (35) = H(X)+H(Y) H(X ˆX,Ŷ) H(Y X, ˆX,Ŷ) = H(X)+H(Y) H(X ˆX) H(Y Ŷ) (37) = I(X; ˆX)+I(Y;Ŷ) (38) Let p(x, ˆx) be a distribution that achieves the rate distortion R X (D ) at distortion D and let p(y,ŷ) be a distribution that achieves the ratedistortionr Y (D ) atdistortion D. Then forthe product distribution p(x,y,ˆx,ŷ) = p(x,ˆx)p(y,ŷ), where the componentdistributionsachieverates(d,r X (D ))and(d,r X (D )), the mutual information corresponding to the product distribution is R X (D )+R Y (D ). Thus R X,Y (D,D ) = min I(X,Y; ˆX,Ŷ) = R X(D )+R Y (D ) p(ˆx,ŷ x,y):ed(x, ˆX) D,Ed(Y,Ŷ) D (39) Thus by using the product distribution, we can achieve the sum of the rates. Therefore the total rate at which we encode two independent sources together with distortions D and D is the same as if we encoded each of them separately. 0. One bit quantization of a single Gaussian random variable. Let X Norm(0,σ ) and let the distortion measure be squared error. Here we do not allow block descriptions. Show that the optimum reproduction points for bit quantization are ± π σ, and that the expected distortion for bit quantization is π π σ. Compare this with the distortion rate bound D = σ R for R =. Solution: One bit quantization of a single Gaussian random variable With one bit quantization, the obvious reconstruction regions are the positive and negative real axes. The reconstruction point is the centroid
of each region. For example, for the positive real line, the centroid a is given as a = x 0 πσ = σ 0 π e y dy = σ π, e x σ dx using the substitution y = x /σ. The expected distortion for one bit quantization is D = 0 ( x+σ π ( + x σ 0 π ( x +σ ) π πσ ( ) xσ π = 0 = σ + π σ 4 π σ π = σ π π.3634σ, ) x e σ dx πσ ) x e σ dx πσ e x σ dx x e σ dx πσ which is much larger than the distortion rate bound D() = σ /4.. Side information. Amemorylesssourcegeneratesi.i.d.pairsofrandomvariables(U i,s i ), i =,,... on finite alphabets, according to p(u n,s n ) = n p(u i,s i ). i=
We are interested in describing U, when S, called side information, is known both at the encoder and at the decoder. Prove that the rate distortion function is given by R U S (D) = min I(U;V S). p(v u,s):e[d(u,v)] D Compare R U S (D) with the ordinary rate distortion function R U (D) without any side information. What can you say about the influence of the correlation between U and S on R U S (D)? Solution: Side information Since both encoder and decoder have the common side information S, we can be helped from the correlation of our source U and the side information S. The idea for the achievability is, we use the different rate distortion code for the source symbols that have different corresponding side information. Therefore, for the symbols that has side information S = s will have the rate distortion code at rate R U S=s (D) = min I(U;V S = s) p(v u,s=s):e(d(u,v)) D and expected distortion D. Now, the total rate will be S R = P S n(s)r U S=s (D), s= where P S n(s) is the empirical distribution of S n. As n becomes large enough, we know that P S n(s) will be very close to the true distribution of S, p(s), or with high probability, S n will be in the strong typical set T P (δ), where { T P (δ) = s n : max P s n(s) p(s) < δ }. s S S Therefore, if n sufficiently large, and δ sufficiently small, we will achieve the rate R = p(s)i(u;v S = s) = I(U;V S) s S 3
It is clear that the overall rate distortion code has the expected distortion D. For the converse, we have nr H(V n S n ) =I(U n ;V n S n ) =H(U n S n ) H(U n V n,s n ) ( = H(Ui S i ) H(U i U i,v n,s n ) ) = i= (H(U i S i ) H(U i V i )) i= I(U i ;V i S i ) i= R U S (Ed(U i,v i )) i= R U S (Ed(U n,v n )) R U S (D) where each step has the identical reason as the original converse. Now, to compare R U S (D) and R U (D), consider following: R U S (D) = min p(v u,s):e(d(u,v)) D min p(v u):e(d(u,v)) D min p(v u):e(d(u,v)) D =R U (D). I(U;V S) I(U;V S) (40) I(U;V) (4) Here, the inequality in (40) is from the fact that the minimization is held on the smaller set. The inequality in (4) is from the fact that S U V forms a Markov chain with the joint distribution that we get from the minimization in (40). Also, we have, I(U,S;V) =I(U;V)+I(S;V U) =I(S;V)+I(U;V S). 4
Since I(S;V U) = 0 and I(S;V) 0, we have the inequality in (4). Therefore, we can only gain from the the side information. Intuitively, this also makes sense, because we can always just ignore the side information and get the original rate distortion code, but the rate may be worse. The two rates will be equal if and only if U and S are independent, which achieves the equality in both (40) and (4). 5