EE539R: Problem Set 3 Assigned: 24/08/6, Due: 3/08/6. Cover and Thomas: Problem 2.30 (Maimum Entropy): Solution: We are required to maimize H(P X ) over all distributions P X on the non-negative integers satisfying np X (n) = A n=0 and also the normalization constraint n=0 P X(n) = (which we ignore without loss of generality). Now, construct the Lagrangian: ( ) L(P X, λ) = P X (n) log P X (n) + λ np X (n) A n=0 Differentiating with respect to P X (n) (assuming natural logs and interchanging of differentiation and infinite sum), we obtain so we have n=0 P X (n) P X (n) log P X(n) + λn = 0 P X(n) = ep( + λn), n 0. We immediately recognize that this is a geometric distribution with mean A, i.e., PX can be written alternatively as where From direct calculations, the entropy is P X(n) = ( p) n p, n 0. A = p p H(P X ) = H b(p). p 2. (Optional): Cover and Thomas: Problem 2.38 (The Value of a Question): since H(Y X) = 0. H(X) H(X Y ) = I(X; Y ) = H(Y ) H(Y X) = H b (α) H(Y X) = H b (α)
3. Fano s inequality for list decoding: Recall the proof of Fano s inequality. Now develop a generalization of Fano s inequality for list decoding. Let (X, Y ) P XY and let L(Y ) ˆX be a set of size L (compare this to an estimator ˆX(Y ) X which is a set of size L = ). Lower bound the probability of error Pr(X / L(Y )) in terms of L, H(X L(Y )) and X. You should be able to recover the standard Fano inequality if you set L =. Solution: Define the error random variable Now consider E = { X / L(Y ) 0 X L(Y ) H(X, E L(Y )) = H(X E, L(Y )) + H(E L(Y )) = H(E X, L(Y )) + H(X L(Y )) Let P e := Pr(X / L(Y )). Now clearly, H(E X, L(Y )) = 0, and H(E L(Y )) H(E) = H b (P e ). Now, we eamine the term H(X E, L(Y )). We have H(X E, L(Y )) = Pr(E = 0)H(X E = 0, L(Y )) + Pr(E = )H(X E =, L(Y )) ( P e ) log L + P e log( X L) since if we know that E = 0, the number of values that X can take on is no more than L and if E =, the number of values that X can take on is no more than X L. Putting everything together and upper bounding H b (P e ) by, we have H(X L(Y )) log L P e. log X L L 4. (Optional): Data Processing Inequality for KL Divergence: Let P X, Q X be pmfs on the same alphabet X. Assume for the sake of simplicity that P X (), Q X () > 0 for all X. Let W (y ) = Pr(Y = y X = ) be a channel from X to Y. Define P Y (y) = W (y )P X (), and Q Y (y) = W (y )Q X () Show that D(P X Q X ) D(P Y Q Y ) You may use the log-sum inequality. This problem shows that processing does not increase divergence. Solution: Starting from the definition of D(P Y Q Y ), we have D(P Y Q Y ) = y = y y = y P Y (y) log P Y (y) Q Y (y) ( ) W (y )P X () log ( W (y )P X()) ( W (y )Q X()) W (y )P X () log W (y )P X() W (y )Q X () W (y )P X () log P X() Q X () = P X () log P X() Q X () = D(P X Q X ) where the inequality follows from the log-sum inequality 2
5. Typical-Set Calculations : (a) Suppose a DMS emits h and t with probability /2 each. For ɛ = 0.0 and n = 5, what is A n ɛ? Solution: In this case, H(X) =. All source sequences are equally likely, each with probability 2 5 = 2 nh(x). Hence, all sequences satisfy the condition for being typical, 2 n(h(x)+ɛ) p X n( n ) 2 n(h(x) ɛ) for any ɛ > 0. Hence, all 32 sequences are typical. (b) Repeat if Pr(h) = 0.2, Pr(t) = 0.8, n = 5, and ɛ = 0.000. Solution: Consider a sequence with m heads and n m tails. Then, the probability of occurrence of this sequence is p m ( p) n m, where p = Pr(h). For such a sequence to be typical which translates to Plugging in the value of p = 0.2, we get 2 n(h(x)+ɛ) p m ( p) n m 2 n(h(x) ɛ) ( m ) n p log p p ɛ m 5 5 ɛ 2. Since m = 0,..., 5, this condition will be satisfied for the given ɛ only for m = i.e. when there is one H in the sequence. Thus, A n ɛ = {(HT T T T ), (T HT T T ), (T T HT T ), (T T T HT ), (T T T T H)}. 6. Typical-Set Calculations 2: Consider a DMS with a two symbol alphabet {a, b} where p X (a) = 2/3 and p X (b) = /3. Let X n = (X,..., X n ) be a string of chance variables from the source with n = 00, 000. (a) Let W (X j ) be the log pmf random variable for the j-th source output, i.e., W (X j ) = log 2/3 for X j = a and log /3 for X j = b. Find the variance of W (X j ). Solution: For notational convenience, we will denote the log pmf random variable by W. Now, note that W takes on values log 2/3 with probability 2/3 and log /3 with probability /3. Hence, Var(W ) = E[W 2 ] E[W ] 2 = 2 9 (b) For ɛ = 0.0, evaluate the bound on the probability of the typical set using Pr(X n σw 2 /(nɛ2 ). Solution: The bound on the typical set, as derived using Chebyshev s inequality is / A (n) ɛ ) Pr(X n A (n) ɛ ) σ2 W nɛ 2. Substituting the values of n = 0 5 and ɛ = 0.0, we obtain Pr(X n A (n) ɛ ) 45 = 44 45 Loosely speaking this means that if we were to look at sequences of length 00, 000 generated from our DMS, more than 97% of the time the sequence will be typical. 3
(c) Let N a be the number of a s in the string X n = (X,..., X n ). The random variable (rv) N a is the sum of n iid rv s. Show what these rv s are. Solution: The rv N a is the sum of n iid rv s Y i, N a = n i= Y i where Y i s are Bernoulli with Pr(Y i = ) = 2/3. (d) Epress the rv W (X n ) as a function of the rv N a. Note how this depends on n. Solution: The probability of a particular sequence X n with N a number of a s (2/3) Na (/3) n Na. Hence, W (X n ) = log p X n( n ) = log[(2/3) Na (/3) n Na ] = n log 3 N a (e) Epress the typical set in terms of bounds on N a (i.e., A (n) ɛ = { n : α < N a < β} and calculate α and β). Solution: For a sequence X n to be typical, it must satisfy n log p X n(n ) H(X) < ɛ From (a) the source entropy is H(X) = E[W (X)] = log 3 2/3 and substituting in ɛ and W (X n ) from part (d), we get N a n 2 3 0.0 Note the intuitive appeal of this condition! It says that for a sequence to be typical, the proportion of a s in that sequence will be very close to the probability that the DMS generates an a. Plugging in the value of n in the above equation, we get the bounds on 65, 667 N a 67, 666. (f) Find the mean and variance of N a. Approimate Pr(X n A (n) ɛ ) by the central limit theorem approimation. The central limit theorem approimation is to evaluate Pr(X n A ɛ (n) ) assuming that N a is Gaussian with the mean and variance of the actual N a. Recall that for a sequence of iid rvs C,... C n, the central limit theorem assert that Pr ( n n i= C i µ C t ) ( ) t Φ σ C where µ C and σ C are the mean and standard deviation of the C i s and Φ(z) = z 2π ep( u2 2 ) du is the cdf of the standard Gaussian. Solution: N a is a binomial r.v. (which is a sum independent Bernoulli r.v. as we have shown in part (c)). The mean and variance are E[N a ] = 2 3 05, Var(N a ) = 2 9 05 Note that we can calculate the eact probability of the typical set A (n) ɛ : Pr(A (n) ɛ ) = Pr(65, 667 N a 67, 666) = 67,666 N a=65,667 ( 0 5 N a ) ( ) Na ( ) 0 2 5 N a 3 3 But this is computationally intensive, so we approimate the Pr(A (n) ɛ ) with the central limit theorem. We can use the CLT because N a is the sum of n iid r.v. so in the limit of large n, the cumulative distribution approaches that of a Gaussian r.v. with the mean and variance of N a. β ( Pr(65, 667 N a 67, 666) 2π Var(Na ) ep ( E[N a]) 2 ) d = Φ(6.706) Φ(6.70) 2 Var(N a ) α 4
where Φ() is the integral of the unit Gaussian r.v. from (, ). Thus the CLT approimation tells us approimately all of the sequences we observe from the output of the DMS will be typical, whereas Chebyshev gave us a bound that more than 97% of the sequences that we observe will be typical. 7. (Optional): Typical-Set Calculations 3: For the random variables in the previous problem, find Pr(N a = i) for i = 0,, 2. Find the probability of each individual string n for those values of i. Find the particular string n that has maimum probability over all sample values of X n. What are the net most probable n-strings. Give a brief discussion of why the most probable n-strings are not regarded as typical strings. Solution: We know from the previous problem that ( 0 5 Pr(N a = i) = i ) ( 2 3 ) i ( ) 0 5 i 3 For i = 0,, 2, Pr(N a = i) is approimately zero. The string with the maimal probability is the string with all a s. The net most probable strings are the sequences with n a s and one b, and so forth. From the definition of the typical set, we see that the typical set is a fairly small set which contains most of the probability, and the probability of each sequence in the typical set is almost the same. The most probable sequences and the least probable sequences are the tails of the distribution of the sample mean of the log pmf (they are the furthest from the mean), so are not regarded as typical strings. In fact, the aggregate probability of the all the most likely sequences and all the least likely sequences is very small. The only case where the most likely sequence is regarded as typical is when every sequence is typical and every sequence is most likely (as in problem Typical Set Calculation ). However, this is not the case in general. From what we have seen in problem Typical Set Calculation 2 for very long sequences, the typical sequence will contain roughly the same proportion of of symbols as the probability of that symbol. 8. (Optional): AEP and Mutual Information: Let (X i, Y i ) be i.i.d. p X,Y (, y). We form the loglikelihood ratio of the hypothesis that X and Y are independent vs the hypothesis that X and Y are dependent. What is the limit of n log p Xn(X n )p Y n(y n ) p Xn,Y n(xn, Y n ) What is the limit of p X n (Xn )p Y n (Y n ) p X n,y n (X n,y n ) if X i and Y i are independent for all i? Solution: Let L = n log p X n(xn )p Y n(y n ) p Xn,Y n(xn, Y n ) Since (X i, Y i ) be i.i.d. p X,Y (, y), we have L = n n i= log p X(X i )p Y (Y i ) p X,Y (X i, Y i ) }{{} W (X i,y i) Each of the terms is a function of (X i, Y i ) which are independent across i =,..., n. following convergence in probability is observed: [ L E [W (X, Y )] = E (X,Y ) px,y log p ] X(X)p Y (Y ) = I(X; Y ) p X,Y (X, Y ) Thus, the is 2 ni(x;y ) which converges to one if X and Y are inde- Hence, the limit of 2 nl = p X n (Xn )p Y n (Y n ) p X n,y n (X n,y n ) pendent because I(X; Y ) = 0. 5
9. Piece of Cake: A cake is sliced roughly in half, the largest piece being chosen each time, the other pieces discarded. We will assume that a random cut creates pieces of proportions: { (2/3, /3) w.p. 3/4 P = (2/5, 3/5) w.p. /4 Thus, for eample, the first cut (and choice of largest piece) may result in a piece of size 3/5. Cutting and choosing from this piece might reduce it to size (3/5)(2/3) at time 2, and so on. Let T n be the fraction of cake left after n cuts. Find the limit (in probability) of lim n n log T n Solution: Let C i be the fraction of the piece of cake that is cut at the i-th cut, and let T n be the fraction of cake left after n cuts. Then we have T n = C C 2... C n. Hence, lim n n log T n = lim n n n log C i E[log C ] = 3 4 log 2 3 + 4 log 3 5. i= 0. Two Typical Sets: Let X i be a sequence of real-valued random variables independent and identically distributed according to P X (), X. Let µ = E[X] and denote the entropy of X as H(X) = P X() log P X (). Define the two sets A n = { n X n : } { n log P X n( n ) H(X) ɛ, B n = n X n : n (a) ( point) Pr(X n A n ) as n. True or false. Justify your answer. Solution: This follows by Chebyshev s inequality: Indeed. where σ 2 0 = Var( log P X (X)). Consequently as desired. Pr(X n A c n) σ2 0 nɛ 2 0 Pr(X n A n ) (b) ( point) Pr(X n A n B n ) as n. True or false. Justify your answer. Solution: By Chebyshev s inequality and the same logic as the above, Pr(X n B n ) So by De Morgan s theorem and the union bound, } n X i µ ɛ Pr(X n A n B n ) = Pr(X n A c n B c n) Pr(X n A c n) Pr(X n B c n) Since the latter two terms tend to zero, we know that as desired. Pr(X n A n B n ) (c) ( point) Show that A n B n 2 n(h(x)+ɛ) for all n. A n B n A n 2 n(h+ɛ) where the final inequality comes from the AEP, shown in class. i= 6
(d) ( point) Show that A n B n 2 2n(H(X) ɛ) for n sufficiently large. Pr(X n A n B n ) 2 for n sufficiently large. Thus and we are done. 2 P X n( n ) n A n B n n A n B n 2 n(h ɛ) = A n B n 2 n(h ɛ). Entropy Inequalities: Let X and Y be real-valued random variables that take on discrete values in X = {,..., r} and Y = {,..., s}. Let Z = X + Y. (a) ( point) Show that H(Z X) = H(Y X). Justify your answer carefully. Solution: Consider H(Z X) = P X ()H(Z X = ) = = = P X () z P X () z P X () y P Z X (z ) log P Z X (z ) P Y X (z ) log P Y X (z ) P Y X (y ) log P Y X (y ) = H(Y X). (b) ( point) It is now known that X and Y are independent, which of the following is true in general? (i) H(X) H(Z); (ii) H(X) H(Z). Justify your answer. Solution: From the above, note that X and Y are symmetrical. So given what we have proved in (a), we also know that H(Z Y ) = H(X Y ) Now, we have H(Z) H(Z Y ) = H(X Y ) = H(X) where the inequality is due to conditioning reduces entropy and the final equality by the independence of X and Z. So the first assertion is true. (c) ( point) Now, in addition to Z = X + Y and that X and Y are independent, it is also known that X = f (Z) and Y = f 2 (Z) for some functions f and f 2. Find H(Z) in terms of H(X) and H(Y ). H(Z) = H(X + Y ) H(X, Y ) = H(X) + H(Y ) where the final equality is by independence of X and Y. On the other hand, H(X) + H(Y ) = H(X, Y ) = H(f (Z), f 2 (Z)) H(Z) Hence all inequalities above are equalities and we have H(Z) = H(X) + H(Y ). 7