Please submit the solutions on Gradescope. Some definitions that may be useful: EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018 Definition 1: A sequence of random variables X 1, X 2,..., X n,... converges to a random variable X in probability if for any ɛ > 0, lim P { X n X < ɛ} = 1 (1) n 1. Inequalities. Let X, Y and Z be jointly distributed random variables. Prove the following inequalities and find conditions for equality. (a) H(X, Y Z) H(X Z). (b) I(X, Y ; Z) I(X; Z). (c) H(X, Y, Z) H(X, Y ) H(X, Z) H(X). (d) I(X; Z Y ) I(Z; Y X) I(Z; Y ) + I(X; Z). Solution: Inequalities. (a) Using the chain rule for conditional entropy, H(X, Y Z) = H(X Z) + H(Y X, Z) H(X Z), with equality iff H(Y X, Z) = 0, that is, when Y is a function of X and Z. (b) Using the chain rule for mutual information, I(X, Y ; Z) = I(X; Z) + I(Y ; Z X) I(X; Z), with equality iff I(Y ; Z X) = 0, that is, when Y and Z are conditionally independent given X. (c) Using first the chain rule for entropy and then the definition of conditional mutual information, H(X, Y, Z) H(X, Y ) = H(Z X, Y ) = H(Z X) I(Y ; Z X) H(Z X) = H(X, Z) H(X), with equality iff I(Y ; Z X) = 0, that is, when Y and Z are conditionally independent given X. Homework 2 Page 1 of 9
(d) Using the chain rule for mutual information, I(X; Z Y ) + I(Z; Y ) = I(X, Y ; Z) = I(Z; Y X) + I(X; Z), and therefore I(X; Z Y ) = I(Z; Y X) I(Z; Y ) + I(X; Z). We see that this inequality is actually an equality in all cases. 2. AEP Let X i be iid p(x), x {1, 2,..., m}. For ɛ > 0, let µ = EX, and H = p(x) log p(x). Let A n = {x n X n : 1 n log p(xn ) H ɛ} and B n = {x n X n : 1 n n i=1 x i µ ɛ}. (a) Does P {X n A n } 1? (b) Does P {X n A n B n } 1? (c) Show that A n B n 2 n(h+ɛ), for all n. (d) Show that A n B n ( 1 2 )2n(H ɛ), for n sufficiently large. Solution: AEP (a) Yes, by the AEP for discrete random variables the probability X n is typical goes to 1. (b) Yes, by the Law of Large Numbers P (X n B n ) 1. So there exists ɛ > 0 and N 1 such that P (X n A n ) > 1 ɛ 2 for all n > N 1, and there exists N 2 such that P (X n B n ) > 1 ɛ 2 for all n > N 2. So for all n > max(n 1, N 2 ): P (X n A n B n ) = P (X n A n ) + P (X n B n ) P (X n A n B n ) > 1 ɛ 2 + 1 ɛ 2 1 = 1 ɛ So for any ɛ > 0 there exists N = max(n 1, N 2 ) such that P (X n A n B n ) > 1 ɛ for all n > N, therefore P (X n A n B n ) 1. (c) By the law of total probability x n A n B n p(x n ) 1. Also, for x n A n, from Theorem 3.1.2 in the text, p(x n ) 2 n(h+ɛ). Combining these two equations gives 1 x n A n B n p(x n ) x n A n B n 2 n(h+ɛ) = A n B n 2 n(h+ɛ). Multiplying through by 2 n(h+ɛ) gives the result A n B n 2 n(h+ɛ). (d) Since from (b) P {X n A n B n } 1, there exists N such that P {X n A n B n } 1 2 for all n > N. From Theorem 3.1.2 in the text, for xn A n, p(x n ) 2 n(h ɛ). So combining these two gives 1 2 x n A n B n p(x n ) x n A n B n 2 n(h ɛ) = A n B n 2 n(h ɛ). Multiplying through by 2 n(h ɛ) gives the result A n B n ( 1 2 )2n(H ɛ) for n sufficiently large. Homework 2 Page 2 of 9
3. An AEP-like limit and the AEP (a) Let X 1, X 2,... be i.i.d. drawn according to probability mass function p(x). Find the limit in probability as n of p(x 1, X 2,..., X n ) 1 n. (b) Let X 1, X 2,... be an i.i.d. sequence of discrete random variables with entropy H(X). Let C n (t) = {x n X n : p(x n ) 2 nt } denote the subset of n-length sequences with probabilities 2 nt. i. Show that C n (t) 2 nt. ii. What is lim n P (X n C n (t)) when t < H(X)? And when t > H(X)? Solution: An AEP-like limit and the AEP. (a) By the AEP, we know that for every δ > 0, lim P n ( H(X) δ 1 n log p(x 1, X 2,..., X n ) H(X) + δ ) = 1 Now, fix ɛ > 0 (sufficiently small) and choose δ = min{log(1 + 2 H(X) ɛ), log(1 2 H(X) ɛ)}. Then, 2 H(X) (2 δ 1) ɛ and 2 H(X) (2 δ 1) ɛ. Thus, H(X) δ 1 n log p(x 1, X 2,..., X n ) H(X) + δ = 2 H(X) 2 δ (p(x 1, X 2,..., X n )) 1 n 2 H(X) 2 δ = 2 H(X) (2 δ 1) (p(x 1, X 2,..., X n )) 1 n 2 H(X) 2 H(X) (2 δ 1) = ɛ (p(x 1, X 2,..., X n )) 1 n 2 H(X) ɛ This along with AEP implies that P ( p(x 1, X 2,..., X n )) 1 n 2 H(X) ɛ) 1 for all ɛ > 0 and hence (p(x 1, X 2,..., X n )) 1 n converges to 2 H(X) in probability. This proof can be shortened by directly invoking the continuous mapping theorem, which says that if Z n converges to Z in probability and f is a continuous function, then f(z n ) converges to f(z) in probability. Alternate proof (using Strong LLN): X 1, X 2,..., i.i.d. p(x). Hence log(x i ) are also i.i.d. and lim(p(x 1, X 2,..., X n )) 1 n = lim 2 log(p(x 1,X 2,...,X n)) 1 n = 2 lim 1 n log p(xi ) = 2 E(log(p(X))) = 2 H(X) Homework 2 Page 3 of 9
where the second equality uses the continuity of the function 2 x and the third equality uses the strong law of large numbers. Thus, (p(x 1, X 2,..., X n )) 1 n converges to 2 H(X) alomost surely, and hence in probability. (b) i. 1 p(x n ) x n C n(t) x n C n(t) 2 nt = C n (t) 2 nt Thus, C n (t) 2 nt. ii. AEP immediately implies that lim n P (X n C n (t)) = 0 for t < H(X) and lim n P (X n C n (t)) = 1 for t > H(X). 4. Prefix and Uniquely Decodable codes Consider the following code: u Codeword a 1 0 b 0 0 c 1 1 d 1 1 0 (a) Is this a Prefix code? (b) Argue that this code is uniquely decodable, by providing an algorithm for the decoding. Solution: Prefix and Uniquely Decodable (a) No. The codeword of c is a prefix of the codeword of d. (b) We decode the encoded symbols from left to right. At any stage, If the next two bits are 10, output a and move to the third bit. If the next two bits are 00, output b and move to the third bit. If the next two bits are 11, look at the third bit: If it is 1, output c and move to the third bit If it is 0, count the number of 0 s after the 11: If even (say 2m zeros), decode to cb... b with m b s and move to the bit after the 0 s. Homework 2 Page 4 of 9
If odd (say 2m + 1 zeros), decode to db... b with m b s and move to the bit after the 0 s. Some examples with their decoding: 11011. It is not possible to split this string as 11 0 11 because there is no codeword 0. Therefore the only way is: 110 11. 1110. It is not possible to split this string as 1 11 0 or 1 110 because there is no codeword 0 or 1. Therefore the only way is: 11 10. 110010. It is not possible to split this string as 110 0 10 because there is no codeword 0. Therefore the only way is: 11 00 10. For a more elaborate discussion on this topic read Problem 5.27 1. In this problem,the Sardinas-Patterson test of unique decodability is explained. 5. Information in the Sequence length. How much information does the length of a sequence give about the content of a sequence? Consider an i.i.d. Bernoulli(1/2) process {X i }. Stop the process when the first 1 appears. Let N designate this stopping time. Thus X N is a random element of the set of all finite length binary sequences {0, 1} = {0, 1, 00, 01, 10, 11, 000, }. (a) Find H(X N N). (b) Find H(X N ). (c) Find I(N; X N ). Let s now consider a different stopping time. For this part, again assume X i Bernoulli(1/2) but stop at time N = 6 with probability 1/3 and stop at time N = 12 with probability 2/3. Let this stopping time be independent of the sequence X 1 X 2 X 12. (d) Find H(X N N). (e) Find H(X N ). (f) Find I(N; X N ). Solution: Sequence length. (a) Given N, the only possible X N is 000 01 with N 1 number of 0 s. Thus, H(X N N) = 0. (b) H(X N ) = I(X N ; N) + H(X N N) = H(N) H(N X N ) + H(X N N). Now, H(N X N ) = H(X N N) = 0 and H(N) = i=1 (1/2)i log(1/2) i = 2. Thus, H(X N ) = 2. 1 from: T.M. Cover and J.A. Thomas, Elements of Information Theory, Second Edition,2006. Homework 2 Page 5 of 9
(c) I(N; X N ) = H(N) H(N X N ) = 2 (d) H(X N N) = P (N = 6)H(X N N = 6) + P (N = 12)H(X N N = 12) = (1/3) 6 + (2/3) 12 = 10. (e) Again, we have H(N X N ) = 0. Therefore, I(N; X N ) = H(N) H(N X N ) = H(N) = (1/3) log(1/3) (2/3) log(2/3) = log 2 3 2/3. H(X N ) = I(X N ; N) + H(X N N) = log 2 3 2/3 + 10 (f) From above, I(N; X N ) = log 2 3 2/3. 6. Coin Toss Experiment and Golomb Codes Kedar, Mikel and Naroa have been instructed to record the outcomes of a coin toss experiment. Consider the coin toss experiment X 1, X 2, X 3,... where X i are i.i.d. Bern(p) (probability of a H (head) is p), p = 15/16. (a) Kedar decides to use Huffman coding to represent the outcome of each coin toss separately. What is the resulting scheme? What compression rate does it achieve? (b) Mikel suggests he can do a better job by applying Huffman coding on blocks of r tosses. Construct the Huffman code for r = 2. (c) Will Mikel s scheme approach the optimum expected number of bits per description of source symbol (coin toss outcome) with increasing r? How does the space required to represent the codebook increase as we increase r? (d) Naroa suggests that, as the occurrence of T is so rare, we should just record the number of tosses it takes for each T to occur. To be precise, if Y k represents the number of trials until the k th T occurred (inclusive), then Naroa records the sequence: where Y 0 = 0 Z k = Y k Y k 1, k 1, (2) i. What is the distribution of Z k, i.e., P (Z k = j), j = 1, 2, 3,...? ii. Compute the entropy and expectation of Z k. iii. How does the ratio between the entropy and the expectation of Z k compare to the entropy of X k? Give an intuitive interpretation. (e) Consider the following scheme for encoding Z k, which is a specific case of Golomb Coding. We are showing the first 10 codewords. Homework 2 Page 6 of 9
Z Quotient Remainder Code 1 0 1 1 01 2 0 2 1 10 3 0 3 1 11 4 1 0 0 1 00 5 1 1 0 1 01 6 1 2 0 1 10 7 1 3 0 1 11 8 2 0 00 1 00 9 2 1 00 1 01 10 2 2 00 1 10 i. Can you guess the general coding scheme? Compute the expected codelength of this code. ii. What is the decoding rule for this Golomb code? iii. How can you efficiently encode and decode with this codebook? Solution: (a) As we have just 2 source symbols (H & T), the optimal Huffman code is {0, 1} for {H, T }. The compression rate (expected codelength) achieved is 1. (b) One possible Huffman code for r = 2 is HH - 0, HT - 10, TH - 110, TT - 111. This has average codelength of 303 bits/source symbol. 512 (c) We know that, using Shannon codes for extended codewords of length r, the expected average codelength has bounds: H(X) l avg H(X) + 1 r (3) As Huffman coding is the optimal prefix code, it will always perform at least as well as the Shannon codes. Thus, as r increases, we will achieve optimal average codelength of H(X) = H b (1/16). As r increases, the number of codewords increases as 2 r. Also, the average size of the codewords, increases as rh(x). Thus, effectively, the amount of space required to store the codebook increases exponentially in r. (d) First of all, let us convince ourselves that if Y k = n, then {Z 1,..., Z k } represents the same information as {X 1,..., X n }. Also, Z k random variables are independent of each other (since the coin tosses are independent) and have identical distributions. i. Z k has geometric distribution. Specifically P (Z k = j) = p j 1 (1 p), j 1, 2, 3,... To see this, lets consider Z 1 for ease first. For Z 1 = j, we should have the first j 1 tosses to be H, while the j th toss should be a T. This results in: P (Z 1 = j) = p j 1 (1 p). Due to Z k being i.i.d., we can generalize the results to Z k. Homework 2 Page 7 of 9
ii. H(Z k ) = H b(p) (1 p). As Z k has geometric distribution, E[Z k ] = 1 (1 p). iii. We observe that, H(X) = H(Z k) E[Z k. Intuitively, we can explain this as follows: ] Z k represents the same amount of information as X Yk 1 +1,..., X Yk 1. As the X i are i.i.d., the information content of Z k is the same as the information content of E[Z k ] i.i.d random variables X i. Thus, H(Z k ) = E[Z k ]H(X) (e) As an aside, the aim of the question was to introduce Golomb Codes. Golomb codes are infact optimal for sequences with geometric distribution, even though having a very simple construction. We have presented a specific case of Golomb codes in our example. i. We are considering quotient of Z with respect to 4 in the unary form, and remainder in the binary form and concatenating them using a separator bit (the bit 1 in this case). To calculate the expected codelength, observe that the length of codeword for Z = 4k + m (where 0 m 3), is 3 + k. Thus, adding terms in groups of 4 and subtracting one term corresponding to Z = 0, we get, l = (1 p)p 4k 1 (1 + p + p 2 + p 3 )(3 + k) 3 1 p p k=0 = (1 p)(1 + p + p 2 + p 3 ) p 4k 1 (3 + k) 3 1 p p k=0 = (1 p)(1 + p + p 2 + p 3 ) 3 2p4 p(1 p 4 ) 31 p 2 p = 3 2p4 p(1 p 4 ) 31 p p For the current source, this evaluates to 6.62 bits, as compared to the entropy H(Z) = 5.397 bits. ii. It is easy to show that the code is prefix-free. Thus, the decoding can proceed by using the standard techniques of constructing a prefix-tree and using the same for decoding. However, note that in this case, the codebook size itself is unbounded, thus we might need to truncate the tree to some maximum depth. There, is however a better and a simpler way, which does not require us to store the codebook explicitly. We can parse the symbol stream until we get the bit 1, this gives us the quotient q. Now, we read the next 2 bits, which gives us the remainder r. Using q and r, we can decode Z = 4 q + r. iii. The good thing about Golomb codes is that, even though the codebook size is unbounded, we do not need to explicitly store the codebook, but can have a simple code which generate the codes, and also to decode the codes. 7. Bad codes. Homework 2 Page 8 of 9
Which of these codes cannot be a Huffman code for any probability assignment? Justify your answers. (a) {0, 10, 11}. (b) {00, 01, 10, 110}. (c) {01, 10}. Solution: Bad codes. (a) {0, 10, 11} is a Huffman code for the distribution ( 1 2, 1 4, 1 4 ). (b) The code {00, 01, 10, 110} can be shortened to {00, 01, 10, 11} without losing its prefix-free property, and therefore is not optimal, so it cannot be a Huffman code. Alternatively, it is not a Huffman code because there is a unique longest codeword. (c) The code {01, 10} can be shortened to {0, 1} without losing its prefix-free property, and therefore is not optimal and not a Huffman code. Homework 2 Page 9 of 9