EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, PDF Free Download

Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code are complicated functions of the message probabilities {p 1, p 2,..., p m }, it can be said that less probable symbols are encoded into longer codewords. Suppose that the message probabilities are given in decreasing order p 1 > p 2 p m. (a) Prove that for any binary Huffman code, if the most probable message symbol has probability p 1 < 1/3, then that symbol must be assigned a codeword of length 2. (b) Prove that for any binary Huffman code, if the most probable message symbol has probability p 1 > 2/5, then that symbol must be assigned a codeword of length 1. Solution: Optimal codeword lengths. Let {c 1, c 2,..., c m } be codewords of respective lengths {l 1, l 2,..., l m } corresponding to probabilities {p 1, p 2,..., p m }. (a) Suppose, for the sake of contradiction, that l 1 = 1. Without loss of generality, assume that c 1 = 0. For x, y {0, 1} let C xy denote the set of codewords beginning with xy. The total probability of C 10 and C 11 is 1 p 1 > 2/3, so at least one of these two sets (without loss of generality, C 10 ) has probability greater than 1/3. We can now obtain a better code by interchanging the subtree of the decoding tree beginning with 0 with the subtree beginning with 10; that is, we replace codewords of the form 10x... by 0x... and we let c 1 = 10. This improvement contradicts the assumption that l 1 = 1, and so l 1 2. (b) We prove that if p 1 > p 2 and p 1 > 2/5 then l 1 = 1. Suppose, for the sake of contradiction, that l 1 2. Then there are no codewords of length 1; otherwise c 1 would not be the shortest codeword. Without loss of generality, we can assume that c 1 begins with 00. For x, y {0, 1} let C xy denote the set of codewords beginning with xy. Then the sets C 01, C 10, and C 11 have total probability 1 p 1 < 3/5, so some two of these sets (without loss of generality, C 10 and C 11 ) have total probability less 2/5. We can now obtain a better code by interchanging the subtree of the decoding tree beginning with 1 with the subtree beginning with 00; that is, we replace codewords of the form 1x... by 00x... and codewords of the form 00y... by 1y.... This improvement contradicts the assumption that l 1 2, and so l 1 = 1. (Note that p 1 > p 2 was a hidden assumption for this problem; otherwise, for example, the probabilities {.49,.49,.02} have the optimal code {00, 1, 01}.) Homework 3 Page 1 of 10

2. Shannon Fano-Elias and Arithmetic Coding Let p(x) be a PMF on the alphabet X = {1, 2, 3,..., m}. Assume that p(x) > 0, x X. We define, F (x) and F (x) as: F (x) = a<x p(a), F (x) = F (x) + 1 2 p(x) An example of F (x) is shown in Figure 1. We first discuss the construction of Shannon- Fano-Elias codes, which form the basis for Arithmetic coding. F(x) F(x 2 ) 0.80 x x 2 1 2 3 11 12 13 21 22 23 31 32 33 Figure 1: Figure represents F (x) and F (x 2 ) for X = {1, 2, 3} (a) Show that you can decode x if F (x) is known. (b) In general F (x) is a real number in [0, 1]. Thus, for storing F (x), we need to truncate it appropriately. Let F T (x) = F (x) l(x) be the truncation of F (x) written in binary base to l(x) bits, where l(x) = log 1 + 1 p(x) Show that, one can decode x from F T (x). (c) Let C(x) represent the l(x) decimal bits in the binary representation of FT (x). Show that C(x) forms a prefix code. (d) Show that the average codelength L of the code C(x) is given by: L < H(X) + 2 (e) Consider a sequence x n = (x 1, x 2,..., x n ) over the extended alphabet space X n distributed i.i.d according to the distribution p(x). Then show that: p(x n ) = p(x n 1 )p(x n ) F (x n ) = F (x n 1 ) + p(x n 1 )F (x n ), where F (x n ) = a n <x n p(a n ), assuming lexicographic order over a n X n. (Hint: see Figure 1.) Homework 3 Page 2 of 10

(f) Given a sequence x n, give an efficient way to perform Shannon-Fano-Elias encoding C(x n ) of the sequence x n. (g) Suggest an efficient decoding algorithm for C(x n ) which does not explicitly store the prefix codebook. (Hint: can you recursively decode x 1, x 2,..., x n from F T (x n )?) (h) The efficient algorithm described above for computing Shannon-Fano-Elias codes for the alphabet X n is known as Arithmetic Coding. Show that the average codelength L avg = E[l(C(x n ))]/n for Arithmetic coding satisfies L avg < H(X) + 2 n. Due to the efficiency and near optimality of Arithmetic codes, they are widely used for compression, and are building blocks in compressors including GZIP, JPEG and MP4. Solution: Shannon Fano-Elias and Arithmetic Coding. (a) Since F (x) is a strictly increasing function of x, it is invertible. Thus we can decode x if F (x) is known. (b) First, observe that 2 log 1 p(x) 1 < p(x)/2 and hence F T (x) ( F (x) p(x)/2, F (x)]. Since F (x 1) < F (x) F (x) p(x)/2, the intervals ( F (x) p(x)/2, F (x)] are disjoint for different x s. Thus, by looking at the interval containing F T (x), we can decode x. (c) As shown above, FT (x) ( F (x) p(x)/2, F (x)]. Now any suffix of FT (x) lies in [ F T (x), F T (x) + p(x)/2) which is contained in ( F (x) p(x)/2, F (x) + p(x)/2). Again these intervals are disjoint for different x s and hence no codeword can be a prefix of another. (d) L = x X p(x)l(x) ( p(x) log 1 ) + 1 = p(x) x X < ( p(x) log 1 ) p(x) + 2 x X = H(X) + 2 (e) p(x n ) = p(x n 1 )p(x n ) follows from definition of independence. For the other equality, first note that in a lexicographic ordering, a n < x n if either a n 1 < x n 1 Homework 3 Page 3 of 10

or if a n 1 = x n 1 and a n < x n. Thus F (x n ) = a n <x n p(a n ) = p(a n 1 ) + p(x n 1 )p(a n ) a n 1 <x n 1 a n<x n = F (x n 1 ) + p(x n 1 )F (x n ) (f) Use the recursion in (e) to compute F (x n ) and p(x n ) in linear time. Then use the encoding given in parts (b) and (c). Note that computing and storing entire p(x n ) is exponential in n but we don t need to do that to encode x n. (g) We show how we can iteratively decode x 1, x 2,..., x n from F (x n ). The decoding starting from F T (x n ) is along the same lines, but slightly more involved. Observe that F (x 1 ) F (x n ) < F (x 1 + 1) (e.g., see Figure 1). We can decode x 1 by using this inequality. Once we know x 1, we can use the recursion relation F (x n ) = p(x 1 )F (x n 2) + F (x 1 ) where x n 2 = (x 2,..., x n ). This relation can be derived by noting the fact that a n < x n if either a 1 < x 1 or if a 1 = x 1 and a n 2 < x n 2 and proceeding as in part (e). Continuing this recursion, we can obtain x 2... x n as well. (h) Applying the result in part (d) to X n, we get, L avg = E[l(C(x n ))]/n < 1 n (H(Xn ) + 2) = 1 (nh(x) + 2) n = H(X) + 2 n 3. Generate Discrete Distribution from Fair Coin Tosses Suppose we have a fair coin which we can toss infinitely many times. Our target is to use this coin to generate some random variable X, which follows a desired discrete distribution P. (a) Can you generate the discrete distribution ( 1, 3 )? How many coin tosses do you 4 4 need? (b) Can you generate the discrete distribution ( 1, 2 )? How many coin tosses do you 3 3 need in expectation? Can you generate it within a fixed number of coin tosses? Homework 3 Page 4 of 10

(c) For general (possibly irrational) p (0, 1), can you generate the discrete distribution (p, 1 p)? How many coin tosses do you need in expectation? (Hint: consider the binary decimal expansion of p.) (d) For general discrete distributions P = (p 1,, p m ), propose a scheme to generate P using fair coin tosses. (Hint: Understanding Shannon-Fano-Elias coding from Q2 might be useful) (e) (Bonus) In the setting of (d), propose a scheme to generate P with the required number of coin tosses T satisfying ET H(P ) + 2 = m i=1 p i log p i + 2. (f) Show that for any scheme that generates P = (p 1,, p m ), the required number of coin tosses T must satisfy ET H(P ). Solution: Generate Discrete Distribution from Fair Coin Tosses. (a) Toss coin twice, output X = 1 if the outcome is HH, and output X = 2 otherwise. Obviously X follows a discrete distribution ( 1, 3 ), and we need to toss 2 times. 4 4 (b) Toss coin twice, output X = 1 if the outcome is HH, output X = 2 if the outcome is HT or TH, and repeat this process again if the outcome is TT. Then P(X = 1) = P(HH) P(T H) + P(HT ) = 1 3. The number of repetitions follows a geometric distribution with success probability 3, so the expected number of tosses is 4 2 = 8. We cannot generate it within 4 3 3 a fixed number (say, N) of coin tosses: the probability of each outcome must be an integral multiple of 2 N, while 1 is not. 3 (c) Consider the binary decimal expansion of p = 0.a 1 a 2 a 3. Let b i {0, 1} be the i-th outcome of coin toss, define the stopping time T by T = min{n : 0.b 1 b 2 b n 0.a 1 a 2 a n } and output X = 1 if 0.b 1 b 2 b T < 0.a 1 a 2 a T, and output 2 otherwise. One can think of this process as tossing coin infinitely many times to find U = 0.b 1 b 2 b 3, and stop whenever we are sure about U < p or U > p. Clearly U follows the uniform distribution on [0, 1], so P(X = 1) = P(U < p) = p, as desired. For the expected number of coin tosses, note that T > n iff 0.b 1 b 2 b n 0.a 1 a 2 a n, which occurs with probability 2 n. Hence, E(T ) = P(T > n) = 2 n = 2. (d) Similar to (c), we consider s i = i j=1 p j for i = 0, 1,, m, and toss the coin infinitely many times. We stop whenever we are sure about U (s i 1, s i ) for some i = 1,, m, and output X = i in this case. Clearly, P(X = i) = P(s i 1 < U < s i ) = s i s i 1 = p i. Algorithmically, we stop at time T whenever 0.b 1 b 2 b T does not coincide with the first T -bit string of any of s 1,, s m 1. Homework 3 Page 5 of 10

(e) We show that the scheme in (d) satisfies a weaker inequality E(T ) H(X) + 3. By (d), T > n if and only if 0.b 1 b 2 b n is the first n-bit string of at least one of s 1,, s m 1, which occurs with probability An, where A 2 n n is the number of distinct first n-bit strings of s 1,, s m 1. We can upper bound A n as A n 2 n m max{ 2 n s i 2 n s i 1 1, 0} 2 n i=1 m max{ 2 n p i 1, 0} where the last inequality follows from x + y x + y. As a result, E(T ) = P(T > n) = A n 2 m n i=1 The inner sum can be further upper bounded as ( ) log 1 p i max{ 2n p p i 1, 0} i p 2 n i + i=1 ( ) p i max{ 2n p i 1, 0}. 2 n n= log 1 p i +1 2 2 n = p i ( log 1 ) 1 + 1 + p. i 2 log 1 1 p i Consider the function f(p) = p log 1 + 3p (n + 1)p p 2 (n 1) on [2 n, 2 n+1 ) (where n is an integer), clearly f(p) is concave and thus quasi-concave in p. As a result, f(p) min{f(2 n ), f(2 n+1 )} = 0, implying that p i ( log 1 ) 1 + 1 + p p i 2 log 1 i log 1 + 3p i. 1 p i p i Hence, E(T ) m i=1 ( ) p i max{ 2n p i 1, 0} 2 n m ) (p i log 1pi + 3p i = H(X) + 3. For a better bound with a different scheme, see Theorem 5.11.3 and the preceding discussion in Cover & Thomas. (f) For any scheme, consider the shortest coin sequence c i with length l i under which X = i is outputted. The sequences c 1,, c m must be prefix-free (why?). Since entropy serves as a lower bound for the average length of any prefix code, we have E(T ) i=1 m p i l i H(X). i=1 4. Channel capacity. Find the capacity of the following channels with probability transition matrices: Homework 3 Page 6 of 10

(a) X = Y = {0, 1, 2} (b) X = Y = {0, 1, 2} p(y x) = p(y x) = 1/2 1/2 0 0 1/2 1/2 1/2 0 1/2 (c) X = Y = {0, 1} (The Z-channel) p(y x) = [ 1 0 1/2 1/2 ] Solution: Channel capacity. (a) X = Y = {0, 1, 2} p(y x) = This is a symmetric channel and by the known result for symmetric channel (section 7.2 in Cover & Thomas): C = log Y H(r) = log 3 log 3 = 0. In this case, the output is independent of the input. (b) X = Y = {0, 1, 2} p(y x) = Again the channel is symmetric: 1/2 1/2 0 0 1/2 1/2 1/2 0 1/2 C = log Y H(r) = log 3 log 2 = 0.58 bits (c) First we express I(X; Y ), the mutual information between the input an output of the Z-channel, as a function of α = Pr(X = 1): H(Y X) = Pr(X = 0) 0 + Pr(X = 1) 1 = α H(Y ) = H(Pr(Y = 1)) = H(α/2) I(X; Y ) = H(Y ) H(Y X) = H(α/2) α Since I(X; Y ) is strictly concave on α (why?) and I(X; Y ) = 0 when α = 0 and α = 1, the maximum mutual information is obtained for some value of α such that 0 < α < 1. Homework 3 Page 7 of 10

Using elementary calculus, we determine that d dα I(X; Y ) = 1 2 log 1 α/2 2 1, α/2 which is equal to zero for α = 2/5. (It is reasonable that Pr(X = 1) < 1/2 since X = 1 is the noisy input to the channel.) So the capacity of the Z-channel in bits is H(1/5) 2/5 = 0.722 0.4 = 0.322. 5. Choice of channels. Let C 1 (X 1, p 1 (y 1 x 1 ), Y 1 ) and C 2 (X 2, p 2 (y 2 x 2 ), Y 2 ) be two channels with capacities C 1, C 2 respectively. Assume the output alphabets are distinct and do not intersect. Consider a channel C, which is a union of 2 channels C 1, C 2, where at each time, one can send a symbol over C 1 or over C 2 but not both. (a) Let: { X1 with probability α X = with probability 1 α X 2 where X 1 and X 2 are random variables taking values in X 1 and X 2, respectively. Then show that: I(X; Y ) = H b (α) + αi(x 1 ; Y 1 ) + (1 α)i(x 2 ; Y 2 ) (H b (α) represents the binary entropy corresponding to α) (b) Let C be the capacity of the channel C. Use the result in part (a) to show that 2 C = 2 C 1 + 2 C 2. (c) Let C 1 = C 2. Then show that: and give an intuitive explanation. C = C 1 + 1 (d) Calculate the capacity of the following channel: Homework 3 Page 8 of 10

0 1 1 p 0 p p 1 1 p 2 1 2 Solution: Choice of channels. (a) Let { 1 X = X1 θ(x) = 2 X = X 2 Since the output alphabets Y 1 and Y 2 are disjoint, θ is a function of Y as well, i.e. X Y θ. I(X; Y, θ) = I(X; θ) + I(X; Y θ) Since X Y θ, I(X; θ Y ) = 0. Therefore, I(X; Y ) = I(X; θ) + I(X; Y θ) = I(X; Y ) + I(X; θ Y ) = H(θ) H(θ X) + αi(x 1 ; Y 1 ) + (1 α)i(x 2 ; Y 2 ) = H(α) + αi(x 1 ; Y 1 ) + (1 α)i(x 2 ; Y 2 ) (b) It follows from (a) that C = sup {H(α) + αc 1 + (1 α)c 2 }. α Maximizing over α one gets the desired result. The maximum occurs for H (α) + C 1 C 2 = 0, or α = 2 C 1 /(2 C 1 + 2 C 2 ). (c) The result follows from (b). Intuitively, if we have two identical channels, then in each transmission, we can transmit one extra bit through our choice of the Homework 3 Page 9 of 10

channel. As an extreme example, suppose both channels are BSC(0.5) channels. Then C 1 = C 2 = 0, but C = 1. This is because we have two possible channels and we can communicate 1 bit/transmission by sending through channel 1 if the input bit is 0 and sending through channel 2 if the input bit is 1. (d) This channel consists of a sum of a BSC and a zero-capacity channel. C = log ( 2 1 H(p) + 1 ) Thus Homework 3 Page 10 of 10