MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.44 Transmission of Information Spring 2006 Homework 2 Solution name username April 4, 2006 Reading: Chapter 3 to 5 of Cover. Initial conditions [prob 7 on p75 of Cover]. Show, for a Markov chain, that H(X 0 X n ) H(X 0 X n ) Thus initial condition X 0 becomes more difficult to recover as the future X n unfolds 2. AEP [modified prob 5 on p58 of Cover]. Let X,X 2,... be independent identically distributed random variables drawn according to the probability mass function p(x), x {, 2,..., m}. Thus p(x, x 2,..., x n ) = n i= p(x i). We know that n log p(x,..., X n ) H(X) in probability. Let q(x,..., x n ) = n i= q(x i), where q is another probability mass function on {, 2,..., m}. (a) Evaluate lim n log q(x,..., X n ), where X, X 2,... are i.i.d. p(x). (b) Now evaluate the limit of the log likelihood ratio n when X, X 2,... are i.i.d. p(x). Thus the odds favoring q are exponentially small when p is true. Give and interpret the scenario when the log likelihood ratio is infinite. log q(x,...,xn) p(x,...,x n) 3. Random box size [prob 6 on p58 of Cover]. An n-dimensional rectangular box with sides X, X 2, X 3,..., X n is to be constructed. The volume is V n = n i= X i. The edge length l of an n-cube with the same volume of the random box is l = V n Let X, X 2,... be i.i.d. uniform random variables over the unit interval [0, ]. Find lim n V n n and compare to (EV n ) n. Clearly, the expected edge length does not capture the idea of the volume of the box. n.
6.44 Spring 2006 2 4. The AEP and source coding [prob 3 on p57 in Cover]. optional A discrete memoryless source emits a sequence of statistically independent binary digits with probabilities p() = 0.005 and p(0) = 0.995. The digits are taken 00 at a time and a binary codeword is provided for every sequence of 00 digits containing three or fewer ones. (a) Asumming that all codewords are the same length, find the minimum length required to provide codewords for all sequences with three or fewer ones. (b) Calculate the probability of observing a source sequence for which no codeword has been assigned. (c) Use Chebyshev s inequality to bound the probability of observing a source sequence for which no codeword has been assigned. Compare this bound with the actual probability computed in part b. 5. Proof of Theorem 3.3. [modified prob 7 on p58 of Cover]. Let X, X 2,..., X n be i.i.d. p(x). Let B (n) X n such that Pr(B (n) ) >. [0, ). (a) Given any two sets A, B such that Pr(A) > and Pr(B) > 2, show that Pr(A B) > 2. Hence, Pr(A (n) B (n) ) >. (b) Using the above result, show that ( ) ( [0, ))( B > 0)( N N)( n > N) n log (n) > H Hence, ( [0, )) ( { min B (n) : B (n) } ) X n, Pr(B (n).= ) > A (n) as 0 6. Entropy rates of Markov Chain [prob 5 on p74 of Cover]. optional (a) Find the entropy rate of the two-state Markov chain with transition matrix [ ] p0 p P = 0 p 0 p 0
6.44 Spring 2006 3 (b) What values of p 0, p 0 maximize the rate of part a? (c) Find the entropy rate of the two-state Markov chain with transition matrix [ ] p p P = 0 (d) Find the maximum value of the entropy rate of the Markov chain of part c. We expect that the maximizing value of p should be less than /2, since the 0 state permits more information to be generated than the state. (e) Let N(t) be the number of allowable state sequences of length t for the Markov chain of part c. Find N(t) and calculate H 0 := lim log N(t) t t [Hint: Find a linear recurrence that expresses N(t) in terms of N(t ) and N(t 2).] Why is H 0 an upper bound on the entropy rate of the Markov chain? Compare H 0 with the maximum entropy found in part d. (f) How could we optimally encode an arbitrary random source with the allowable state sequences assuming that AEP holds? What is the minimum and expected codeword length per source symbol in terms of the source entropy rate H and the quantity H 0? 7. The entropy rate of a dog looking for a bone [prob 4 on p75 of Cover]. A dog walks on the integers, possibly reversing direction at each step with probability p = 0.. Let X 0 = 0. The first step is equally likely to be positive or negative. A typical walk might look like this: (X 0, X,... ) = (0,, 2, 3, 4, 3, 2,, 0,,... ) (a) Find H(X n ) where X n := (X, X 2,..., X n ).
6.44 Spring 2006 4 (b) Find the entropy rate of this browsing dog. (c) What is the expected number of steps the dog takes before reversing direction? 8. General AEP. Consider the stochastic process {X i } i Z +. (a) Independent sequence. Suppose that X i s are independent (not necessarily identically distributed) with Var[log p(x i )] bounded above by some constant α. i. Express the expectation H n := E[ n log p(xn )] in terms of the entropies H(X i ). ii. Argue that AEP holds if the limit H := lim n H n exists. Construct an example where the limit H does not exist (i.e. specify the sequence {H n } in terms of n such that the average does not converge). iii. Consider X i s to be deterministically 0 if i k N { 2 2k,..., 2 2k+ }, and i.i.d. Bernoulli random variables with probability /2 otherwise. Give H if it exists. If not, briefly explains why. (b) difficult Hidden Markov. Let {S i } i Z + be the stationary 2-state Markov process with transition probabilities p 0 = 2p 0 = 0.5. If S i = 0, X i s are independent Bernoulli distributed random variables with probability /2. Otherwise, X i will be deterministically 0.5. Calculate the entropy rate of {X i } [Hint: apply chain rule on H(X n, S n ) and see if any term is 0.] (c) Briefly explain whether the following statement is true/false. i. The optimal compression scheme that minimizes the expected codeword length for a general stochastic process is not known if the process does not have AEP.
6.44 Spring 2006 5 ii. (See the historical notes on p59 of Cover.) The most general process for which AEP holds is continuous stationary ergodic process and the type of convergence is almost sure convergence. (d) difficult and optional Random length. Let {L i } i Z + be a set of i.i.d. geometric random variables with probability /2. X j is Bernoulli random variable with probability /L k for j [ + k i= L i, k i= L i] where k Z +. Briefly explain what is wrong with the following calculation of the entropy rate of {X i } assuming that AEP holds for this process. Consider L m. As m, there will be almost surely a fraction 2 l of L i s taking values of l. Then, there will be almost surely a fraction l2 l Pl>0 l2 l of X i s being Bernoulli(/l) random variables. Hence, we can obtain the entropy rate as follows, l>0 H(X ) = l2 l H[/l] l>0 l2 l = l2 l H[/l] 0.447 2 l>0 9. Information divergence Let {X i } i Z + be a set of i.i.d. random variables with probability distribution Q, which is unknown. We would like to compress {X i } by defining the hypothesized typical set A (n) based on the estimate P x n of Q after observing the n-sequence x n. (a) Suppose X i s are Bernoulli random variable, where the actual generating distribution Q := Bernoulli(q), the hypothesized distribution based on x n is P x n := Bernoulli(p := i x i/n), and the typical set defined using P x n is A (n) := { x n : n log P x n(xn ) H[p] } i. Show that x n A (n) P n i= xi ii. If p /2, show that as 0, n Pr(A (n) p log p p ) := Q(A (n) ) =. 2 D[p q]n Briefly explain whether this result holds for p = /2. (b) difficult and optional Strong typicality. Consider a general distribution Q on a finite support set X. Let N(α; x n ) be the number of α X in the sample sequence x n. We
6.44 Spring 2006 6 define the distribution P x n : α X N(α;xn ) n to be the type (or empirical distribution) of x n. We define A (n) to be the strong typical set based on P x n as follows, { A (n) := x n : α X, P x n (α) P x n (α) < } X As a side note, the strong typical set has all the desired property of the (weak) typical set. With this stronger version, the following parts will generalize the interpretation of information divergence beyond the Bernoulli distribution, and avoid the problem raised in part ii i. Define P n := { P : x n X n P = P x n } to be the class of all types. Show that ( ) n + X P n = n X (n + ) ii. Define T P := { x n X n : P x n = P } to be the class of type P P n. Show that P x n (T Px n ) = max P P n P x n (T P ) P n iii. Show using the previous result that T Px n = P x n (T Px n )2 nh[p x n ]. = 2 nh[p x n ] iv. Finally, show that Q(T Px n ) =. 2 nd[p x n Q] and so Pr(A (n) ) := Q(A (n) ) =. 2 nd[p x n Q] as 0.