EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 28 Please submit on Gradescope. Start every question on a new page.. Maximum Differential Entropy (a) Show that among all distributions supported in an interval [a,b], the uniform distribution maximizes differential entropy. (b) Let X be a continuous random variable with E[X 4 ] σ 4 and Y be ( a continuous ) random variable with a probability density function g(y) = c exp y4 where 4σ 4 c = ). Show that ( exp y4 4σ 4 dy h(x) h(y ) with equality if and only if X is distributed as Y. [Hint: you can use the fact that E[Y 4 ] = σ 4.] Solutions: Maximum Differential Entropy (a) Denote by u(x) the uniform distribution, with x [a, b], such that u(x) = b a if x [a, b], and otherwise. Let g(x) be any distribution supported in the interval [a, b]. Then, we have D(g u) () = g(x) log g(x) (2) u(x) = g(x) log ((b a)g(x)) (3) = log(b a) + g(x) log g(x) (4) = log(b a) H(X), (5) which implies H(X) log(b a). On the other hand, note that if x is uniformly distributed in the interval [a, b], we have which finishes the proof. H(X) = = u(x) log u(x) u(x) log(b a) = log(b a), Homework 4 Page of 9
(b) Since E[X 4 ] σ 4 = E[Y 4 ], we have [ D(f X g) = E log f ] X(X) g(x) = h(x) + E [ log g(x)] ] = h(x) + E [ log c + X4 4σ log e 4 h(x) + E [ log c + Y ] 4 4σ log e 4 = h(x) + E [ log g(y )] = h(x) + h(y ). Therefore, h(y ) h(x) + D(f X g) h(x). 2. Cascaded BSCs. Consider the two discrete memoryless channels (X, p (y x), Y) and (Y, p 2 (z y), Z). Let p (y x) and p 2 (z y) be binary symmetric channels with crossover probabilities λ and λ 2 respectively. X λ λ 2 λ λ 2 λ λ 2 λ λ 2 Y Z (a) What is the capacity C of p (y x)? (b) What is the capacity C 2 of p 2 (z y)? (c) We now cascade these channels. Thus p 3 (z x) = y p (y x)p 2 (z y). What is the capacity C 3 of p 3 (z x)? (d) Now let us actively intervene between channels and 2, rather than passively transmit y n. What is the capacity of channel followed by channel 2 if you are allowed to decode the output y n of channel and then reencode it as ỹ n for transmission over channel 2? (Think W x n (W ) y n ỹ n (y n ) z n Ŵ.) (e) What is the capacity of the cascade in part c) if the receiver can view both Y and Z? Homework 4 Page 2 of 9
Solution: Cascaded BSCs (a) C is just a capacity of a BSC(λ ). Thus, C = H(λ ). (b) Similarly, C 2 = H(λ 2 ). (c) First observe that the cascaded channel is also a BSC. Since the new BSC has a crossover probability of p 3 = λ ( λ 2 ) + ( λ )λ 2 = λ + λ 2 2λ λ 2, C 3 = H(λ + λ 2 2λ λ 2 ). Note that the new channel is noisier than the original two since by concavity of H(p), H(( λ )λ 2 + λ ( λ 2 )) λ 2 H( λ ) + ( λ 2 )H(λ ) = H(λ ) Similarly for H(λ 2 ). Thus, C 3 minc, C 2 }. (d) Since we are allowed to decode the intermediate outputs and reencode them prior to the second transmission, any rate less than both C and C 2 can be achievable and at the same time any rate greater than either C or C 2 will cause P ɛ (n) exponentially. Hence, the overall capacity is the minimum of two capacities, min(c, C 2 ) = min( H(λ ), H(λ 2 )). (e) Note that Z becomes irrelevant once we observe Y. Thus, the capacity of this channel is just C = H(λ ). Alternatively, X Y (Y, Z) forms a Markov chain so that I(X; Y ) I(X; Y, Z). On the other hand, I(X; Y ) I(X; Y, Z) since we can always ignore the observation Z. (Or X (Y, Z) Y also forms a Markov chain.) Hence, I(X; Y ) = I(X; Y, Z) and the capacity of this case is C. 3. Tensor Power Trick We have seen the proof of Kraft s inequality for uniquely decodable codes via the tensor power trick: we upper bound ( i 2 l i) k and then let k. This is a powerful tool in various problems (e.g., harmonic analysis) where some product structure is available. In this problem we look at another application in information theory. Let (X, Y ),, (X n, Y n ) (X, Y ) be i.i.d discrete random variables. For any ɛ >, define the following ɛ-typical sets: A (n) ɛ (X) = (x n, y n ) : } n log p(xn ) H(X) A (n) ɛ (Y ) = (x n, y n ) : } n log p(yn ) H(Y ) A (n) ɛ (X, Y ) = (x n, y n ) : } n log p(xn, y n ) H(X, Y ) and define A (n) ɛ = A (n) ɛ (X) A (n) ɛ (Y ) A ɛ (n) (X, Y ). Homework 4 Page 3 of 9
(a) Show that P((X n, Y n ) A (n) ɛ ) as n. (b) Show that for n large enough, we have ( ɛ)2 n(h(x,y ) ɛ) A (n) ɛ 2 n(h(x)+ɛ) n(h(y )+ɛ) 2 (c) Conclude from (b) that H(X, Y ) H(X) + H(Y ) by taking n and ɛ. This gives another proof of I(X; Y ) without using any convexity/concavity of mutual information and/or KL divergence. Solution: Tensor Power Trick (a) (From Lecture 9, courtesy scribers) We apply WLLN and convergence in probability on the three conditions of the jointly typical set. That is, there exists n, n 2, n 3 such that for all n > n, we have ( P ) n log p(xn ) H(X) ɛ < ɛ/3, and for all n > n 2, we have ( P ) n log p(yn ) H(Y ) ɛ < ɛ/3, and for all n > n 3, we have ( P ) n log p(xn, y n ) H(X, Y ) ɛ < ɛ/3. All three apply for n greater than the largest of n, n 2, n 3. Therefore the probability of the union the set of (x n, y n ) satisfying these inequalities must be less than ɛ, and for n sufficiently large, the probability of the set A (n) ɛ is greater than ɛ. (b) (Lower bound from Lecture 9, courtesy scribers) Upper Bound: First suppose we have S x X n and S y Y n. Then we have S x S y = (x n, y n ) : x n S x, y n S y } X n Y n and S x S y = S x S y. Now, define S x = x n : } n log p(xn ) H(X) S y = y n : } n log p(yn ) H(Y ) Homework 4 Page 4 of 9
Then by the AEP, we know that S x 2 n(h(x)+ɛ) and S y 2 n(h(y )+ɛ). Also observe that S x S y = A (n) ɛ (X) A (n) ɛ (Y ) and hence we have A (n) ɛ A (n) ɛ (X) A (n) ɛ (Y ) S x S y = S x S y Lower Bound: ( By Part, P (X n, Y n ) A (n) ɛ (X, Y ) Thus, for large n: (c) First take logarithm to get Dividing by n, 2 n(h(x)+ɛ) n(h(y )+ɛ) 2 ) n. ɛ P ((X n, Y n ) A (n) ɛ (X, Y )) (x n,y n ) A (n) ɛ n(h(x,y ) ɛ) 2 = 2 n(h(x,y ) ɛ) A (n) ɛ (X, Y ) = A (n) n(h(x,y ) ɛ) ɛ (X, Y ) ( ɛ)2 log( ɛ) + n(h(x, Y ) ɛ) n(h(x) + ɛ) + n(h(y ) + ɛ) log( ɛ) n Letting n for fixed ɛ, + (H(X, Y ) ɛ) (H(X) + ɛ) + (H(Y ) + ɛ) (H(X, Y ) ɛ) (H(X) + ɛ) + (H(Y ) + ɛ) This holds for all ɛ >, so let ɛ and get H(X, Y ) H(X) + H(Y ) 4. Channel with uniformly distributed noise. Consider an additive channel whose input alphabet X = 2,,,, 2}, and whose output Y = X + Z, where Z is uniformly distributed over the interval [, ]. Thus the input of the channel is a discrete random variable, while the output is continuous. Calculate the capacity C = max p(x) I(X; Y ) of this channel. Solution: Channel with uniformly distributed noise We can expand the mutual information I(X; Y ) = h(y ) h(y X) = h(y ) h(z) Homework 4 Page 5 of 9
and h(z) = log 2, since Z U(, ). The output Y is a sum a of a discrete and a continuous random variable, and if the probabilities of X are p 2, p,..., p 2, then the output distribution of Y has a uniform distribution with weight p 2 /2 for 3 Y 2, uniform with weight (p 2 + p )/2 for 2 Y, etc. Given that Y ranges from -3 to 3, the maximum entropy that it can have is an uniform over this range. This can be achieved if the distribution of X is (/3,, /3,,/3). Then h(y ) = log 6 and the capacity of this channel is C = log 6 log 2 = log 3 bits. Homework 4 Page 6 of 9
5. Exponential Noise Channel and Exponential Source Recall that X Exp(λ) is to say that X is a continuous non-negative random variable with density λe f X (x) = λx if x if x < or, equivalently, that X is a random variable with characteristic function ϕ X (t) = E [ e itx] = Recall also that in this case EX = /λ. (a) Find the differential entropy of X Exp(λ). it/λ. (b) Prove that Exp(λ) uniquely maximizes the differential entropy among all nonnegative random variables confined to EX /λ. Hint: Recall our proof of an analogous fact for the Gaussian distribution. Fix positive scalars a and b. Let X be the non-negative random variable of mean a formed by taking X = with probability b and, with probability a, drawing from a+b a+b an exponential distribution Exp(/(a + b)). Equivalently stated, X is the random variable with characteristic function ϕ X (t) = b a + b + Let N Exp(/b) and independent of X. a a + b it(a + b). (c) What is the distribution of X + N? Tip: simplest would be to compute the characteristic function of X + N by recalling the relation ϕ X+N (t) = ϕ X (t) ϕ N (t). (d) Find I(X; X + N). (e) Consider the problem of communication over the additive exponential noise channel Y = X +N, where N Exp(/b), independent of the channel input X, which is confined to being non-negative and satisfying the moment constraint EX a. Find C(a) = max I(X; X + N), where the maximization is over all non-negative X satisfying EX a. What is the capacity-achieving distribution? Hint: Using findings from previous parts, show that for any non-negative random variable X, independent of N, with EX a, we have I(X; X+N) I(X; X+N). Homework 4 Page 7 of 9
Solution: Exponential Noise Channel and Exponential Source (a) h(x) = = = = log λ. f X (x) log(f X (x))dx λe λx log(λe λx )dx λe λx log(λ) + λ xλe λx dx (b) Let the probability density of any such non-negative random variable be f X, while g X is the density of Exp(λ) as in Part () above, (c) (d) h(x) = = f X (x) log(f X (x))dx f X (x) log( f X(x) g X (x) )dx = D(f X g X ) log(λ) = log λ D(f X g X ) log λ, f X (x)dx + λ f X (x) log(λe λx )dx xf X (x)dx where the last inequality is due to the fact that D(f X g X ), equality holds if X = Exp(λ). ϕ X+N (t) = ϕ X (t) ϕ N (t) b = ( a + b + a a + b it(a + b) ) itb a + b itab itb 2 = it(a + b) (a + b)( itb) = it(a + b), which is the characteristic function of Exp(/a + b). Thus X + N is distributed as Exp(/a + b). I(X; X + N) = h(x + N) h(x + N X) N X = h(x + N) h(n) = + log(a + b) ( + log(b)) = log( + a b ) Homework 4 Page 8 of 9
(e) For any feasible X, note that X + N is a non-negative random variable and E[X +N] = E[X]+E[N] a+b, thus by the result of Part (2) above, h(x +N) + log(a + b). Hence, I(X; X + N) = h(x + N) h(x + N X) N X = h(x + N) h(n) ( ) + log(a + b) h(n) = + log(a + b) ( + log(b)) = log( + a b ) = I(X; X + N), Thus C(a) log(+ a b ). Equality in ( ) holds if X = X proving C(a) = log(+ a b ). Maximizing distribution is that of X. Homework 4 Page 9 of 9