EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Similar documents
4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

National University of Singapore Department of Electrical & Computer Engineering. Examination for

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

ECE 4400:693 - Information Theory

Lecture 22: Final Review

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 5: Asymptotic Equipartition Property

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

Solutions to Set #2 Data Compression, Huffman code and AEP

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Exercises with solutions (Set B)

Lecture 11: Quantum Information III - Source Coding

Homework Set #2 Data Compression, Huffman code and AEP

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

Lecture 11: Continuous-valued signals and differential entropy

Chapter 2. Discrete Distributions

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Communications Theory and Engineering

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

ECE 587 / STA 563: Lecture 5 Lossless Compression

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

ECE 587 / STA 563: Lecture 5 Lossless Compression

X = X X n, + X 2

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

Discrete Probability Refresher

Information measures in simple coding problems

Chapter 3 Single Random Variables and Probability Distributions (Part 1)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

LECTURE 13. Last time: Lecture outline

MATH Notebook 5 Fall 2018/2019

EE5139R: Problem Set 7 Assigned: 30/09/15, Due: 07/10/15

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

Fundamental Tools - Probability Theory IV

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Chapter 3, 4 Random Variables ENCS Probability and Stochastic Processes. Concordia University

Data Compression. Limit of Information Compression. October, Examples of codes 1

More on Distribution Function

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

LECTURE 3. Last time:

Lecture 3: Channel Capacity

Lecture 17: Differential Entropy

Lecture 14 February 28

A Probability Primer. A random walk down a probabilistic path leading to some stochastic thoughts on chance events and uncertain outcomes.

1 Basic Information Theory

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Probability. Lecture Notes. Adolfo J. Rumbos

Math 180A. Lecture 16 Friday May 7 th. Expectation. Recall the three main probability density functions so far (1) Uniform (2) Exponential.

Frans M.J. Willems. Authentication Based on Secret-Key Generation. Frans M.J. Willems. (joint work w. Tanya Ignatenko)

CSCI-6971 Lecture Notes: Probability theory

1 Review of Probability

Chapter 2: The Random Variable

Lecture 2: CDF and EDF

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Topic 3: The Expectation of a Random Variable

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

Probability Review. Gonzalo Mateos

lossless, optimal compressor

Topic 7: Convergence of Random Variables

ELEC546 Review of Information Theory

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

The binary entropy function

CS 630 Basic Probability and Information Theory. Tim Campbell

LECTURE 10. Last time: Lecture outline

6 The normal distribution, the central limit theorem and random samples

Lecture 4: Sampling, Tail Inequalities

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

Dept. of Linguistics, Indiana University Fall 2015

The Method of Types and Its Application to Information Hiding

5 Mutual Information and Channel Capacity

SOURCE CODING WITH SIDE INFORMATION AT THE DECODER (WYNER-ZIV CODING) FEB 13, 2003

Lecture 15: Conditional and Joint Typicaility

INFORMATION THEORY AND STATISTICS

MAS113 Introduction to Probability and Statistics

Gaussian, Markov and stationary processes

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Lecture 1: August 28

Bandits, Experts, and Games

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

On the Duality between Multiple-Access Codes and Computation Codes

Introduction to Probability

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

(Re)introduction to Statistics Dan Lizotte

Introduction to Probability Theory for Graduate Economics Fall 2008

CS145: Probability & Computing

Handout 1: Mathematical Background

On the Shamai-Laroia Approximation for the Information Rate of the ISI Channel

Correlation Detection and an Operational Interpretation of the Rényi Mutual Information

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122

Shannon s Noisy-Channel Coding Theorem

Capacity of AWGN channels

Lecture 2: Review of Probability

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

Some Basic Concepts of Probability and Information Theory: Pt. 2

Transcription:

EE539R: Problem Set 3 Assigned: 24/08/6, Due: 3/08/6. Cover and Thomas: Problem 2.30 (Maimum Entropy): Solution: We are required to maimize H(P X ) over all distributions P X on the non-negative integers satisfying np X (n) = A n=0 and also the normalization constraint n=0 P X(n) = (which we ignore without loss of generality). Now, construct the Lagrangian: ( ) L(P X, λ) = P X (n) log P X (n) + λ np X (n) A n=0 Differentiating with respect to P X (n) (assuming natural logs and interchanging of differentiation and infinite sum), we obtain so we have n=0 P X (n) P X (n) log P X(n) + λn = 0 P X(n) = ep( + λn), n 0. We immediately recognize that this is a geometric distribution with mean A, i.e., PX can be written alternatively as where From direct calculations, the entropy is P X(n) = ( p) n p, n 0. A = p p H(P X ) = H b(p). p 2. (Optional): Cover and Thomas: Problem 2.38 (The Value of a Question): since H(Y X) = 0. H(X) H(X Y ) = I(X; Y ) = H(Y ) H(Y X) = H b (α) H(Y X) = H b (α)

3. Fano s inequality for list decoding: Recall the proof of Fano s inequality. Now develop a generalization of Fano s inequality for list decoding. Let (X, Y ) P XY and let L(Y ) ˆX be a set of size L (compare this to an estimator ˆX(Y ) X which is a set of size L = ). Lower bound the probability of error Pr(X / L(Y )) in terms of L, H(X L(Y )) and X. You should be able to recover the standard Fano inequality if you set L =. Solution: Define the error random variable Now consider E = { X / L(Y ) 0 X L(Y ) H(X, E L(Y )) = H(X E, L(Y )) + H(E L(Y )) = H(E X, L(Y )) + H(X L(Y )) Let P e := Pr(X / L(Y )). Now clearly, H(E X, L(Y )) = 0, and H(E L(Y )) H(E) = H b (P e ). Now, we eamine the term H(X E, L(Y )). We have H(X E, L(Y )) = Pr(E = 0)H(X E = 0, L(Y )) + Pr(E = )H(X E =, L(Y )) ( P e ) log L + P e log( X L) since if we know that E = 0, the number of values that X can take on is no more than L and if E =, the number of values that X can take on is no more than X L. Putting everything together and upper bounding H b (P e ) by, we have H(X L(Y )) log L P e. log X L L 4. (Optional): Data Processing Inequality for KL Divergence: Let P X, Q X be pmfs on the same alphabet X. Assume for the sake of simplicity that P X (), Q X () > 0 for all X. Let W (y ) = Pr(Y = y X = ) be a channel from X to Y. Define P Y (y) = W (y )P X (), and Q Y (y) = W (y )Q X () Show that D(P X Q X ) D(P Y Q Y ) You may use the log-sum inequality. This problem shows that processing does not increase divergence. Solution: Starting from the definition of D(P Y Q Y ), we have D(P Y Q Y ) = y = y y = y P Y (y) log P Y (y) Q Y (y) ( ) W (y )P X () log ( W (y )P X()) ( W (y )Q X()) W (y )P X () log W (y )P X() W (y )Q X () W (y )P X () log P X() Q X () = P X () log P X() Q X () = D(P X Q X ) where the inequality follows from the log-sum inequality 2

5. Typical-Set Calculations : (a) Suppose a DMS emits h and t with probability /2 each. For ɛ = 0.0 and n = 5, what is A n ɛ? Solution: In this case, H(X) =. All source sequences are equally likely, each with probability 2 5 = 2 nh(x). Hence, all sequences satisfy the condition for being typical, 2 n(h(x)+ɛ) p X n( n ) 2 n(h(x) ɛ) for any ɛ > 0. Hence, all 32 sequences are typical. (b) Repeat if Pr(h) = 0.2, Pr(t) = 0.8, n = 5, and ɛ = 0.000. Solution: Consider a sequence with m heads and n m tails. Then, the probability of occurrence of this sequence is p m ( p) n m, where p = Pr(h). For such a sequence to be typical which translates to Plugging in the value of p = 0.2, we get 2 n(h(x)+ɛ) p m ( p) n m 2 n(h(x) ɛ) ( m ) n p log p p ɛ m 5 5 ɛ 2. Since m = 0,..., 5, this condition will be satisfied for the given ɛ only for m = i.e. when there is one H in the sequence. Thus, A n ɛ = {(HT T T T ), (T HT T T ), (T T HT T ), (T T T HT ), (T T T T H)}. 6. Typical-Set Calculations 2: Consider a DMS with a two symbol alphabet {a, b} where p X (a) = 2/3 and p X (b) = /3. Let X n = (X,..., X n ) be a string of chance variables from the source with n = 00, 000. (a) Let W (X j ) be the log pmf random variable for the j-th source output, i.e., W (X j ) = log 2/3 for X j = a and log /3 for X j = b. Find the variance of W (X j ). Solution: For notational convenience, we will denote the log pmf random variable by W. Now, note that W takes on values log 2/3 with probability 2/3 and log /3 with probability /3. Hence, Var(W ) = E[W 2 ] E[W ] 2 = 2 9 (b) For ɛ = 0.0, evaluate the bound on the probability of the typical set using Pr(X n σw 2 /(nɛ2 ). Solution: The bound on the typical set, as derived using Chebyshev s inequality is / A (n) ɛ ) Pr(X n A (n) ɛ ) σ2 W nɛ 2. Substituting the values of n = 0 5 and ɛ = 0.0, we obtain Pr(X n A (n) ɛ ) 45 = 44 45 Loosely speaking this means that if we were to look at sequences of length 00, 000 generated from our DMS, more than 97% of the time the sequence will be typical. 3

(c) Let N a be the number of a s in the string X n = (X,..., X n ). The random variable (rv) N a is the sum of n iid rv s. Show what these rv s are. Solution: The rv N a is the sum of n iid rv s Y i, N a = n i= Y i where Y i s are Bernoulli with Pr(Y i = ) = 2/3. (d) Epress the rv W (X n ) as a function of the rv N a. Note how this depends on n. Solution: The probability of a particular sequence X n with N a number of a s (2/3) Na (/3) n Na. Hence, W (X n ) = log p X n( n ) = log[(2/3) Na (/3) n Na ] = n log 3 N a (e) Epress the typical set in terms of bounds on N a (i.e., A (n) ɛ = { n : α < N a < β} and calculate α and β). Solution: For a sequence X n to be typical, it must satisfy n log p X n(n ) H(X) < ɛ From (a) the source entropy is H(X) = E[W (X)] = log 3 2/3 and substituting in ɛ and W (X n ) from part (d), we get N a n 2 3 0.0 Note the intuitive appeal of this condition! It says that for a sequence to be typical, the proportion of a s in that sequence will be very close to the probability that the DMS generates an a. Plugging in the value of n in the above equation, we get the bounds on 65, 667 N a 67, 666. (f) Find the mean and variance of N a. Approimate Pr(X n A (n) ɛ ) by the central limit theorem approimation. The central limit theorem approimation is to evaluate Pr(X n A ɛ (n) ) assuming that N a is Gaussian with the mean and variance of the actual N a. Recall that for a sequence of iid rvs C,... C n, the central limit theorem assert that Pr ( n n i= C i µ C t ) ( ) t Φ σ C where µ C and σ C are the mean and standard deviation of the C i s and Φ(z) = z 2π ep( u2 2 ) du is the cdf of the standard Gaussian. Solution: N a is a binomial r.v. (which is a sum independent Bernoulli r.v. as we have shown in part (c)). The mean and variance are E[N a ] = 2 3 05, Var(N a ) = 2 9 05 Note that we can calculate the eact probability of the typical set A (n) ɛ : Pr(A (n) ɛ ) = Pr(65, 667 N a 67, 666) = 67,666 N a=65,667 ( 0 5 N a ) ( ) Na ( ) 0 2 5 N a 3 3 But this is computationally intensive, so we approimate the Pr(A (n) ɛ ) with the central limit theorem. We can use the CLT because N a is the sum of n iid r.v. so in the limit of large n, the cumulative distribution approaches that of a Gaussian r.v. with the mean and variance of N a. β ( Pr(65, 667 N a 67, 666) 2π Var(Na ) ep ( E[N a]) 2 ) d = Φ(6.706) Φ(6.70) 2 Var(N a ) α 4

where Φ() is the integral of the unit Gaussian r.v. from (, ). Thus the CLT approimation tells us approimately all of the sequences we observe from the output of the DMS will be typical, whereas Chebyshev gave us a bound that more than 97% of the sequences that we observe will be typical. 7. (Optional): Typical-Set Calculations 3: For the random variables in the previous problem, find Pr(N a = i) for i = 0,, 2. Find the probability of each individual string n for those values of i. Find the particular string n that has maimum probability over all sample values of X n. What are the net most probable n-strings. Give a brief discussion of why the most probable n-strings are not regarded as typical strings. Solution: We know from the previous problem that ( 0 5 Pr(N a = i) = i ) ( 2 3 ) i ( ) 0 5 i 3 For i = 0,, 2, Pr(N a = i) is approimately zero. The string with the maimal probability is the string with all a s. The net most probable strings are the sequences with n a s and one b, and so forth. From the definition of the typical set, we see that the typical set is a fairly small set which contains most of the probability, and the probability of each sequence in the typical set is almost the same. The most probable sequences and the least probable sequences are the tails of the distribution of the sample mean of the log pmf (they are the furthest from the mean), so are not regarded as typical strings. In fact, the aggregate probability of the all the most likely sequences and all the least likely sequences is very small. The only case where the most likely sequence is regarded as typical is when every sequence is typical and every sequence is most likely (as in problem Typical Set Calculation ). However, this is not the case in general. From what we have seen in problem Typical Set Calculation 2 for very long sequences, the typical sequence will contain roughly the same proportion of of symbols as the probability of that symbol. 8. (Optional): AEP and Mutual Information: Let (X i, Y i ) be i.i.d. p X,Y (, y). We form the loglikelihood ratio of the hypothesis that X and Y are independent vs the hypothesis that X and Y are dependent. What is the limit of n log p Xn(X n )p Y n(y n ) p Xn,Y n(xn, Y n ) What is the limit of p X n (Xn )p Y n (Y n ) p X n,y n (X n,y n ) if X i and Y i are independent for all i? Solution: Let L = n log p X n(xn )p Y n(y n ) p Xn,Y n(xn, Y n ) Since (X i, Y i ) be i.i.d. p X,Y (, y), we have L = n n i= log p X(X i )p Y (Y i ) p X,Y (X i, Y i ) }{{} W (X i,y i) Each of the terms is a function of (X i, Y i ) which are independent across i =,..., n. following convergence in probability is observed: [ L E [W (X, Y )] = E (X,Y ) px,y log p ] X(X)p Y (Y ) = I(X; Y ) p X,Y (X, Y ) Thus, the is 2 ni(x;y ) which converges to one if X and Y are inde- Hence, the limit of 2 nl = p X n (Xn )p Y n (Y n ) p X n,y n (X n,y n ) pendent because I(X; Y ) = 0. 5

9. Piece of Cake: A cake is sliced roughly in half, the largest piece being chosen each time, the other pieces discarded. We will assume that a random cut creates pieces of proportions: { (2/3, /3) w.p. 3/4 P = (2/5, 3/5) w.p. /4 Thus, for eample, the first cut (and choice of largest piece) may result in a piece of size 3/5. Cutting and choosing from this piece might reduce it to size (3/5)(2/3) at time 2, and so on. Let T n be the fraction of cake left after n cuts. Find the limit (in probability) of lim n n log T n Solution: Let C i be the fraction of the piece of cake that is cut at the i-th cut, and let T n be the fraction of cake left after n cuts. Then we have T n = C C 2... C n. Hence, lim n n log T n = lim n n n log C i E[log C ] = 3 4 log 2 3 + 4 log 3 5. i= 0. Two Typical Sets: Let X i be a sequence of real-valued random variables independent and identically distributed according to P X (), X. Let µ = E[X] and denote the entropy of X as H(X) = P X() log P X (). Define the two sets A n = { n X n : } { n log P X n( n ) H(X) ɛ, B n = n X n : n (a) ( point) Pr(X n A n ) as n. True or false. Justify your answer. Solution: This follows by Chebyshev s inequality: Indeed. where σ 2 0 = Var( log P X (X)). Consequently as desired. Pr(X n A c n) σ2 0 nɛ 2 0 Pr(X n A n ) (b) ( point) Pr(X n A n B n ) as n. True or false. Justify your answer. Solution: By Chebyshev s inequality and the same logic as the above, Pr(X n B n ) So by De Morgan s theorem and the union bound, } n X i µ ɛ Pr(X n A n B n ) = Pr(X n A c n B c n) Pr(X n A c n) Pr(X n B c n) Since the latter two terms tend to zero, we know that as desired. Pr(X n A n B n ) (c) ( point) Show that A n B n 2 n(h(x)+ɛ) for all n. A n B n A n 2 n(h+ɛ) where the final inequality comes from the AEP, shown in class. i= 6

(d) ( point) Show that A n B n 2 2n(H(X) ɛ) for n sufficiently large. Pr(X n A n B n ) 2 for n sufficiently large. Thus and we are done. 2 P X n( n ) n A n B n n A n B n 2 n(h ɛ) = A n B n 2 n(h ɛ). Entropy Inequalities: Let X and Y be real-valued random variables that take on discrete values in X = {,..., r} and Y = {,..., s}. Let Z = X + Y. (a) ( point) Show that H(Z X) = H(Y X). Justify your answer carefully. Solution: Consider H(Z X) = P X ()H(Z X = ) = = = P X () z P X () z P X () y P Z X (z ) log P Z X (z ) P Y X (z ) log P Y X (z ) P Y X (y ) log P Y X (y ) = H(Y X). (b) ( point) It is now known that X and Y are independent, which of the following is true in general? (i) H(X) H(Z); (ii) H(X) H(Z). Justify your answer. Solution: From the above, note that X and Y are symmetrical. So given what we have proved in (a), we also know that H(Z Y ) = H(X Y ) Now, we have H(Z) H(Z Y ) = H(X Y ) = H(X) where the inequality is due to conditioning reduces entropy and the final equality by the independence of X and Z. So the first assertion is true. (c) ( point) Now, in addition to Z = X + Y and that X and Y are independent, it is also known that X = f (Z) and Y = f 2 (Z) for some functions f and f 2. Find H(Z) in terms of H(X) and H(Y ). H(Z) = H(X + Y ) H(X, Y ) = H(X) + H(Y ) where the final equality is by independence of X and Y. On the other hand, H(X) + H(Y ) = H(X, Y ) = H(f (Z), f 2 (Z)) H(Z) Hence all inequalities above are equalities and we have H(Z) = H(X) + H(Y ). 7