Entropy, Relative Entropy and Mutual Information Exercises

Entrop, Relative Entrop and Mutual Information Exercises Exercise 2.: Coin Flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips required. (a) Find the entrop H(X) in bits. The following expressions ma be useful: r n = r r, nr n r = ( r) 2 () n= (b) A random variable X is drawn according to this distribution. Find an efficient sequence of es-no questions of the form, Is X contained in the set S? Compare H(X) to the expected number of questions required to determine X. n= The probabilit for the random variable is given b P {X = i} =.5 i. Hence, H(X) = i = i p i log p i.5 i log(.5 i ) = log(.5) i i.5 i (2) = = 2.5 (.5) 2 Exercise 2.: Minimum entrop. What is the minimum value of H(p,..., p n ) = H(p) as p ranges over the set of n-dimensional probabilit vectors? Find all p s which achieve this minimum. Since H(p) and i p i =, then the minimum value for H(p) is which is achieved when p i = and p j =, j i. Exercise 2.: Average entrop. Let H(p) = p log 2 p ( p) log 2 ( p) be the binar entrop function. (a) Evaluate H(/4). (b) Calculate the average entrop H(p) when the probabilit p is chosen uniforml in the range p.

(a) H(/4) = /4 log 2 (/4) ( /4) log 2 ( /4) =.8 () (b) Now, So, H(p) = E[H(p)] f(p) = = H(p)f(p)dp {, p,, otherwise. (4) (5) H(p) = H(p)dp = (p log p + ( p) log( p)) dp [ ] = p log pdp q log qdq = 2 p log p dp Letting u = ln p and v = p 2 and integrating b parts, we have: H(p) = udv [ ] = uv udv [ = p 2 ln p ] ln 2 p 2 p ln 2 dp [ = p 2 ln p ln 2 ] 2 ln 2 p2 = 2 ln 2 (6) (7) Exercise 2.6: Example of joint entrop. Let p(x, ) be given b X \ Y / / / Find 2

(a) H(X), H(Y ). (b) H(X Y ), H(Y X). (c) H(X, Y ). (d) H(Y ) H(Y X). (e) I(X; Y ). (f) Draw a Venn diagram for the quantities in (a) through (e). (a) H(X) = 2 log(2 ) log( ) = log 2 =.98 (8) (b) H(X Y ) = x H(Y ) = log( ) 2 log(2 ) =.98 ( ) p() p(x, ) log p(x, ) = log( ) + 2 log( ) + + 2 log( ) = 2 log 2 + log (9) () = 2 H(Y X) = x ( ) p(x) p(x, ) log p(x, ) = 2 log( ) + 2 log( ) + + log( ) = 2 log 2 + log () = 2

(c) (d) H(X, Y ) = x = = log p(x, ) log p(x, ) [ log( ) + log( ) + log + ] log( ) (2) (e) I(X; Y ) = x H(Y ) H(Y X) = log 2 2 = log( 2 = 2 log 2 + log 4 = log 4 ( ) p(x, ) p(x, ) log p(x)p() ) + log( 2 2 ) + + log( 2 ) () (4) = log 4 Exercise 2.8: Entrop of a sum. Let X and Y be random variables that take on values x, x 2,..., x r and, 2,..., s. Let Z = X + Y. (a) Show that H(Z X) = H(Y X). Argue that if X, Y are independent then H(Y ) H(Z) and H(X) H(Z). Thus the addition of independent random variables adds uncertaint. (b) Give an example (of necessaril dependent random variables) in which H(X) > H(Z) and H(Y ) > H(Z). (c) Under what conditions does H(Z) = H(X) + H(Y )? (a) H(Z X) = x = x = x p(z, x) log p(z x) z p(x) p(z x) log p(z x) z p(x) p( x) log p( x) (5) = H(Y X) If X, Y are independent, then H(Y X) = H(Y ). Now, H(Z) H(Z X) = H(Y X) = H(Y ). Similarl H(X) H(Z) 4

(b) If X, Y are dependent such that P r{y = x X = x} =, then P r{z = } =, so that H(X) = H(Y ) >, but H(Z) =. Hence H(X) > H(Z) and H(Y ) > H(Z). Another example is the sum of the two opposite faces on a dice, which alwas add to seven. (c) The random variables X, Y are independent and x i + j x m + n for all i, m R and j, n S, ie the two random variables X, Y never sum up to the same value. In other words, the alphabet of Z is r s. The proof is as follows. Notice that Z = X + Y = φ(x, Y ). Now H(Z) = H(φ(X, Y )) H(X, Y ) = H(X) + H(Y X) H(X) + H(Y ) (6) Now if φ( ) is a bijection (ie onl one pair of x, maps to one value of z), then H(φ(X, Y )) = H(X, Y ) and if X, Y are independent then H(Y X) = H(Y ). Hence, with these two conditions, H(Z) = H(X) + H(Y ). Exercise 2.2: Data processing. Let X X 2 X X n form a Markov chain in this order; i.e., let Reduce I(X ; X 2,..., X n ) to its simplest form. p(x, x 2,..., x n ) = p(x )p(x 2 x ) p(x n x n ) (7) I(X ; X 2,..., X n ) = H(X ) H(X X 2,..., X n ) = H(X ) [H(X, X 2,..., X n ) H(X 2,..., X n )] [ n ] n = H(X ) H(X i X i,..., X ) H(X i X i,..., X 2 ) = H(X ) i= [( H(X ) + = H(X 2 ) H(X 2 X ) = I(X 2 ; X ) = I(X ; X 2 ) i=2 ) ( n H(X i X i ) H(X 2 ) + i=2 )] n H(X i X i ) Exercise 2.: Fano s inequalit. Let P r(x = i) = p i, i =, 2,..., m and let p p 2 p p m. The minimal probabilit of error predictor of X is ˆX =, with resulting probabilit of error P e = p. Maximise H(p) subject to the constraint p = P e to find a bound on P e in terms of H. This is Fano s inequalit in the absence of conditioning. We want to maximise H(p) = m i= p i log p i subject to the constraints p = P e and p i =. Form the Lagrangian: L = H(p) + λ(p e + p ) + µ( p i ) (9) i= (8) 5

and take the partial derivatives for each p i and the Lagrangian multipliers: p i = (log p i + ) + µ, i p = (log p i + ) + λ + µ λ = P e + p µ = p i (2) Setting these equations to zero, we have p i = 2 µ p = 2 λ+µ p = P e pi = (2) We proceed b eliminating µ. Since the probabilities must sum to one, Hence, we have P e + i 2 µ = ( ) (22) Pe µ = + log m p = P e p i = P e m, i (2) Since we know that for these probabilities the entrop is maximised, H(p) ( P e ) log( P e ) + ( ) P e m log Pe m i ( ) = ( P e ) log( P e ) + P e log P e + P e m log m = H(P e ) + P e log ( H ) from which we get Fano s inequalit in the absence of conditioning. i (24) 6