LECTURE 2 Information Measures 2. ENTROPY LetXbeadiscreterandomvariableonanalphabetX drawnaccordingtotheprobability mass function (pmf) p() = P(X = ), X, denoted in short as X p(). The uncertainty about the outcome of X, or equivalently, the amount of information gained by observing X, is measured by its entropy H(X) = p() log p() = E log p(x). X Bycontinuity,weusetheconvention 0log0 = 0intheabovesummation. Sometimeswe denoteh(x)byh(p()),highlightingthefactthath(x)isafunctionalofthepmf p(). Eample2.. If X is a Bernoulli random variable with parameter p = P{X = } [0,] (inshort X Bern(p)),then H(X) = plog p +( p)log p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function. The entropy H(X) satisfies the following properties.. H(X) 0. 2. H(X)isaconcavefunctionin p(). 3. H(X) log X. Thefirstpropertyistrivial. Theproofofthesecondpropertyisleftasaneercise. Forthe proof of the third property, we recall the following. Lemma 2.(Jensen s inequality). If f() is conve, then E(f(X)) f(e(x)). If f()isconcave,then E(f(X)) f(e(x)).
2 Information Measures Now by the concavity of the logarithm function and Jensen s inequality, H(X) = E log where the last inequality follows since log E log X, p(x) p(x) E = p(x) : p() 0 p() p() = {: p() 0} X. Let (X,Y) be a pair of discrete random variables. Then the conditional entropy of Y given X isdefinedas H(Y X) = p()h(p(y )) = E log p(y X), where p(y ) = p(, y)/p() is the conditional pmf of Y given {X = }. We sometimes use the notation H(Y X = ) = H(p(y )), X. By the concavity of H(p(y)) in p(y) and Jensen s inequality, p()h(p(y )) H p()p(y ) = H(p(y)), where the inequality holds with equality if p(y ) p(y), or equivalently, X and Y are independent. We summarize this relationship between the conditional and unconditional entropies as follows. Conditioning reduces entropy. withequality if X andy areindependent. H(Y X) H(Y) (2.) Let(X,Y) p(, y)beapairofdiscreterandomvariables. Theirjointentropyis H(X,Y) = E log p(x,y). Bythechainruleofprobability p(, y) = p()p(y ) = p(y)p( y),wehavethechainrule of entropy H(X,Y) = E log p(x) +E log p(y X) = H(X)+H(Y X) = H(Y)+H(X Y).
2.2 Relative Entropy 3 More generally, for an n-tuple of random variables X n = (X,X 2,...,X n ), we have the following. Chain rule of entropy. H(X n ) = H(X )+H(X 2 X )+H(X n X n ) = n i= H(X i X i ), where X 0 issettobeanunspecifiedconstantbyconvention. Bythechainruleand(2.),wecanupperboundthejointentropyas H(X n ) n i= H(X i ) withequality if X,...,X n aremutually independent. 2.2 RELATIVE ENTROPY Let p() and q() be a pair of pmfs on X. The etent of discrepancy between p() and q() is measured by their relative entropy(also referred to as Kullback Leibler divergence) D(p q) = D(p() q()) = E p log p(x) q(x) = p()log p() q(). (2.2) X where the epectation is taken w.r.t. X p(). Note that this quantity is well defined only when p() is absolutely continuous w.r.t.q(), namely, p() = 0 wheneverq() = 0. Otherwise, we define D(p q) =, which follows by adopting the convention /0 = as well. The relative entropy D(p q) satisfies the following properties.. D(p q) 0withequalityifandonlyif(iff)p q,namely,p() = q()forevery X. 2. D(p q) is not symmetric, i.e., D(p q) D(q p) in general. 3. D(p q) isconvein(p,q),i.e.,forany(p,q ),(p 2,q 2 ),and λ, λ = λ [0,], λd(p q )+ λd(p 2 q 2 ) D(λp + λp 2 λq + λq 2 ).
4 Information Measures 4. Chainrule. Forany p(, y)andq(, y), D(p(, y) q(, y)) = D(p() q()) + p()d(p(y ) q(y )) = D(p() q()) + E p D(p(y X) q(y X)). The proofof thefirstthreeproperties isleftas aneercise. For thefourthproperty, consider D(p(, y) q(, y)) = E p log p(x,y) q(x,y) = E p log p(x) q(x) +E p log p(y X) q(y X). The notion of relative entropy can be etended to arbitrary probability measures P and Qdefinedonthesamesamplespaceandsetofeventsas D(P Q) = log dp dq d P, where dp/dqistheradon Nikodym derivative of Pw.r.t. Q. (If Pisnotabsolutely continuousw.r.t. Q,thenD(P Q) =.) Inparticular, if Pand Qhaverespectivedensities p andqw.r.t.aσ-finitemeasure μ(suchasthelebesgueandcountingmeasures),then D(P Q) = D(p q) = p()log p() dμ(). (2.3) q() This epression generalizes (2.2) since probability mass functions can be viewed as densities w.r.t. the counting measure. When μ is the Lebesgue measure (or equivalently, p and q are derivatives of continuous distributions on the Euclidean space), we follow the standardconventionofdenoting dμ()by d in(2.3). 2.3 f -DIVERGENCE Wedigressabittogeneralizethenotionofrelativeentropy. Let f : [0, ) Rbeconve with f() = 0.Thenthe f-divergencebetweenapairofdensitiespandqw.r.t.μisdefined as D f (p q) = q()f p() q() dμ() = E q f p(x) q(x). Eample2.2. Let f(u) = ulogu. Then D f (p q) = q() p() p() log q() q() dμ() = p()log p() q() dμ() = D(p q).
2.4 Mutual Information 5 Eample2.3. Nowlet f(u) = logu. Then D f (p q) = q()log p() q() dμ() = q()log q() p() dμ() = D(q p). Eample2.4. Combiningtheabovetwocases,let f(u) = (u )logu. Then whichissymmetricin(p,q). D f (p q) = D(p q)+d(q p), Many basic distance functions on probability measures can be represented as f-divergences; see, for eample, Liese and Vajda (2006). 2.4 MUTUAL INFORMATION Let (X,Y) be a pair of discrete random variables with joint pmf p(, y) = p()p(y ). Theamountofinformationaboutoneprovidedbytheotherismeasuredbytheirmutual information I(X;Y) = D(p(, y) p()p(y)) p(, y) = p(, y)log,y p()p(y) = p() p(y )log p(y ) y p(y) = p()d(p(y ) p(y)). The mutual information I(X; Y) satisfies the following properties.. I(X;Y)isanonnegativefunctionof p(, y). 2. I(X;Y) = 0iff X andy areindependent,i.e., p(, y) p()p(y). 3. Asafunctionof(p(), p(y )), I(X;Y)isconcavein p()forafied p(y ),andconvein p(y )forafied p(). 4. Mutual information and entropy. and I(X;X) = H(X) I(X;Y) = H(X) H(X Y) = H(Y) H(Y X) = H(X)+H(Y) H(X,Y).
6 Information Measures 5. Variational characterization. I(X;Y) = min q(y) wheretheminimumisattainedbyq(y) p(y). p()d(p(y ) q(y)), (2.4) Theproofofthefirstfourpropertiesisleftasaneercise. Forthefifthproperty,consider I(X;Y) = p(, y)log p(y ),y p(y) = p()p(y )log p(y ) q(y),y q(y) p(y) = p()d(p(y ) q(y)) p(y)log p(y) y q(y) p()d(p(y ) q(y)), where the last inequality follows since the subtracted term is equal to D(p(y) q(y)) 0, andholdswithequality iff p(y) q(y). Sometimesweareinterestedinthemaimummutualinformationma p() I(X;Y)of a conditional pmf p(y ), which is referred to as the information capacity(or the capacity in short). By the variational characterization in (2.4), the information capacity can be epressed as mai(x;y) = ma min p()d(p(y ) q(y)), p() p() q(y) whichcanbeviewedasagamebetweentwoplayers,onechoosing p()firstandtheother choosing q(y) net, with the payoff function f(p(),q(y)) = p()d(p(y ) q(y)). Using the following fundamental result in game theory, we show that the order of plays can be echanged without affecting the outcome of the game. Minima theorem(sion 958). Suppose that U and V are compact conve subsets of the Euclidean space, and that a real-valued continuous function f(u, ) on U V is concavein uforeach andconvein foreachu. Then ma min f(u, ) = min ma f(u, ). u U V V u U Since f(p(),q(y)) is linear (thus concave) in p() and conve in q(y) (recall Prop-
2.5 Entropy Rate 7 erty 3 of relative entropy), we can apply the minima theorem and conclude that mai(x;y) = ma min p() p() q(y) p()d(p(y ) q(y)) = min ma p()d(p(y ) q(y)) q(y) p() = min D(p(y ) q(y)), ma q(y) where the last equality follows by noting that the maimum epectation is attained by putting all the weights on the value that maimizes D(p(y ) q(y)). Furthermore, if p ()attainsthemaimum, thenbytheoptimalityconditionof (2.4) attains the minimum. q (y) p ()p(y ) 2.5 ENTROPY RATE Let X = (X n ) n= bearandomprocessonafinitealphabetx. Theamount ofuncertainty persymbol ismeasuredbyitsentropyrate if the limit eists. H(X) = lim n n H(X,...,X n ), Eample 2.5. If X is stationary, then the limit eists and H(X) = lim n H(X n X n ). Eample 2.6. If X is an aperiodic irreducible Markov chain, then the limit eists and H(X) = lim n H(X n X n ) = π( )H(p( 2 )), where π is the unique stationary distribution of the chain. Eample2.7. If X,X 2,...areindependentandidentically distributed(i.i.d.), then H(X) = H(X ). Eample2.8. LetY = (Y n ) n= beastationarymarkovchainandx n = f(y n ),n =,2,... Theresultingrandomprocess X = (X n ) n= ishiddenmarkovanditsentropyratesatisfies H(X n X n,y ) H(X) H(X n X n ) and H(X) = lim n H(X n X n,y ) = lim n H(X n X n ).
8 Information Measures 2.6 RELATIVE ENTROPY RATE Let Pand QbetwoprobabilitymeasuresonX withn-thorderdensitiesp( n )andq( n ), respectively. The normalized discrepancy between them is measured by their relative entropy rate D(P Q) = lim n n D(p(n ) q( n )) if the limit eists. Eample 2.9. If P is stationary, Q is stationary finite-order Markov, and P is absolutely continuous w.r.t. Q, then the limit eists and D(P Q) = lim n p( n )D(p( n n ) q( n n ))d n. (2.5) See, for eample, Barron(985) and Gray(20, Lemma 0.). Eample 2.0. Similarly, if P and Q are stationary hidden Markov and P is absolutely continuous w.r.t. Q, then the limit eists and(2.5) holds(juang and Rabiner 985, Lerou 992, Ephraim and Merhav 2002). PROBLEMS 2.. Prove Property 2 of entropy in Section 2. and find the equality condition for Property 3. 2.2. Prove Properties through 3 of relative entropy in Section 2.2. 2.3. Entropy and relative entropy. Let X be a finite alphabet and q() be the uniform pmfonx. Showthatforanypmf p()onx, D(p q) = log X H(p()). 2.4. The total variation distance between two pmfs p() and q() is defined as δ(p,q) = 2 p() q(). X (a) Showthatthisdistanceisan f-divergencebyfindingthecorresponding f. (b) Show that δ(p,q) = ma A X P(A) Q(A), where Pand Qarecorrespondingprobabilitymeasures,e.g., P(A) = A p().
Problems 9 2.5. Pinsker s inequality. Show that δ(p,q) 2loge D(p q), where the logarithm has the same base as the relative entropy.(hint: First consider thecasethatx isbinary.) 2.6. Let p(, y) be a joint pmf on X Y. Show that p(, y) is absolutely continuous w.r.t. p()p(y). 2.7. Prove Properties through 4 of mutual information in Section 2.4. 2.8. Let X = (X n ) n= beastationaryrandomprocess. Showthat H(X) = lim n H(X n X n ). 2.9. LetY = (Y n ) n= beastationarymarkovchainandx = {д(y n)}beahiddenmarkov process. Show that and conclude that H(X n X n,y ) H(X) H(X n X n ) H(X) = lim n H(X n X n,y ) = lim n H(X n X n ). 2.0. Recurrencetime. LetX 0,X,X 2,...bei.i.d.copiesofX p(),andletn = min{n : X n = X 0 }bethewaitingtimetothenetoccurrenceof X 0. (a) Showthat E(N) = X. (b)showthat E(logN) H(X). 2.. Thepastandthefuture. Let X = (X n ) n= bestationary. Showthat lim n n I(X,...,X n ;X n+,...,x 2n ) = 0. 2.2. Variable-duration symbols. A discrete memoryless source has the alphabet {, 2}, where symbol has duration and symbol 2 has duration 2. Let X = (X n ) n= be the resulting random process. (a) FinditsentropyrateH(X)intermsoftheprobability pofsymbol. (b) Find the maimum entropy by optimizing over p.
Bibliography Barron, A. R.(985). The strong ergodic theorem for densities: Generalized Shannon McMillan Breiman theorem. Ann. Probab., 3(4), 292 303. [8] Ephraim, Y. and Merhav, N. (2002). Hidden Markov processes. IEEE Trans. Inf. Theory, 48(6), 58 569. [8] Gray, R. M.(20). Entropy and information theory. Springer, New York. [8] Juang, B.-H. F. and Rabiner, L. R. (985). A probabilistic distance measure for hidden Markov models. AT&T Tech. J., 64(2), 39 408. [8] Lerou, B. G.(992). Maimum-likelihood estimation for hidden Markov models. Stoc. Proc. Appl., 40(), 27 43. [8] Liese, F. and Vajda, I.(2006). On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory, 52(0), 4394 442. [5] Sion, M.(958). On general minima theorems. Pacific J. Math., 8, 7 76. [6]