H(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.

Similar documents
1 p(x) = E log 1. H(X) = plog 1 p +(1 p)log 1 1 p.

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 5 - Information theory

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

5 Mutual Information and Channel Capacity

The binary entropy function

Information Theory and Communication

Lecture 2: August 31

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

ECE 587 / STA 563: Lecture 5 Lossless Compression

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

ECE 587 / STA 563: Lecture 5 Lossless Compression

Information Theory Primer:

Example: Letter Frequencies

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Example: Letter Frequencies

Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Communication Theory and Engineering

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Lecture 22: Final Review

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

5.1 Inequalities via joint range

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

LECTURE 3. Last time:

Lecture 1: Introduction, Entropy and ML estimation

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Dept. of Linguistics, Indiana University Fall 2015

Information Theory: Entropy, Markov Chains, and Huffman Coding

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Introduction to Information Theory

EE 4TM4: Digital Communications II. Channel Capacity

ECE 4400:693 - Information Theory

Solutions to Set #2 Data Compression, Huffman code and AEP

The Method of Types and Its Application to Information Hiding

Quantitative Biology II Lecture 4: Variational Methods

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

Data Compression. Limit of Information Compression. October, Examples of codes 1

Series 7, May 22, 2018 (EM Convergence)

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Computational Systems Biology: Biology X

Example: Letter Frequencies

Homework Set #2 Data Compression, Huffman code and AEP

Convexity/Concavity of Renyi Entropy and α-mutual Information

Computing and Communications 2. Information Theory -Entropy

Variable Length Codes for Degraded Broadcast Channels

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

EE514A Information Theory I Fall 2013

Entropies & Information Theory

Large Deviations Performance of Knuth-Yao algorithm for Random Number Generation

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Chaos, Complexity, and Inference (36-462)

Bioinformatics: Biology X

Entropy and Large Deviations

How to Quantitate a Markov Chain? Stochostic project 1

Computation of Information Rates from Finite-State Source/Channel Models

Information Theoretic Limits of Randomness Generation

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

CS 229: Lecture 7 Notes

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Information Theory and Hypothesis Testing

NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

Information measures in simple coding problems

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Lecture Notes for Statistics 311/Electrical Engineering 377. John Duchi

AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013

x log x, which is strictly convex, and use Jensen s Inequality:

Chapter I: Fundamental Information Theory

Machine Learning Srihari. Information Theory. Sargur N. Srihari

1 Basic Information Theory

Context tree models for source coding

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Information in Biology

Symmetric Characterization of Finite State Markov Channels

Gambling and Information Theory

Principles of Communications

ELEC546 Review of Information Theory

arxiv: v4 [cs.it] 17 Oct 2015

Arimoto Channel Coding Converse and Rényi Divergence

Information in Biology

Coding on Countably Infinite Alphabets

Hands-On Learning Theory Fall 2016, Lecture 3

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

Common Information. Abbas El Gamal. Stanford University. Viterbi Lecture, USC, April 2014

Information Theory in Intelligent Decision Making

arxiv: v4 [cs.it] 8 Apr 2014

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Lecture 5 Channel Coding over Continuous Channels

Lecture 8: Channel Capacity, Continuous Random Variables

3F1 Information Theory, Lecture 1

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Transcription:

LECTURE 2 Information Measures 2. ENTROPY LetXbeadiscreterandomvariableonanalphabetX drawnaccordingtotheprobability mass function (pmf) p() = P(X = ), X, denoted in short as X p(). The uncertainty about the outcome of X, or equivalently, the amount of information gained by observing X, is measured by its entropy H(X) = p() log p() = E log p(x). X Bycontinuity,weusetheconvention 0log0 = 0intheabovesummation. Sometimeswe denoteh(x)byh(p()),highlightingthefactthath(x)isafunctionalofthepmf p(). Eample2.. If X is a Bernoulli random variable with parameter p = P{X = } [0,] (inshort X Bern(p)),then H(X) = plog p +( p)log p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function. The entropy H(X) satisfies the following properties.. H(X) 0. 2. H(X)isaconcavefunctionin p(). 3. H(X) log X. Thefirstpropertyistrivial. Theproofofthesecondpropertyisleftasaneercise. Forthe proof of the third property, we recall the following. Lemma 2.(Jensen s inequality). If f() is conve, then E(f(X)) f(e(x)). If f()isconcave,then E(f(X)) f(e(x)).

2 Information Measures Now by the concavity of the logarithm function and Jensen s inequality, H(X) = E log where the last inequality follows since log E log X, p(x) p(x) E = p(x) : p() 0 p() p() = {: p() 0} X. Let (X,Y) be a pair of discrete random variables. Then the conditional entropy of Y given X isdefinedas H(Y X) = p()h(p(y )) = E log p(y X), where p(y ) = p(, y)/p() is the conditional pmf of Y given {X = }. We sometimes use the notation H(Y X = ) = H(p(y )), X. By the concavity of H(p(y)) in p(y) and Jensen s inequality, p()h(p(y )) H p()p(y ) = H(p(y)), where the inequality holds with equality if p(y ) p(y), or equivalently, X and Y are independent. We summarize this relationship between the conditional and unconditional entropies as follows. Conditioning reduces entropy. withequality if X andy areindependent. H(Y X) H(Y) (2.) Let(X,Y) p(, y)beapairofdiscreterandomvariables. Theirjointentropyis H(X,Y) = E log p(x,y). Bythechainruleofprobability p(, y) = p()p(y ) = p(y)p( y),wehavethechainrule of entropy H(X,Y) = E log p(x) +E log p(y X) = H(X)+H(Y X) = H(Y)+H(X Y).

2.2 Relative Entropy 3 More generally, for an n-tuple of random variables X n = (X,X 2,...,X n ), we have the following. Chain rule of entropy. H(X n ) = H(X )+H(X 2 X )+H(X n X n ) = n i= H(X i X i ), where X 0 issettobeanunspecifiedconstantbyconvention. Bythechainruleand(2.),wecanupperboundthejointentropyas H(X n ) n i= H(X i ) withequality if X,...,X n aremutually independent. 2.2 RELATIVE ENTROPY Let p() and q() be a pair of pmfs on X. The etent of discrepancy between p() and q() is measured by their relative entropy(also referred to as Kullback Leibler divergence) D(p q) = D(p() q()) = E p log p(x) q(x) = p()log p() q(). (2.2) X where the epectation is taken w.r.t. X p(). Note that this quantity is well defined only when p() is absolutely continuous w.r.t.q(), namely, p() = 0 wheneverq() = 0. Otherwise, we define D(p q) =, which follows by adopting the convention /0 = as well. The relative entropy D(p q) satisfies the following properties.. D(p q) 0withequalityifandonlyif(iff)p q,namely,p() = q()forevery X. 2. D(p q) is not symmetric, i.e., D(p q) D(q p) in general. 3. D(p q) isconvein(p,q),i.e.,forany(p,q ),(p 2,q 2 ),and λ, λ = λ [0,], λd(p q )+ λd(p 2 q 2 ) D(λp + λp 2 λq + λq 2 ).

4 Information Measures 4. Chainrule. Forany p(, y)andq(, y), D(p(, y) q(, y)) = D(p() q()) + p()d(p(y ) q(y )) = D(p() q()) + E p D(p(y X) q(y X)). The proofof thefirstthreeproperties isleftas aneercise. For thefourthproperty, consider D(p(, y) q(, y)) = E p log p(x,y) q(x,y) = E p log p(x) q(x) +E p log p(y X) q(y X). The notion of relative entropy can be etended to arbitrary probability measures P and Qdefinedonthesamesamplespaceandsetofeventsas D(P Q) = log dp dq d P, where dp/dqistheradon Nikodym derivative of Pw.r.t. Q. (If Pisnotabsolutely continuousw.r.t. Q,thenD(P Q) =.) Inparticular, if Pand Qhaverespectivedensities p andqw.r.t.aσ-finitemeasure μ(suchasthelebesgueandcountingmeasures),then D(P Q) = D(p q) = p()log p() dμ(). (2.3) q() This epression generalizes (2.2) since probability mass functions can be viewed as densities w.r.t. the counting measure. When μ is the Lebesgue measure (or equivalently, p and q are derivatives of continuous distributions on the Euclidean space), we follow the standardconventionofdenoting dμ()by d in(2.3). 2.3 f -DIVERGENCE Wedigressabittogeneralizethenotionofrelativeentropy. Let f : [0, ) Rbeconve with f() = 0.Thenthe f-divergencebetweenapairofdensitiespandqw.r.t.μisdefined as D f (p q) = q()f p() q() dμ() = E q f p(x) q(x). Eample2.2. Let f(u) = ulogu. Then D f (p q) = q() p() p() log q() q() dμ() = p()log p() q() dμ() = D(p q).

2.4 Mutual Information 5 Eample2.3. Nowlet f(u) = logu. Then D f (p q) = q()log p() q() dμ() = q()log q() p() dμ() = D(q p). Eample2.4. Combiningtheabovetwocases,let f(u) = (u )logu. Then whichissymmetricin(p,q). D f (p q) = D(p q)+d(q p), Many basic distance functions on probability measures can be represented as f-divergences; see, for eample, Liese and Vajda (2006). 2.4 MUTUAL INFORMATION Let (X,Y) be a pair of discrete random variables with joint pmf p(, y) = p()p(y ). Theamountofinformationaboutoneprovidedbytheotherismeasuredbytheirmutual information I(X;Y) = D(p(, y) p()p(y)) p(, y) = p(, y)log,y p()p(y) = p() p(y )log p(y ) y p(y) = p()d(p(y ) p(y)). The mutual information I(X; Y) satisfies the following properties.. I(X;Y)isanonnegativefunctionof p(, y). 2. I(X;Y) = 0iff X andy areindependent,i.e., p(, y) p()p(y). 3. Asafunctionof(p(), p(y )), I(X;Y)isconcavein p()forafied p(y ),andconvein p(y )forafied p(). 4. Mutual information and entropy. and I(X;X) = H(X) I(X;Y) = H(X) H(X Y) = H(Y) H(Y X) = H(X)+H(Y) H(X,Y).

6 Information Measures 5. Variational characterization. I(X;Y) = min q(y) wheretheminimumisattainedbyq(y) p(y). p()d(p(y ) q(y)), (2.4) Theproofofthefirstfourpropertiesisleftasaneercise. Forthefifthproperty,consider I(X;Y) = p(, y)log p(y ),y p(y) = p()p(y )log p(y ) q(y),y q(y) p(y) = p()d(p(y ) q(y)) p(y)log p(y) y q(y) p()d(p(y ) q(y)), where the last inequality follows since the subtracted term is equal to D(p(y) q(y)) 0, andholdswithequality iff p(y) q(y). Sometimesweareinterestedinthemaimummutualinformationma p() I(X;Y)of a conditional pmf p(y ), which is referred to as the information capacity(or the capacity in short). By the variational characterization in (2.4), the information capacity can be epressed as mai(x;y) = ma min p()d(p(y ) q(y)), p() p() q(y) whichcanbeviewedasagamebetweentwoplayers,onechoosing p()firstandtheother choosing q(y) net, with the payoff function f(p(),q(y)) = p()d(p(y ) q(y)). Using the following fundamental result in game theory, we show that the order of plays can be echanged without affecting the outcome of the game. Minima theorem(sion 958). Suppose that U and V are compact conve subsets of the Euclidean space, and that a real-valued continuous function f(u, ) on U V is concavein uforeach andconvein foreachu. Then ma min f(u, ) = min ma f(u, ). u U V V u U Since f(p(),q(y)) is linear (thus concave) in p() and conve in q(y) (recall Prop-

2.5 Entropy Rate 7 erty 3 of relative entropy), we can apply the minima theorem and conclude that mai(x;y) = ma min p() p() q(y) p()d(p(y ) q(y)) = min ma p()d(p(y ) q(y)) q(y) p() = min D(p(y ) q(y)), ma q(y) where the last equality follows by noting that the maimum epectation is attained by putting all the weights on the value that maimizes D(p(y ) q(y)). Furthermore, if p ()attainsthemaimum, thenbytheoptimalityconditionof (2.4) attains the minimum. q (y) p ()p(y ) 2.5 ENTROPY RATE Let X = (X n ) n= bearandomprocessonafinitealphabetx. Theamount ofuncertainty persymbol ismeasuredbyitsentropyrate if the limit eists. H(X) = lim n n H(X,...,X n ), Eample 2.5. If X is stationary, then the limit eists and H(X) = lim n H(X n X n ). Eample 2.6. If X is an aperiodic irreducible Markov chain, then the limit eists and H(X) = lim n H(X n X n ) = π( )H(p( 2 )), where π is the unique stationary distribution of the chain. Eample2.7. If X,X 2,...areindependentandidentically distributed(i.i.d.), then H(X) = H(X ). Eample2.8. LetY = (Y n ) n= beastationarymarkovchainandx n = f(y n ),n =,2,... Theresultingrandomprocess X = (X n ) n= ishiddenmarkovanditsentropyratesatisfies H(X n X n,y ) H(X) H(X n X n ) and H(X) = lim n H(X n X n,y ) = lim n H(X n X n ).

8 Information Measures 2.6 RELATIVE ENTROPY RATE Let Pand QbetwoprobabilitymeasuresonX withn-thorderdensitiesp( n )andq( n ), respectively. The normalized discrepancy between them is measured by their relative entropy rate D(P Q) = lim n n D(p(n ) q( n )) if the limit eists. Eample 2.9. If P is stationary, Q is stationary finite-order Markov, and P is absolutely continuous w.r.t. Q, then the limit eists and D(P Q) = lim n p( n )D(p( n n ) q( n n ))d n. (2.5) See, for eample, Barron(985) and Gray(20, Lemma 0.). Eample 2.0. Similarly, if P and Q are stationary hidden Markov and P is absolutely continuous w.r.t. Q, then the limit eists and(2.5) holds(juang and Rabiner 985, Lerou 992, Ephraim and Merhav 2002). PROBLEMS 2.. Prove Property 2 of entropy in Section 2. and find the equality condition for Property 3. 2.2. Prove Properties through 3 of relative entropy in Section 2.2. 2.3. Entropy and relative entropy. Let X be a finite alphabet and q() be the uniform pmfonx. Showthatforanypmf p()onx, D(p q) = log X H(p()). 2.4. The total variation distance between two pmfs p() and q() is defined as δ(p,q) = 2 p() q(). X (a) Showthatthisdistanceisan f-divergencebyfindingthecorresponding f. (b) Show that δ(p,q) = ma A X P(A) Q(A), where Pand Qarecorrespondingprobabilitymeasures,e.g., P(A) = A p().

Problems 9 2.5. Pinsker s inequality. Show that δ(p,q) 2loge D(p q), where the logarithm has the same base as the relative entropy.(hint: First consider thecasethatx isbinary.) 2.6. Let p(, y) be a joint pmf on X Y. Show that p(, y) is absolutely continuous w.r.t. p()p(y). 2.7. Prove Properties through 4 of mutual information in Section 2.4. 2.8. Let X = (X n ) n= beastationaryrandomprocess. Showthat H(X) = lim n H(X n X n ). 2.9. LetY = (Y n ) n= beastationarymarkovchainandx = {д(y n)}beahiddenmarkov process. Show that and conclude that H(X n X n,y ) H(X) H(X n X n ) H(X) = lim n H(X n X n,y ) = lim n H(X n X n ). 2.0. Recurrencetime. LetX 0,X,X 2,...bei.i.d.copiesofX p(),andletn = min{n : X n = X 0 }bethewaitingtimetothenetoccurrenceof X 0. (a) Showthat E(N) = X. (b)showthat E(logN) H(X). 2.. Thepastandthefuture. Let X = (X n ) n= bestationary. Showthat lim n n I(X,...,X n ;X n+,...,x 2n ) = 0. 2.2. Variable-duration symbols. A discrete memoryless source has the alphabet {, 2}, where symbol has duration and symbol 2 has duration 2. Let X = (X n ) n= be the resulting random process. (a) FinditsentropyrateH(X)intermsoftheprobability pofsymbol. (b) Find the maimum entropy by optimizing over p.

Bibliography Barron, A. R.(985). The strong ergodic theorem for densities: Generalized Shannon McMillan Breiman theorem. Ann. Probab., 3(4), 292 303. [8] Ephraim, Y. and Merhav, N. (2002). Hidden Markov processes. IEEE Trans. Inf. Theory, 48(6), 58 569. [8] Gray, R. M.(20). Entropy and information theory. Springer, New York. [8] Juang, B.-H. F. and Rabiner, L. R. (985). A probabilistic distance measure for hidden Markov models. AT&T Tech. J., 64(2), 39 408. [8] Lerou, B. G.(992). Maimum-likelihood estimation for hidden Markov models. Stoc. Proc. Appl., 40(), 27 43. [8] Liese, F. and Vajda, I.(2006). On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory, 52(0), 4394 442. [5] Sion, M.(958). On general minima theorems. Pacific J. Math., 8, 7 76. [6]