2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and use Jensen s Inequality: P (x) D( P Q) = P (x) log X Q(x) = X Q(x)ϕ ( P (x) Q(x) ) ϕ ( X Q(x) P (x) Q(x) ) = ϕ(1) = 0 2.2 Conditional divergence The main objects in our course are random variables. The main operation for creating new random variables, and also for defining relations between random variables, is that of a random transformation: Definition 2.1. Conditional probability distribution (aka random transformation, transition probability kernel, Markov kernel, channel) K( ) has two arguments: first argument is a measurable subset of, second argument is an element of X. It must satisfy: 1. For any x X : K( x) is a probability measure on 2. For any measurable A function x K(A x) is measurable on X. In this case we will say that K acts from X to. In fact, we will abuse notation and write P X instead of K to suggest what spaces X and are 1. Furthermore, if X and are connected by the random transformation P X we will write X X. P Remark 2.1. (Very technical!) Unfortunately, condition 2 (standard for probability textbooks) will frequently not be sufficiently strong for this course. The main reason is that we want Radon-Nikodym dp X=x derivatives such as dq (y) to be jointly measurable in (x, y). See Section?? for more. Example: 1. deterministic system: = f(x) P X=x = δ f(x ) 2. decoupled system: X P X=x = P 1 Another reason for writing P X is that from any joint distribution, (on standard Borel spaces) one can extract a random transformation by conditioning on X. 20
3. additive noise (convolution): = X + Z with Z X P X=x = P x+z. Multiplication: X X to get = P P X: PX (x, y) = P X (y x) (x). Composition (Marginalization): P = P X, that is P X acts on to produce P : P ( y) = ( y x) ( x). x X P X Will also write P. Definition 2.2 (Conditional divergence). D(P X Q X ) = E x PX [D( X=x = ( x)p D( X=x Q P X=x Q X=x ). (2.2) x X )] (2.1) Note: H(X ) = log A D( U X P ), where U X is is uniform distribution on X. Theorem 2.2 (Properties of Divergence). 1. D(P X Q X ) = D( P X Q X ) 2. (Simple chain rule) D ( 3. (Monotonicity) D( Q X ) D( ) 4. (Full chain rule) D( Q X ) = D( P X Q X ) + D( Q X ) P Q 1 X n Q X 1 X n Xi 1 i= 1 n ) = D( i Q X 1 i X i 1 i ) In the special case of QX n = i Q X i we have D(1 X n Q X1 Q Xn ) = D(1 X n 1 n ) + D( i Q Xi ) 5. (Conditioning increases divergence) Let P X and Q X be two kernels, let P and Q = Q X. Then = P X D(P Q ) D(P X Q X ) equality iff D( Q X P ) = 0 Pictorially: P X P Q X Q 21
6. (Data-processing for divergences) Let P = P X P = P X ( x) dpx } D(P Q = P X ( x) dq Q ) D( Q X ) (2.3) X Pictorially: P X P X D( Q X ) D(P Q ) Q X Q Proof. We only illustrate these results for the case of finite alphabets. General case follows by doing a careful analysis of Radon-Nikodym derivatives, introduction of regular branches of conditional probability etc. For certain cases (e.g. separable metric spaces), however, we can simply discretize alphabets and take granularity of discretization to 0. This method will become clearer in Lecture 4, once we understand continuity of D. 1. Ex [D(P X=x Q X=x )] = E (X, ) PX P X [log P X Q X PX ] 2. Disintegration: E (X, ) [log Q X ] = E (X, ) [log P X Q X + log 3. Apply 2. with X and interchanged and use D( ) 0. 4. Telescoping n = n i=1 P n X X i 1 and Q X n = i i= 1 QX i X i 1. 5. Inequality follows from monotonicity. To get conditions for equality, notice that by the chain rule for D: Q X D( Q X ) = D( P X Q X ) + D( ) = 0 = D( P Q X P ) + D(P Q ) X and hence we get the claimed result from positivity of D. 6. This again follows from monotonicity. Corollary 2.1. D(1 X n Q X1 Q Xn ) D(i Q Xi ) or n = iff n = j 1 P = X j Note: In general we can have D( Q X ) D( PX X ) + D( P Q ). For example, if X = under P and Q, then D( D( Q X ) = D( Q X ) <Q 2D( Q X ). Conversely, if = Q X and P = Q but Q X we have D( Q X ) > 0 = D( Q X ) + D( P Q ). Corollary 2.2. = f(x) D( P Q ) D( Q X ), with equality if f is 1-1. 22 ]
Note: D(P Q ) = D( Q X ) / f is 1-1. Example: = Gaussian, Q X = Laplace, = X. Corollary 2.3 (Large deviations estimate). For any subset E X we have d( [ E] Q X [ E]) D( Q X ) Proof. Consider = 1 {X E}. 2.3 Mutual information Definition 2.3 (Mutual information). Note: I(X; ) = D( P ) Intuition: I(X; ) measures the dependence between X and, or, the information about X (resp. ) provided by (resp. X) Defined by Shannon (in a different form), in this form by Fano. Note: not restricted to discrete. I(X; ) is a functional of the joint distribution, or equivalently, the pair (, P X ). Theorem 2.3 (Properties of I). 1. I(X; ) = D( P ) = D(P X P ) = D( P ) 2. Symmetry: I(X; ) = I( ; X) 3. Positivity: I(X; ) 0; I(X; ) = 0 iff X 4. I(f(X); ) I(X; ); f one-to-one I(f(X); ) = I( X; ) 5. More data More info : I( X 1, X 2 ; Z) I(X 1 ; Z) Proof. 1. I(X; ) = E log P = E log P X P = E log. 2. Apply data-processing inequality twice to the map ( x, y) ( y, x) to get D(, P ) = D( P,X P ). 3. By definition. 4. We will use the data-processing property of mutual information (to be proved shortly, see Theorem 2.5). Consider the chain of data processing: (x, y) (f(x), y) (f 1 (f(x)), y). Then I(X; ) I(f(X); ) I(f 1 (f(x)); ) = I(X; ) 5. Consider f(x1, X 2 ) = X 1. Theorem 2.4 (I v.s. H). 23
( ) = H( X) X discrete 1. I X; X + otherwise 2. If X, discrete then I(X; ) = H(X) + H( ) H(X, ) If only X discrete then I( X; ) = H( X) H( X ) 3. If X, are real-valued vectors, have joint pdf and all three differential entropies are finite then I(X; ) = h(x) + h( ) h( X, ) If X has marginal pdf p X and conditional pdf p X (x y) then I(X; ) = h(x) h(x ). 4. If X or are discrete then I( X; ) min ( H( X ), H( )), with equality iff H( X ) = 0 or H( X) = 0, i.e., one is a deterministic function of the other. Proof. 1. By definition, I(X; X) = D( X ) = E x X D( δx ). If is discrete, then 1 D( δ x ) = log ( x) and I (X; X) = H ( X). If is not discrete, then let A = { x ( x) > 0} c denote the set of atoms of. Let = {( x, x) x / A} X X. Then,X ( ) = PX(A ) > 0 but since ( )( E) (dx 1 ) ( dx 2 ) 1{( x 1, x 2 ) E} X X we have by taking E = that ( )( ) = 0. Thus / and thus,x I(X; X) = D(,X ) = +. 2. E log P = E [log 1 + log 1 P log 1 ]. 3. Similarly, when, and P have densities p X and p X p we have 4. Follows from 2. p X D( P ) E [ log ] = h( X) + h( ) h( X, ) p X p Corollary 2.4 (Conditioning reduces entropy). X discrete: H( X ) H( X ), with equality iff X. Intuition: The amount of entropy reduction = mutual information i.i.d. Example: X = U OR, where U, Bern( 1 2 ). Then X Bern( 3 4 ) and H(X) = h( 1 4 ) < 1 bits = H(X = 0), i.e., conditioning on = 0 increases entropy. But on average, H(X ) = P [ = 0] H(X = 0) + P [ = 1] H(X = 1) = 1 2 bits < H(X), by the strong concavity of h( ). Note: Information, entropy and Venn diagrams: 1. The following Venn diagram illustrates the relationship between entropy, conditional entropy, joint entropy, and mutual information. 24
H(X, ) H( X) I(X; ) H(X ) H( ) H(X) 2. If you do the same for 3 variables, you will discover that the triple intersection corresponds to H(X 1 ) + H(X 2 ) + H(X 3 ) H(X 1, X 2 ) H(X 2, X 3 ) H(X 1, X 3 ) + H( X 1, X 2, X 3 ) (2.4) which is sometimes denoted I( X; ; Z ). It can be both positive and negative (why?). 3. In general, one can treat random variables as sets (so that r.v. X i corresponds to set E i and ( X 1, X 2 ) corresponds to E 1 E 2 ). Then we can define a unique signed measure µ on the finite algebra generated by these sets so that every information quantity is found by replacing I/ H µ ;,. As an example, we have H(X 1 X 2, X 3 ) = µ (E 1 (E 2 E 3 )), (2.5) I( X 1, X 2 ; X 3 X 4 ) = µ ((( E1 E 2 ) E 3 ) E 4 ). (2.6) By inclusion-exclusion, quantity (2.4) corresponds to µ(e is not necessarily a positive measure. Example: Bivariate Gaussian. X, jointly Gaussian 1 1 I( X; ) = log 2 1 ρ 2 X 1 E 2 E 3 ), which explains why µ I(X; ) where ρ X E[(X EX)( E )] σ X σ [ 1, 1] is the correlation coefficient. -1 0 1 ρ Proof. WLOG, by shifting and scaling if necessary, we can assume EX = E = 0 and EX 2 = E 2 = 1. Then ρ = EX. By joint Gaussianity, = ρx + Z for some Z N ( 0, 1 ρ 2 ) X. Then using the divergence formula for Gaussians (1.16), we get I(X; ) = D(P X P ) = ED(N ( ρx, 1 ρ 2 ) N (0, 1)) 1 1 = E [ log 2 1 ρ 2 + log e 2 ((ρx)2 + 1 ρ 2 1)] = 1 2 log 1 1 ρ 2 25
Note: Similar to the role of mutual information, the correlation coefficient also measures the dependency between random variables which are real-valued (more generally, on an inner-product space) in certain sense. However, mutual information is invariant to bijections and more general: it can be defined not just for numerical random variables, but also for apples and oranges. Example: Additive white Gaussian noise (AWGN) channel. X N independent Gaussian N X + I(X; X + N) = 1 2 log (1 + σ2 X ) σn 2 signal-to-noise ratio (SNR) Example: Gaussian vectors. X R m, R n jointly Gaussian I(X; ) = 1 2 log det Σ X det Σ det Σ[ X, ] where Σ X E [(X EX)(X EX) ] denotes the covariance matrix of X R m, and Σ X,] denotes the the covariance matrix of the random vector [X, ] R m + n [. In the special case of additive noise: = X + N for N X, we have I( X; X + N) = 1 2 log det(σ X + Σ N ) det Σ N Σ Σ X why? Σ X+Σ N ) = since det Σ [X,X+N] = det ( X det Σ X Σ X det Σ N. Example: Binary symmetric channel (BSC). N X 0 1 1 δ 1 δ δ 0 1 X + X Bern( 1 ), N Bern(δ) 2 = X + N I(X; ) = log 2 h(δ) Example: Addition over finite groups. X is uniform on G and independent of Z. Then I(X; X + Z) = log G H(Z) Proof. Show that X + Z is uniform on G regardless of Z. 26
2.4 Conditional mutual information and conditional independence Definition 2.4 (Conditional mutual information). I(X; Z) = D( Z Z P Z P Z ) (2.7) = E z P Z [I(X; Z = z)]. (2.8) where the product of two random transformations is ( Z = z P Z = z )( x, y) Z ( x z) P Z ( y z ), under which X and are independent conditioned on Z. Note: I(X; Z) is a functional of Z. Remark 2.2 (Conditional independence). A family of distributions can be represented by a directed acyclic graph. A simple example is a Markov chain (line graph), which represents distributions that factor as { Z Z = P X P Z }. X Z Z = PZ P Z X = P Z Z = P X PZ Cond. indep. X,, Z form a Markov chain notation X Z Theorem 2.5 (Further properties of Mutual Information). 1. I(X; Z ) 0, with equality iff X Z 2. (Kolmogorov identity or small chain rule) Z = P P Z Z X I(X, ; Z) = I(X; Z) + I( ; Z X) = I( ; Z) + I( X; Z ) 3. (Data Processing) If X Z, then a) I( X; Z) I( X; ) b) I(X; Z) I( X; ) 4. (Full chain rule) I(X n ; ) = n k=1 I(X k ; X k 1 ) Proof. 1. By definition and Theorem 2.3.3. 2. Z P Z = Z P Z P XZ P X 27
3. Apply Kolmogorov identity to I(, Z; X): 4. Recursive application of Kolmogorov identity. I(, Z; X) = I(X; ) + I( X; Z ) =0 = I( X; Z) + I( X; Z) Example: 1-to-1 function I( X; ) = I( X; f( )) Note: In general, I( X; Z) I(X; ). Examples: a) > : Conditioning does not always decrease M.I. To find counterexamples when X,, Z do not form a Markov chain, notice that there is only one directed acyclic graph non-isomorphic to X Z, namely X Z. Then a counterexample is i.i.d. 1 I( X; Z) = I( X; X Z Z) = H( X) = log 2 X, Z Bern( ) 2 = X Z I(X; ) = 0 since X b) < : Z =. Then I( X; ) = 0. Note: (Chain rule for I Chain rule for H) Set = X n. Then H( X n ) = I( X n ; X n ) = n ( ) = ( = 1 I X k; X n X k 1 k n = 1 H X k Xk 1 ), since H( X X k k X n 1, ) = 0. k Remark 2.3 (Data processing for mutual information via data processing of divergence). We proved data processing for mutual information in Theorem 2.5 using Kolmogorov s identity. In fact, data processing for mutual information is implied by the data processing for divergence: I(X; Z) = D(P Z X P Z ) D(P X P ) = I( X; ), P Z where note that for each x, we have P X = x P Z X= x and P P Z. Therefore if we have a bi-variate functional of distributions D( P Q) which satisfies data processing, then we can define an M.I.-like quantity via I D ( X; ) D( P X P ) E x PX D( P X=x P ) which will satisfy data processing on Markov chains. A rich class of examples arises by taking D = D f (an f-divergence, defined in (1.15)). That f-divergence satisfies data-processing is going to be shown in Remark 4.2. P Z 2.5 Strong data-processing inequalities For many random transformations P X, it is possible to improve the data-processing inequality (2.3): For any, Q X we have D(P Q ) η KL D( Q X ), where η KL < 1 and depends on the channel P X only. Similarly, this gives an improvement in the data-processing inequality for mutual information: For any P U,X we have U X I( U; ) η KL I( U; X ). For example, for P X = BSC(δ) we have η KL = (1 2δ) 2. Strong data-processing inequalities quantify the intuitive observation that noise inside the channel P X must reduce the information that carries about the data U, regardless of how smart the hook up U X is. This is an active area of research, see [PW15] for a short summary. 28
2.6* How to avoid measurability problems? As we mentioned in Remark 2.1 conditions imposed by Definition 2.1 on P X are insufficient. Namely, we get the following two issues: 1. Radon-Nikodym derivatives such as dp X=x 2. Set {x P X=x Q } may not be measurable. The easiest way to avoid all such problems is the following: dq (y) may not be jointly measurable in (x, y) Agreement A1: All conditional kernels P X X in these notes will be assumed to be defined by choosing a σ-finite measure µ 2 on and measurable function ρ(y x) 0 on X such that P X(A x) = ρ(y x)µ 2 (dy) for all x and measurable sets A and ρ( y x ) µ 2 (dy) = 1 for all x. Notes: A 1. Given another kernel Q X specified via ρ (y x) and µ = + 2 we may first replace µ 2 and µ 2 via µ 2 µ 2 µ 2 and thus assume that both P X and Q X are specified in terms of the same dominating measure µ 2. (This modifies ρ(y x) to ρ( y x) dµ 2 (y).) 2. Given two kernels P X and Q X specified in terms of the same dominating measure µ 2 and functions ρ P (y x) and ρ Q (y x), respectively, we may set dp X ρ P (y x) dq X ρ Q ( y x) outside of ρ Q = 0. When P X=x Q X=x the above gives a version of the Radon-Nikodym derivative, which is automatically measurable in (x, y). dµ 2 3. Given Q specified as we may set A 0 dq = q( y) dµ 2 = {x ρ(y x)dµ 2 = 0} {q=0} This set plays a role of { x P X = x Q }. Unlike the latter A0 is guaranteed to be measurable by Fubini [Ç11, Prop. 6.9]. By plays a role we mean that it allows to prove statements like: For any, Q [ A 0 ] = 1. So, while our agreement resolves the two measurability problems above, it introduces a new one. Indeed, given a joint distribution, on standard Borel spaces, it is always true that one can extract a conditional distribution P X satisfying Definition 2.1 (this is called disintegration). However, it is not guaranteed that P X will satisfy Agreement A1. To work around this issue as well, we add another agreement: 29
Agreement A2: All joint distributions, are specified by means of data: µ 1,µ 2 σ-finite measures on X and, respectively, and measurable function λ( x, y) such that Notes:, (E) E λ(x, y)µ 1 (dx)µ 2 ( dy). 1. Again, given a finite or countable collection of joint distributions,, Q X,,... satisfying A2 we may without loss of generality assume they are defined in terms of a common µ 1, µ 2. 2. Given, satisfying A2 we can disintegrate it into conditional (satisfying A1) and marginal: P X (A x) = ρ ( λ y x)µ 2 (dy) ρ(y x) (x, y) (2.9) A p(x) PX( A) = p( x ) µ 1 ( dx) p( x) λ( x, η)µ 2 (dη) (2.10) with ρ(y x) defined arbitrarily for those x, for which p(x) = 0. A Remark 2.4. The first problem can also be resolved with the help of Doob s version of Radon- Nikodym theorem [Ç11, Chapter V.4, Theorem 4.44]: If the σ-algebra on is separable (satisfied whenever is a Polish space, for example) and P X =x Q X=x then there exists a jointly measurable version of Radon-Nikodym derivative dp x ( x, y) X = (y) dq X=x 30
MIT OpenCourseWare https://ocw.mit.edu 6.441 Information Theory Spring 2016 For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.