Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Application of Information Theory, Lecture 7 Relative Entropy Handout Mode Iftach Haitner Tel Aviv University. December 1, 2015 Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 1 / 36

Part I Statistical Distance Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 2 / 36

Statistical distance Let p = (p 1,..., p m ) and q = (q 1,..., q m ) be distributions over [m] Their statistical distance (also known as, variation distance) is defined by SD(p, q) := 1 p i q i 2 i [m] This is simply the L 1 norm between the distribution vectors We will soon see another distance measures for distributions next lecture For Z p and Y q, let SD(X, Y ) = SD(p, q) Claim (HW): SD(p, q) = max S [m] ( i S p i i S q ) i Hence, SD(p, q) = max D (Pr X p [D(X) = 1] Pr X q [D(X) = 1]) Interpretation Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 3 / 36

Distance from the uniform distribution Let X be rv over [m] H(X) log m H(X) = log m X is uniform over [m] Theorem 1 (this lecture) Let X rv over [m]. Assume H(X) log m ε, then SD(X, [m]) ε ln 2 2 = O( ε) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 4 / 36

Part II Relative entropy Distance Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 5 / 36

Section 1 Definition and Basic Facts Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 6 / 36

Definition For p = (p 1,..., p m ) and q = (q 1,..., q m ), let 0 log 0 0 = 0, p log p 0 = D(p q) = m i=1 p i log p i q i The relative entropy of pair of rv s, is the relative entropy of their distributions. Names: Entropy of p relative to q, relative entropy, information divergence, Kullback-Leibler (KL) divergence/distance Many different interpretations Main interpretation: the information we gained about X, if we originally thought X q and now we learned X p Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 7 / 36

Numerical Example D(p q) = m i=1 p i log p i q i p = ( 1 4, 1 2, 1 4, 0), q = ( 1 2, 1 4, 1 8, 1 8 ) D(p q) = 1 4 log 1 4 1 + 1 2 log 1 2 1 + 1 4 log 1 4 1 + 0 log 0 = 1 4 ( 1) + 1 2 1 + 1 4 1 = 1 2 2 4 8 D(q p) = 1 2 log 1 2 1 + 1 4 log 1 4 1 + 1 8 log 1 8 1 + 1 8 log 1 8 4 2 4 0 = 1 2 1 + 1 4 ( 1) + 1 8 ( 1) + = Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 8 / 36

Supporting the interpretation X rv over [m] H(X) measure for amount of information we do not have about X log m H(X) measure for information we do have about X (just by knowing its distribution) Example X = (X 1, X 2 ) ( 1 2, 0, 0, 1 2 ) over {00, 01, 10, 11} H(X) = 1, log m H(X) = 2 1 = 1 Indeed, we know X 1 X 2 H( [m]) H(p 1,..., p m ) = log m H(p 1,..., p m ) = log m + i p i log p i = i p i (log p i log 1 m ) = p i log p i = D(p [m]) 1 i m D(X [m]) measures the information we gained about X, if we originally thought it is [m] and now we learned it is p Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 9 / 36

Supporting the interpretation, cont. (generally) D(p q) H(q) H(p) H(q) H(p) is not a good measure for information change Example: q = (0.01, 0.99) and p = (0.99, 0.01) We were almost sure that X = 1 but learned that X is almost surely 0 But H(q) H(p) = 0 Also, H(q) H(p) might be negative We understand D(p q) as the information we gained about X, if we originally thought it is q and now we learned it is p Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 10 / 36

Changing distribution What does it mean: originally thought X q and now we learned X p? How can a distribution change? Typically, this happens by learning additional information q i = Pr [X = i] and p i = Pr [X = i E] Example X ( 1 2, 1 4, 1 4, 0); someone saw X and tells us that X 2 The distribution changes to X ( 2 3, 1 3, 0, 0) Another example X Y 1 2 3 4 0 1 4 1 4 0 0 1 1 4 0 1 4 0 Y ( 1 2, 1 4, 1 4, 0), but Y ( 1 2, 1 2, 0, 0) conditioned on X = 0 Y ( 1 2, 0, 1 2, 0) conditioned on X = 1 Generally, a distribution can change if we condition on event E Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 11 / 36

Additional properties 0 log 0 0 = 0, p log p 0 = for p > 0 i s.t. p i > 0 and q i = 0, then D(p q) = If originally Pr [X = i] = 0, then it cannot be more than 0 after we learned something. Hence, it make sense to think of it as infinite amount of information learnt Alteratively, we can define D(p q) only for distribution with q i = 0 = p i = 0 (recall that Pr [X = i] = 0 = Pr [X = i E] = 0, for any event E If p i is large and q i is small, then D(p q) is large D(p q) 0, with equality iff p = q (hw) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 12 / 36

Example q = (q 1,..., q m ) with n i=1 q i = 2 k (i.e., n < m) { qi /2 p i = k, 1 i n 0, otherwise. p = (p 1,..., p m ) the distribution of q conditioned on the event i [n] D(p q) = n i=1 p i log p i q i We gained k bits of information = n i=1 p i log 2 k = n i=1 p ik = k Example: n i=1 q i = 1 2, and we were told that i n or i > n, we got one bit of information Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 13 / 36

Section 2 Axiomatic Derivation Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 14 / 36

Axiomatic derivation Let D is a continuous and symmetric (wrt each distribution) function such that 1. D(p [m]) = log m H(p) 2. D((p1,..., p m ) (q 1,..., q m )) = D((p 1,..., p m 1, αp m, (1 α)p m ) (q 1,..., q m 1, αq m, (1 α)q m )), for any α [0, 1] then D = D. Interpretation Proof: Let p and q be distributions over [m], and assume q i Q \ {0}. D(p q) = D((α1,1 p 1,..., α 1,k1 p 1,..., α m,1 p m,..., α m,km p m ) (α 1,1 q 1,..., α 1,k1 q 1,..., α m,1 q m,..., α m,km q m )), for j α i,j = 1 and α i,j 0 Taking α s s.t. α i,1 = α i,2..., α i,ki = α i and α i q i = 1 M, it follows that D(p q) = log M H((α 1,1 p 1,..., α 1,k1 p 1,..., α m,1 p m,..., α m,km p m )) = p i log M + p i log α i p i = p i (log M + log p i q i M ) = i i i Zeros and non-rational q i s are dealt by continuity p i log p i q i. Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 15 / 36

Section 3 Relation to Mutual Information Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 16 / 36

Mutual information as expected relative entropy Claim 2 E y Y [D(X Y =y X)] = I(X; Y ). Proof: Let X (q 1,..., q m ) over [m], and Y be rv over {0, 1} (X Y =j ) p j = (p j,1,..., p j,m ), p j,i = Pr [X = i Y = j] E Y [D(p Y q)] = Pr [Y = 0] D(p 0,1,..., p 0,m q 1,..., q m ) + Pr [Y = 1] D(p 1,1,..., p 1,m q 1,..., q m ) = Pr [Y = 0] p 0,i log p 0,i + Pr [Y = 1] p 1,i log p 1,i q i q i i = Pr [Y = 0] p 0,i log p 0,i + Pr [Y = 1] p 1,i log p 1,i Pr [Y = 0] p 0,i log q i Pr [Y = 1] p 1,i log q i = H(X Y ) (Pr [Y = 0] p 0,i + Pr [Y = 1] p 1,i ) log q i = H(X Y ) + H(X) = I(X; Y ). Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 17 / 36 i

Equivalent definition for mutual information Claim 3 Let (X, Y ) p, then I(X; Y ) = D(p p X p Y ). Proof: D(p p X p Y ) = x,y = x,y p(x, y) log p(x, y) p X (x)p Y (y) p(x, y) log p X Y (x y) p X (x) = x,y p(x, y) log p X (x) + x,y p(x, y) log p X Y (x y) = H(X) + y p Y (y) x p X Y (x y) log p X Y (x y) = H(X) H(X Y ) = I(X; Y ). We will later relate the above two claims. Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 18 / 36

Section 4 Relation to Data Compression Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 19 / 36

Wrong code Theorem 4 Let p and q be distributions over [m], and let C be code with l(i) = C(i) = log 1qi. Then H(p) + D(p q) E i p [l(i)] H(p) + D(p q) + 1 Recall that H(q) E i q [l(i)] H(q) + 1. Proof of upperbound (upperbound is proved similarly) E [l(i)] = p i log 1 < p i (log 1 + 1) i p q i q i i i = 1 + p i (log p i 1 ) = 1 + q i p i i i = 1 + D(p q) + H(p) p i (log p i q i ) + i p i (log 1 p i ) Can there be a (close) to optimal code for q that is better for p? HW Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 20 / 36

Section 5 Conditional Relative Entropy Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 21 / 36

Conditional relative entropy For dist. p over X Y, let p X and p Y X be its marginal and conditional dist. Definition 5 For two distributions p and q over X Y: D(p Y X q Y X ) := p X (x) x X y Y D(p Y X q Y X ) = E (X,Y ) p(x,y) [ log p Y X (Y X) q Y X (Y X) p Y X (y x) log p Y X (y x) q Y X (y x) Let (X p, Y p ) p and (X q, Y q ) q, then D(p Y X q Y X ) = E x Xp [ D(Yq Xp=x Y q Xq=x) ] ] Numerical example: p = X Y 0 1 0 1 8 1 8 q = 1 1 4 1 2 X Y 0 1 0 1 8 1 4 1 1 2 1 8 D(p Y X q Y X ) = 1 4 D((1 2, 1 2 ) (1 3, 2 3 )) + 3 4 D((1 3, 2 3 ) (4 5, 1 5 )) =... Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 22 / 36

Chain rule Claim 6 For any two distributions p and q over X Y, it holds that D(p q) = D(p X q X ) + D(p Y X q Y X ) Proof: D(p q) = = = (x,y) X Y (x,y) X Y (x,y) X Y Hence, for (X, Y ) p: p(x, y) log p(x, y) q(x, y) p(x, y) log p X (x)p Y X (y x) q X (x)q Y X (y x) p(x, y) log p X (x) q X (x) + = D(p X q X ) + D(p Y X q Y X ) (x,y) X Y p(x, y) log p Y X (y x) q Y X (y x) [ I(X, Y ) = D(p p X p Y ) = D(p X p X ) + E D(pY X=x p Y ) ] x X = E x X [ D(pY X=x, p Y ) ]... Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 23 / 36

Section 6 Data-processing inequality Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 24 / 36

Data-processing inequality Claim 7 For any rv s X and Y and function f, it holds that D(f (X) f (Y )) D(X Y ). Analogues to H(X) H(f (X)) Proof: D(X, f (X) Y, f (Y )) = D(X Y ) D(X, f (X) Y, f (Y )) = D(f (X) f (Y )) + E z f (X) [ D(X f (X)=z Y f (X)=z )) ] D(f (X) f (Y )) Hence, D(f (X) f (Y )) D(X Y ). Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 25 / 36

Section 7 Relation to Statistical Distance Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 26 / 36

Relation to statistical distance D(p q) is used many time to measure the distance from p to q It is not a distance in the mathematical sense: D(p q) D(q p) and no triangle inequality However, Theorem 8 SD(p, q) ln 2 2 D(p q) Corollary: For rv X over [m] with H(X) log m ε, it holds that ln 2 SD(X, [m]) 2 (log m H(X)) = ln 2 2 ε Other direction is incorrect: SD(p, q) might be small but D(p q) = Does SD(p, [m]) being small imply D(p [m]) = log m H(p) is small? HW Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 27 / 36

Proving Thm 8, boolean case Let p = (α, 1 α) and q = (β, 1 β) and assume α β SD(p, q) = α β We will show that D(p q) = α log α 1 α β + (1 α) log 1 β 4 2 ln 2 (α β)2 = 2 ln 2 SD(p, q)2 Let g(x, y) = x log x y g(x, y) y 1 x + (1 x) log 1 y 4 2 ln 2 (x y)2 = x y ln 2 + 1 x (1 y) ln 2 4 2(y x) 2 ln 2 y x = y(1 y) ln 2 4 (y x) ln 2 Since y(1 y) 1 4, g(x,y) y 0 for y < x. Since g(x, x) = 0, g(x, y) 0 for y < x. Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 28 / 36

Proving Thm 8, general case Let U = Supp(p) Supp(q) Let S = {u U : p(u) > q(u)} SD(p, q) = Pr p [S] Pr q [S] (by homework) Let P p, and let the indicator ˆP be 1 iff P S. Let Q q, and let the indicator ˆQ be 1 iff Q S. SD(ˆP, ˆQ) = Pr [P S] Pr [Q S] = SD(p, q) D(p q) D(ˆP ˆQ) 2 2 SD(ˆP, ˆQ) (the ln 2 = 2 ln 2 SD(p, q)2. (data-proccessing inequality) Boolean case) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 29 / 36

Section 8 Conditioned Distributions Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 30 / 36

Main theorem Theorem 9 Let X 1,..., X k be iid over U, and let Y = (Y 1,..., Y k ) be rv over U k. Then k j=1 D(Y j X j ) D(Y (X 1,..., X k )). For rv Z, let Z (z) = Pr [Z = z]. We prove for k = 2, general case follows similar lines. Let X = (X 1, X 2 ) D(Y X) = y U 2 Y (y) log Y (y) X(y) = = y=(y 1,y 2 ) + y=(y 1,y 2 ) y=(y 1,y 2 ) Y (y) log Y 1(y 1 ) X 1 (y 1 ) + Y (y) log Y (y) log Y 1(y 1 ) Y 2 (y 2 ) Y (y) X 1 (y 1 ) X 2 (y 2 ) Y 1 (y 1 )Y 2 (y 2 ) y=(y 1,y 2 ) Y (y) Y 1 (y 1 )Y 2 (y 2 ) Y (y) log Y 2(y 2 ) X 2 (y 1 ) = D(Y 1 X 1 ) + D(Y 2 X 2 ) + I(Y 1 ; Y 2 ) D(Y 1 X 1 ) + D(Y 2 X 2 ) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 31 / 36

Conditioning distributions, relative entropy case Theorem 10 Let X 1,..., X k be iid over X, let X = (X 1,..., X k ) and let W be an event (i.e., Boolean rv). Then k j=1 D((X j W ) X j ) D((X W ) X) log 1 Pr[W ]. k D((X j W ) X j ) D((X W ) X) (Thm 9) j=1 = (X W )(x) log (X W )(x) X(x) x X k = Pr [W X = x]) (X W )(x) log Pr [W ] x X k 1 = log Pr [W ] + (X W )(x) log Pr [W X = x]) x X k log 1 Pr [W ] (Bayes) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 32 / 36

Conditioning distributions, statistical distance case Theorem 11 Let X 1,..., X k be iid over X and let W be an event. Then k j=1 SD((X j W ), X j ) 2 log 1 Pr[W ]. Proof: follows by Thm 8, and Thm 9. Using ( k j=1 a i) 2 k k j=1 a2 i, it follows that Corollary 12 k j=1 SD((X 1 j W ), X j ) k log( Pr[W ]), and E j k SD((X j W ), X j ) 1 k log( 1 Pr[W ] ) Extraction Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 33 / 36

Numerical example Let X = (X 1,..., X k ) {0, 1} 40 and let f : {0, 1} 40 0 be such that Pr [f (X) = 0] = 2 10. 1 E j [40] SD((X j f (X)=0 ), {0, 1}) 40 10 = 1 2 Typical bits are not too biassed, even when conditioning on a very unlikely event. Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 34 / 36

Extension Theorem 13 Let X = (X 1,..., X k ), T and V be rv s over X k, T and V respectively. Let W be an event and assume that the X i s are iid conditioned on T. Then k j=1 D((TVX j) W (TV ) W X j 1 (T )) log Pr[W ] + log Supp(V W ), where X j (t) is distributed according to X j T =t. Interpretation. Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 35 / 36

Proving Thm 13 Let X = (X 1,..., X k ), T and V be rv s over X k, T and V respectively, such that X i s are iid conditioned on T. Let W be an event and let X j (t) be distributed according to the distribution of X j T =t. k j=1 D((TVX j ) W (TV ) W X j (T )) [ k = E D ( )] X j W,V =v,t =t (X j T =t (t,v) (TV ) W j=1 [ ( )] = E D (Xj (t,v) (TV ) W W, V = v ) T =t (X j T = t }{{} W [ 1 ] E log (t,v) (TV ) W Pr [W V = v T = t] 1 log E (t,v) (TV ) W Pr [W V = v T = t] = log (t,v) Supp((TV ) W ) Pr [T = t] Pr [W ] log Supp(V W ). Pr [W ] (chain rule) (Thm 10) (Jensen s inequality) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 36 / 36