Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Similar documents
4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

The binary entropy function

Hands-On Learning Theory Fall 2016, Lecture 3

LECTURE 3. Last time:

Foundation of Cryptography, Lecture 4 Pseudorandom Functions

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Series 7, May 22, 2018 (EM Convergence)

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Lecture 2: August 31

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Computational Models - Lecture 3 1

ECE 4400:693 - Information Theory

Information Theory Primer:

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Lecture 5 - Information theory

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Quantitative Biology II Lecture 4: Variational Methods

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Lecture 1: Introduction, Entropy and ML estimation

5 Mutual Information and Channel Capacity

x log x, which is strictly convex, and use Jensen s Inequality:

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Lecture 5 Channel Coding over Continuous Channels

Lecture 35: December The fundamental statistical distances

Introduction to Statistical Learning Theory

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 11: Continuous-valued signals and differential entropy

Information Theory and Communication

Chapter I: Fundamental Information Theory

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Homework Set #2 Data Compression, Huffman code and AEP

Solutions 1. Introduction to Coding Theory - Spring 2010 Solutions 1. Exercise 1.1. See Examples 1.2 and 1.11 in the course notes.

APC486/ELE486: Transmission and Compression of Information. Bounds on the Expected Length of Code Words

Solutions to Homework Set #3 Channel and Source coding

Computational Models: Class 3

1 Review of The Learning Setting

Introduction to Machine Learning

Introduction to Machine Learning

Computational Models - Lecture 1 1

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

CS 630 Basic Probability and Information Theory. Tim Campbell

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

Power series solutions for 2nd order linear ODE s (not necessarily with constant coefficients) a n z n. n=0

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Homework 1 Due: Wednesday, September 28, 2016

The Communication Complexity of Correlation. Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan

Lecture 10. (2) Functions of two variables. Partial derivatives. Dan Nichols February 27, 2018

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Computing and Communications 2. Information Theory -Entropy

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

Lecture 8: Channel Capacity, Continuous Random Variables

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

Capacity of AWGN channels

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Lecture 19: Solving linear ODEs + separable techniques for nonlinear ODE s

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

A DARK GREY P O N T, with a Switch Tail, and a small Star on the Forehead. Any

Computational Models - Lecture 3 1

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Computational Models Lecture 2 1

1/37. Convexity theory. Victor Kitov

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Statistical Learning Theory

Homework 2: Solution

Computational Models Lecture 2 1

Lecture 17: Differential Entropy

PATTERN RECOGNITION AND MACHINE LEARNING

Expectation Maximization

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Lecture 22: Final Review

Foundation of Cryptography ( ), Lecture 1

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

A CLASSROOM NOTE: ENTROPY, INFORMATION, AND MARKOV PROPERTY. Zoran R. Pop-Stojanović. 1. Introduction

A Probability Review

Information measures in simple coding problems

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Generalization bounds

Lecture 11: Quantum Information III - Source Coding

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

16.1 Bounding Capacity with Covering Number

Chaos, Complexity, and Inference (36-462)

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Foundation of Cryptography, Lecture 7 Non-Interactive ZK and Proof of Knowledge

Computational Models Lecture 8 1

EECS 750. Hypothesis Testing with Communication Constraints

the Information Bottleneck

Information Theory in Intelligent Decision Making

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

18.440: Lecture 26 Conditional expectation

16.4. Power Series. Introduction. Prerequisites. Learning Outcomes

Information Theory. Week 4 Compressing streams. Iain Murray,

Transcription:

Application of Information Theory, Lecture 7 Relative Entropy Handout Mode Iftach Haitner Tel Aviv University. December 1, 2015 Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 1 / 36

Part I Statistical Distance Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 2 / 36

Statistical distance Let p = (p 1,..., p m ) and q = (q 1,..., q m ) be distributions over [m] Their statistical distance (also known as, variation distance) is defined by SD(p, q) := 1 p i q i 2 i [m] This is simply the L 1 norm between the distribution vectors We will soon see another distance measures for distributions next lecture For Z p and Y q, let SD(X, Y ) = SD(p, q) Claim (HW): SD(p, q) = max S [m] ( i S p i i S q ) i Hence, SD(p, q) = max D (Pr X p [D(X) = 1] Pr X q [D(X) = 1]) Interpretation Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 3 / 36

Distance from the uniform distribution Let X be rv over [m] H(X) log m H(X) = log m X is uniform over [m] Theorem 1 (this lecture) Let X rv over [m]. Assume H(X) log m ε, then SD(X, [m]) ε ln 2 2 = O( ε) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 4 / 36

Part II Relative entropy Distance Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 5 / 36

Section 1 Definition and Basic Facts Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 6 / 36

Definition For p = (p 1,..., p m ) and q = (q 1,..., q m ), let 0 log 0 0 = 0, p log p 0 = D(p q) = m i=1 p i log p i q i The relative entropy of pair of rv s, is the relative entropy of their distributions. Names: Entropy of p relative to q, relative entropy, information divergence, Kullback-Leibler (KL) divergence/distance Many different interpretations Main interpretation: the information we gained about X, if we originally thought X q and now we learned X p Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 7 / 36

Numerical Example D(p q) = m i=1 p i log p i q i p = ( 1 4, 1 2, 1 4, 0), q = ( 1 2, 1 4, 1 8, 1 8 ) D(p q) = 1 4 log 1 4 1 + 1 2 log 1 2 1 + 1 4 log 1 4 1 + 0 log 0 = 1 4 ( 1) + 1 2 1 + 1 4 1 = 1 2 2 4 8 D(q p) = 1 2 log 1 2 1 + 1 4 log 1 4 1 + 1 8 log 1 8 1 + 1 8 log 1 8 4 2 4 0 = 1 2 1 + 1 4 ( 1) + 1 8 ( 1) + = Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 8 / 36

Supporting the interpretation X rv over [m] H(X) measure for amount of information we do not have about X log m H(X) measure for information we do have about X (just by knowing its distribution) Example X = (X 1, X 2 ) ( 1 2, 0, 0, 1 2 ) over {00, 01, 10, 11} H(X) = 1, log m H(X) = 2 1 = 1 Indeed, we know X 1 X 2 H( [m]) H(p 1,..., p m ) = log m H(p 1,..., p m ) = log m + i p i log p i = i p i (log p i log 1 m ) = p i log p i = D(p [m]) 1 i m D(X [m]) measures the information we gained about X, if we originally thought it is [m] and now we learned it is p Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 9 / 36

Supporting the interpretation, cont. (generally) D(p q) H(q) H(p) H(q) H(p) is not a good measure for information change Example: q = (0.01, 0.99) and p = (0.99, 0.01) We were almost sure that X = 1 but learned that X is almost surely 0 But H(q) H(p) = 0 Also, H(q) H(p) might be negative We understand D(p q) as the information we gained about X, if we originally thought it is q and now we learned it is p Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 10 / 36

Changing distribution What does it mean: originally thought X q and now we learned X p? How can a distribution change? Typically, this happens by learning additional information q i = Pr [X = i] and p i = Pr [X = i E] Example X ( 1 2, 1 4, 1 4, 0); someone saw X and tells us that X 2 The distribution changes to X ( 2 3, 1 3, 0, 0) Another example X Y 1 2 3 4 0 1 4 1 4 0 0 1 1 4 0 1 4 0 Y ( 1 2, 1 4, 1 4, 0), but Y ( 1 2, 1 2, 0, 0) conditioned on X = 0 Y ( 1 2, 0, 1 2, 0) conditioned on X = 1 Generally, a distribution can change if we condition on event E Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 11 / 36

Additional properties 0 log 0 0 = 0, p log p 0 = for p > 0 i s.t. p i > 0 and q i = 0, then D(p q) = If originally Pr [X = i] = 0, then it cannot be more than 0 after we learned something. Hence, it make sense to think of it as infinite amount of information learnt Alteratively, we can define D(p q) only for distribution with q i = 0 = p i = 0 (recall that Pr [X = i] = 0 = Pr [X = i E] = 0, for any event E If p i is large and q i is small, then D(p q) is large D(p q) 0, with equality iff p = q (hw) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 12 / 36

Example q = (q 1,..., q m ) with n i=1 q i = 2 k (i.e., n < m) { qi /2 p i = k, 1 i n 0, otherwise. p = (p 1,..., p m ) the distribution of q conditioned on the event i [n] D(p q) = n i=1 p i log p i q i We gained k bits of information = n i=1 p i log 2 k = n i=1 p ik = k Example: n i=1 q i = 1 2, and we were told that i n or i > n, we got one bit of information Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 13 / 36

Section 2 Axiomatic Derivation Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 14 / 36

Axiomatic derivation Let D is a continuous and symmetric (wrt each distribution) function such that 1. D(p [m]) = log m H(p) 2. D((p1,..., p m ) (q 1,..., q m )) = D((p 1,..., p m 1, αp m, (1 α)p m ) (q 1,..., q m 1, αq m, (1 α)q m )), for any α [0, 1] then D = D. Interpretation Proof: Let p and q be distributions over [m], and assume q i Q \ {0}. D(p q) = D((α1,1 p 1,..., α 1,k1 p 1,..., α m,1 p m,..., α m,km p m ) (α 1,1 q 1,..., α 1,k1 q 1,..., α m,1 q m,..., α m,km q m )), for j α i,j = 1 and α i,j 0 Taking α s s.t. α i,1 = α i,2..., α i,ki = α i and α i q i = 1 M, it follows that D(p q) = log M H((α 1,1 p 1,..., α 1,k1 p 1,..., α m,1 p m,..., α m,km p m )) = p i log M + p i log α i p i = p i (log M + log p i q i M ) = i i i Zeros and non-rational q i s are dealt by continuity p i log p i q i. Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 15 / 36

Section 3 Relation to Mutual Information Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 16 / 36

Mutual information as expected relative entropy Claim 2 E y Y [D(X Y =y X)] = I(X; Y ). Proof: Let X (q 1,..., q m ) over [m], and Y be rv over {0, 1} (X Y =j ) p j = (p j,1,..., p j,m ), p j,i = Pr [X = i Y = j] E Y [D(p Y q)] = Pr [Y = 0] D(p 0,1,..., p 0,m q 1,..., q m ) + Pr [Y = 1] D(p 1,1,..., p 1,m q 1,..., q m ) = Pr [Y = 0] p 0,i log p 0,i + Pr [Y = 1] p 1,i log p 1,i q i q i i = Pr [Y = 0] p 0,i log p 0,i + Pr [Y = 1] p 1,i log p 1,i Pr [Y = 0] p 0,i log q i Pr [Y = 1] p 1,i log q i = H(X Y ) (Pr [Y = 0] p 0,i + Pr [Y = 1] p 1,i ) log q i = H(X Y ) + H(X) = I(X; Y ). Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 17 / 36 i

Equivalent definition for mutual information Claim 3 Let (X, Y ) p, then I(X; Y ) = D(p p X p Y ). Proof: D(p p X p Y ) = x,y = x,y p(x, y) log p(x, y) p X (x)p Y (y) p(x, y) log p X Y (x y) p X (x) = x,y p(x, y) log p X (x) + x,y p(x, y) log p X Y (x y) = H(X) + y p Y (y) x p X Y (x y) log p X Y (x y) = H(X) H(X Y ) = I(X; Y ). We will later relate the above two claims. Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 18 / 36

Section 4 Relation to Data Compression Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 19 / 36

Wrong code Theorem 4 Let p and q be distributions over [m], and let C be code with l(i) = C(i) = log 1qi. Then H(p) + D(p q) E i p [l(i)] H(p) + D(p q) + 1 Recall that H(q) E i q [l(i)] H(q) + 1. Proof of upperbound (upperbound is proved similarly) E [l(i)] = p i log 1 < p i (log 1 + 1) i p q i q i i i = 1 + p i (log p i 1 ) = 1 + q i p i i i = 1 + D(p q) + H(p) p i (log p i q i ) + i p i (log 1 p i ) Can there be a (close) to optimal code for q that is better for p? HW Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 20 / 36

Section 5 Conditional Relative Entropy Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 21 / 36

Conditional relative entropy For dist. p over X Y, let p X and p Y X be its marginal and conditional dist. Definition 5 For two distributions p and q over X Y: D(p Y X q Y X ) := p X (x) x X y Y D(p Y X q Y X ) = E (X,Y ) p(x,y) [ log p Y X (Y X) q Y X (Y X) p Y X (y x) log p Y X (y x) q Y X (y x) Let (X p, Y p ) p and (X q, Y q ) q, then D(p Y X q Y X ) = E x Xp [ D(Yq Xp=x Y q Xq=x) ] ] Numerical example: p = X Y 0 1 0 1 8 1 8 q = 1 1 4 1 2 X Y 0 1 0 1 8 1 4 1 1 2 1 8 D(p Y X q Y X ) = 1 4 D((1 2, 1 2 ) (1 3, 2 3 )) + 3 4 D((1 3, 2 3 ) (4 5, 1 5 )) =... Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 22 / 36

Chain rule Claim 6 For any two distributions p and q over X Y, it holds that D(p q) = D(p X q X ) + D(p Y X q Y X ) Proof: D(p q) = = = (x,y) X Y (x,y) X Y (x,y) X Y Hence, for (X, Y ) p: p(x, y) log p(x, y) q(x, y) p(x, y) log p X (x)p Y X (y x) q X (x)q Y X (y x) p(x, y) log p X (x) q X (x) + = D(p X q X ) + D(p Y X q Y X ) (x,y) X Y p(x, y) log p Y X (y x) q Y X (y x) [ I(X, Y ) = D(p p X p Y ) = D(p X p X ) + E D(pY X=x p Y ) ] x X = E x X [ D(pY X=x, p Y ) ]... Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 23 / 36

Section 6 Data-processing inequality Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 24 / 36

Data-processing inequality Claim 7 For any rv s X and Y and function f, it holds that D(f (X) f (Y )) D(X Y ). Analogues to H(X) H(f (X)) Proof: D(X, f (X) Y, f (Y )) = D(X Y ) D(X, f (X) Y, f (Y )) = D(f (X) f (Y )) + E z f (X) [ D(X f (X)=z Y f (X)=z )) ] D(f (X) f (Y )) Hence, D(f (X) f (Y )) D(X Y ). Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 25 / 36

Section 7 Relation to Statistical Distance Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 26 / 36

Relation to statistical distance D(p q) is used many time to measure the distance from p to q It is not a distance in the mathematical sense: D(p q) D(q p) and no triangle inequality However, Theorem 8 SD(p, q) ln 2 2 D(p q) Corollary: For rv X over [m] with H(X) log m ε, it holds that ln 2 SD(X, [m]) 2 (log m H(X)) = ln 2 2 ε Other direction is incorrect: SD(p, q) might be small but D(p q) = Does SD(p, [m]) being small imply D(p [m]) = log m H(p) is small? HW Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 27 / 36

Proving Thm 8, boolean case Let p = (α, 1 α) and q = (β, 1 β) and assume α β SD(p, q) = α β We will show that D(p q) = α log α 1 α β + (1 α) log 1 β 4 2 ln 2 (α β)2 = 2 ln 2 SD(p, q)2 Let g(x, y) = x log x y g(x, y) y 1 x + (1 x) log 1 y 4 2 ln 2 (x y)2 = x y ln 2 + 1 x (1 y) ln 2 4 2(y x) 2 ln 2 y x = y(1 y) ln 2 4 (y x) ln 2 Since y(1 y) 1 4, g(x,y) y 0 for y < x. Since g(x, x) = 0, g(x, y) 0 for y < x. Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 28 / 36

Proving Thm 8, general case Let U = Supp(p) Supp(q) Let S = {u U : p(u) > q(u)} SD(p, q) = Pr p [S] Pr q [S] (by homework) Let P p, and let the indicator ˆP be 1 iff P S. Let Q q, and let the indicator ˆQ be 1 iff Q S. SD(ˆP, ˆQ) = Pr [P S] Pr [Q S] = SD(p, q) D(p q) D(ˆP ˆQ) 2 2 SD(ˆP, ˆQ) (the ln 2 = 2 ln 2 SD(p, q)2. (data-proccessing inequality) Boolean case) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 29 / 36

Section 8 Conditioned Distributions Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 30 / 36

Main theorem Theorem 9 Let X 1,..., X k be iid over U, and let Y = (Y 1,..., Y k ) be rv over U k. Then k j=1 D(Y j X j ) D(Y (X 1,..., X k )). For rv Z, let Z (z) = Pr [Z = z]. We prove for k = 2, general case follows similar lines. Let X = (X 1, X 2 ) D(Y X) = y U 2 Y (y) log Y (y) X(y) = = y=(y 1,y 2 ) + y=(y 1,y 2 ) y=(y 1,y 2 ) Y (y) log Y 1(y 1 ) X 1 (y 1 ) + Y (y) log Y (y) log Y 1(y 1 ) Y 2 (y 2 ) Y (y) X 1 (y 1 ) X 2 (y 2 ) Y 1 (y 1 )Y 2 (y 2 ) y=(y 1,y 2 ) Y (y) Y 1 (y 1 )Y 2 (y 2 ) Y (y) log Y 2(y 2 ) X 2 (y 1 ) = D(Y 1 X 1 ) + D(Y 2 X 2 ) + I(Y 1 ; Y 2 ) D(Y 1 X 1 ) + D(Y 2 X 2 ) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 31 / 36

Conditioning distributions, relative entropy case Theorem 10 Let X 1,..., X k be iid over X, let X = (X 1,..., X k ) and let W be an event (i.e., Boolean rv). Then k j=1 D((X j W ) X j ) D((X W ) X) log 1 Pr[W ]. k D((X j W ) X j ) D((X W ) X) (Thm 9) j=1 = (X W )(x) log (X W )(x) X(x) x X k = Pr [W X = x]) (X W )(x) log Pr [W ] x X k 1 = log Pr [W ] + (X W )(x) log Pr [W X = x]) x X k log 1 Pr [W ] (Bayes) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 32 / 36

Conditioning distributions, statistical distance case Theorem 11 Let X 1,..., X k be iid over X and let W be an event. Then k j=1 SD((X j W ), X j ) 2 log 1 Pr[W ]. Proof: follows by Thm 8, and Thm 9. Using ( k j=1 a i) 2 k k j=1 a2 i, it follows that Corollary 12 k j=1 SD((X 1 j W ), X j ) k log( Pr[W ]), and E j k SD((X j W ), X j ) 1 k log( 1 Pr[W ] ) Extraction Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 33 / 36

Numerical example Let X = (X 1,..., X k ) {0, 1} 40 and let f : {0, 1} 40 0 be such that Pr [f (X) = 0] = 2 10. 1 E j [40] SD((X j f (X)=0 ), {0, 1}) 40 10 = 1 2 Typical bits are not too biassed, even when conditioning on a very unlikely event. Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 34 / 36

Extension Theorem 13 Let X = (X 1,..., X k ), T and V be rv s over X k, T and V respectively. Let W be an event and assume that the X i s are iid conditioned on T. Then k j=1 D((TVX j) W (TV ) W X j 1 (T )) log Pr[W ] + log Supp(V W ), where X j (t) is distributed according to X j T =t. Interpretation. Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 35 / 36

Proving Thm 13 Let X = (X 1,..., X k ), T and V be rv s over X k, T and V respectively, such that X i s are iid conditioned on T. Let W be an event and let X j (t) be distributed according to the distribution of X j T =t. k j=1 D((TVX j ) W (TV ) W X j (T )) [ k = E D ( )] X j W,V =v,t =t (X j T =t (t,v) (TV ) W j=1 [ ( )] = E D (Xj (t,v) (TV ) W W, V = v ) T =t (X j T = t }{{} W [ 1 ] E log (t,v) (TV ) W Pr [W V = v T = t] 1 log E (t,v) (TV ) W Pr [W V = v T = t] = log (t,v) Supp((TV ) W ) Pr [T = t] Pr [W ] log Supp(V W ). Pr [W ] (chain rule) (Thm 10) (Jensen s inequality) Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December 1, 2015 36 / 36