Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

Etropy ad Ergodic Theory Lecture 5: Joit typicality ad coditioal AEP 1 Notatio: from RVs back to distributios Let (Ω, F, P) be a probability space, ad let X ad Y be A- ad B-valued discrete RVs, respectively. The p X,Y Prob(A B), ad we may write it usig the multiplicatio rule for coditioal probabilities: p X,Y (a, b) = p X (a) p Y X (b a). I vaious applicatios of the theory, we eed to fix oe of the two factors o the right of this equatio, ad cosider all the possible joit distributios that ca be obtaied by varyig the other factor. Defiitio 1.1. A kerel from A to B is a A B matrix θ = (θ(b a)) a,b such that the row vector (θ(b a)) b B is a probability vector o B for each a A. Later i this course we exted this defiitio to more geeral measurable spaces A ad B. The th extesio of a kerel θ is the kerel θ from A to B defied by θ (y 1 y x 1 x ) = θ(y i x i ). Each of the fuctios θ ( x) for x A is a product probability distributio o B. A kerel θ may be see as specifyig a collectio of coditioal probabilities θ( a), a A, but ot a probability distributio o A itself. If (X, Y ) are as above, the we say that p Y X agrees with a kerel θ if p Y X (b a) = θ(b a) wheever p X (a) > 0. 1

If p Prob(A) ad θ is a kerel from A to B, the they ca be combied to give a probability distributio o A B: q(a, b) := p(a) θ(b a). (1) This is sometimes called the iput-output distributio of p ad θ, or the hookup of p with θ. We sometimes deote if by the (o-stadard) otatio p θ, which is meat to suggest a kid of geeralizatio of a product measure. If (X, Y ) are a (A B)-valued RV with this distributio, the p X = p ad p Y X agrees with θ. Ay pair of RVs (X, Y ) with this distributio is called a radomizatio of p ad θ. If (X, Y ) is a radomizatio of p ad θ, the the coditioal etropy H(Y X) clearly depeds oly o the joit distributio p θ. We sometimes use the alterative otatio H(θ p) := H(Y X) = a A p(a) H(θ( a)). Sice this equals H(X, Y ) H(X) = H(p θ) H(p), it is clearly a cotiuous fuctio of the joit distributio p θ. 2 Joit typicality ad the coditioal AEP Let q Prob(A B), ad let p A ad p B its margials o A ad B. I order to emphasize that q is a distributio o a product of two alphabets, we sometimes refer to elemets of T,δ (q) as joitly δ-typical for q. This omeaclature also emphasizes the followig importat poit: If x T,δ (p A ) ad y T,δ (p B ), it does ot follow that (x, y) is joitly ε-typical for q for some small ε. Example. Let x {0, 1} be δ-typical for the uiform distributio o {0, 1}, ad let y = x. The (x, y) is very far from typical for the uiform distributio o {0, 1} {0, 1}, sice N(01 (x, y)) = N(10 (x, y)) = 0 /4. Istead, (x, y) is typical for the distributio p = 1 2 (δ 00 +δ 11 ) o {0, 1} {0, 1}. 2

Our ext goal is a coutig-problem itepretatio of coditioal etropy, aalogous to the meaig established for ucoditioal etropy i Lecture 1. I provig it, we will also meet coditioal versios of the LLN for types ad the AEP. We build up all these results followig roughly the same sequece as i the ucoditioal settig of Lecture 1. Let us ow write q = p A θ for some kerel θ from A to B. We fix all of these otatios for the rest of the lecture. 2.1 Coditioal typicality Defiitio 2.1. Let x A ad y B. The the coditioal type of y give x is the collectio of values p y x (b a) := N( (a, b) (x, y) ), N(a x) defied for all a A such that N(a x) > 0. Heceforth, we usually abbreviate N(ab xy) := N ( (a, b) (x, y) ). Ituitively, the quatity p y x (b a) does the followig: amog all i {1, 2,..., }, we cosider oly those for which x i = a, ad amog those we record the fractio which also satisfy y i = b. Sometimes we refer to the coditioal type as beig idexed by all a A, without worryig about whether N(a x) > 0. We do this oly i cases where it causes o real problems. The first part of our iterpretatio of coditioal etropy is the followig. It is the aalog of the upper boud i Theorem 3.1 from Lecture 1. Lemma 2.2. Fix a kerel θ ad a strig x A. The umber of strigs y B for which p y x agrees with θ is at most 2 H(θ px). Proof. Exercise: mimic the proof of the upper boud i Lecture 1, Theorem 3.1. The key fact is that, if p y x agrees with θ, the N(ab xy) = θ(b a)p x (a) for a A, b B. 3

Like Theorem 3.1 i Lecture 1, there is a matchig lower boud for this lemma, but we leave it aside here. Istead let us itroduce a approximate form of the problem, ad prove upper ad lower bouds for that. The approximate form is more importat for later applicatios. Defiitio 2.3. Let δ > 0 ad x A. The y B is coditioally δ-typical for θ give x if px,y p x θ < δ. The set of all such y is deoted by T,δ (θ, x). More expasively, y T,δ (θ, x) if N(ab xy) a,b N(a x) θ(b a) < δ. Beware that some authors use slightly differet defiitios, as i the case of ucoditioal tylicality. This defiitio requires just a little care for the followig reaso: if θ ad λ are two kerels satisfyig θ(b a) = λ(b a) wheever N(a x) > 0, the y is coditioally δ-typical for θ give x if ad oly if the same holds for λ. As i the ucoditioal settig, the upper boud of Lemma 2.2 is easily tured ito a upper boud for the problem of coutig approximately coditioally typical strigs. Corollary 2.4. For x A ad δ > 0 we have T,δ (θ, x) 2 H(θ px)+ (δ)+o(), where the estimates i the two error terms of the expoet are idepedet of x. Proof. The quatity H(θ p) is cotiuous as a fuctio of the joit distributio p θ. From this ad Lemma 2.2 it follows that {y : p y x agrees with λ} 2 H(λ px) H(θ px)+ (δ) 2 wheever p x λ p x θ < δ. The values p y x (b a) take by the coditioal type of y give x are all ratios of itegers betwee 0 ad, ad there are at most A B of these values. Therefore 4

we may choose a set C of at most ( + 1) 2 A B ratioal stochastic matrices such that every coditioal type betwee strigs of legth agrees with some λ C. Havig doe so, we obtai T,δ (θ, x) λ C p x λ p x θ < δ λ C p x λ p x θ < δ H(λ px) 2 2 H(θ px)+ (δ) 2 H(θ px)+ (δ)+o(). I the remaider of this sectio, we itroduce a LLN ad AEP for coditioal types, ad obtai a lower boud which complemets the previous corollary. 2.2 A LLN for coditioal types We eed a versio of the weak law of large umbers for idepedet RVs of possibly differet distributios. We give a very simple versio which assumes a uiform boud o the RVs. It ca be proved by the same applicatio of Chebyshev s iequality as i the i.i.d. case (which i fact requires oly a uiform boud o the variaces). Propositio 2.5 (Weak LLN for RVs with differig distributios). Fix M > 0. For ay ε > 0 there is a 0 N, depedig o M ad ε, with the followig property: wheever 0 ad X 1,... X are idepedet [ M, M]-valued RVs, we have { 1 P [ 1 X i E ] } X i > ε < ε. Propositio 2.6 (LLN for coditioal types). For ay δ > 0, we have as. mi θ( T,δ (θ, x) ) x 1 x A 5

Proof. Fix x A, ad let Y B be draw from the distributio θ ( x). This meas that Y 1,..., Y are idepedet ad that Y i has distributio θ( x i ). For ay (a, b) A B we have p x,y (a, b) = N(ab xy) = 1 1 {xi =a, Y i =b}. The RVs 1 {xi =a, Y i =b} are idepedet, but i geeral ot idetically distributed: for those i which satisfy x i = a the distributio is Beroulli(θ(b a)), ad for the other i this RV is idetically zero. However, it does follow that they are all bouded by 1, ad so Propositio 2.5 gives { px,y P (a, b) E [ p x,y (a, b) ] } < δ 1 δ > 0, where the rate of covergece does ot deped o the specific choice of x. This gives the correct limitig behaviour for p x,y (a, b) because E [ p x,y (a, b) ] = 1 i: x i =a E[1 {Yi =b}] = 1 N(a x)θ(b a) = (p x θ)(a, b). Combiig this covergece for the fiitely may choices of (a, b) completes the proof. 2.3 Coditioal etropy typicality ad AEP There is also a coditioal versio of etropy typicality. Defiitio 2.7. A strig y B is coditioally ε-etropy typical for θ give x if 2 H(θ px) ε < θ (y x) < 2 H(θ px)+ε. The set of all such y is deoted by T et,ε (θ, x). The choice of the costat H(θ p x ) here is justified by the followig. Theorem 2.8 (Coditioal AEP). For ay ε > 0 we have as. mi θ( T et x A,ε (θ, x) ) x 1 6

Proof. Let Y θ ( x) as before. Takig logarithms gives log 2 θ (Y x) = Z i, where Z i = log 2 θ(y i x i ). These are idepedet, fiite-valued RVs. Amog them, there are at most A -may differet possible distributios, oe for each of the possible values of x i. This list of possible distributios depeds o θ, but the specific choice of x determies oly which distributio is chose for each Z i. Therefore Propositio 2.5 gives { 1 [ 1 } P log 2 θ (Y x) E log 2 θ (Y x)] < δ 1 δ > 0, where the rate of covergece agai does ot deped o the specific choice of x. This completes the proof, because E[log 2 θ (Y x)] = = m E[log 2 θ(y i x i )] = H(θ( x i )) = a θ(b x i ) log 2 θ(b x i ) b p x (a) H(θ( a)) = H(θ P x ). From this we obtai a coverig-umber corollary exactly as i the ucoditioal settig. Corollary 2.9. For ay ε > 0, we have cov ε (θ ( x)) = 2 H(θ px)+oε(), where the estimate i o ε () also depeds o θ but ot o x. This ow implies the lower boud o the umber of coditioally typical strigs which accompaies Corollary 2.4 Corollary 2.10. For ay x A we have T,δ (θ, x) 2 H(θ px) o δ(), where the estimate i o δ () depeds o θ but ot o x. Proof. Combie Propositio 2.6 with the previous corollary. 7

2.4 Ituitive picture of the above results Let (X, Y ) be a radomizatio of q. The origial AEP lets us roughly visualize q as a uiform distributio o typical strigs. If we wish to thik about q as a joit distributio for pairs of strigs (x, y), the the results of this sectio give us a ehaced versio of that picture. To explai this, we start with the followig. The followig lemma gives some simple but useful relatios betwee coditioal ad ucoditioal typicality. Lemma 2.11. 1. If x T,δ (p) ad y T,δ (θ, x) the (x, y) T,2δ (p θ). 2. O the other had, if (x, y) T,δ (p θ), the y T,2δ (θ, x). Proof. The key is that for ay p Prob(A) we have p θ p θ = p (a)θ(b a) p(a)θ(b a) a,b = a ( ) p (a) p(a) θ(b a) = p p. b Part 1. Usig the calculatio above, our first pair of assumptios gives p x,y p θ p x,y p x θ + p x θ p θ = p x,y p x θ + p x p < 2δ. Part 2. Clearly if (x, y) T,δ (p θ), the x T,δ (p). Give this, we obtai p x,y p x θ p x,y p θ + p x θ p θ < 2δ. More succictly, this lemma asserts that T,δ/2 (p θ) ( {x} T,δ (θ, x) ) T,2δ (p θ) δ > 0. (2) x T,δ (p) Usig (2), we see that q is somethig like the uiform distributio o the subset of the Cartesia product where moreover T,δ (q) T,δ (p A ) T,δ (p B ), 8

1. most vertical slices T,δ (q) ({x} B ) have size roughly 2 H(Y X) (where we restrict attetio to x which is itself typical for p A ), ad similarly 2. most horizotal slices T,δ (q) (A {y}) have size roughly 2 H(X Y ). This ituitive picture gives us a ice way to visualize the chai rule. There are some messy εs ad δs to keep track of, but the idea is this: 2 H(X,Y ) T,δ (q) = T,δ (q) ({x} B ) x T,δ (p A ) T,δ (p A ) (typical size T,δ (q) ({x} B ) ) 2 H(X) 2 H(Y X), where the approximatio betwee the first ad the secod lies uses property 1 above. 3 Notes ad remarks Basic sources for this lecture: [CT91, Chapter 10, Exercise 16]. Further readig: the book [CK11] icludes a much more thorough accout of typicality ad coditioal typicality. Refereces [CK11] Imre Csiszár ad Jáos Körer. Iformatio theory. Cambridge Uiversity Press, Cambridge, secod editio, 2011. Codig theorems for discrete memoryless systems. [CT91] Thomas M. Cover ad Joy A. Thomas. Elemets of iformatio theory. Wiley Series i Telecommuicatios. Joh Wiley & Sos, Ic., New York, 1991. A Wiley-Itersciece Publicatio. TIM AUSTIN COURANT INSTITUTE OF MATHEMATICAL SCIENCES, NEW YORK UNIVERSITY, 251 MERCER ST, NEW YORK NY 10012, USA Email: tim@cims.yu.edu URL: cims.yu.edu/ tim 9