Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

Similar documents
Lecture 6: Source coding, Typicality, and Noisy channels and capacity

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Lecture 15: Strong, Conditional, & Joint Typicality

Lecture 14: Graph Entropy

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Sequences and Series of Functions

Entropy Rates and Asymptotic Equipartition

Advanced Stochastic Processes.

Distribution of Random Samples & Limit theorems

Lecture 19: Convergence

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 6 9/24/2008 DISCRETE RANDOM VARIABLES AND THEIR EXPECTATIONS

Law of the sum of Bernoulli random variables

Lecture 8: Convergence of transformations and law of large numbers

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

STAT Homework 1 - Solutions

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

2.4 Sequences, Sequences of Sets

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Infinite Sequences and Series

Notes 19 : Martingale CLT

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

5.1 A mutual information bound based on metric entropy

Optimally Sparse SVMs

Shannon s noiseless coding theorem

EE 4TM4: Digital Communications II Information Measures

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Chapter IV Integration Theory

MAT1026 Calculus II Basic Convergence Tests for Series

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Maximum Likelihood Estimation and Complexity Regularization

1 Convergence in Probability and the Weak Law of Large Numbers

On Random Line Segments in the Unit Square

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Lecture 3 : Random variables and their distributions

An Introduction to Randomized Algorithms

7.1 Convergence of sequences of random variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

6.3 Testing Series With Positive Terms

Math 61CM - Solutions to homework 3

Refinement of Two Fundamental Tools in Information Theory

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Basics of Probability Theory (for Theory of Computation courses)

Linear Regression Demystified

Lecture 3 The Lebesgue Integral

Rademacher Complexity

Lecture 7: October 18, 2017

ON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Empirical Process Theory and Oracle Inequalities

Math 155 (Lecture 3)

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

7.1 Convergence of sequences of random variables

Inequalities for Entropies of Sets of Subsets of Random Variables

Lecture 12: November 13, 2018

Output Analysis and Run-Length Control

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

ON POINTWISE BINOMIAL APPROXIMATION

4.3 Growth Rates of Solutions to Recurrences

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ST5215: Advanced Statistical Theory

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

1 Approximating Integrals using Taylor Polynomials

Sieve Estimators: Consistency and Rates of Convergence

An Introduction to Asymptotic Theory

Sequences. Notation. Convergence of a Sequence

Quick Review of Probability

INFINITE SEQUENCES AND SERIES

Asymptotic distribution of products of sums of independent random variables

Lecture 3: August 31

Lecture Notes for Analysis Class

Lecture 11: Channel Coding Theorem: Converse Part

STA Object Data Analysis - A List of Projects. January 18, 2018

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

4. Partial Sums and the Central Limit Theorem

CHAPTER 10 INFINITE SEQUENCES AND SERIES

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Addition: Property Name Property Description Examples. a+b = b+a. a+(b+c) = (a+b)+c

Seunghee Ye Ma 8: Week 5 Oct 28

The multiplicative structure of finite field and a construction of LRC

11.5 Alternating Series, Absolute and Conditional Convergence

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Generalized Semi- Markov Processes (GSMP)

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Kernel density estimator

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

REGRESSION WITH QUADRATIC LOSS

B Supplemental Notes 2 Hypergeometric, Binomial, Poisson and Multinomial Random Variables and Borel Sets

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

32 estimating the cumulative distribution function

Quick Review of Probability

1 Review and Overview

The log-behavior of n p(n) and n p(n)/n

Lecture 9: Expanders Part 2, Extractors

ECE 901 Lecture 13: Maximum Likelihood Estimation

Transcription:

Etropy ad Ergodic Theory Lecture 5: Joit typicality ad coditioal AEP 1 Notatio: from RVs back to distributios Let (Ω, F, P) be a probability space, ad let X ad Y be A- ad B-valued discrete RVs, respectively. The p X,Y Prob(A B), ad we may write it usig the multiplicatio rule for coditioal probabilities: p X,Y (a, b) = p X (a) p Y X (b a). I vaious applicatios of the theory, we eed to fix oe of the two factors o the right of this equatio, ad cosider all the possible joit distributios that ca be obtaied by varyig the other factor. Defiitio 1.1. A kerel from A to B is a A B matrix θ = (θ(b a)) a,b such that the row vector (θ(b a)) b B is a probability vector o B for each a A. Later i this course we exted this defiitio to more geeral measurable spaces A ad B. The th extesio of a kerel θ is the kerel θ from A to B defied by θ (y 1 y x 1 x ) = θ(y i x i ). Each of the fuctios θ ( x) for x A is a product probability distributio o B. A kerel θ may be see as specifyig a collectio of coditioal probabilities θ( a), a A, but ot a probability distributio o A itself. If (X, Y ) are as above, the we say that p Y X agrees with a kerel θ if p Y X (b a) = θ(b a) wheever p X (a) > 0. 1

If p Prob(A) ad θ is a kerel from A to B, the they ca be combied to give a probability distributio o A B: q(a, b) := p(a) θ(b a). (1) This is sometimes called the iput-output distributio of p ad θ, or the hookup of p with θ. We sometimes deote if by the (o-stadard) otatio p θ, which is meat to suggest a kid of geeralizatio of a product measure. If (X, Y ) are a (A B)-valued RV with this distributio, the p X = p ad p Y X agrees with θ. Ay pair of RVs (X, Y ) with this distributio is called a radomizatio of p ad θ. If (X, Y ) is a radomizatio of p ad θ, the the coditioal etropy H(Y X) clearly depeds oly o the joit distributio p θ. We sometimes use the alterative otatio H(θ p) := H(Y X) = a A p(a) H(θ( a)). Sice this equals H(X, Y ) H(X) = H(p θ) H(p), it is clearly a cotiuous fuctio of the joit distributio p θ. 2 Joit typicality ad the coditioal AEP Let q Prob(A B), ad let p A ad p B its margials o A ad B. I order to emphasize that q is a distributio o a product of two alphabets, we sometimes refer to elemets of T,δ (q) as joitly δ-typical for q. This omeaclature also emphasizes the followig importat poit: If x T,δ (p A ) ad y T,δ (p B ), it does ot follow that (x, y) is joitly ε-typical for q for some small ε. Example. Let x {0, 1} be δ-typical for the uiform distributio o {0, 1}, ad let y = x. The (x, y) is very far from typical for the uiform distributio o {0, 1} {0, 1}, sice N(01 (x, y)) = N(10 (x, y)) = 0 /4. Istead, (x, y) is typical for the distributio p = 1 2 (δ 00 +δ 11 ) o {0, 1} {0, 1}. 2

Our ext goal is a coutig-problem itepretatio of coditioal etropy, aalogous to the meaig established for ucoditioal etropy i Lecture 1. I provig it, we will also meet coditioal versios of the LLN for types ad the AEP. We build up all these results followig roughly the same sequece as i the ucoditioal settig of Lecture 1. Let us ow write q = p A θ for some kerel θ from A to B. We fix all of these otatios for the rest of the lecture. 2.1 Coditioal typicality Defiitio 2.1. Let x A ad y B. The the coditioal type of y give x is the collectio of values p y x (b a) := N( (a, b) (x, y) ), N(a x) defied for all a A such that N(a x) > 0. Heceforth, we usually abbreviate N(ab xy) := N ( (a, b) (x, y) ). Ituitively, the quatity p y x (b a) does the followig: amog all i {1, 2,..., }, we cosider oly those for which x i = a, ad amog those we record the fractio which also satisfy y i = b. Sometimes we refer to the coditioal type as beig idexed by all a A, without worryig about whether N(a x) > 0. We do this oly i cases where it causes o real problems. The first part of our iterpretatio of coditioal etropy is the followig. It is the aalog of the upper boud i Theorem 3.1 from Lecture 1. Lemma 2.2. Fix a kerel θ ad a strig x A. The umber of strigs y B for which p y x agrees with θ is at most 2 H(θ px). Proof. Exercise: mimic the proof of the upper boud i Lecture 1, Theorem 3.1. The key fact is that, if p y x agrees with θ, the N(ab xy) = θ(b a)p x (a) for a A, b B. 3

Like Theorem 3.1 i Lecture 1, there is a matchig lower boud for this lemma, but we leave it aside here. Istead let us itroduce a approximate form of the problem, ad prove upper ad lower bouds for that. The approximate form is more importat for later applicatios. Defiitio 2.3. Let δ > 0 ad x A. The y B is coditioally δ-typical for θ give x if px,y p x θ < δ. The set of all such y is deoted by T,δ (θ, x). More expasively, y T,δ (θ, x) if N(ab xy) a,b N(a x) θ(b a) < δ. Beware that some authors use slightly differet defiitios, as i the case of ucoditioal tylicality. This defiitio requires just a little care for the followig reaso: if θ ad λ are two kerels satisfyig θ(b a) = λ(b a) wheever N(a x) > 0, the y is coditioally δ-typical for θ give x if ad oly if the same holds for λ. As i the ucoditioal settig, the upper boud of Lemma 2.2 is easily tured ito a upper boud for the problem of coutig approximately coditioally typical strigs. Corollary 2.4. For x A ad δ > 0 we have T,δ (θ, x) 2 H(θ px)+ (δ)+o(), where the estimates i the two error terms of the expoet are idepedet of x. Proof. The quatity H(θ p) is cotiuous as a fuctio of the joit distributio p θ. From this ad Lemma 2.2 it follows that {y : p y x agrees with λ} 2 H(λ px) H(θ px)+ (δ) 2 wheever p x λ p x θ < δ. The values p y x (b a) take by the coditioal type of y give x are all ratios of itegers betwee 0 ad, ad there are at most A B of these values. Therefore 4

we may choose a set C of at most ( + 1) 2 A B ratioal stochastic matrices such that every coditioal type betwee strigs of legth agrees with some λ C. Havig doe so, we obtai T,δ (θ, x) λ C p x λ p x θ < δ λ C p x λ p x θ < δ H(λ px) 2 2 H(θ px)+ (δ) 2 H(θ px)+ (δ)+o(). I the remaider of this sectio, we itroduce a LLN ad AEP for coditioal types, ad obtai a lower boud which complemets the previous corollary. 2.2 A LLN for coditioal types We eed a versio of the weak law of large umbers for idepedet RVs of possibly differet distributios. We give a very simple versio which assumes a uiform boud o the RVs. It ca be proved by the same applicatio of Chebyshev s iequality as i the i.i.d. case (which i fact requires oly a uiform boud o the variaces). Propositio 2.5 (Weak LLN for RVs with differig distributios). Fix M > 0. For ay ε > 0 there is a 0 N, depedig o M ad ε, with the followig property: wheever 0 ad X 1,... X are idepedet [ M, M]-valued RVs, we have { 1 P [ 1 X i E ] } X i > ε < ε. Propositio 2.6 (LLN for coditioal types). For ay δ > 0, we have as. mi θ( T,δ (θ, x) ) x 1 x A 5

Proof. Fix x A, ad let Y B be draw from the distributio θ ( x). This meas that Y 1,..., Y are idepedet ad that Y i has distributio θ( x i ). For ay (a, b) A B we have p x,y (a, b) = N(ab xy) = 1 1 {xi =a, Y i =b}. The RVs 1 {xi =a, Y i =b} are idepedet, but i geeral ot idetically distributed: for those i which satisfy x i = a the distributio is Beroulli(θ(b a)), ad for the other i this RV is idetically zero. However, it does follow that they are all bouded by 1, ad so Propositio 2.5 gives { px,y P (a, b) E [ p x,y (a, b) ] } < δ 1 δ > 0, where the rate of covergece does ot deped o the specific choice of x. This gives the correct limitig behaviour for p x,y (a, b) because E [ p x,y (a, b) ] = 1 i: x i =a E[1 {Yi =b}] = 1 N(a x)θ(b a) = (p x θ)(a, b). Combiig this covergece for the fiitely may choices of (a, b) completes the proof. 2.3 Coditioal etropy typicality ad AEP There is also a coditioal versio of etropy typicality. Defiitio 2.7. A strig y B is coditioally ε-etropy typical for θ give x if 2 H(θ px) ε < θ (y x) < 2 H(θ px)+ε. The set of all such y is deoted by T et,ε (θ, x). The choice of the costat H(θ p x ) here is justified by the followig. Theorem 2.8 (Coditioal AEP). For ay ε > 0 we have as. mi θ( T et x A,ε (θ, x) ) x 1 6

Proof. Let Y θ ( x) as before. Takig logarithms gives log 2 θ (Y x) = Z i, where Z i = log 2 θ(y i x i ). These are idepedet, fiite-valued RVs. Amog them, there are at most A -may differet possible distributios, oe for each of the possible values of x i. This list of possible distributios depeds o θ, but the specific choice of x determies oly which distributio is chose for each Z i. Therefore Propositio 2.5 gives { 1 [ 1 } P log 2 θ (Y x) E log 2 θ (Y x)] < δ 1 δ > 0, where the rate of covergece agai does ot deped o the specific choice of x. This completes the proof, because E[log 2 θ (Y x)] = = m E[log 2 θ(y i x i )] = H(θ( x i )) = a θ(b x i ) log 2 θ(b x i ) b p x (a) H(θ( a)) = H(θ P x ). From this we obtai a coverig-umber corollary exactly as i the ucoditioal settig. Corollary 2.9. For ay ε > 0, we have cov ε (θ ( x)) = 2 H(θ px)+oε(), where the estimate i o ε () also depeds o θ but ot o x. This ow implies the lower boud o the umber of coditioally typical strigs which accompaies Corollary 2.4 Corollary 2.10. For ay x A we have T,δ (θ, x) 2 H(θ px) o δ(), where the estimate i o δ () depeds o θ but ot o x. Proof. Combie Propositio 2.6 with the previous corollary. 7

2.4 Ituitive picture of the above results Let (X, Y ) be a radomizatio of q. The origial AEP lets us roughly visualize q as a uiform distributio o typical strigs. If we wish to thik about q as a joit distributio for pairs of strigs (x, y), the the results of this sectio give us a ehaced versio of that picture. To explai this, we start with the followig. The followig lemma gives some simple but useful relatios betwee coditioal ad ucoditioal typicality. Lemma 2.11. 1. If x T,δ (p) ad y T,δ (θ, x) the (x, y) T,2δ (p θ). 2. O the other had, if (x, y) T,δ (p θ), the y T,2δ (θ, x). Proof. The key is that for ay p Prob(A) we have p θ p θ = p (a)θ(b a) p(a)θ(b a) a,b = a ( ) p (a) p(a) θ(b a) = p p. b Part 1. Usig the calculatio above, our first pair of assumptios gives p x,y p θ p x,y p x θ + p x θ p θ = p x,y p x θ + p x p < 2δ. Part 2. Clearly if (x, y) T,δ (p θ), the x T,δ (p). Give this, we obtai p x,y p x θ p x,y p θ + p x θ p θ < 2δ. More succictly, this lemma asserts that T,δ/2 (p θ) ( {x} T,δ (θ, x) ) T,2δ (p θ) δ > 0. (2) x T,δ (p) Usig (2), we see that q is somethig like the uiform distributio o the subset of the Cartesia product where moreover T,δ (q) T,δ (p A ) T,δ (p B ), 8

1. most vertical slices T,δ (q) ({x} B ) have size roughly 2 H(Y X) (where we restrict attetio to x which is itself typical for p A ), ad similarly 2. most horizotal slices T,δ (q) (A {y}) have size roughly 2 H(X Y ). This ituitive picture gives us a ice way to visualize the chai rule. There are some messy εs ad δs to keep track of, but the idea is this: 2 H(X,Y ) T,δ (q) = T,δ (q) ({x} B ) x T,δ (p A ) T,δ (p A ) (typical size T,δ (q) ({x} B ) ) 2 H(X) 2 H(Y X), where the approximatio betwee the first ad the secod lies uses property 1 above. 3 Notes ad remarks Basic sources for this lecture: [CT91, Chapter 10, Exercise 16]. Further readig: the book [CK11] icludes a much more thorough accout of typicality ad coditioal typicality. Refereces [CK11] Imre Csiszár ad Jáos Körer. Iformatio theory. Cambridge Uiversity Press, Cambridge, secod editio, 2011. Codig theorems for discrete memoryless systems. [CT91] Thomas M. Cover ad Joy A. Thomas. Elemets of iformatio theory. Wiley Series i Telecommuicatios. Joh Wiley & Sos, Ic., New York, 1991. A Wiley-Itersciece Publicatio. TIM AUSTIN COURANT INSTITUTE OF MATHEMATICAL SCIENCES, NEW YORK UNIVERSITY, 251 MERCER ST, NEW YORK NY 10012, USA Email: tim@cims.yu.edu URL: cims.yu.edu/ tim 9