Lecture 5: Asymptotic Equipartition Property

Size: px

Start display at page:

Download "Lecture 5: Asymptotic Equipartition Property"

Buck Price
5 years ago
Views:

1 Lecture 5: Asymptotic Equipartition Property Law of large number for product of random variables AEP and consequences Dr. Yao Xie, ECE587, Information Theory, Duke University

2 Stock market Initial investment Y 0, daily return ratio r i, in t-th day, your money is Y t = Y 0 r 1... r t. Now if returns ratio r i are i.i.d., with r i = { 4, w.p. 1/2 0, w.p. 1/2. So you think the expected return ratio is Er i = 2, and then EY t = E(Y 0 r 1... r t ) = Y 0 (Er i ) t = Y 0 2 t? Dr. Yao Xie, ECE587, Information Theory, Duke University 1

Is optimized really optimal? With Y 0 = 1, actual return Y t goes like 1 4 16 0 0 0... Optimize expected return is not optimal?

3 Is optimized really optimal? With Y 0 = 1, actual return Y t goes like Optimize expected return is not optimal? Fundamental reason: products does not behave the same as addition Dr. Yao Xie, ECE587, Information Theory, Duke University 2

4 (Weak) Law of large number Theorem. For independent, identically distributed (i.i.d.) random variables X i, X n = 1 n n X i EX, i=1 in probability. Convergence in probability if for every ϵ > 0, P { X n X > ϵ} 0. Proof by Markov inequality. So this means P { X n EX ϵ} 1, n. Dr. Yao Xie, ECE587, Information Theory, Duke University 3

5 In mean square if as n Other types of convergence E(X n X) 2 0 With probability 1 (almost surely) if as n In distribution if as n { } P lim X n = X = 1 n lim n F n F, where F n and F are the cumulative distribution function of X n and X. Dr. Yao Xie, ECE587, Information Theory, Duke University 4

6 Product of random variables How does this behave? n n i=1 X i Geometric mean n n i=1 X i arithmetic mean 1 n n i=1 X i Examples: Volume V of a random box, each dimension X i, V = X 1... X n Stock return Y t = Y 0 r 1... r t Joint distribution of i.i.d. RVs: p(x 1,..., x n ) = n i=1 p(x i) Dr. Yao Xie, ECE587, Information Theory, Duke University 5

7 Law of large number for product of random variables We can write Hence So from LLN n n i=1 X i = e log X i n n X i = en 1 ni=1 log X i i=1 X i e E(log X) e log EX = EX. Dr. Yao Xie, ECE587, Information Theory, Duke University 6

8 Stock example: E log r i = 1 2 log log 0 = E(Y t ) Y 0 e E log r i = 0, t. Example { a, w.p. 1/2 X = b, w.p. 1/2. n E n X i ab a + b 2 i=1 Dr. Yao Xie, ECE587, Information Theory, Duke University 7

9 Asymptotic equipartition property (AEP) LLN states that 1 n n X i EX i=1 AEP states that most sequences 1 n log 1 p(x 1, X 2,..., X n ) H(X) p(x 1, X 2,..., X n ) 2 nh(x) Analyze using LLN for product of random variables Dr. Yao Xie, ECE587, Information Theory, Duke University 8

10 AEP lies in the heart of information theory. Proof for lossless source coding Proof for channel capacity and more... Dr. Yao Xie, ECE587, Information Theory, Duke University 9

11 AEP Theorem. If X 1, X 2,... are i.i.d. p(x), then 1 n log p(x 1, X 2,..., X n ) H(X), in probability. Proof: 1 n log p(x 1, X 2,, X n ) = 1 n n log p(x i ) i=1 E log p(x) = H(X). There are several consequences. Dr. Yao Xie, ECE587, Information Theory, Duke University 10

12 Typical set A typical set A (n) ϵ contains all sequences (x 1, x 2,..., x n ) X n with the property 2 n(h(x)+ϵ) p(x 1, x 2,..., x n ) 2 n(h(x) ϵ). Dr. Yao Xie, ECE587, Information Theory, Duke University 11

13 Not all sequences are created equal Coin tossing example: X {0, 1}, p(1) = 0.8 p(1, 0, 1, 1, 0, 1) = p X i (1 p) 5 X i = p 4 (1 p) 2 = p(0, 0, 0, 0, 0, 0) = p X i (1 p) 5 X i = p 4 (1 p) 2 = In this example, if (x 1,..., x n ) A (n) ϵ, H(X) ϵ 1 n log p(x 1,..., X n ) H(X) + ϵ. This means a binary sequence is in typical set is the frequency of heads is approximately k/n Dr. Yao Xie, ECE587, Information Theory, Duke University 12

14 Dr. Yao Xie, ECE587, Information Theory, Duke University 13

15 p = 0.6, n = 25, k = number of 1 s k ( n k ) ( n k ) p k (1 p) n k 1 n log p(xn ) Dr. Yao Xie, ECE587, Information Theory, Duke University 14

16 Consequences of AEP Theorem. 1. If (x 1, x 2,..., x n ) A (n) ϵ, then for n sufficiently large: H(X) ϵ 1 n log p(x 1, x 2,..., x n ) H(X) + ϵ 2. P {A (n) ϵ } 1 ϵ. 3. A (n) ϵ 2 n(h(x)+ϵ). 4. A (n) ϵ (1 ϵ)2 n(h(x) ϵ). Dr. Yao Xie, ECE587, Information Theory, Duke University 15

17 If (x 1, x 2,..., x n ) A ϵ (n), then Property 1 H(X) ϵ 1 n log p(x 1, x 2,..., x n ) H(X) + ϵ. Proof from definition: (x 1, x 2,..., x n ) A (n) ϵ, if 2 n(h(x)+ϵ) p(x 1, x 2,..., x n ) 2 n(h(x) ϵ). The number of bits used to describe sequences in typical set is approximately nh(x). Dr. Yao Xie, ECE587, Information Theory, Duke University 16

18 Property 2 P {A ϵ (n) } 1 ϵ for n sufficiently large. Proof: From AEP: because 1 n log p(x 1,..., X n ) H(X) in probability, this means for a given ϵ > 0, when n is sufficiently large p{ 1 n log p(x 1,..., X n ) H(X) ϵ } 1 ϵ. }{{} A (n) ϵ High probability: sequences in typical set are most typical. These sequences almost all have same probability - equipartition. Dr. Yao Xie, ECE587, Information Theory, Duke University 17

19 Property 3 and 4: size of typical set Proof: (1 ϵ)2 n(h(x) ϵ) A (n) ϵ 2 n(h(x)+ϵ) 1 = (x 1,...,x n ) (x 1,...,x n ) A (n) ϵ (x 1,...,x n ) A (n) ϵ p(x 1,..., x n ) = A (n) ϵ 2 n(h(x)+ϵ). p(x 1,..., x n ) p(x 1,..., x n )2 n(h(x)+ϵ) Dr. Yao Xie, ECE587, Information Theory, Duke University 18

20 On the other hand, P {A (n) ϵ } 1 ϵ for n, so 1 ϵ < (x 1,...,x n ) A (n) ϵ A (n) ϵ 2 n(h(x) ϵ). p(x 1,..., x n ) Size of typical set depends on H(X). When p = 1/2 in coin tossing example, H(X) = 1, 2 nh(x) = 2 n : all sequences are typical sequences. Dr. Yao Xie, ECE587, Information Theory, Duke University 19

21 Typical set diagram This enables us to divide all sequences into two sets Typical set: high probability to occur, sample entropy is close to true entropy so we will focus on analyzing sequences in typical set Non-typical set: small probability, can ignore in general n : n elements Non-typical set Typical set A (n) : 2 n(h + ) elements Dr. Yao Xie, ECE587, Information Theory, Duke University 20

22 Data compression scheme from AEP Let X 1, X 2,..., X n be i.i.d. RV drawn from p(x) We wish to find short descriptions for such sequences of RVs Dr. Yao Xie, ECE587, Information Theory, Duke University 21

23 Divide all sequences in X n into two sets Non-typical set Description: n log + 2 bits Typical set Description: n(h + ) + 2 bits Dr. Yao Xie, ECE587, Information Theory, Duke University 22

24 Use one bit to indicate which set Typical set A (n) ϵ use prefix 1 Since there are no more than 2 n(h(x)+ϵ) sequences, indexing requires no more than (H(X) + ϵ) + 1 (plus one extra bit) Non-typical set use prefix 0 Since there are at most X n sequences, indexing requires no more than n log X + 1 Notation: x n = (x 1,..., x n ), l(x n ) = length of codeword for x n We can prove E [ ] 1 n l(xn ) H(X) + ϵ Dr. Yao Xie, ECE587, Information Theory, Duke University 23

25 Summary of AEP Almost everything is almost equally probable. Reasons that AEP has H(X) 1 n log p(xn ) H(X), in probability n(h(x) ± ϵ) suffices to describe that random sequence on average 2 H(X) is the effective alphabet size Typical set is the smallest set with probability near 1 Size of typical set 2 nh(x) The distance of elements in the set nearly uniform Dr. Yao Xie, ECE587, Information Theory, Duke University 24

26 Next Time AEP is about the property of independent sequences What about the dependence processes? Answer is entropy rate - next time. Dr. Yao Xie, ECE587, Information Theory, Duke University 25

Lecture 22: Final Review

Lecture 22: Final Review Nuts and bolts Fundamental questions and limits Tools Practical algorithms Future topics Dr Yao Xie, ECE587, Information Theory, Duke University Basics Dr Yao Xie, ECE587, Information