Chapter 3: Asymptotic Equipartition Property

Chapter 3: Asymptotic Equipartition Property Chapter 3 outline Strong vs. Weak Typicality Convergence Asymptotic Equipartition Property Theorem High-probability sets and the typical set Consequences of the AEP: data compression

Strong versus Weak Typicality Intuition behind typicality? Another example of Typicality Bit-sequences of length n = 8, prob(1) = p (prob(0) = (1-p)) Strong typicality? Weak typicality? What if p=0.5?

Convergence of random variables Weak Law of Large Numbers + the AEP

Typical sets intuition What s the point? Consider iid bit-strings strings of length N=100, prob(1) = p1=0.1 Probability of a given string X with r ones is Number of strings with r ones is Distribution of r, the # of ones in a string of length N is thus Typical sets intuition n(r)p (x) = ( ) ( ) N r p r 1 (1 p 1 ) N r 0.14 0.12 0.1 0.08 0.06 0.04 0.02 N = 100 N = 1000 0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 9001000 0 0 10 20 30 40 50 60 70 80 90 100 r 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 100 200 300 400 500 600 700 800 9001000 r Consider iid bit-strings strings of length N, prob(1) = p1=0.1

Typical sets intuition What s the point? Consider iid bit-strings strings of length N=100, prob(1) = p1=0.1 x log 2 (P (x))...1...1...1...1.1...1...1...1...1...11... 50.1...1...1...1...1...1...1...1... 37.3...1...1..1...1...11..1.1...11...1...1.1..1...1...1. 65.9 1.1...1...1...11.1..1...1...1..1.11... 56.4...11...1...1...1.1...1...1...1...1...1...1... 53.2...1...1...1.1...1...1...1...1...1... 43.7...1...1...1...1...1...1...1...1..11... 46.8...1..1..1...111...1...1...1.1...1...1...1 56.4...1...1...1...1...1...1...1... 37.3...1...1...1...1..1.1.1..1...1. 43.7 1...1...1...1...1...1...1...1..11..1.1...1... 56.4...11.1...1...1...1...1... 37.3.1...1...1.1...1...11...1.1...1...1...11... 56.4...1...1..1...1..11.1.1.1...1...1...1...1..1... 59.5...11.1...1...1..1...1...1...1...1... 46.8... 15.2 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 332.1 Figure 4.10. The top 15 strings are samples from X 100, where p 1 = 0.1 and p 0 = 0.9. The bottom two are the most and least probable strings in this ensemble. The final column shows the log-probabilities of the random strings, which may be compared with the entropy H(X 100 ) = 46.9 bits. [Mackay textbook, pg. 78] Definition: weak typicality

n(h (X)+) A(n). 2 Finally, for sufficiently large n, Pr{A(n) } (3.12) > 1, so that 1 < Pr{A(n) (3.13) } n(h (X) ) 2 (3.14) permitted. Printing not permitted. http://www.cambridge.org/0521642981 Copyright Cambridge University Press 2003. On-screen viewing (n) this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. You can x Abuy The typical set visually A(n) = 2 n(h (X) ) 4.5: Proofs, (3.15) 81 where the second inequality follows from (3.6). Hence, log2 P (x) n(h (X) ) of length 100, prob(1) = 0.1 Bit sequences A(n), (3.16) (1 )2 which completes the proof of the properties of A(n). 3.2 N H(X) TN β CONSEQUENCES OF THE AEP: DATA COMPRESSION Let X1, X2,..., Xn be independent, identically distributed random variables drawn from the probability mass"function p(x). We wish to find short descriptions for such sequences of1111111111110. random variables. We divide all.. 11111110111 sequences in X n into two sets: the typical set A(n) and its complement, as shown in Figure 3.1. " " " " 0000100000010... 00001000010 0100000001000... 00010000000 n: 0001000000000... 00000000000 n elements 0000000000000... 00000000000 The asymptotic equipartition principle is equivalent to: Shannon s source Non-typical coding theorem (verbal statement). N i.i.d. ranset dom variables each with entropy H(X) can be compressed into more than N H(X) bits with negligible risk of information loss, as N ; conversely if they are compressed into fewer than N H(X) bits it is virset tually certain that Typical information will be lost. A(n) : 2n(H + ) Most + least likely sequences NOT in typical set Proofsthe second inequality follows from (3.6). Hence where 60 4.5 [Mackay pg. 81] elements These two theorems are equivalent because we can define a compression algorithm that gives a distinct name of length N H(X) bits to each x in the typical FIGURE 3.1. Typical sets and source coding. set. [Cover+Thomas pg. 60] Figure 4.12. Schematic diagram showing all strings in the ensemble X N ranked by their probability, and the typical set TN β. ASYMPTOTIC EQUIPARTITION PROPERTY University of Illinois at Chicago ECE 534, FallThis 2009, Natasha Devroye section may be skipped if found tough going. A(n) 2n(H (X)+). (3.12) The law of large numbers Finally, for sufficiently large n, Pr{A(n) } > 1, so that Our proof of the source coding theorem uses the law of large numbers. (n) (3.13) 1 Pr{A } Mean and variance ofa<real random = u = u P (u)u variable are E[u] 2 2. and var(u) = σu2 = E[(u u ) ] = P (u)(u u ) n(h (X) ) u 2 (3.14) (n) Technical note: strictly x A I am assuming here that u is a function u(x) of finite discrete ensemble X. Then the summations a sample x from a n(h (X) ) (n) Ax P, =2 (x)f (u(x)). This means that(3.15) P (u) u P (u)f (u) should be written is a finite sum of delta functions. This restriction guarantees that the and variance of u follows do exist,from which(3.6). is nothence, necessarily the case for wheremean the second inequality general P (u). Properties of the typical set (n) n(h (X) ) ALet (1 (3.16) Chebyshev s inequality 1. t be a )2 non-negative, real random variable, and let α be a positive real number. Then which completes the proof of the properties of A(n). t P (t α). (4.30) α 3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION Proof: P (t α) = t α P (t). We multiply each term by t/α 1 and,...,α) Xn be independent, identically random varilet X1, X add thedistributed (non-negative) missing obtain: P2(t t α P (t)t/α. We function ables drawn from the terms and obtain: P (t probability α) t mass P (t)t/α = t /α.p(x). We wish to find short descriptions for such sequences of random variables. We divide all sequences in X n into two sets: the typical set A(n) and its complement, as shown in Figure 3.1. n: n elements Non-typical set Typical set FIGURE 3.1. Typical sets and source coding. [Cover+Thomas pg. 60] A(n) : 2n(H + ) elements

n(h (X) ) A(n), (1 )2 (3.16) which completes the proof of the properties of A(n). 3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION Let X1, X2,..., Xn be independent, identically distributed random variables drawn from the probability mass function p(x). We wish to find short descriptions for such sequences of random variables. We divide all sequences in X n into two sets: the typical set A(n) and its complement, as shown in Figure 3.1. Consequences of the AEP n: n elements Typical set contains almost all the probability Non-typical set Typical set A(n) : 2n(H + ) elements 3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION 61 FIGURE 3.1. Typical sets and source coding. Non-typical set Description: n log + 2 bits How many are in this set useful for source coding (compression) Typical set Description: n(h + ) + 2 bits FIGURE 3.2. Source code using the typical set. We order all elements in each set according to some order (e.g., lexicographic order). Then we can represent each sequence of A(n) by giving the index of the sequence in the set. Since there are 2n(H +) sequences (n) in A, the indexing requires no more than n(h + ) + 1 bits. [The extra bit may be necessary because n(h + ) may not be an integer.] We prefix all these sequences by a 0, giving a total length of n(h + ) + 2 bits to represent each sequence in A(n) (see Figure 3.2). Similarly, we can index each sequence not in A(n) by using not more than n log X + 1 bits. Prefixing these indices by 1, we have a code for all the sequences in X n. Note the following features of the above coding scheme: Consequences of the AEP The code is one-to-one and easily decodable. The initial bit acts as a flag bit to indicate the length of the codeword that follows. c We have used a brute-force enumeration of the atypical set A(n) without taking into account the fact that the number of elements in c is less than the number of elements in X n. Surprisingly, this is A(n) good enough to yield an efficient description. The typical sequences have short descriptions of length nh. We use the notation x to denote a sequence x, x,..., x. Let l(x ) By enumeration be the length of the codeword corresponding to x. If n is sufficiently n 1 n 2 n n large so that Pr{A(n) } 1, the expected length of the codeword is E(l(X n )) = p(x n )l(x n ) (3.17) xn

AEP and data compression ``High-probability set vs. ``typical set Typical set: small number of outcomes that contain most of the probability Is it the smallest such set?

Some notation Bit sequences of length 100, prob(1) = 0.1 NH(X) log 2 P (x) T Nβ 1111111111110... 11111110111 0000100000010... 00001000010 0100000001000... 00010000000 0001000000000... 00000000000 0000000000000... 00000000000 What s the difference? Why use the ``typical set rather than the ``high-probability set?