Chapter 11. Information Theory and Statistics

Chapter 11 Information Theory and Statistics Peng-Hua Wang Graduate Inst. of Comm. Engineering National Taipei University

Chapter Outline Chap. 11 Information Theory and Statistics 11.1 Method of Types 11.2 Law of Large Numbers 11.3 Universal Source Coding 11.4 Large Deviation Theory 11.5 Examples of Sanov s THeorem 11.6 Conditional Limit Theorem 11.7 Hypothesis Testing 11.8 Chernoff-Stein Lemma 11.9 Chernoff Information 11.10 Fisher Information and the Cramér-Rao Inequality Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 2/34

11.1 Method of Types Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 3/34

Definitions LetX 1,X 2,... be a sequence ofnsymbols from an alphabet X = {a 1,a 2,...,a M } wherem = X is the number of alphabets. x n x is a sequencex 1,x 2,...x n. The typep x (or empirical probability distribution) of a sequence x 1,x 2,...x n is the relative frequency of each symbol ofx. P x (a) = N(a x) n for alla X wheren(a x) is the number of times the symbola occurs in the sequencex. Example. LetX = {a,b,c},x = aabca. Then the typep x = P aabca is P (a) = 3, P (b) = 1, P (c) = 1, orp = ( 3, 1, 1 ) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 4/34

Definitions The type classt(p) is the set of sequences that have the same type. T(P) = {x : P x = P}. Example. LetX = {a,b,c},x = aabca. Then the type P x = P aabca is P x (a) = 3 5, P x(b) = 1 5, P x(c) = 1 5. The type classt(p x ) is the set of the length-5 sequences that have 3a s, 1band 1c. T(P x ) = {aaabc,aabca,abcaa,bcaaa,...}. The number of elements int(p x ) is ( ) 5 T(P x ) = = 5! = 20. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 5/34

Definitions LetP n denote the set of types with denominatorn. For example, if X = {a,b,c}, P n = {( x1 n, x 2 n, x 3 n ) } : x 1 +x 2 +x 3 = n,x 1 0,x 2 0,x 3 0 wherex 1 = P(a),x 2 = P(b),x 3 = P(c). Theorem. P n (n+1) M Proof. {( x1 P n = n, x 2 n,..., x )} M n where0 x k n. Since there aren+1choices for eachx k, the result follows. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 6/34

Observations The number of sequences of lengthnism n. (exponential inn). The number of types of lengthnis(n+1) M. (polynomial inn). Therefore, at least one type has exponentially many sequences in its type class. In fact, the largest type class has essentially the same number of elements as the entire set of sequences. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 7/34

Theorem Theorem. IfX 1,X 2,...,X n are drawn i.i.d. according toq(x), the probability ofxdepends only on its type and is given by Q n (x) = 2 n(h(p x)+d(p x Q)) where Q n (x) = Pr(x) = n Pr(x i ) = n Q(x i ). i=1 i=1 Proof. n Q n (x) = i ) = i=1q(x a X = Q(a) npx(a) = a X a X Q(a) N(a x) 2 np x(a)logq(a) = 2 n a X P x(a)logq(a) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 8/34

Theorem Proof. (cont.) Since P x (a)logq(a) a X = (P x (a)logq(a)+p x (a)logp x (a) P x (a)logp x (a)) a X = H(P x ) D(P x Q), we have Q n (x) = 2 n(h(p x)+d(p x Q)). Corollary. Ifxis in the type class ofq, then Q n (x) = 2 nh(q). Proof. Ifx T(Q), thenp x = Q andd(p x Q) = 0. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 9/34

Size of T(P) Next, we will estimate the size of T(P). The exact size of T(P) is ( ) n T(P) =. np(a 1 ),np(a 2 ),...,np(a M ) This value is hard to manipulate. We give a simple bound of T(P). We need the following lemmas. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 10/34

Size of T(P) Lemma. Proof. For m n,we have Form < n, m! n! nm n m! n! = 1 2 m 1 2 n = (n+1)(n+2) m n n...n }{{} m n times = n m n m! n! = 1 2 m 1 2 n = 1 (m+1)(m+2) n 1 = 1 n n...n }{{} n = n m nm n Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 11/34

Size of T(P) Lemma. The type classt(p) has the the highest probability among all type classes under the probability distributionp. Proof. P n (T(P)) P n (T(ˆP)) for all ˆP Pn. a X P(a)nP(a) P n (T(P)) P n (T(ˆP)) = T(P) T(ˆP) a X P(a)nˆP(a) ( ) n np(a 1 ),np(a 2 ),...,np(a M ) a X = ( ) n nˆp(a 1 ),nˆp(a 2 ),...,nˆp(a M ) = (nˆp(a))! P(a) n(p(a) ˆP(a)) P(a) np(a) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 12/34 a X P(a) nˆp(a)

Size of T(P) Proof. (cont.) a X (np(a)) nˆp(a) np(a) P(a) n(p(a) ˆP(a)) = a X n nˆp(a) np(a) = n n a X ˆP(a) n a X P(a) = n n n = 1 Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 13/34

Size of T(P) Theorem. 1 (n+1) M2nH(P) T(P) 2 nh(p). Note. The exact size of T(P) is T(P) = This value is hard to manipulate. ( ) n. np(a 1 ),np(a 2 ),...,np(a M ) Proof. (upper bound) IfX 1,X 2,...,X n are drawn i.i.d. fromp, then 1 P n (T(P)) = x T(P) a X P(a) np(a) = T(P) a X 2 np(a)logp(a) = T(P) 2 n a X P(a)logP(a) = T(P) 2 nh(p). Thus, T(P) 2 nh(p) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 14/34

Size of T(P) Proof. (lower bound) 1 = Q P n P n (T(Q)) Q P n max Q Pn (T(Q)) = Q P n P n (T(P)) (n+1) M P n (T(P)) = (n+1) M T(P) 2 nh(p) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 15/34

Probability of type class Theorem. For anyp P n and any distributionq, the probability of the type classt(p) underq n satisfies Proof. 1 (n+1) M2 nd(p Q) Q n (T(P)) 2 nd(p Q). Q n (T(P))) = (x T(P))Q n (x) = (x T(P))2 n(h(p x)+d(p x Q)) = T(P) 2 n(h(p x)+d(p x Q)) Since we have 1 (n+1) M2nH(P) T(P) 2 nh(p), Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 16/34

Summary 1 n P n (n+1) M Q n (x) = 2 n(h(p x)+d(p x Q)) log T(P) H(P) asn. 1 n logqn (T(P)) D(P Q) asn. IfX i Q, the probability of sequences with typep Q approaches 0 asn. Typical sequences aret(q). Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 17/34

11.2 Law of Large Numbers Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 18/34

Typical Sequences Givenǫ > 0, the typicalt ǫ Q for the distributionqn is defined as T ǫ Q = {x : D(P x Q) ǫ} The probability thatxis nontypical is 1 Q n (TQ) ǫ = P:D(P Q)>ǫ Q n (T(P)) 2 nd(p Q) P:D(P Q)>ǫ 2 nǫ P:D(P Q)>ǫ 2 nǫ = (n+1) M 2 nǫ P Q n ln(n+1) n(ǫ M ) = 2 n Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 19/34

Theorem Theorem. LetX 1,X 2,... be i.i.d. P(x). Then ln(n+1) n(ǫ M Pr(D(P x P) > ǫ) 2 n ). Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 20/34

11.3 Universal Source Coding Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 21/34

Introduction An iid source with a known distributionp(x) can be compressed to its entropyh(x). by Huffman coding. Wrong code for incorrect distributionq(x), a penalty ofd(p q) bits is incurred. Is there a universal code of raterthat is sufficient to compress every iid source with entropyh(x) < R? Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 22/34

Concept There are2 nh(p) sequences of typep. There are no more than(n+1) X (polynomial) types. There are no more than(n+1) X 2 nh(p) sequences to describe. IfH(P) < R there are no more than(n+1) X 2 nr sequences to describe. Need nr bits as n. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 23/34

11.4 Large Deviation Theory Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 24/34

Large Deviation Theory IfX i is i.i.d. Bernoulli withp(x i = 1) = 1 3 that 1 n n i=1 X i is near 1 3? This is a small deviation., what is the probability Deviation means deviation from the expected outcome. The probability is near 1. What is the probability that 1 n n i=1 X i is greater than 3 4 large deviation. The probability is exponentially small.? This is a We might estimate the exponent using central limit theory, but this is a poor approximation for more than a few standard deviation. We note that 1 n n i=1 X i = 3 4 is equivalent top x = ( 1 4, 3 4). Thus, the probability is approximated to 2 nd(p x Q) = 2 nd (( 1 4,3 4) ( 1 3,2 3)) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 25/34

Definition LetE be a subset of the set of probability mass functions. We write (with a slight abuse of notation) Q n (E) = Q n (E P n ) = x:p x E P n Q n (x) Why do we say this is a slight abuse of notation? The reason is thatq n ( ) in its original meaning represents the probability of a set of sequences. But now we borrow this notion to represent a set of probability mass functions. For example, let X = 2 ande 1 be the probability mass functions with mean= 1. ThenE 1 =. For example, let X = 2 ande 2 be the probability mass functions with mean= 2/2. ThenE 2 P n =. (why?) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 26/34

Definition IfE contains a relative entropy neighborhood ofq, then by the weak law of large numbersq n (E) 1. Specifically, ifp x E, then Q n (E) Q n (T(P x )) = P(D(P x Q) < ǫ) 1 2 n(ǫ X ln(n+1) n ) 1 Otherwise,Q n (E) 0 exponential fast. We will use the method of types to calculate the exponent (rate function.) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 27/34

Example By observation we find that the sample average ofg(x) is greater than or equal toα. This event is equivalent to the eventp X E P n, where E = { P : a X g(a)p(a) α }. Because 1 n n i=1 g(x i ) α a X P X (a)g(a) α P X E P n Thus, Pr ( 1 n n i=1 g(x i ) α ) = Q n (E P n ) = Q n (E) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 28/34

Sanov s Theorem Theorem. LetX 1,X 2,...X n be iid Q(x). LetE P be a set of probability distributions. Then Q n (E) = Q n (E P n ) (n+1) X 2 nd(p Q) where P = argmin P E D(P Q) is the distribution ine that is closet toqin relative entropy. If, in addition,e P n for alln n 0 for somen 0, then 1 n logqn (E) D(P Q). Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 29/34

Proof of Upper Bound Q n (E) = Q n (T(P)) P E P n 2 nd(p Q) P E P n max 2 nd(p Q) P E P n P E P n max P E 2 nd(p Q) P E P n = P E P n 2 nminp E D(P Q) = P E P n 2 nd(p Q) (n+1) X 2 nd(p Q) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 30/34

Proof of Lower Bound SinceE P n for alln n 0, we can find a sequence of P n E P n such thatd(p n Q) D(P Q)), and Q n (E) = P E P n Q n (T(P)) Q n (T(P n )) 1 (n+1) X 2 nd(p n Q). Accordingly,D(P Q) is the large deviation rate function. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 31/34

Example 1 Suppose that we toss a fair dientimes, what is the probability that the average of the throws is greater than or equal to 4? From Sanov s theorem, the large deviation rate function isd(p Q) wherep minimizesd(p Q) over all distributionp that satisfy 6 ip i 4, 6 p i = 1 i=1 i=1 By Lagrange multipliers, we construct the cost function J = D(P Q)+λ = 6 i=1 6 ip i +µ Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 32/34 i=1 p i ln p i q i +λ 6 p i +µ i=1 6 i=1 6 i=1 p i p i

Example 1 Let J = 0 ln(6p i )+1+iλ+µ = 0 p i = e 1 µ e iλ. p i 6 Substituting them for the constraints, 6 i=1 6 i=1 ip i = e 1 µ 6 p i = e 1 µ 6 6 i=1 6 i=1 ie iλ = 4 e iλ = 1 We can solve numericallye λ = 1.190804264. And P = (0.1031,0.1227,0.1461,0.1740,0.2072,0.2468). Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 33/34

Example 1 The probability that the average of 10000 throws is grater than or equal to 4 is about 2 nd(p Q) 2 624 Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 34/34