Chapter 11. Information Theory and Statistics

Similar documents
INFORMATION THEORY AND STATISTICS

Information Theory and Statistics, Part I

Information Theory and Hypothesis Testing

Chapter 5. Data Compression

IT and large deviation theory

Chapter 9. Gaussian Channel

Topics. Probability Theory. Perfect Secrecy. Information Theory

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Lecture 8: Information Theory and Statistics

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

Solutions to Set #2 Data Compression, Huffman code and AEP

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Homework Set #2 Data Compression, Huffman code and AEP

Lecture 22: Error exponents in hypothesis testing, GLRT

Information Theory in Intelligent Decision Making

Communications Theory and Engineering

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Lecture 8: Information Theory and Statistics

Exercises with solutions (Set B)

If the objects are replaced there are n choices each time yielding n r ways. n C r and in the textbook by g(n, r).

The Method of Types and Its Application to Information Hiding

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

The method of types. PhD short course Information Theory and Statistics Siena, September, Mauro Barni University of Siena

The PAC Learning Framework -II

Chapter 5: Data Compression

EE5585 Data Compression May 2, Lecture 27

Second-Order Asymptotics in Information Theory

Data Compression. Limit of Information Compression. October, Examples of codes 1

COS597D: Information Theory in Computer Science September 21, Lecture 2

ST5215: Advanced Statistical Theory

EE 4TM4: Digital Communications II Information Measures

Hypothesis Testing with Communication Constraints

Math Bootcamp 2012 Miscellaneous

CSE 312 Final Review: Section AA

What is a random variable

Entropy and Large Deviations

Stat 260/CS Learning in Sequential Decision Problems.

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

1 Review of The Learning Setting

Chapter 8. Hypothesis Testing. Po-Ning Chen. Department of Communications Engineering. National Chiao-Tung University. Hsin Chu, Taiwan 30050

Lecture 3 : Algorithms for source coding. September 30, 2016

Large Deviations Performance of Knuth-Yao algorithm for Random Number Generation

Homework 8 Solutions to Selected Problems

7 Convergence in R d and in Metric Spaces

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Coding of memoryless sources 1/35

1 INFO Sep 05

Information Theory and Coding Techniques

Chernoff Bounds. Theme: try to show that it is unlikely a random variable X is far away from its expectation.

Discrete Mathematics and Probability Theory Fall 2017 Ramchandran and Rao Midterm 2 Solutions

Introduction to Information Entropy Adapted from Papoulis (1991)

8. Limit Laws. lim(f g)(x) = lim f(x) lim g(x), (x) = lim x a f(x) g lim x a g(x)

COMM901 Source Coding and Compression. Quiz 1

Chaos, Complexity, and Inference (36-462)

Network coding for multicast relation to compression and generalization of Slepian-Wolf

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method

A One-to-One Code and Its Anti-Redundancy

Large Deviations Performance of Interval Algorithm for Random Number Generation

1 Introduction to information theory

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Capacity of AWGN channels

APC486/ELE486: Transmission and Compression of Information. Bounds on the Expected Length of Code Words

Cases Where Finding the Minimum Entropy Coloring of a Characteristic Graph is a Polynomial Time Problem

Computing and Communications 2. Information Theory -Entropy

Signal Processing - Lecture 7

Quick Tour of Basic Probability Theory and Linear Algebra

Lecture 5 - Information theory

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel

to mere bit flips) may affect the transmission.

Testing Monotone High-Dimensional Distributions

Intermittent Communication

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Discrete Probability Refresher

Subset Source Coding

LECTURE NOTES 57. Lecture 9

Information-theoretic Secrecy A Cryptographic Perspective

h(x) lim H(x) = lim Since h is nondecreasing then h(x) 0 for all x, and if h is discontinuous at a point x then H(x) > 0. Denote

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

EECS 750. Hypothesis Testing with Communication Constraints

1 Randomized Computation

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Sequences. Chapter 3. n + 1 3n + 2 sin n n. 3. lim (ln(n + 1) ln n) 1. lim. 2. lim. 4. lim (1 + n)1/n. Answers: 1. 1/3; 2. 0; 3. 0; 4. 1.

Chapter 3: Random Variables 1

Introduction and basic definitions

A Hierarchy of Information Quantities for Finite Block Length Analysis of Quantum Tasks

ELEMENTS O F INFORMATION THEORY

On Achievable Rates for Channels with. Mismatched Decoding

Information measures in simple coding problems

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Lecture 22: Final Review

Transcription:

Chapter 11 Information Theory and Statistics Peng-Hua Wang Graduate Inst. of Comm. Engineering National Taipei University

Chapter Outline Chap. 11 Information Theory and Statistics 11.1 Method of Types 11.2 Law of Large Numbers 11.3 Universal Source Coding 11.4 Large Deviation Theory 11.5 Examples of Sanov s THeorem 11.6 Conditional Limit Theorem 11.7 Hypothesis Testing 11.8 Chernoff-Stein Lemma 11.9 Chernoff Information 11.10 Fisher Information and the Cramér-Rao Inequality Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 2/34

11.1 Method of Types Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 3/34

Definitions LetX 1,X 2,... be a sequence ofnsymbols from an alphabet X = {a 1,a 2,...,a M } wherem = X is the number of alphabets. x n x is a sequencex 1,x 2,...x n. The typep x (or empirical probability distribution) of a sequence x 1,x 2,...x n is the relative frequency of each symbol ofx. P x (a) = N(a x) n for alla X wheren(a x) is the number of times the symbola occurs in the sequencex. Example. LetX = {a,b,c},x = aabca. Then the typep x = P aabca is P (a) = 3, P (b) = 1, P (c) = 1, orp = ( 3, 1, 1 ) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 4/34

Definitions The type classt(p) is the set of sequences that have the same type. T(P) = {x : P x = P}. Example. LetX = {a,b,c},x = aabca. Then the type P x = P aabca is P x (a) = 3 5, P x(b) = 1 5, P x(c) = 1 5. The type classt(p x ) is the set of the length-5 sequences that have 3a s, 1band 1c. T(P x ) = {aaabc,aabca,abcaa,bcaaa,...}. The number of elements int(p x ) is ( ) 5 T(P x ) = = 5! = 20. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 5/34

Definitions LetP n denote the set of types with denominatorn. For example, if X = {a,b,c}, P n = {( x1 n, x 2 n, x 3 n ) } : x 1 +x 2 +x 3 = n,x 1 0,x 2 0,x 3 0 wherex 1 = P(a),x 2 = P(b),x 3 = P(c). Theorem. P n (n+1) M Proof. {( x1 P n = n, x 2 n,..., x )} M n where0 x k n. Since there aren+1choices for eachx k, the result follows. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 6/34

Observations The number of sequences of lengthnism n. (exponential inn). The number of types of lengthnis(n+1) M. (polynomial inn). Therefore, at least one type has exponentially many sequences in its type class. In fact, the largest type class has essentially the same number of elements as the entire set of sequences. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 7/34

Theorem Theorem. IfX 1,X 2,...,X n are drawn i.i.d. according toq(x), the probability ofxdepends only on its type and is given by Q n (x) = 2 n(h(p x)+d(p x Q)) where Q n (x) = Pr(x) = n Pr(x i ) = n Q(x i ). i=1 i=1 Proof. n Q n (x) = i ) = i=1q(x a X = Q(a) npx(a) = a X a X Q(a) N(a x) 2 np x(a)logq(a) = 2 n a X P x(a)logq(a) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 8/34

Theorem Proof. (cont.) Since P x (a)logq(a) a X = (P x (a)logq(a)+p x (a)logp x (a) P x (a)logp x (a)) a X = H(P x ) D(P x Q), we have Q n (x) = 2 n(h(p x)+d(p x Q)). Corollary. Ifxis in the type class ofq, then Q n (x) = 2 nh(q). Proof. Ifx T(Q), thenp x = Q andd(p x Q) = 0. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 9/34

Size of T(P) Next, we will estimate the size of T(P). The exact size of T(P) is ( ) n T(P) =. np(a 1 ),np(a 2 ),...,np(a M ) This value is hard to manipulate. We give a simple bound of T(P). We need the following lemmas. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 10/34

Size of T(P) Lemma. Proof. For m n,we have Form < n, m! n! nm n m! n! = 1 2 m 1 2 n = (n+1)(n+2) m n n...n }{{} m n times = n m n m! n! = 1 2 m 1 2 n = 1 (m+1)(m+2) n 1 = 1 n n...n }{{} n = n m nm n Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 11/34

Size of T(P) Lemma. The type classt(p) has the the highest probability among all type classes under the probability distributionp. Proof. P n (T(P)) P n (T(ˆP)) for all ˆP Pn. a X P(a)nP(a) P n (T(P)) P n (T(ˆP)) = T(P) T(ˆP) a X P(a)nˆP(a) ( ) n np(a 1 ),np(a 2 ),...,np(a M ) a X = ( ) n nˆp(a 1 ),nˆp(a 2 ),...,nˆp(a M ) = (nˆp(a))! P(a) n(p(a) ˆP(a)) P(a) np(a) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 12/34 a X P(a) nˆp(a)

Size of T(P) Proof. (cont.) a X (np(a)) nˆp(a) np(a) P(a) n(p(a) ˆP(a)) = a X n nˆp(a) np(a) = n n a X ˆP(a) n a X P(a) = n n n = 1 Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 13/34

Size of T(P) Theorem. 1 (n+1) M2nH(P) T(P) 2 nh(p). Note. The exact size of T(P) is T(P) = This value is hard to manipulate. ( ) n. np(a 1 ),np(a 2 ),...,np(a M ) Proof. (upper bound) IfX 1,X 2,...,X n are drawn i.i.d. fromp, then 1 P n (T(P)) = x T(P) a X P(a) np(a) = T(P) a X 2 np(a)logp(a) = T(P) 2 n a X P(a)logP(a) = T(P) 2 nh(p). Thus, T(P) 2 nh(p) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 14/34

Size of T(P) Proof. (lower bound) 1 = Q P n P n (T(Q)) Q P n max Q Pn (T(Q)) = Q P n P n (T(P)) (n+1) M P n (T(P)) = (n+1) M T(P) 2 nh(p) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 15/34

Probability of type class Theorem. For anyp P n and any distributionq, the probability of the type classt(p) underq n satisfies Proof. 1 (n+1) M2 nd(p Q) Q n (T(P)) 2 nd(p Q). Q n (T(P))) = (x T(P))Q n (x) = (x T(P))2 n(h(p x)+d(p x Q)) = T(P) 2 n(h(p x)+d(p x Q)) Since we have 1 (n+1) M2nH(P) T(P) 2 nh(p), Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 16/34

Summary 1 n P n (n+1) M Q n (x) = 2 n(h(p x)+d(p x Q)) log T(P) H(P) asn. 1 n logqn (T(P)) D(P Q) asn. IfX i Q, the probability of sequences with typep Q approaches 0 asn. Typical sequences aret(q). Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 17/34

11.2 Law of Large Numbers Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 18/34

Typical Sequences Givenǫ > 0, the typicalt ǫ Q for the distributionqn is defined as T ǫ Q = {x : D(P x Q) ǫ} The probability thatxis nontypical is 1 Q n (TQ) ǫ = P:D(P Q)>ǫ Q n (T(P)) 2 nd(p Q) P:D(P Q)>ǫ 2 nǫ P:D(P Q)>ǫ 2 nǫ = (n+1) M 2 nǫ P Q n ln(n+1) n(ǫ M ) = 2 n Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 19/34

Theorem Theorem. LetX 1,X 2,... be i.i.d. P(x). Then ln(n+1) n(ǫ M Pr(D(P x P) > ǫ) 2 n ). Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 20/34

11.3 Universal Source Coding Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 21/34

Introduction An iid source with a known distributionp(x) can be compressed to its entropyh(x). by Huffman coding. Wrong code for incorrect distributionq(x), a penalty ofd(p q) bits is incurred. Is there a universal code of raterthat is sufficient to compress every iid source with entropyh(x) < R? Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 22/34

Concept There are2 nh(p) sequences of typep. There are no more than(n+1) X (polynomial) types. There are no more than(n+1) X 2 nh(p) sequences to describe. IfH(P) < R there are no more than(n+1) X 2 nr sequences to describe. Need nr bits as n. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 23/34

11.4 Large Deviation Theory Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 24/34

Large Deviation Theory IfX i is i.i.d. Bernoulli withp(x i = 1) = 1 3 that 1 n n i=1 X i is near 1 3? This is a small deviation., what is the probability Deviation means deviation from the expected outcome. The probability is near 1. What is the probability that 1 n n i=1 X i is greater than 3 4 large deviation. The probability is exponentially small.? This is a We might estimate the exponent using central limit theory, but this is a poor approximation for more than a few standard deviation. We note that 1 n n i=1 X i = 3 4 is equivalent top x = ( 1 4, 3 4). Thus, the probability is approximated to 2 nd(p x Q) = 2 nd (( 1 4,3 4) ( 1 3,2 3)) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 25/34

Definition LetE be a subset of the set of probability mass functions. We write (with a slight abuse of notation) Q n (E) = Q n (E P n ) = x:p x E P n Q n (x) Why do we say this is a slight abuse of notation? The reason is thatq n ( ) in its original meaning represents the probability of a set of sequences. But now we borrow this notion to represent a set of probability mass functions. For example, let X = 2 ande 1 be the probability mass functions with mean= 1. ThenE 1 =. For example, let X = 2 ande 2 be the probability mass functions with mean= 2/2. ThenE 2 P n =. (why?) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 26/34

Definition IfE contains a relative entropy neighborhood ofq, then by the weak law of large numbersq n (E) 1. Specifically, ifp x E, then Q n (E) Q n (T(P x )) = P(D(P x Q) < ǫ) 1 2 n(ǫ X ln(n+1) n ) 1 Otherwise,Q n (E) 0 exponential fast. We will use the method of types to calculate the exponent (rate function.) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 27/34

Example By observation we find that the sample average ofg(x) is greater than or equal toα. This event is equivalent to the eventp X E P n, where E = { P : a X g(a)p(a) α }. Because 1 n n i=1 g(x i ) α a X P X (a)g(a) α P X E P n Thus, Pr ( 1 n n i=1 g(x i ) α ) = Q n (E P n ) = Q n (E) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 28/34

Sanov s Theorem Theorem. LetX 1,X 2,...X n be iid Q(x). LetE P be a set of probability distributions. Then Q n (E) = Q n (E P n ) (n+1) X 2 nd(p Q) where P = argmin P E D(P Q) is the distribution ine that is closet toqin relative entropy. If, in addition,e P n for alln n 0 for somen 0, then 1 n logqn (E) D(P Q). Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 29/34

Proof of Upper Bound Q n (E) = Q n (T(P)) P E P n 2 nd(p Q) P E P n max 2 nd(p Q) P E P n P E P n max P E 2 nd(p Q) P E P n = P E P n 2 nminp E D(P Q) = P E P n 2 nd(p Q) (n+1) X 2 nd(p Q) Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 30/34

Proof of Lower Bound SinceE P n for alln n 0, we can find a sequence of P n E P n such thatd(p n Q) D(P Q)), and Q n (E) = P E P n Q n (T(P)) Q n (T(P n )) 1 (n+1) X 2 nd(p n Q). Accordingly,D(P Q) is the large deviation rate function. Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 31/34

Example 1 Suppose that we toss a fair dientimes, what is the probability that the average of the throws is greater than or equal to 4? From Sanov s theorem, the large deviation rate function isd(p Q) wherep minimizesd(p Q) over all distributionp that satisfy 6 ip i 4, 6 p i = 1 i=1 i=1 By Lagrange multipliers, we construct the cost function J = D(P Q)+λ = 6 i=1 6 ip i +µ Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 32/34 i=1 p i ln p i q i +λ 6 p i +µ i=1 6 i=1 6 i=1 p i p i

Example 1 Let J = 0 ln(6p i )+1+iλ+µ = 0 p i = e 1 µ e iλ. p i 6 Substituting them for the constraints, 6 i=1 6 i=1 ip i = e 1 µ 6 p i = e 1 µ 6 6 i=1 6 i=1 ie iλ = 4 e iλ = 1 We can solve numericallye λ = 1.190804264. And P = (0.1031,0.1227,0.1461,0.1740,0.2072,0.2468). Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 33/34

Example 1 The probability that the average of 10000 throws is grater than or equal to 4 is about 2 nd(p Q) 2 624 Peng-Hua Wang, May 21, 2012 Information Theory, Chap. 11 - p. 34/34