Chapter 3: Asymptotic Equipartition Property

Similar documents
Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Lecture 5: Asymptotic Equipartition Property

(Classical) Information Theory II: Source coding

Communications Theory and Engineering

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Data Compression. Part I

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Entropy as a measure of surprise

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

Introduction to information theory and coding

Chapter 9 Fundamental Limits in Information Theory

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Chapter 5: Data Compression

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Lecture 22: Final Review

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

(Classical) Information Theory III: Noisy channel coding

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

Example: Letter Frequencies

Example: Letter Frequencies

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

U Logo Use Guidelines

Network coding for multicast relation to compression and generalization of Slepian-Wolf

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression

Data Compression. Limit of Information Compression. October, Examples of codes 1

Information Theory. Week 4 Compressing streams. Iain Murray,

Entropies & Information Theory

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

Example: Letter Frequencies

1 Introduction to information theory

Lecture 11: Quantum Information III - Source Coding

Exercises with solutions (Set B)

Lecture 3: Channel Capacity

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Homework Set #2 Data Compression, Huffman code and AEP

ECE 4400:693 - Information Theory

Motivation for Arithmetic Coding

Information Theory: Entropy, Markov Chains, and Huffman Coding

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Lec 05 Arithmetic Coding

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Coding of memoryless sources 1/35

COS597D: Information Theory in Computer Science October 19, Lecture 10

Shannon s Noisy-Channel Coding Theorem

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

ELEC 515 Information Theory. Distortionless Source Coding

EE5585 Data Compression May 2, Lecture 27

LECTURE 13. Last time: Lecture outline

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Chapter 5. Data Compression

Lecture 1: September 25, A quick reminder about random variables and convexity

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Some Basic Concepts of Probability and Information Theory: Pt. 2

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

lossless, optimal compressor

Lecture 11: Polar codes construction

CSCI 2570 Introduction to Nanocomputing

Lecture 1: Shannon s Theorem

Chapter 2: Source coding

ELEC546 Review of Information Theory

Lecture 3 : Algorithms for source coding. September 30, 2016

Lecture 17: Differential Entropy

CS 229r Information Theory in Computer Science Feb 12, Lecture 5

Shannon s Noisy-Channel Coding Theorem

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

Source Coding and Function Computation: Optimal Rate in Zero-Error and Vanishing Zero-Error Regime

Lecture 2: August 31

COMM901 Source Coding and Compression. Quiz 1

Fundamental Tools - Probability Theory IV

The binary entropy function

Lecture 4: September Reminder: convergence of sequences

A One-to-One Code and Its Anti-Redundancy

Capacity of AWGN channels

COMP2610/COMP Information Theory

On the Cost of Worst-Case Coding Length Constraints

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Solutions to Set #2 Data Compression, Huffman code and AEP

Lecture 1 : Data Compression and Entropy

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Lecture 5 Channel Coding over Continuous Channels

Transcription:

Chapter 3: Asymptotic Equipartition Property Chapter 3 outline Strong vs. Weak Typicality Convergence Asymptotic Equipartition Property Theorem High-probability sets and the typical set Consequences of the AEP: data compression

Strong versus Weak Typicality Intuition behind typicality? Another example of Typicality Bit-sequences of length n = 8, prob(1) = p (prob(0) = (1-p)) Strong typicality? Weak typicality? What if p=0.5?

Convergence of random variables Weak Law of Large Numbers + the AEP

Typical sets intuition What s the point? Consider iid bit-strings strings of length N=100, prob(1) = p1=0.1 Probability of a given string X with r ones is Number of strings with r ones is Distribution of r, the # of ones in a string of length N is thus Typical sets intuition n(r)p (x) = ( ) ( ) N r p r 1 (1 p 1 ) N r 0.14 0.12 0.1 0.08 0.06 0.04 0.02 N = 100 N = 1000 0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 9001000 0 0 10 20 30 40 50 60 70 80 90 100 r 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 100 200 300 400 500 600 700 800 9001000 r Consider iid bit-strings strings of length N, prob(1) = p1=0.1

Typical sets intuition What s the point? Consider iid bit-strings strings of length N=100, prob(1) = p1=0.1 x log 2 (P (x))...1...1...1...1.1...1...1...1...1...11... 50.1...1...1...1...1...1...1...1... 37.3...1...1..1...1...11..1.1...11...1...1.1..1...1...1. 65.9 1.1...1...1...11.1..1...1...1..1.11... 56.4...11...1...1...1.1...1...1...1...1...1...1... 53.2...1...1...1.1...1...1...1...1...1... 43.7...1...1...1...1...1...1...1...1..11... 46.8...1..1..1...111...1...1...1.1...1...1...1 56.4...1...1...1...1...1...1...1... 37.3...1...1...1...1..1.1.1..1...1. 43.7 1...1...1...1...1...1...1...1..11..1.1...1... 56.4...11.1...1...1...1...1... 37.3.1...1...1.1...1...11...1.1...1...1...11... 56.4...1...1..1...1..11.1.1.1...1...1...1...1..1... 59.5...11.1...1...1..1...1...1...1...1... 46.8... 15.2 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 332.1 Figure 4.10. The top 15 strings are samples from X 100, where p 1 = 0.1 and p 0 = 0.9. The bottom two are the most and least probable strings in this ensemble. The final column shows the log-probabilities of the random strings, which may be compared with the entropy H(X 100 ) = 46.9 bits. [Mackay textbook, pg. 78] Definition: weak typicality

n(h (X)+) A(n). 2 Finally, for sufficiently large n, Pr{A(n) } (3.12) > 1, so that 1 < Pr{A(n) (3.13) } n(h (X) ) 2 (3.14) permitted. Printing not permitted. http://www.cambridge.org/0521642981 Copyright Cambridge University Press 2003. On-screen viewing (n) this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. You can x Abuy The typical set visually A(n) = 2 n(h (X) ) 4.5: Proofs, (3.15) 81 where the second inequality follows from (3.6). Hence, log2 P (x) n(h (X) ) of length 100, prob(1) = 0.1 Bit sequences A(n), (3.16) (1 )2 which completes the proof of the properties of A(n). 3.2 N H(X) TN β CONSEQUENCES OF THE AEP: DATA COMPRESSION Let X1, X2,..., Xn be independent, identically distributed random variables drawn from the probability mass"function p(x). We wish to find short descriptions for such sequences of1111111111110. random variables. We divide all.. 11111110111 sequences in X n into two sets: the typical set A(n) and its complement, as shown in Figure 3.1. " " " " 0000100000010... 00001000010 0100000001000... 00010000000 n: 0001000000000... 00000000000 n elements 0000000000000... 00000000000 The asymptotic equipartition principle is equivalent to: Shannon s source Non-typical coding theorem (verbal statement). N i.i.d. ranset dom variables each with entropy H(X) can be compressed into more than N H(X) bits with negligible risk of information loss, as N ; conversely if they are compressed into fewer than N H(X) bits it is virset tually certain that Typical information will be lost. A(n) : 2n(H + ) Most + least likely sequences NOT in typical set Proofsthe second inequality follows from (3.6). Hence where 60 4.5 [Mackay pg. 81] elements These two theorems are equivalent because we can define a compression algorithm that gives a distinct name of length N H(X) bits to each x in the typical FIGURE 3.1. Typical sets and source coding. set. [Cover+Thomas pg. 60] Figure 4.12. Schematic diagram showing all strings in the ensemble X N ranked by their probability, and the typical set TN β. ASYMPTOTIC EQUIPARTITION PROPERTY University of Illinois at Chicago ECE 534, FallThis 2009, Natasha Devroye section may be skipped if found tough going. A(n) 2n(H (X)+). (3.12) The law of large numbers Finally, for sufficiently large n, Pr{A(n) } > 1, so that Our proof of the source coding theorem uses the law of large numbers. (n) (3.13) 1 Pr{A } Mean and variance ofa<real random = u = u P (u)u variable are E[u] 2 2. and var(u) = σu2 = E[(u u ) ] = P (u)(u u ) n(h (X) ) u 2 (3.14) (n) Technical note: strictly x A I am assuming here that u is a function u(x) of finite discrete ensemble X. Then the summations a sample x from a n(h (X) ) (n) Ax P, =2 (x)f (u(x)). This means that(3.15) P (u) u P (u)f (u) should be written is a finite sum of delta functions. This restriction guarantees that the and variance of u follows do exist,from which(3.6). is nothence, necessarily the case for wheremean the second inequality general P (u). Properties of the typical set (n) n(h (X) ) ALet (1 (3.16) Chebyshev s inequality 1. t be a )2 non-negative, real random variable, and let α be a positive real number. Then which completes the proof of the properties of A(n). t P (t α). (4.30) α 3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION Proof: P (t α) = t α P (t). We multiply each term by t/α 1 and,...,α) Xn be independent, identically random varilet X1, X add thedistributed (non-negative) missing obtain: P2(t t α P (t)t/α. We function ables drawn from the terms and obtain: P (t probability α) t mass P (t)t/α = t /α.p(x). We wish to find short descriptions for such sequences of random variables. We divide all sequences in X n into two sets: the typical set A(n) and its complement, as shown in Figure 3.1. n: n elements Non-typical set Typical set FIGURE 3.1. Typical sets and source coding. [Cover+Thomas pg. 60] A(n) : 2n(H + ) elements

n(h (X) ) A(n), (1 )2 (3.16) which completes the proof of the properties of A(n). 3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION Let X1, X2,..., Xn be independent, identically distributed random variables drawn from the probability mass function p(x). We wish to find short descriptions for such sequences of random variables. We divide all sequences in X n into two sets: the typical set A(n) and its complement, as shown in Figure 3.1. Consequences of the AEP n: n elements Typical set contains almost all the probability Non-typical set Typical set A(n) : 2n(H + ) elements 3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION 61 FIGURE 3.1. Typical sets and source coding. Non-typical set Description: n log + 2 bits How many are in this set useful for source coding (compression) Typical set Description: n(h + ) + 2 bits FIGURE 3.2. Source code using the typical set. We order all elements in each set according to some order (e.g., lexicographic order). Then we can represent each sequence of A(n) by giving the index of the sequence in the set. Since there are 2n(H +) sequences (n) in A, the indexing requires no more than n(h + ) + 1 bits. [The extra bit may be necessary because n(h + ) may not be an integer.] We prefix all these sequences by a 0, giving a total length of n(h + ) + 2 bits to represent each sequence in A(n) (see Figure 3.2). Similarly, we can index each sequence not in A(n) by using not more than n log X + 1 bits. Prefixing these indices by 1, we have a code for all the sequences in X n. Note the following features of the above coding scheme: Consequences of the AEP The code is one-to-one and easily decodable. The initial bit acts as a flag bit to indicate the length of the codeword that follows. c We have used a brute-force enumeration of the atypical set A(n) without taking into account the fact that the number of elements in c is less than the number of elements in X n. Surprisingly, this is A(n) good enough to yield an efficient description. The typical sequences have short descriptions of length nh. We use the notation x to denote a sequence x, x,..., x. Let l(x ) By enumeration be the length of the codeword corresponding to x. If n is sufficiently n 1 n 2 n n large so that Pr{A(n) } 1, the expected length of the codeword is E(l(X n )) = p(x n )l(x n ) (3.17) xn

AEP and data compression ``High-probability set vs. ``typical set Typical set: small number of outcomes that contain most of the probability Is it the smallest such set?

Some notation Bit sequences of length 100, prob(1) = 0.1 NH(X) log 2 P (x) T Nβ 1111111111110... 11111110111 0000100000010... 00001000010 0100000001000... 00010000000 0001000000000... 00000000000 0000000000000... 00000000000 What s the difference? Why use the ``typical set rather than the ``high-probability set?