4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Similar documents
Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

5 Mutual Information and Channel Capacity

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Lecture 22: Final Review

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Homework Set #2 Data Compression, Huffman code and AEP

Communications Theory and Engineering

Lecture 5: Asymptotic Equipartition Property

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

ECE 4400:693 - Information Theory

Exercises with solutions (Set B)

Lecture 1: Introduction, Entropy and ML estimation

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Lecture 2: August 31

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

lossless, optimal compressor

COMM901 Source Coding and Compression. Quiz 1

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

LECTURE 13. Last time: Lecture outline

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Hands-On Learning Theory Fall 2016, Lecture 3

Solutions to Set #2 Data Compression, Huffman code and AEP

Chapter 9 Fundamental Limits in Information Theory

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Information Theory: Entropy, Markov Chains, and Huffman Coding

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

1 Introduction to information theory

Chapter 2: Source coding

Computing and Communications 2. Information Theory -Entropy

(Classical) Information Theory II: Source coding

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 5: Data Compression

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Entropies & Information Theory

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

ECE 587 / STA 563: Lecture 5 Lossless Compression

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

National University of Singapore Department of Electrical & Computer Engineering. Examination for

ECE 587 / STA 563: Lecture 5 Lossless Compression

Information Theory. Week 4 Compressing streams. Iain Murray,

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

3F1 Information Theory, Lecture 3

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

Lecture 11: Continuous-valued signals and differential entropy

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Entropy as a measure of surprise

Information Theory Primer:

LECTURE 3. Last time:

Lecture Notes for Communication Theory

Chaos, Complexity, and Inference (36-462)

Lecture 4 : Adaptive source coding algorithms

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Coding for Discrete Source

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

ELEMENT OF INFORMATION THEORY

Information Theory and Statistics Lecture 2: Source coding

Lecture 15: Conditional and Joint Typicaility

Principles of Communications

Network coding for multicast relation to compression and generalization of Slepian-Wolf

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Data Compression. Limit of Information Compression. October, Examples of codes 1

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

EE5585 Data Compression May 2, Lecture 27

Capacity of AWGN channels

Lecture 5 Channel Coding over Continuous Channels

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Communication Theory and Engineering

Lecture 4 Noisy Channel Coding

APC486/ELE486: Transmission and Compression of Information. Bounds on the Expected Length of Code Words

Universal Incremental Slepian-Wolf Coding

Exercise 1. = P(y a 1)P(a 1 )

Lecture 17: Differential Entropy

ELEC546 Review of Information Theory

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

The Method of Types and Its Application to Information Hiding

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

The binary entropy function

Lecture 11: Quantum Information III - Source Coding

Lecture 5 - Information theory

Motivation for Arithmetic Coding

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Lecture 11: Polar codes construction

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

Introduction to information theory and coding

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

EE 376A: Information Theory Lecture Notes. Prof. Tsachy Weissman TA: Idoia Ochoa, Kedar Tatwawadi

Transcription:

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk Michaelmas Term 2015 1 / 19 The Typical Set Recall the AEP: If X 1, X 2,... are i.i.d. P, then for any ɛ > 0 ( Pr 1 ) n log P(X 1, X 2,..., X n ) H(X ) < ɛ n 1. The typical set A ɛ,n with respect to P is the set of sequences (x 1,..., x n ) X n with the property 2 n(h(x )+ɛ) n(h(x ) ɛ) P(x 1,..., x n ) 2 Sequences whose probability is concentrated around 2 nh(x ) Note the dependence on n and ɛ A sequence belonging to the typical set is called an ɛ-typical sequence. 2 / 19

Properties of the Typical Set Notation: We will use X n to denote the vector X 1, X 2,..., X n Property 1 If X n = (X 1,..., X n ) is generated i.i.d. P, then Pr(X n A ɛ,n ) n 1 Proof: From the definition of A ɛ,n, note that X n A ɛ,n 2 n(h(x )+ɛ) P(X n ) 2 n(h(x ) ɛ) (1) AEP says that Pr(X n A ɛ,n ) = Pr H(X ) ɛ 1 n log P(X n ) H(X ) + ɛ ( ) H(X ) ɛ 1 n log P(X n n ) H(X ) + ɛ 1. 3 / 19 Property 2 n(h(x )+ɛ) A ɛ,n 2 ( A ɛ,n is the number of elements in the set A ɛ,n ) Proof: 1 = P(x n ) x n X n P(x n ) x n A ɛ,n (a) x n A ɛ,n 2 n(h(x )+ɛ) = 2 n(h(x )+ɛ) A ɛ,n Hence A n,ɛ 2 n(h(x )+ɛ). (Inequality (a) follows from the definition of the typical set.) 4 / 19

Property 3 For sufficiently large n, A ɛ,n (1 ɛ) 2n(H(X ) ɛ) Proof: From Property 1, Pr(X n A ɛ,n ) 1 as n. This means that for any ɛ > 0, for sufficiently large n we have Pr(X n A ɛ,n ) > 1 ɛ. Thus, for sufficiently large n: 1 ɛ < Pr(X n A ɛ,n ) = x n A ɛ,n P(x n ) (b) x n A ɛ,n 2 n(h(x ) ɛ) = 2 n(h(x ) ɛ) A ɛ,n Hence A n,ɛ (1 ɛ)2 n(h(x ) ɛ). (Inequality (b) follows from the definition of the typical set.) 5 / 19 Properties of the Typical Set Summary: For large n, Suppose you generate X 1,..., X n i.i.d. P. With high probability, the sequence you obtain will be typical, i.e., its probability is close to 2 nh(x ). The number of typical sequences is close to 2 nh(x ). How is all this relevant to communication? We will soon answer this. First, data compression. 6 / 19

Compression GOAL: To compress a source producing symbols X 1, X 2,... that are i.i.d. P For concreteness, consider English text: X = {a b... z,. space ; @ #} X = 32 (English text is not really i.i.d., but for now assume it is) Assume that we know the source entropy H(X ). H(X ) can be estimated by measuring the frequency of each symbol. E.g. by measuring the frequencies of a, b etc. separately, the entropy estimate for English text is 4 bits. Naïve Representation List all the X n possible length n sequences. Index these as {0, 1,..., X n 1} using log X n bits Number of bits/ sequence = n log X (5n for English) 7 / 19 Compression via the Typical Set All X n sequences 2 n(h(x )+ɛ) sequences A ɛ,n Compression scheme: There are at most 2 n(h(x )+ɛ) ɛ-typical sequences. (2 n(4+ɛ) for our example) Index each sequence in A ɛ,n using log 2 n(h(x )+ɛ) bits. Prefix each of these by a flag bit 0. Bits/typical seq. = n(h(x ) + ɛ) + 1 n(h(x ) + ɛ) + 2 Index each sequence not in A ɛ,n using log X n bits. Prefix each of these by a flag bit 1. Bits/non-typical seq. = n log X + 1 n log X + 2 This code assigns a unique codeword to each sequence in X n 8 / 19

Average code length Let l(x n ) be length of the codeword assigned to sequence X n. E[l(X n )] = x n P(x n )l(x n ) = P(x n )l(x n ) + P(x n )l(x n ) x n A ɛ,n x n / A ɛ,n P(x n )(n(h(x ) + ɛ) + 2) + P(x n )(n log X + 2) x n A ɛ,n x n / A ɛ,n 1 n(h(x ) + ɛ) + ɛ n log X + 2 = n(h(x ) + ɛ) + ɛn log X + 2 = n(h(x ) + ɛ ) where ɛ = ɛ + ɛ log X + 2 n. ɛ can be made arbitrarily small by picking ɛ small enough and then n sufficiently large. 9 / 19 Fundamental Limit of Compression We have just shown that we can represent sequences X n using nh(x ) bits on the average. More precisely... Let X n be i.i.d. P. Fix any ɛ > 0. For n sufficiently large, there exists a code that maps sequences x n of length n into binary strings such that the mapping is one-to-one and E [ 1 n l(x n ) ] H(X ) + ɛ. In fact, more is true you cannot do any better than H(X ), i.e., The expected length of any uniquely decodable code satisfies E [ 1 n l(x n ) ] H(X ) (For a proof of this, see [Cover & Thomas, Chapter 5]; also in 3F1 notes.) Entropy is the fundamental limit of lossless compression 10 / 19

11 / 19 All X n sequences 2 n(h(x )+ɛ) sequences A ɛ,n The typical set is very small subset of the set of all sequences, but contains almost all the probability! This is the reason that even with an i.i.d. model, English text can be compressed to 4 bits/sample. Can compress even more if we consider correlations in the text. E.g. q always followed by u. What kind of source cannot be compressed at all? Is this scheme practical? - No. To find the codeword for any x n, we need to go through a table of 2 nh entries computationally complex! - Practical schemes like Huffman coding, Lempel-Ziv achieve rates close to the entropy with much lower complexity. (3F1) 12 / 19

Relative entropy The relative entropy or the Kullback-Leibler (KL) divergence between two pmfs P and Q is D(P Q) = x X P(x) log P(x) Q(x). (Note: P and Q are defined on the same alphabet X ) Measure of distance between distributions P and Q Not a true distance. For e.g., D(P Q) D(Q P). D(P Q) 0 with equality if and only if P = Q. Proof : First use log a = ln a ln 2. Then use the fact that ln a (a 1) with equality iff a = 1. 13 / 19 Mutual Information Consider two random variables X and Y with joint pmf P XY. The mutual information between X and Y is defined as I (X ; Y ) = H(X ) H(X Y ) bits. Reduction in the uncertainty of X when you observe Y Property 1 I (X ; Y ) = H(X ) + H(Y ) H(X, Y ) = H(Y ) H(Y X ) X says as much about Y as Y says about X Proof : From the chain rule of entropy, H(X, Y ) = H(X ) + H(Y X ) = H(Y ) + H(X Y ). In the definition of I (X ; Y ), use H(X Y ) = H(X, Y ) H(Y ). 14 / 19

Venn Diagram I (X ; Y ) H(X ) H(Y ) The two circles together represent H(X, Y ) Questions 1 What is I (X ; Y ) when X and Y are independent? Ans: 0 2 What is I (X ; Y ) when Y = X? Ans: H(X ) 3 What is I (X ; Y ) when Y = f (X )? Ans: H(Y ) = H(f (X )) 15 / 19 Example X is the event that tomorrow is cloudy; Y is the event that it will rain tomorrow. Joint pmf P XY : In Handout 1, we calculated: Rain No Rain Cloudy 3/8 3/8 Not cloudy 1/16 3/16 H(X, Y ) = 1.764, H(X ) = 0.811, H(Y X ) = 0.953 To compute I (X ; Y ), we need to compute H(Y ) (or H(X Y )). P(Y = rainy) = 3 8 + 1 16 = 7 9 16, P(Y = not rainy) = 16 H(Y ) = 0.989 I (X ; Y ) = H(Y ) H(Y X ) = 0.036 Verify that you get the same answer by computing H(X Y ) and using I (X ; Y ) = H(X ) H(X Y ). 16 / 19

Property 2 of Mutual Information I (X ; Y ) = D (P XY P X P Y ) The relative entropy between the joint pmf and the product of the marginals Proof: I (X ; Y ) = H(X ) H(X Y ) = x P X (x) log P X (x) + x,y P XY (x, y) log P X Y (x y) = x,y = x,y P XY (x, y) log P X Y (x y) P X (x) P XY (x, y) log P X Y (x y)p Y (y) P X (x)p Y (y) = D (P XY P X P Y ). 17 / 19 Property 3 I (X ; Y ) 0 Proof: Follows from Property 2 because D(P Q) 0 for any pair of pmfs P, Q. Implication: H(X Y ) H(X ), H(Y X ) H(Y ) Knowing another random variable Y can only reduce the average uncertainty in X Preview: Let X be the input to a communication channel, and Y the output. We will show that I (X ; Y ) is key to understanding of how much information can be transmitted over the channel. Reduction in the uncertainty of the channel input X when you observe the output Y 18 / 19

You can now do Questions 1 10 on Examples Paper 1. 19 / 19