ECE 587 / STA 563: Lecture 5 Lossless Compression

Similar documents
ECE 587 / STA 563: Lecture 5 Lossless Compression

Data Compression. Limit of Information Compression. October, Examples of codes 1

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Lec 03 Entropy and Coding II Hoffman and Golomb Coding

Chapter 5: Data Compression

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Chapter 2: Source coding

Lecture 4 : Adaptive source coding algorithms

1 Introduction to information theory

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Lecture 1 : Data Compression and Entropy

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Entropy as a measure of surprise

Lecture 3 : Algorithms for source coding. September 30, 2016

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Information Theory and Statistics Lecture 2: Source coding

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

3F1 Information Theory, Lecture 3

Coding for Discrete Source

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Introduction to information theory and coding

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

Motivation for Arithmetic Coding

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

Lecture 5: Asymptotic Equipartition Property

Lecture 22: Final Review

Shannon-Fano-Elias coding

ELEC 515 Information Theory. Distortionless Source Coding

3F1 Information Theory, Lecture 3

Chapter 9 Fundamental Limits in Information Theory

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

Homework Set #2 Data Compression, Huffman code and AEP

Coding of memoryless sources 1/35

Information & Correlation

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Intro to Information Theory

Solutions to Set #2 Data Compression, Huffman code and AEP

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

CS4800: Algorithms & Data Jonathan Ullman

SGN-2306 Signal Compression. 1. Simple Codes

CMPT 365 Multimedia Systems. Lossless Compression

UNIT I INFORMATION THEORY. I k log 2

Information Theory, Statistics, and Decision Trees

Source Coding Techniques

Lecture 16 Oct 21, 2014

lossless, optimal compressor

Chapter 5. Data Compression

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

COMM901 Source Coding and Compression. Quiz 1

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

H(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

Universal Loseless Compression: Context Tree Weighting(CTW)

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes

Communications Theory and Engineering

(Classical) Information Theory II: Source coding

Lecture 4 Noisy Channel Coding

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

18.310A Final exam practice questions

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Variable-to-Variable Codes with Small Redundancy Rates

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

Generalized Kraft Inequality and Arithmetic Coding

Lecture 11: Quantum Information III - Source Coding

Information Theory: Entropy, Markov Chains, and Huffman Coding

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

CSCI 2570 Introduction to Nanocomputing

Multimedia Information Systems

Summary of Last Lectures

6.02 Fall 2012 Lecture #1

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

repetition, part ii Ole-Johan Skrede INF Digital Image Processing

Dept. of Linguistics, Indiana University Fall 2015

Multiple Choice Tries and Distributed Hash Tables

Lecture 1: Shannon s Theorem

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

CS 229r Information Theory in Computer Science Feb 12, Lecture 5

(Classical) Information Theory III: Noisy channel coding

Lec 05 Arithmetic Coding

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

ECE Advanced Communication Theory, Spring 2009 Homework #1 (INCOMPLETE)

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

Transcription:

ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 2017 Author: Galen Reeves Last Modified: October 18, 2017 Outline of lecture: 5.1 Introduction to Lossless Source Coding.......................... 1 5.1.1 Motivating Eample................................ 1 5.1.2 Definitions..................................... 2 5.2 Fundamental Limits of Compression........................... 3 5.2.1 Uniquely Decodable Codes & Kraft Inequality................. 3 5.2.2 Codes on Trees................................... 5 5.2.3 Prefi-Free Codes & Kraft Inequality....................... 5 5.3 Shannon Code....................................... 7 5.4 Huffman Code....................................... 9 5.4.1 Optimality of Huffman code............................ 9 5.5 Coding Over Blocks.................................... 10 5.6 Coding with Unknown Distributions........................... 12 5.6.1 Minima Redundancy............................... 12 5.6.2 Coding with Unknown Alphabet......................... 14 5.6.3 Lempel-Ziv Code.................................. 15 5.1 Introduction to Lossless Source Coding 5.1.1 Motivating Eample Eample: Consider assigning binary phone numbers to your friends friend probability code (i) code (ii) code (iii) code (iv) code (v) code (vi) Alice 1/4 0011 001101 0 00 0 10 Bob 1/2 0011 001110 1 11 11 0 Carol 1/4 1100 110000 10 10 10 11 Analysis of codes: (i) Alice and Bob have same number. Does not work. (ii) Works, but phone numbers are too long (iii) Not decodable. 10 could mean Carol, or could mean Bob, Alice (iv) Works, but why do we need to zeros for Alice? After first zero it is clear who we want. (v) Ok, but Alice has a shorter code than Bob (vi) This is the optimal code. Once you are finished dialing you can be connected immediately.

2 ECE 587 / STA 563: Lecture 5 Desirable properties of a code: (1) Uniquely decodable. (2) Efficient, i.e., minimize the average codeword length: E[l(X) = p()l() X where l() is the length of the codeword associated with symbol. (3) Prefi-free, i.e., no codeword is the prefi of another code 5.1.2 Definitions A source code is a mapping C from a source alphabet X to D-ary sequences D is set of finite-length strings of symbols from D-ary alphabet D = {1, 2,, D}, i.e. D = D D 2 D 3 C() D is the codeword for X l() is the length of C() A code is nonsingular if C() C( ) The etension of the code C is the mapping from finite length strings of X to finite length string of D C( 1 2 }{{ n ) = C( } 1 )C( 2 ) C( n ) input (source) output (code) A code C is uniquely decodable if its etension C is nonsingular, i.e., for all n, 1 2 n 1 2 n = C( 1 ), C( 2 ),, C( n ) C( 1 ), C( 2 ),, C( n ) A code is prefi-free if no codeword is prefied by another codeword. Such codes are also known as prefi codes or instantaneous codes. Venn diagram of instantaneous, uniquely decodable, and nonsingular codes. Given a distribution p() on the input symbol, the goal is to minimize the epected length per-symbol E[l(X) = X l()p()

ECE 587 / STA 563: Lecture 5 3 5.2 Fundamental Limits of Compression This section considers the limits of of lossless compression and proves the following result. Theorem: For any source distribution p(), the epected length E[l(X) of the optimal uniquely decodable D-ary code obeys H(X) log D E[l(X) < H(X) log D + 1 Furthermore, there eists a prefi-free code which is optimal. 5.2.1 Uniquely Decodable Codes & Kraft Inequality Let l() be the length function associated with a code C. i.e. l() is length of codeword C() for all X A code C satisfies the Kraft Inequality if and only if D l() 1 (Kraft Inequality) X Theorem: Every uniquely decodable code satisfies the Kraft inequality, i.e. Uniquely decodable = Kraft Inequality Proof Let C be a uniquely decodable source code. For a source sequence n, the length of the etended codeword C( n ) is given by l( n ) = n l( i ) i=1 and thus, the length function of the etended codeword obeys nl min l( n ) nl ma where l min = min X l() and l ma = ma X l(). Let A k be the number of source sequences of length n for which l( n ) = k, i.e. A k = #{ n X n : l( n ) = k} Since the code is uniquely decodable, the number of source sequence with codewords of length k cannot eceed the number of D-ary sequence of length k, an so A k D k

4 ECE 587 / STA 563: Lecture 5 The etended codeword lengths must obey ( ) D l(n) = 1(l( n ) = k)d k n X n n X n k=1 ( ) = 1(l( n ) = k) D k n X n = k=1 A k D k k=1 nl ma k=nl min D k D k (since uniquely decodable) nl ma The etended codeword lengths must also obey D l(n) = D l( 1) D l(2) [ n D l(n) = D l() n X n 1 X 2 X n X X Combining the above displays shows that [ n D l() nl ma X for all n If the code does not satisfy the Kraft inequality, then the left hand side will blow up eponentially as n becomes large, and this inequality will be violated. Thus, the code must satisfy the Kraft inequality. Theorem: For any source distribution p(), the epected codeword length of every D-ary uniquely decodable code obeys the lower bound E[l(X) H(X) log D Proof E[l(X) H(X) log D = [ p() l() + log D p() [ p() log D D l() + log D p() = = ) p() log D (D l() p() [ p() log D (e) 1 D l() p() ( = log D (e) 1 ) D l() 0 (Fundamental Inq. ) (Kraft Inq.)

ECE 587 / STA 563: Lecture 5 5 5.2.2 Codes on Trees Any D-ary code can be represented as a D-ary tree. A D-ary tree consists of a root with branches, nodes, and leaves. The root and every node has eactly D children. Eample of a binary tree The depth of a leaf (i.e., the number of steps it takes to reach the root) corresponds to the length of the codeword. Lemma: A code is prefi-free if, and only if, each of its codeword is a leaf. code is prefi-free every codeword is a leaf 5.2.3 Prefi-Free Codes & Kraft Inequality Theorem: There eists a prefi-free code with length function l() if and only if l() satisfies the Kraft Inequality, i.e. l() is the length function of a prefi-free code D l() 1 Proof of = One option is to recognize that a prefi- free code is uniquely decodable and recall that all uniquely decodable codes must satisfy the Kraft inequality. A different proof is given below. Let l() be the length function of a prefi-free code and let l ma = ma l() The tree corresponding to the code has has maimum depth l ma, and thus total number of leaves in the tree is upper bounded by D lma. Note that this number is attained only if all codewords are the same length. Since the code is prefi-free, any codeword of length l() < l ma is a leaf, and thus prunes (i.e., removes) D lma l() leave from the total number of possible leaves. Since the number of leaves that is removed cannot eceed the total number of possible leaves, we have D lma l() D lma X

6 ECE 587 / STA 563: Lecture 5 Dividing both sides by D lma yields D l() 1. X Proof of = Let l() be a length function that satisfies the Kraft inequality. The goal is to create a D-ary tree where the depths of the leaves correspond to l(). It suffices to show that, for each integer k, after all codewords of length l() < k have been assigned, there remain enough unpruned nodes on level k to handle codewords with length l() = k. That is, we need to show that for each k, D k D k l() :l()<k no. remaining nodes after assigning short codes The right-hand side can be written as #{ : l() = k} = So, to succeed on level k we need D k : l()<k D k l() + #{ : l() = k} no. needed for codes of length k : l()=k : l()=k D k l() D k l() Dividing both sides by D k yields 1 : l() k D l() Since the lengths satisfy the Kraft inequality, : l() l D l() X D l() 1 And thus we have shown that there always eist enough remaining nodes to handle the codewords of length k. Theorem: For any source distribution p(), there eists a D-ary prefi-free code whose epected length satisfies the upper bound E[l(X) < H(X) log D + 1 Proof (This proof is nonintuitive, the net section gives an eplicit construction) By the previous theorem, is suffices to show that there eists a length function l() that satisfies the Kraft inequality and the stated inequality.

ECE 587 / STA 563: Lecture 5 7 Consider the length function l() = log D p() where denotes the ceiling function (i.e., round up to the nearest integer). Then ( ) ( ) 1 1 log D l() < log p() D + 1 p() Since p() = 1 D l() D log D p() = X X this length function satisfies the Kraft inequality, and there eists a prefi-free code with length function l(). The epected word length is given by [ ( ) 1 E[l(X) = E[ log D p(x) < E log D + 1 = H(X) p(x) log D + 1 5.3 Shannon Code We now investigate how to construct codes with nice properties. These include: short epected code length better compression prefi-free can decode instantaneously efficient representation don t need huge lookup table for encoding and decoding Intuitively, the key idea is to assign shorter codewords to more likely source symbols. The results of the previous section show that there eists a prefi-free code such that: The length function l() is given by: The epected length obeys l() = ( ) 1 log p() E[l(X) < H(X) log D + 1 In 1948, Shannon proposed a specific way to build this code. The resulting code is also known as the Shannon Fano Elias Code. Without loss of generality let the source alphabet be X = {1, 2,, m}. The cumulative distribution function (cdf) of the source distribution is F () = k p(k) Illustration of cdf

8 ECE 587 / STA 563: Lecture 5 Construction of the Shannon Code For {1, 2,, m}, let F () be the midpoint of the interval [F ( 1), F ()), i.e. F () = F ( 1) + F () 2 = F ( 1) + p() 2 Note that F () is a real number between zero and one that uniquely identifies. The codeword C() corresponds to the D-ary epansion of the real number F (), truncated at the point where the codeword is unique (i.e. cannot be confused with the midpoint of any other interval) C() = D-ary epansion of F () such that C() F () < 1 2 p(). If l() terms are retained then the codeword is given by D-ary epansion {}}{ F () = 0. z 1 z 2 z l() z l()+1 z l()+2 C() Note that it is sufficient to retain the first l() terms where since this implies that l() = ( ) 1 log D + 1 p() C() F () D l() p() D 1 2 p() Thus, the epected length of the Shannon code obeys: E[l(X) < H(X) log D + 2 Eample: Consider the following binary Shannon code p() F () F () F () in binary log 1 p() + 1 C() 1 0.25 0.25 0.125 0.001 3 001 2 0.25 0.5 0.375 0.011 3 011 3 0.2 0.7 0.6 0.10011 4 1001 4 0.15 0.85 0.775 0.1100011 4 1100 5 0.15 1 0925 0.11101100 4 1110 The entropy is H(X) 2.2855 (bits) and the epected length is E[l(X) = 3.5 Note that there is a one-to-one mapping between p() and C(). In fact, one can encode and decode without representing the code eplicitly.

ECE 587 / STA 563: Lecture 5 9 5.4 Huffman Code The Shannon code described in the previous section is good, but it is not necessarily optimal. Recall that the Kraft inequality is a: necessary condition for uniquely decodable sufficient condition for the eistence of a prefi-free code The search for the optimal code can be states as the following optimization problem. Given p() find a length function l() that minimizes the epected length and satisfies the Kraft inequality: min l( ) p()l() X s.t. D l() 1, X l() is an integer The optimal code was discovered by David Huffman, who was a graduate student in an information theory course (1952). Construction of the Huffman Code (1) Take the two least probable symbols. These are assigned the longest codewords which have equal length and differ only in the last digit. (2) Merge these two symbols into a new symbol with combined probability mass and repeat. Eample: Consider the following source distribution. code p() 0 1 0.4 110 2 0.15 100 3 0.15 101 4 0.1 1110 5 0.1 11110 6 0.05 111110 7 0.04 111111 8 0.01 The entropy is H(X) 2.45 bits and the epected length is E[l(X) = 2.55 5.4.1 Optimality of Huffman code Let X = {1, 2,, m} and let l i = l(i), p i = p(i), and C i = C(i). Without loss of generality, assume probabilities are in descending order p 1 p 2 p m Lemma 1: In an optimal code, shorter codewords are assigned large probabilities, i.e. Proof: p i > p j = l i l j Assume otherwise, that is l i > l j and p i > p j. Then, by echanging these codewords the epected length will decrease, and thus the code is not optimal.

10 ECE 587 / STA 563: Lecture 5 Lemma 2: There eists an optimal code for which the codewords assigned to the smallest probabilities are siblings (i.e., they have the same length and differ only in the last symbol). Proof: Consider any optimal code. By lemma 1, codeword C m has the longest length. Assume for the sake of contradiction, its sibling is not a codeword. Then the epected length can be decreased by moving C m to its parent. Thus, the code is not optimal and a contradiction is reached. Now, we know the sibling of C m is a codeword. If it is C m 1, we are done. Assume it is some C i for i m 1 and the code is optimal. By Lemma 1, this implies p i = p m 1. Therefore, C i and C m 1 can be echanged without changing epected length. Theorem: Huffman s algorithm produces an optimal code tree Proof of optimality of Huffman Code Let l() be the length function of the optimal code. By lemmas 1 and 2, C m 1 and C m are siblings and the longest codewords. Let p 1 p 2 p m 1 denote the ordered probabilities after merging p m 1 and p m. Let l( ) be the length function of resulting code for this new distribution. (Note the new distribution has support of size m 1). Let E[l(X) be the epected length of the original code and E [ l( X) the epected length of the reduced code. Then E[l(X) = E [ l( X) + P [ l( X) l(x) 1 = E [ l( X) + p m 1 + p m prob of merged symbol Thus, l() is the length function of an optimal code if an only if l( ) is the length function of an optimal code. Therefore, we have reduced the problem to finding and optimal code tree for p 1, p m. Again, merge, and continue the process... Thus, the Huffman algorithm yields the optimal code in a greedy fashion (there may be other optimal codes). 5.5 Coding Over Blocks Let X 1, X 2, be an iid source with finite alphabet X. memoryless source This is known as a discrete One issue with symbol codes is that there is a penalty for using integer codeword lengths. Eample: Suppose that X 1, X 2, are iid Bernoulli(p) with p very small. The optimal code is given by C() = { 0, = 0 1, = 1

ECE 587 / STA 563: Lecture 5 11 The epected length is E[l(X) = 1 but the entropy obeys H(X) = H b (p) p log(1/p), p 0 We can overcome the integer effects by coding over blocks of inputs symbols. Group inputs into blocks of size n to create a new source X 1, X 2, where X 1 = [X 1, X 2,, X n X 2 = [X n+1, X n+2,, X 2n. X i = [X (i 1)n+1, X (i 1)n+2,, X in Each length-n vectors can be viewed as a symbol from the alphabet X = X n. This new source alphabet has size X n. The new probabilities are given by p( ) = n p( k ) k=1 The entropy of the new source distribution is H( X) = H(X 1, X 2,, X n ) = n H(X) The epected length of the optimal code for the source distribution p( ) obeys [ nh(x) E l( X) < nh(x) +1 H( X) H( X) To encode the source X 1, X 2, it is sufficient to encode the new source X 1, X 2,. If we use a prefi-free code, then once the codeword C( X 1 ) is received, we can decode X 1,and thus recover the first n source symbols X 1,, X n. The epected [ codeword length per source symbol is given by the epected codeword length E l( X) per block, normalized by the block length. It obeys H(X) 1 [l( n E X) < H(X) + 1 n Thus, the integer effects are negligible as we increase the block length!. However, by coding over an input block of length n we have introduced delay in the system. Furthermore, we have increased the compleity of the code.

12 ECE 587 / STA 563: Lecture 5 5.6 Coding with Unknown Distributions 5.6.1 Minima Redundancy Suppose X is drawn according to an distribution p θ () with unknown parameter θ belonging to set Θ. If θ is known, then we can construct a code that achieves the optimal epected length p θ ()l() = H(p θ ) The redundancy of coding a distribution p with the optimal code for a distribution q (i.e., l() = log q()) is given by actual length {}}{ optimal length {}}{ R(p, q) = p()l() H(p) ( ( 1 p() log = q() = ( ) p() p() log q() = D(p q) The minima redundancy is defined by R = min ma q R(p θ, q) = min θ Θ q ) ( 1 log p() )) ma D(p θ q) θ Θ Intuitively, the distribution q that leads to a code minimizing the minima redundancy is the distribution at the center of the information ball of radius R. Minima Theorem: Let f(, y) be a continuous function that is conve in and concave in y, and let X and Y be compact conve sets. Then: min ma X y Y f(, y) = ma min y Y X f(, y) This is a classic result in game theory. There are many etensions, such as Sion s minima theorem, which applies when f(, y) is quasi-conve-concave and at least one of the sets is compact. Recall that D(p q) is conve in the pair (p, q), i.e., for all λ [0, 1, D(λp 1 + (1 λ)p 2 λq 1 + (1 λ)q 2 ) λd(p 1 q 1 ) + (1 λ)d(p 2 q 2 ) Let Π be the set of all distributions on θ. Note that Π is a conve set, i.e., for all λ [0, 1, π 1, π 2 Π = λπ 1 + (1 λ)π 2 Π

ECE 587 / STA 563: Lecture 5 13 Since the maimum of a conve function over a conve set is attained at an etreme point of the set, the maimum over R(p θ, q) with respect to θ Θ is equal to the worst case epectation with respect to π Π: ma D(p θ q) = ma θ Θ π Π π(θ)d(p θ q) This means that the minima redundancy is can be epressed equivalently as R = min π(θ)d(p θ q) q ma π Π θ Θ Note that the objective is linear (and hence both conve and concave) in π and conve in q. Applying the minima theorem yields: R = ma min π(θ)d(p θ q) π Π q For each distribution π we want to find the optimal distribution q. As an educated guess, consider the distribution induced on by p θ when θ is drawn according to π i.e. θ Θ θ Θ q π () = θ Θ π(θ)p θ () To see that q π achieves the minimum, observe that π(θ)d(p θ q) = π(θ)d(p θ q) D(q π q) + D(q π q) θ θ = ( ) pθ () π(θ)p θ () log ( ) ( ) qπ () π(θ)p θ () log + D(q π q) q() q() θ θ = θ = θ [ ( p() π(θ)p θ () log q() ( pθ () π(θ)p θ () log q π () ) ( qπ () log q() ) + D(q π q) q π() ) + D(q π q) Since D(q π q) is nonnegative and equal to zero if and only if q = q π, we see that q π is the optimal distribution. To make the epression more interpretable, consider the notation p(θ) = π(θ), p( θ) = p θ (), p() = q π () Then, we have shown that the minima redundancy can be epressed as ( ) p( θ)) R = ma p(θ)p( θ) log p(θ) p() = ma p(θ) θ I(θ; X) Theorem: The minima redundancy is equal to the maimum mutual information between the the parameter θ and the source X In other words, the code which minimizes the minima redundancy has length function l() = log p() where p() is the distribution of X p( θ) when θ is drawn according to the distribution that maimizes the mutual information I(θ; X).

14 ECE 587 / STA 563: Lecture 5 5.6.2 Coding with Unknown Alphabet We want to compress integers Z without specifying a probability distribution. c() =?? First consider the setting where we have an upper bound N on the integer, and thus X = {1, 2,, N}. We can simply send log N bits. For eample, N = 7, then we send three bits per integer: 3, 7 = c(3)c(7) = 011 3 111 7 To analyze minima redundancy of this approach, consider the set of distributions: { 1, = θ p θ () =, X = Θ = {1, 2,, N}, 0, θ The minima redundancy is given by R = ma p(θ) I(θ; X) = ma H(X) = log N p() since the uniform distribution maimizes entropy on a finite set. But we want the code to be universal and work for any integer. A unary code sends a sequence of 1 0 s followed by a 1 to mark the end of the codeword. For eample, 3, 7 = c(3)c(7) = 001 0000001 3 7 The unary code requires bits to represent each symbol. This seems wasteful. Idea: First use a unary code to describe how many bits are needed for the binary code, and then send the binary code, For eample, suppose we want to compress 9: The binary code is c binary (9) = 1001 c universal () = (c unary (l binary ()), c binary ()) The length of the binary code is l binary (9) = 4 So the universal code is c universal (9) = 0001 1001 header number This universal code requires log 2 () + log 2 () = 2 log 2 () bits. In fact, we can do better by repeating the process to first compress universal code using itself! ( ) c (2) universal () = c (1) univeral (l binary()), c binary () The number of bits this scheme requires obeys l (2) universal () = log 2() + 2 log 2 ( log 2 () ) log 2 () + 2 log 2 (log 2 ()) + 4

ECE 587 / STA 563: Lecture 5 15 It is interesting to note that this length function obeys the Kraft inequality. Thus, the length function may be viewed as a universal prior p universal () = 2 l(2) universal () Recall that n 1 1/(n log n)p diverges for p = 1 but converges for p > 1. 1 (log()) 2 (5.1) It is also interesting to note that the entropy of this distribution is infinite, H(p universal ) = =1 1 (log()) 2 log( log()2 ) = + In this case, we have θ = X and so the minima redundancy corresponds to a distribution which maimizes I(X; X) = H(X). 5.6.3 Lempel-Ziv Code Lempel-Ziv (LZ) codes are a key component of many moderns data compression algorithms, including: compress, gzip, pkzip, ZIP file format Graphics Interchange Format (GIF) Portable Document Format (PDF) Several variations, some are patented, some are not. Basic idea: Compress string using reference to its past. The more redundant the string, the better this process works. Roughly speaking, Lempel-Ziv codes are optimal in two senses: (1) They approach the entropy rate if the source is generated from a stationary and ergodic distribution. (2) They are competitive against all possible finite-state machines Two different variations, LZ 77 and LZ 78. The gzip algorithm using LZ 77 followed by a Huffman code. Construction of Lempel-Ziv code: Input: a string of source symbols 1, 2, 3, Output: sequence of code words: c( 1 ), c( 2 ), c( 3 ), Assume that we have compressed the string from 1 to i 1. The goal is to find the longest possible match between the net symbols and a sequence in the previous sequence. In other words, we want to find the largest integer k such that [ i, i+1,, i+k = [ j, j+1,, j+k new bits previous bits for some j i 1 Thus, this matching phrase can be represented by a pointer to inde i j and its length k. For convenience,

16 ECE 587 / STA 563: Lecture 5 If not match is found, we send the net symbol uncompressed. Use a flag to distinguish the two cases: Find a match = send (1, pointer, length) No match = send (0, i ) Eample: Compress the following sequence with window size W = 4 Parsed String: Output: ABBABBBAABBBA A, B, B, ABB, BA, ABBBA (0, A), (0, B), (1, 1, 1), (1, 3, 3), (1, 4, 2), (1, 5, 5) Theorem: If a process X 1, X 2, is stationary and ergodic, then the per-symbol epected codeword length of the Lempel-Ziv code asymptotically achieves the entropy rate of the source. Proof sketch Assume infinite window size. Assume that we only consider matches of eactly length m, and that the sequence has been running long enough that all possible strings of length n have occurred previously. Given a new sequence of length n, how far back in time must we look to find a match? The return time is defined by: R n (X 1, X 2,, X n ) = min{j 1 : [X 1 j, X 2 j,, X n j = [X 1 j, X 2 j,, X n j } Using universal integer code, can describe R n with log 2 R n + 2 log 2 log 2 R n + 4 bits. Thus, the epected per-symbol length of our code is given by 1 n E[log 2 R n + 2 log 2 log 2 R n + 4 Observe that if the sequence is iid, then the return time of a sequence of n 1 is geometrically distributed with probability p( n 1 ), and thus the epected wait time is E[R n (X1 n ) X1 n = n 1 = 1 p( n 1 ) Kac s Lemma: If X 1, X 2, is a stationary ergodic processes, then E[R n (X n 1 ) X n 1 = n 1 = 1 p( n 1 ) To conclude proof, we use Jensen s inequality: E[log R n = E X n 1 [E[log(R n (X n 1 )) X n 1 E X n 1 [log(e[log(r n (X1 n )) X1 n ) ( ) 1 = E X n 1 p(x1 n) = H(X 1,, X n ) By the AEP for stationary ergodic processes, H(X 1,, X n ) n H(X )