ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 2017 Author: Galen Reeves Last Modified: October 18, 2017 Outline of lecture: 5.1 Introduction to Lossless Source Coding.......................... 1 5.1.1 Motivating Eample................................ 1 5.1.2 Definitions..................................... 2 5.2 Fundamental Limits of Compression........................... 3 5.2.1 Uniquely Decodable Codes & Kraft Inequality................. 3 5.2.2 Codes on Trees................................... 5 5.2.3 Prefi-Free Codes & Kraft Inequality....................... 5 5.3 Shannon Code....................................... 7 5.4 Huffman Code....................................... 9 5.4.1 Optimality of Huffman code............................ 9 5.5 Coding Over Blocks.................................... 10 5.6 Coding with Unknown Distributions........................... 12 5.6.1 Minima Redundancy............................... 12 5.6.2 Coding with Unknown Alphabet......................... 14 5.6.3 Lempel-Ziv Code.................................. 15 5.1 Introduction to Lossless Source Coding 5.1.1 Motivating Eample Eample: Consider assigning binary phone numbers to your friends friend probability code (i) code (ii) code (iii) code (iv) code (v) code (vi) Alice 1/4 0011 001101 0 00 0 10 Bob 1/2 0011 001110 1 11 11 0 Carol 1/4 1100 110000 10 10 10 11 Analysis of codes: (i) Alice and Bob have same number. Does not work. (ii) Works, but phone numbers are too long (iii) Not decodable. 10 could mean Carol, or could mean Bob, Alice (iv) Works, but why do we need to zeros for Alice? After first zero it is clear who we want. (v) Ok, but Alice has a shorter code than Bob (vi) This is the optimal code. Once you are finished dialing you can be connected immediately.

2 ECE 587 / STA 563: Lecture 5 Desirable properties of a code: (1) Uniquely decodable. (2) Efficient, i.e., minimize the average codeword length: E[l(X) = p()l() X where l() is the length of the codeword associated with symbol. (3) Prefi-free, i.e., no codeword is the prefi of another code 5.1.2 Definitions A source code is a mapping C from a source alphabet X to D-ary sequences D is set of finite-length strings of symbols from D-ary alphabet D = {1, 2,, D}, i.e. D = D D 2 D 3 C() D is the codeword for X l() is the length of C() A code is nonsingular if C() C( ) The etension of the code C is the mapping from finite length strings of X to finite length string of D C( 1 2 }{{ n ) = C( } 1 )C( 2 ) C( n ) input (source) output (code) A code C is uniquely decodable if its etension C is nonsingular, i.e., for all n, 1 2 n 1 2 n = C( 1 ), C( 2 ),, C( n ) C( 1 ), C( 2 ),, C( n ) A code is prefi-free if no codeword is prefied by another codeword. Such codes are also known as prefi codes or instantaneous codes. Venn diagram of instantaneous, uniquely decodable, and nonsingular codes. Given a distribution p() on the input symbol, the goal is to minimize the epected length per-symbol E[l(X) = X l()p()

ECE 587 / STA 563: Lecture 5 3 5.2 Fundamental Limits of Compression This section considers the limits of of lossless compression and proves the following result. Theorem: For any source distribution p(), the epected length E[l(X) of the optimal uniquely decodable D-ary code obeys H(X) log D E[l(X) < H(X) log D + 1 Furthermore, there eists a prefi-free code which is optimal. 5.2.1 Uniquely Decodable Codes & Kraft Inequality Let l() be the length function associated with a code C. i.e. l() is length of codeword C() for all X A code C satisfies the Kraft Inequality if and only if D l() 1 (Kraft Inequality) X Theorem: Every uniquely decodable code satisfies the Kraft inequality, i.e. Uniquely decodable = Kraft Inequality Proof Let C be a uniquely decodable source code. For a source sequence n, the length of the etended codeword C( n ) is given by l( n ) = n l( i ) i=1 and thus, the length function of the etended codeword obeys nl min l( n ) nl ma where l min = min X l() and l ma = ma X l(). Let A k be the number of source sequences of length n for which l( n ) = k, i.e. A k = #{ n X n : l( n ) = k} Since the code is uniquely decodable, the number of source sequence with codewords of length k cannot eceed the number of D-ary sequence of length k, an so A k D k

4 ECE 587 / STA 563: Lecture 5 The etended codeword lengths must obey ( ) D l(n) = 1(l( n ) = k)d k n X n n X n k=1 ( ) = 1(l( n ) = k) D k n X n = k=1 A k D k k=1 nl ma k=nl min D k D k (since uniquely decodable) nl ma The etended codeword lengths must also obey D l(n) = D l( 1) D l(2) [ n D l(n) = D l() n X n 1 X 2 X n X X Combining the above displays shows that [ n D l() nl ma X for all n If the code does not satisfy the Kraft inequality, then the left hand side will blow up eponentially as n becomes large, and this inequality will be violated. Thus, the code must satisfy the Kraft inequality. Theorem: For any source distribution p(), the epected codeword length of every D-ary uniquely decodable code obeys the lower bound E[l(X) H(X) log D Proof E[l(X) H(X) log D = [ p() l() + log D p() [ p() log D D l() + log D p() = = ) p() log D (D l() p() [ p() log D (e) 1 D l() p() ( = log D (e) 1 ) D l() 0 (Fundamental Inq. ) (Kraft Inq.)

ECE 587 / STA 563: Lecture 5 5 5.2.2 Codes on Trees Any D-ary code can be represented as a D-ary tree. A D-ary tree consists of a root with branches, nodes, and leaves. The root and every node has eactly D children. Eample of a binary tree The depth of a leaf (i.e., the number of steps it takes to reach the root) corresponds to the length of the codeword. Lemma: A code is prefi-free if, and only if, each of its codeword is a leaf. code is prefi-free every codeword is a leaf 5.2.3 Prefi-Free Codes & Kraft Inequality Theorem: There eists a prefi-free code with length function l() if and only if l() satisfies the Kraft Inequality, i.e. l() is the length function of a prefi-free code D l() 1 Proof of = One option is to recognize that a prefi- free code is uniquely decodable and recall that all uniquely decodable codes must satisfy the Kraft inequality. A different proof is given below. Let l() be the length function of a prefi-free code and let l ma = ma l() The tree corresponding to the code has has maimum depth l ma, and thus total number of leaves in the tree is upper bounded by D lma. Note that this number is attained only if all codewords are the same length. Since the code is prefi-free, any codeword of length l() < l ma is a leaf, and thus prunes (i.e., removes) D lma l() leave from the total number of possible leaves. Since the number of leaves that is removed cannot eceed the total number of possible leaves, we have D lma l() D lma X

6 ECE 587 / STA 563: Lecture 5 Dividing both sides by D lma yields D l() 1. X Proof of = Let l() be a length function that satisfies the Kraft inequality. The goal is to create a D-ary tree where the depths of the leaves correspond to l(). It suffices to show that, for each integer k, after all codewords of length l() < k have been assigned, there remain enough unpruned nodes on level k to handle codewords with length l() = k. That is, we need to show that for each k, D k D k l() :l()<k no. remaining nodes after assigning short codes The right-hand side can be written as #{ : l() = k} = So, to succeed on level k we need D k : l()<k D k l() + #{ : l() = k} no. needed for codes of length k : l()=k : l()=k D k l() D k l() Dividing both sides by D k yields 1 : l() k D l() Since the lengths satisfy the Kraft inequality, : l() l D l() X D l() 1 And thus we have shown that there always eist enough remaining nodes to handle the codewords of length k. Theorem: For any source distribution p(), there eists a D-ary prefi-free code whose epected length satisfies the upper bound E[l(X) < H(X) log D + 1 Proof (This proof is nonintuitive, the net section gives an eplicit construction) By the previous theorem, is suffices to show that there eists a length function l() that satisfies the Kraft inequality and the stated inequality.

ECE 587 / STA 563: Lecture 5 7 Consider the length function l() = log D p() where denotes the ceiling function (i.e., round up to the nearest integer). Then ( ) ( ) 1 1 log D l() < log p() D + 1 p() Since p() = 1 D l() D log D p() = X X this length function satisfies the Kraft inequality, and there eists a prefi-free code with length function l(). The epected word length is given by [ ( ) 1 E[l(X) = E[ log D p(x) < E log D + 1 = H(X) p(x) log D + 1 5.3 Shannon Code We now investigate how to construct codes with nice properties. These include: short epected code length better compression prefi-free can decode instantaneously efficient representation don t need huge lookup table for encoding and decoding Intuitively, the key idea is to assign shorter codewords to more likely source symbols. The results of the previous section show that there eists a prefi-free code such that: The length function l() is given by: The epected length obeys l() = ( ) 1 log p() E[l(X) < H(X) log D + 1 In 1948, Shannon proposed a specific way to build this code. The resulting code is also known as the Shannon Fano Elias Code. Without loss of generality let the source alphabet be X = {1, 2,, m}. The cumulative distribution function (cdf) of the source distribution is F () = k p(k) Illustration of cdf

8 ECE 587 / STA 563: Lecture 5 Construction of the Shannon Code For {1, 2,, m}, let F () be the midpoint of the interval [F ( 1), F ()), i.e. F () = F ( 1) + F () 2 = F ( 1) + p() 2 Note that F () is a real number between zero and one that uniquely identifies. The codeword C() corresponds to the D-ary epansion of the real number F (), truncated at the point where the codeword is unique (i.e. cannot be confused with the midpoint of any other interval) C() = D-ary epansion of F () such that C() F () < 1 2 p(). If l() terms are retained then the codeword is given by D-ary epansion {}}{ F () = 0. z 1 z 2 z l() z l()+1 z l()+2 C() Note that it is sufficient to retain the first l() terms where since this implies that l() = ( ) 1 log D + 1 p() C() F () D l() p() D 1 2 p() Thus, the epected length of the Shannon code obeys: E[l(X) < H(X) log D + 2 Eample: Consider the following binary Shannon code p() F () F () F () in binary log 1 p() + 1 C() 1 0.25 0.25 0.125 0.001 3 001 2 0.25 0.5 0.375 0.011 3 011 3 0.2 0.7 0.6 0.10011 4 1001 4 0.15 0.85 0.775 0.1100011 4 1100 5 0.15 1 0925 0.11101100 4 1110 The entropy is H(X) 2.2855 (bits) and the epected length is E[l(X) = 3.5 Note that there is a one-to-one mapping between p() and C(). In fact, one can encode and decode without representing the code eplicitly.

ECE 587 / STA 563: Lecture 5 9 5.4 Huffman Code The Shannon code described in the previous section is good, but it is not necessarily optimal. Recall that the Kraft inequality is a: necessary condition for uniquely decodable sufficient condition for the eistence of a prefi-free code The search for the optimal code can be states as the following optimization problem. Given p() find a length function l() that minimizes the epected length and satisfies the Kraft inequality: min l( ) p()l() X s.t. D l() 1, X l() is an integer The optimal code was discovered by David Huffman, who was a graduate student in an information theory course (1952). Construction of the Huffman Code (1) Take the two least probable symbols. These are assigned the longest codewords which have equal length and differ only in the last digit. (2) Merge these two symbols into a new symbol with combined probability mass and repeat. Eample: Consider the following source distribution. code p() 0 1 0.4 110 2 0.15 100 3 0.15 101 4 0.1 1110 5 0.1 11110 6 0.05 111110 7 0.04 111111 8 0.01 The entropy is H(X) 2.45 bits and the epected length is E[l(X) = 2.55 5.4.1 Optimality of Huffman code Let X = {1, 2,, m} and let l i = l(i), p i = p(i), and C i = C(i). Without loss of generality, assume probabilities are in descending order p 1 p 2 p m Lemma 1: In an optimal code, shorter codewords are assigned large probabilities, i.e. Proof: p i > p j = l i l j Assume otherwise, that is l i > l j and p i > p j. Then, by echanging these codewords the epected length will decrease, and thus the code is not optimal.

10 ECE 587 / STA 563: Lecture 5 Lemma 2: There eists an optimal code for which the codewords assigned to the smallest probabilities are siblings (i.e., they have the same length and differ only in the last symbol). Proof: Consider any optimal code. By lemma 1, codeword C m has the longest length. Assume for the sake of contradiction, its sibling is not a codeword. Then the epected length can be decreased by moving C m to its parent. Thus, the code is not optimal and a contradiction is reached. Now, we know the sibling of C m is a codeword. If it is C m 1, we are done. Assume it is some C i for i m 1 and the code is optimal. By Lemma 1, this implies p i = p m 1. Therefore, C i and C m 1 can be echanged without changing epected length. Theorem: Huffman s algorithm produces an optimal code tree Proof of optimality of Huffman Code Let l() be the length function of the optimal code. By lemmas 1 and 2, C m 1 and C m are siblings and the longest codewords. Let p 1 p 2 p m 1 denote the ordered probabilities after merging p m 1 and p m. Let l( ) be the length function of resulting code for this new distribution. (Note the new distribution has support of size m 1). Let E[l(X) be the epected length of the original code and E [ l( X) the epected length of the reduced code. Then E[l(X) = E [ l( X) + P [ l( X) l(x) 1 = E [ l( X) + p m 1 + p m prob of merged symbol Thus, l() is the length function of an optimal code if an only if l( ) is the length function of an optimal code. Therefore, we have reduced the problem to finding and optimal code tree for p 1, p m. Again, merge, and continue the process... Thus, the Huffman algorithm yields the optimal code in a greedy fashion (there may be other optimal codes). 5.5 Coding Over Blocks Let X 1, X 2, be an iid source with finite alphabet X. memoryless source This is known as a discrete One issue with symbol codes is that there is a penalty for using integer codeword lengths. Eample: Suppose that X 1, X 2, are iid Bernoulli(p) with p very small. The optimal code is given by C() = { 0, = 0 1, = 1

ECE 587 / STA 563: Lecture 5 11 The epected length is E[l(X) = 1 but the entropy obeys H(X) = H b (p) p log(1/p), p 0 We can overcome the integer effects by coding over blocks of inputs symbols. Group inputs into blocks of size n to create a new source X 1, X 2, where X 1 = [X 1, X 2,, X n X 2 = [X n+1, X n+2,, X 2n. X i = [X (i 1)n+1, X (i 1)n+2,, X in Each length-n vectors can be viewed as a symbol from the alphabet X = X n. This new source alphabet has size X n. The new probabilities are given by p( ) = n p( k ) k=1 The entropy of the new source distribution is H( X) = H(X 1, X 2,, X n ) = n H(X) The epected length of the optimal code for the source distribution p( ) obeys [ nh(x) E l( X) < nh(x) +1 H( X) H( X) To encode the source X 1, X 2, it is sufficient to encode the new source X 1, X 2,. If we use a prefi-free code, then once the codeword C( X 1 ) is received, we can decode X 1,and thus recover the first n source symbols X 1,, X n. The epected [ codeword length per source symbol is given by the epected codeword length E l( X) per block, normalized by the block length. It obeys H(X) 1 [l( n E X) < H(X) + 1 n Thus, the integer effects are negligible as we increase the block length!. However, by coding over an input block of length n we have introduced delay in the system. Furthermore, we have increased the compleity of the code.

12 ECE 587 / STA 563: Lecture 5 5.6 Coding with Unknown Distributions 5.6.1 Minima Redundancy Suppose X is drawn according to an distribution p θ () with unknown parameter θ belonging to set Θ. If θ is known, then we can construct a code that achieves the optimal epected length p θ ()l() = H(p θ ) The redundancy of coding a distribution p with the optimal code for a distribution q (i.e., l() = log q()) is given by actual length {}}{ optimal length {}}{ R(p, q) = p()l() H(p) ( ( 1 p() log = q() = ( ) p() p() log q() = D(p q) The minima redundancy is defined by R = min ma q R(p θ, q) = min θ Θ q ) ( 1 log p() )) ma D(p θ q) θ Θ Intuitively, the distribution q that leads to a code minimizing the minima redundancy is the distribution at the center of the information ball of radius R. Minima Theorem: Let f(, y) be a continuous function that is conve in and concave in y, and let X and Y be compact conve sets. Then: min ma X y Y f(, y) = ma min y Y X f(, y) This is a classic result in game theory. There are many etensions, such as Sion s minima theorem, which applies when f(, y) is quasi-conve-concave and at least one of the sets is compact. Recall that D(p q) is conve in the pair (p, q), i.e., for all λ [0, 1, D(λp 1 + (1 λ)p 2 λq 1 + (1 λ)q 2 ) λd(p 1 q 1 ) + (1 λ)d(p 2 q 2 ) Let Π be the set of all distributions on θ. Note that Π is a conve set, i.e., for all λ [0, 1, π 1, π 2 Π = λπ 1 + (1 λ)π 2 Π

ECE 587 / STA 563: Lecture 5 13 Since the maimum of a conve function over a conve set is attained at an etreme point of the set, the maimum over R(p θ, q) with respect to θ Θ is equal to the worst case epectation with respect to π Π: ma D(p θ q) = ma θ Θ π Π π(θ)d(p θ q) This means that the minima redundancy is can be epressed equivalently as R = min π(θ)d(p θ q) q ma π Π θ Θ Note that the objective is linear (and hence both conve and concave) in π and conve in q. Applying the minima theorem yields: R = ma min π(θ)d(p θ q) π Π q For each distribution π we want to find the optimal distribution q. As an educated guess, consider the distribution induced on by p θ when θ is drawn according to π i.e. θ Θ θ Θ q π () = θ Θ π(θ)p θ () To see that q π achieves the minimum, observe that π(θ)d(p θ q) = π(θ)d(p θ q) D(q π q) + D(q π q) θ θ = ( ) pθ () π(θ)p θ () log ( ) ( ) qπ () π(θ)p θ () log + D(q π q) q() q() θ θ = θ = θ [ ( p() π(θ)p θ () log q() ( pθ () π(θ)p θ () log q π () ) ( qπ () log q() ) + D(q π q) q π() ) + D(q π q) Since D(q π q) is nonnegative and equal to zero if and only if q = q π, we see that q π is the optimal distribution. To make the epression more interpretable, consider the notation p(θ) = π(θ), p( θ) = p θ (), p() = q π () Then, we have shown that the minima redundancy can be epressed as ( ) p( θ)) R = ma p(θ)p( θ) log p(θ) p() = ma p(θ) θ I(θ; X) Theorem: The minima redundancy is equal to the maimum mutual information between the the parameter θ and the source X In other words, the code which minimizes the minima redundancy has length function l() = log p() where p() is the distribution of X p( θ) when θ is drawn according to the distribution that maimizes the mutual information I(θ; X).

14 ECE 587 / STA 563: Lecture 5 5.6.2 Coding with Unknown Alphabet We want to compress integers Z without specifying a probability distribution. c() =?? First consider the setting where we have an upper bound N on the integer, and thus X = {1, 2,, N}. We can simply send log N bits. For eample, N = 7, then we send three bits per integer: 3, 7 = c(3)c(7) = 011 3 111 7 To analyze minima redundancy of this approach, consider the set of distributions: { 1, = θ p θ () =, X = Θ = {1, 2,, N}, 0, θ The minima redundancy is given by R = ma p(θ) I(θ; X) = ma H(X) = log N p() since the uniform distribution maimizes entropy on a finite set. But we want the code to be universal and work for any integer. A unary code sends a sequence of 1 0 s followed by a 1 to mark the end of the codeword. For eample, 3, 7 = c(3)c(7) = 001 0000001 3 7 The unary code requires bits to represent each symbol. This seems wasteful. Idea: First use a unary code to describe how many bits are needed for the binary code, and then send the binary code, For eample, suppose we want to compress 9: The binary code is c binary (9) = 1001 c universal () = (c unary (l binary ()), c binary ()) The length of the binary code is l binary (9) = 4 So the universal code is c universal (9) = 0001 1001 header number This universal code requires log 2 () + log 2 () = 2 log 2 () bits. In fact, we can do better by repeating the process to first compress universal code using itself! ( ) c (2) universal () = c (1) univeral (l binary()), c binary () The number of bits this scheme requires obeys l (2) universal () = log 2() + 2 log 2 ( log 2 () ) log 2 () + 2 log 2 (log 2 ()) + 4

ECE 587 / STA 563: Lecture 5 15 It is interesting to note that this length function obeys the Kraft inequality. Thus, the length function may be viewed as a universal prior p universal () = 2 l(2) universal () Recall that n 1 1/(n log n)p diverges for p = 1 but converges for p > 1. 1 (log()) 2 (5.1) It is also interesting to note that the entropy of this distribution is infinite, H(p universal ) = =1 1 (log()) 2 log( log()2 ) = + In this case, we have θ = X and so the minima redundancy corresponds to a distribution which maimizes I(X; X) = H(X). 5.6.3 Lempel-Ziv Code Lempel-Ziv (LZ) codes are a key component of many moderns data compression algorithms, including: compress, gzip, pkzip, ZIP file format Graphics Interchange Format (GIF) Portable Document Format (PDF) Several variations, some are patented, some are not. Basic idea: Compress string using reference to its past. The more redundant the string, the better this process works. Roughly speaking, Lempel-Ziv codes are optimal in two senses: (1) They approach the entropy rate if the source is generated from a stationary and ergodic distribution. (2) They are competitive against all possible finite-state machines Two different variations, LZ 77 and LZ 78. The gzip algorithm using LZ 77 followed by a Huffman code. Construction of Lempel-Ziv code: Input: a string of source symbols 1, 2, 3, Output: sequence of code words: c( 1 ), c( 2 ), c( 3 ), Assume that we have compressed the string from 1 to i 1. The goal is to find the longest possible match between the net symbols and a sequence in the previous sequence. In other words, we want to find the largest integer k such that [ i, i+1,, i+k = [ j, j+1,, j+k new bits previous bits for some j i 1 Thus, this matching phrase can be represented by a pointer to inde i j and its length k. For convenience,

16 ECE 587 / STA 563: Lecture 5 If not match is found, we send the net symbol uncompressed. Use a flag to distinguish the two cases: Find a match = send (1, pointer, length) No match = send (0, i ) Eample: Compress the following sequence with window size W = 4 Parsed String: Output: ABBABBBAABBBA A, B, B, ABB, BA, ABBBA (0, A), (0, B), (1, 1, 1), (1, 3, 3), (1, 4, 2), (1, 5, 5) Theorem: If a process X 1, X 2, is stationary and ergodic, then the per-symbol epected codeword length of the Lempel-Ziv code asymptotically achieves the entropy rate of the source. Proof sketch Assume infinite window size. Assume that we only consider matches of eactly length m, and that the sequence has been running long enough that all possible strings of length n have occurred previously. Given a new sequence of length n, how far back in time must we look to find a match? The return time is defined by: R n (X 1, X 2,, X n ) = min{j 1 : [X 1 j, X 2 j,, X n j = [X 1 j, X 2 j,, X n j } Using universal integer code, can describe R n with log 2 R n + 2 log 2 log 2 R n + 4 bits. Thus, the epected per-symbol length of our code is given by 1 n E[log 2 R n + 2 log 2 log 2 R n + 4 Observe that if the sequence is iid, then the return time of a sequence of n 1 is geometrically distributed with probability p( n 1 ), and thus the epected wait time is E[R n (X1 n ) X1 n = n 1 = 1 p( n 1 ) Kac s Lemma: If X 1, X 2, is a stationary ergodic processes, then E[R n (X n 1 ) X n 1 = n 1 = 1 p( n 1 ) To conclude proof, we use Jensen s inequality: E[log R n = E X n 1 [E[log(R n (X n 1 )) X n 1 E X n 1 [log(e[log(r n (X1 n )) X1 n ) ( ) 1 = E X n 1 p(x1 n) = H(X 1,, X n ) By the AEP for stationary ergodic processes, H(X 1,, X n ) n H(X )