Data Compression. Limit of Information Compression. October, Examples of codes 1

Data Compression Limit of Information Compression Radu Trîmbiţaş October, 202 Outline Contents Eamples of codes 2 Kraft Inequality 4 2. Kraft Inequality............................ 4 2.2 Kraft inequality - infinite case.................... 5 3 Optimal codes 6 3. Construction of optimal codes.................... 6 3.2 Bounds on the optimal code length................. 8 4 Kraft Inequality for Uniquely Decodable Codes 0 5 Huffman Codes 2 5. Huffman codes............................ 2 5.2 Optimality of Huffman codes.................... 3 6 Shannon-Fano-Elias Coding 6 6. The code................................ 6 6.2 Competitive optimality of the Shannon code........... 8 Eamples of codes Eamples of codes X : S X RV, X p() pmf, D the set of finite-length strings of symbols from a D-ary alphabet Definition. A source code for X is a mapping C : X D ; C() is the codeword corresponding to and l() is the length of C(X).

E: C(red) = 00, C(blue) = is a source code for X = {red, blue} with alphabet D = {0, }. Definition 2. The epected length L(C) of a source code C() for a RV X with pmf p() is L(C) = p()l(), () X where l() is the length of the codeword associated to. w.l.o.g. we can assume D = {0,,..., D }. ( ) 2 3 4 Eample 3. X =, the codewords C() = 0, C(2) = 0, 2 4 8 8 C(3) = 0, C(4) =. The entropy of X is H(X) =.75 bits; the epected length L(C) = E(l(X)) =.75. Any sequence of of bits can be uniquely decoded into a sequence of symbols of X. The bit string 00000 is decoded as 3423. Eample 4. X = ( 2 3 3 3 3 ), the codewords C() = 0, C(2) = 0, C(3) =. The code is uniquely decodable. H(X) = log 3 =.58 bits, but L(C) =.66 bits > H(X). Eample 5 (Morse code). Code for English alphabet: dot, dash, letter space,word space. Short sequences represent frequent letters (e.g. dot is E); long sequences represent infrequent letters (dash, dash,dot, dash is Q). It is not optimal. Definition 6. A code is nonsingular if every element of X maps into a different string in D, i.e. = = C() = C( ). (2) 2

Nonsingularity suffices for an unambiguous description of a single value of X. For sequences of values we can ensure decodability by adding a special symbol (a comma ) between any two code words. This is inefficient. Definition 7. The etension C of a code C is a mapping C : X n D defined by C ( 2... n ) = C( )C( 2 )... C( n ). (3) Eample 8. If C( ) = 00 and C( 2 ) =, then C( 2 ) = 00. Definition 9. A code is uniquely decodable if its etension is nonsingular. Any encoded string in a uniquely decodable code has only one possible source string producing it. However, one must look at the entire string to determine even the first symbol in the corresponding source string. Definition 0. A code is called a prefi code or an instantaneous code if no codeword is a prefi of any other codeword. An instantaneous code can be decoded without reference to future codeword: the symbol i can be decoded as soon as we come to the end of the codeword corresponding to it. An instantaneous code is a self-punctuating code: we can look at the sequence code symbols and add comas to separate codewords without looking at other symbols. E: 0000 0,0,,0,0. Figure shows the relationship between these codes. Figure : Classes of codes Eamples. C(E, F, G, H) = (0,, 00, ) IU C(E, F) = (0, 0) IU 3

Nonsingular, but not Uniquely Decodable, X Singular Uniquely Decodable but not Instantaneous Instantaneous 0 0 0 0 2 0 00 00 0 3 0 0 0 4 0 0 0 Table : Classes of codes C(E, F) = (, 0) IU C(E, F, G, H) = (00, 0, 0, ) IU C(E, F, G, H) = (0, 0, 0, ) IU 2 Kraft Inequality 2. Kraft Inequality Kraft Inequality Aim: to construct instantaneous codes of minimum epected length We cannot assign short codewords to all source symbols and still be prefifree The set of codeword lengths possible for instantaneous code is limited by the following inequality: Theorem 2 (Kraft inequality). For any instantaneous code (prefi code) over an alphabet of size D, the codeword lengths l, l 2,..., l m must satisfy the inequality D li. (4) i Conversely, given a set of codeword lengths that satisfy this inequality, there eists an instantaneous code with these word lengths. Proof. Necessity. We construct a D-ary tree; the branches represent the symbols of the codeword. Each code is represented by a leaf on the tree. The path from the root traces out the symbols of the codeword. See an eample for D = 2 in Figure 2. Prefi condition = no codeword is an ancestor of any other codeword = each codeword eliminates its descendants as possible codewords. Let l ma be the length of the longest codeword and consider all the nodes at level l ma ; these can be codewords, descendants, and neither (unused). A codeword at level l i has D l ma l i descendants 4

at level l ma. Each of these descendant sets must be disjoint. The total number of nodes in sets must be D l ma. Summing over all codewords D l ma l i D l ma D l i. Sufficiency. Conversely, given any set of codeword lengths l, l 2,..., l m that satisfy the Kraft inequality, we can construct a tree like the one in Figure 2. Label the first node (leicographically) of depth l as codeword, and remove its descendants from the tree. Then label the first remaining node of depth l 2 as codeword 2, and so on. Proceeding in this way, we construct prefi code with the specified l, l 2,..., l m. Figure 2: Code tree for the Kraft inequality (D = 2) 2.2 Kraft inequality - infinite case Kraft inequality - infinite case Theorem 3 (Etended Kraft inequality). For any countably infinite set of codewords that form a prefi code, the codeword lengths satisfy the etended Kraft inequal- 5

ity, D l i. (5) i= Conversely, given any l, l 2,... satisfying the etended Kraft inequality, we can construct a prefi code with these codeword lengths. Proof. D = {0,,..., D }; the ith codeword y y 2 y li and 0.y y 2 y li = l i y j D j. (6) j= This codeword corresponds to the interval [ 0.y y 2 y li, 0.y y 2 y li + ) D l [0, ], i the set of all real numbers whose D-ary epansion begins with 0.y y 2 y li. These intervals are disjoint, due to the prefi condition, so the sum of their lengths, that is D l i. i= Proof - continuation. Conversely, if the lengths l, l 2,... satisfy the Kraft inequality we reorder the indeing such that l l 2.... Then assign the intervals in order from the low end of the unit interval. For eample if we wish to construct a binary code with l =, l 2 = 2,..., we assign the intervals [ ) 2, 4,... to the symbols, with the corresponding codewords 0, 0,... [ 0, 2 ), 3 Optimal codes 3. Construction of optimal codes Construction of optimal codes We look for the prefi code with the minimum epected length. Minimize X L = p i l i (7) i= over all integers l, l 2,..., l m satisfying D l i. (8) 6

We ignore the condition l i integer, assume equality in (8) and use the Lagrange multiplication method. Find ( ) min J = p i l i + λ D l i (9) Differentiating w.r.t. l i, we obtain Equating to 0 J l i = p i λd li ln D. D l i = p i λ ln D. Substituting this in (8) we find λ = / ln D and p i = D l i, yielding optimal code lengths, l i = log D p i. (0) This noninteger codeword lengths yield epected codeword length L = p i l i = p i log D p i = H D (X). () Rather than demonstrate that li = log D l i is a global minimum we verify optimality directly in the proof of the following theorem. Theorem 4. The epected length L of any instantaneous D-ary code for a random variable X is greater than or equal to the entropy H D (X); that is, with equality iff D l i = p i. Proof. L H D (X), (2) L H D (X) = p i l i p i log D p i = p i log D D l i + p i log p i. Letting r i = D l i/ j D l j and c = D l i, we obtain L H D (X) = p i log D p i r i log D c = D(p r) + log D c 0 by the nonnegativity of relative entropy and Kraft inequality (c ). Hence L H D (X), with equality iff p i = D l i (i.e. iff log D p i N, i). Definition 5. A probability distribution is called D-adic if each of the probabilities is equal to D n for some n. 7

Thus, we have equality in the theorem iff the distribution of X is D-adic. The preceding proof also indicates a procedure for finding an optimal code: Find the D-adic distribution that is closest (in the relative entropy sense) to the distribution of X. This distribution provides the set of codeword lengths. Construct the code by choosing the first available node as in the proof of the Kraft inequality. We then have an optimal code for X. However, this procedure is not easy, since the search for the closest D-adic distribution is not obvious. 3.2 Bounds on the optimal code length Bounds on the optimal code length Since log D pi may not equal an integer; we round up l i = These lengths satisfy the Kraft inequality But log D p i log D D pi D log D p i = p i = log D p i l i < log D p i + Multiplying by p i and summing, we obtain An optimal code can do better than this code! H D (X) L < H D (X) + (3) Theorem 6. Let l, l 2,..., l m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L be the associated epected length of an optimal code (L = p i li ). Then H D (X) L < H D (X) +. (4) Proof. Let l i = log D pi. Then l i satisfy the Kraft inequality and from (3) H D (X) L = p i l i < H D (X) + L L, since the code is optimal and L H D (X) from Theorem 4, we have the conclusion. Assume D = 2! Overhead maimum bit. We can do better by encoding blocks of n symbols a block is a supersymbol from the alphabet X n 8

Define L n, the epected codeword length per input symbol L n = n p(,..., n )l(,..., n ) = n E (l(x,..., X n )) (5) Bounds H(X,..., X n ) E (l(x,..., X n )) < H(X,..., X n ) + (6) X,..., X n i.i.d H(X,..., X n ) = H(X i ) = nh(x) H(X) L n < H(X) + n Using large block lengths we can achieve an epected codelength per symbol arbitrarily close to the entropy. For a stochastic process with (X i ) not necessarily i.i.d: Theorem 7. The minimum epected codeword length per symbol satisfies H(X,..., X n ) n L n < H(X,..., X n ) n Moreover, if X,..., X n is a stationary stochastic process, where H (X ) is the entropy rate of the process. + n. (7) L n H (X ), (8) Proof. In this case we still have the bound (6). Dividing by n and defining L n to be the epected description length per symbol, we obtain H(X,..., X n ) n L n < H(X,..., X n ) n + n. If the stochastic process is stationary, then H(X,...,X n ) n H (X ), as n. If the code is designed for the wrong distribution q() (for eample the wrong distribution may be the best estimate of the unknown distribution that we can make): Theorem 8 (Wrong code). The epected length under p() of the code assignment l() = log satisfies q() H(p) + D(p q) E p (l(x)) < H(p) + D(p q) +. (9) 9

Proof. The epected codelength is E(l(X) = p() log Lower bound similarly. = p() q() ( log p() q() = p() log p() q() + = D(p q) + H(p) +. ( < p() log ) q() + ) p() + p() log p() + Thus, believing that the distribution is q() when the true distribution is p() incurs a penalty of D(p q) in the average description length. 4 Kraft Inequality for Uniquely Decodable Codes Kraft Inequality for Uniquely Decodable Codes The class of uniquely decodable code is larger than the class of instantaneous code. Can we achieve a lower epected codeword length for this class? No! Theorem 9 (McMillan). The codeword lengths of any uniquely decodable D-ary code must satisfy the Kraft inequality D l i. (20) Conversely, given a set of codeword lengths that satisfy this inequality, it is possible to construct a uniquely decodable code with these codeword lengths. Proof. l(,..., k ) = Trick: consider the kth power of lhs of (22) ( X k l( i ). (2) i= D l()? (22) X ) k D l() = D l( ) D l( k) X X = D l( ) D l( k) = D l(k),..., k X k k X k 0

Proof - continuation. Last relation by (2). Gathering terms by word lengths k X k D l(k) = kl ma a(m)d m, m= where l ma is the maimum codeword length and a(m) is the number of source sequences k mapping into codewords of length m, a(m) D m, and we have ( ) k D l() X kl ma D m D m = kl ma. (23) m= Proof - continuation. (23) is equivalent to Since lim k (kl ma ) /k =, we have D l j (kl ma ) /k. (24) j D l j. j Conversely, given any set of l, l 2,..., l m satisfying the Kraft inequality, we can construct an instantaneous code as proved in Section on Kraft inequality. Since every instantaneous code is uniquely decodable, we have also constructed a uniquely decodable code. Corollary 20. A uniquely decodable code for an infinite source alphabet X also satisfies the Kraft inequality. Proof. For infinite X the preceding proof crashes at (24). Fiing: any subset of uniquely decodable code is uniquely decodable; any finite subset satisfies Kraft inequality, hence D l i = lim N i= N i= D l i. Conversely, given any set of l, l 2,... satisfying the Kraft inequality, we can construct an instantaneous code as proved in Section on Kraft inequality. Since every instantaneous code is uniquely decodable, we have also constructed a uniquely decodable code with an infinite number of codewords. Hence the McMillan inequality also applies for infinite alphabets.

5 Huffman Codes 5. Huffman codes Huffman codes Huffman gave in [] an algorithm for the construction of an optimal code An optimal binary instantaneous code must satisfy:. p( i ) > p( j ) l( i ) l( j ) (else swap codewords) 2. The two longest codewords have the same length (else chop a bit off the longer codeword) 3. two longest codewords differing only in the last bit (else chop a bit off all of them) Huffman Code construction. Take the two smallest p( i ) and assign each a different last bit. Then merge into a single symbol. 2. Repeat step until only one symbol remains Used in JPEG, MP3,... Eamples Eample 2. X ( 2 3 4 5 0.25 0.25 0.2 0.5 0.5 We can combine the symbols 4 and 5 into a single source symbol, with a probability assignment 0.30. Proceeding this way, combining the two least likely symbols into one symbol until we are finally left with only one symbol, and then assigning codewords to the symbols, we obtain the following table: ) This code has average length L = 2.3 bits and H(X) = 2.286. For D-ary code, first add etra zero-probability (dummy) symbols until X is a multiple of D and then group D symbols at a time. 2

5.2 Optimality of Huffman codes Optimality of Huffman codes We prove by induction that the binary Huffman code is optimal. There are many optimal codes: inverting all the bits or echanging two codewords of the same length will give another optimal code. The Huffman procedure constructs one such optimal code. Assume w.l.o.g. that p p 2 p m. A code is optimal if p i l i is minimal. Lemma 22. For any distribution, there eists an optimal instantaneous code (with minimum epected length) that satisfies the following properties:. The lengths are ordered inversely with the probabilities (i.e., if p j > p k, then l j l k ). 2. The two longest codewords have the same length. 3. Two of the longest codewords differ only in the last bit and correspond to the two least likely symbols. Proof. The proof amounts to swapping, trimming, and rearranging, as shown in Figure 3. Figure 3: Properties of optimal codes. 3

Optimality of Huffman codes - proof We assume that p p 2 p m. A possible instantaneous code is given in (a). By trimming branches without siblings, we improve the code to (b). We now rearrange the tree as shown in (c), so that the word lengths are ordered by increasing length from top to bottom. Finally, we swap probability assignments to improve the epected depth of the tree, as shown in (d). Every optimal code can be rearranged and swapped into canonical form as in (d), where l l 2 l m and l m = l m, and the last two codewords differ only in the last bit. Consider an optimal code C m : If p j > p k, then l j l k. (Swap) C m is C m with j k L(C m) L(C m ) = p i l i p i l i = p j l k + p k l j p j l j p k l k = (p j p k )(l k l j ) But p j p k > 0, and since C m is optimal, L(C m) L(C m ) 0. Hence, we must have l k l j. Thus, C m itself satisfies property. The two longest codewords are of the same length. (Trim) Otherwise, one can delete the last bit of the longer one, preserving the prefi property and achieving lower epected codeword length. By property, the longest codewords must belong to the least probable source symbols. The two longest codewords differ only in the last bit and correspond to the two least likely symbols. Not all optimal codes satisfy this property, but by rearranging, we can find an optimal code that does. If there is a maimallength codeword without a sibling, we can delete the last bit of the codeword and still satisfy the prefi property. This reduces the average codeword length and contradicts the optimality of the code. Hence, every maimal-length codeword in any optimal code has a sibling. Now we can echange the longest codewords so that the two lowest-probability source symbols are associated with two siblings on the tree. This does not change the epected length, p i l i. Thus, the codewords for the two lowest-probability source symbols have maimal length and agree in all but the last bit. Thus, we have shown that there eists an optimal code satisfying the properties of the lemma. We call such codes canonical codes. 4

Theorem 23. Huffman coding is optimal; that is, if C is a Huffman code and C is any other uniquely decodable code, L(C ) L(C). Proof. For any probability mass function for an alphabet of size m, p = (p, p 2,..., p m ) with p p 2 p m, we define the Huffman reduction p = (p, p 2,..., p m 2, p m + p m ) over an alphabet of size m (Figure 4). Let C m (p ) be an optimal code for p, and let C m(p) be the canonical optimal code for p. Induction - two steps. epand an optimal code for p to construct a code for p; 2. condense an optimal canonical code for p to construct a code for the Huffman reduction p. Comparing the average codeword lengths for the two codes establishes that the optimal code for p can be obtained by etending the optimal code for p. From p we construct an etension code for m elements: take the codeword in C m to weight p m + p m and etend it by adding a 0 to form a codeword for symbol m and adding to form the codeword for symbol m: C m (p ) C m(p) p w l w = w l = l p 2 w 2 l 2 w 2 = w 2 l 2 = l 2..... (25) p m 2 w m 2 l m 2 w m 2 = w m 2 l m 2 = l m 2 p m + p m w m l m w m = w m l m = l m w m = w m l m = l m Calculation of average length i p i l i L(p) = L (p ) + p m + p m. (26) Similarly, from p, we construct a code for p by merging the codewords for the two lowest-probability symbols m and m with the probabilities p m şi p m, which are siblings by the properties of canonical code. The new code for p has average length L(p ) = m 2 i= p i l i + p m (l m ) + p m (l m ) m = p i l i p m p m i= = L (p) p m p m. (27) 5

(26)+(27) L(p ) + L(p) = L (p ) + L (p) (28) or ( L(p ) L (p ) ) + (L(p) L (p)) = 0. (29) The epressions in paranthesis are nonnegative, so L(p) = L (p). Figure 4: Induction step for Huffman coding. A canonical optimal code is illustrated in (a). Combining the two lowest probabilities (b). Rearranging the probabilities in decreasing order (c) for m symbols. 6 Shannon-Fano-Elias Coding 6. The code Shannon-Fano-Elias Coding X = {, 2,..., m}; p() > 0, cdf - cumulative distribution function F() = p(a). (30) a 6

Figure 5: cdf and Shannon-Fano-Elias coding the modified cumulative distribution function (see Figure 5) F() = p(a) + p(). (3) a 2 a = b F(a) = F(b); we can determine if we know F(), thus F() can be used as a code for. F() R F() has an infinite number of bits use F() l() is F() truncated at l() bits l() = log p() F() F() l() <. (32) 2l() + = p() < = F() F( ) (33) 2l() 2 F() l() lies within the step corresponding to The code is prefi-free! codeword z z 2... z l [0.z z 2... z l, 0.z z 2... z l + 2 l ) ; the code is prefifree iff the intervals are disjoint interval length = 2 l() < half of the step in (33) = interval are disjoint 7

Since l() = log +, the epected length of the code is p() L = ( p()l() = p() log ) + < H(X) + 2. (34) p() The procedure does not require ordered probabilities Eample 24. We consider an eample where all the probabilities are dyadic. + Codeword p() F() F() F() [2] log p() 0.25 0.25 0.25 0.00 3 00 2 0.5 0.75 0.5 0.0 2 0 3 0.25 0.875 0.825 0.0 4 0 4 0.25.0 0.9375 0. 4 L = 2.75 bits, H(X) =.75 Eample 25. The distribution is not dyadic, the binary representation is infinite. + Codeword p() F() F() F() [2] log p() 0.25 0.25 0.25 0.00 3 00 2 0.25 0.5 0.375 0.0 2 0 3 0.2 0.7 0.6 0.00 4 00 4 0.5 0.85 0.775 0.000 4 00 5 0.5.0 0.925 0.000 4 0 6.2 Competitive optimality of the Shannon code Competitive optimality of the Shannon code Huffman coding is optimal - it has minimum epected length. For particular sequencess Huffman code can be worse than other codes. Formalization - zero-sum game: two people are given a probability distribution and are asked to design an instantaneous code for the distribution. Then a source symbol is drawn the payoff to player A is, - or 0 (code shorter, longer or tie) Since analysis for Huffman code is difficult, we consider Shannon code with codeword lengths l() = log. p() Theorem 26. Let l() be the codeword lengths of Shannon code and l () be the codeword lengths associated with any other uniquely decodable code. Then P(l(X) l (X) + c). (35) 2c 8

Proof. P ( l(x) l (X) + c ) ( ) = P log l (X) + c p(x) ( ) P log p(x) l (X) + c = P (p(x) ) 2 l (X) c+ = :p() 2 l (X) c+ p() 2 l (X) (c ) :p() 2 l (X) c+ 2 l (X) 2 (c ) Kraft inequality 2 (c ) Stronger result if p() is dyadic: Theorem 27. For a dyadic pmf p(), if l() = log is the length of the binary p() Shannon code and l () is the length of any other uniquely decodable binary code, then P ( l(x) < l (X) ) P ( l(x) > l (X) ), (36) with equality iff l () = l() for all. Thus, the code length assignment l() = is uniquely competitively optimal. log p() Proof. Consider It can be shown graphically that if t > 0 sgn(t) = 0 if t = 0 if t < 0 sgn(t) 2 t, t Z. (37) P ( l (X) < l(x) ) P ( l (X) > l(x) ) = :l ()<l() p() = p()sgn ( l() l () ) = E ( sgn ( l(x) l (X) )) p() ( ) 2 l() l () = ( 2 l() 2 l() l () ) = 2 l () Kraft inequality = 0. bound on sgn 2 l () 2 l() :l ()>l() p() 9

Figure 6: sgn function and bound We have equality in the above chain if t = 0 or t = in sgn(t) (i.e. l() = l () or l() = l () + ) and l () satisfies Kraft inequality with equality, that is l () is length for an optimal code (i.e. l () = l() for all ). Corollary 28. For nondyadic probability mass functions, where l() = E ( sgn ( l(x) l (X) )) 0, (38) log and l p() () is any other code for the source. Proof. Along the same lines as the preceding proof. Hence Shannon coding is optimal under a large variety of criteria; it is robust with respect to the payoff function. In particular, for dyadic p, E(l l ) 0, E (sgn(l l )) 0, and by use of inequality (37), E ( f (l l )) for any function f satisfying f (t) 2 t, t = 0, ±, ±2,.... References References [] Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, 2nd edition, Wiley, 2006. [2] David J.C. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003. [3] Robert M. Gray, Entropy and Information Theory, Springer, 2009 20

Figure 7: David A. Huffman (925 999) Figure 8: Robert Fano (97 ) 2

References [] D. A. Huffman, A method for the construction of minimum redundancy codes, Proc. IRE, 40: 098 0,952 22