Chapter 5. Data Compression

Size: px

Start display at page:

Download "Chapter 5. Data Compression"

Logan Terry
5 years ago
Views:

1 Chapter 5 Data Compression Peng-Hua Wang Graduate Inst. of Comm. Engineering National Taipei University

2 Chapter Outline Chap. 5 Data Compression 5.1 Example of Codes 5.2 Kraft Inequality 5.3 Optimal Codes 5.4 Bound on Optimal Code Length 5.5 Kraft Inequality for Uniquely Decodable Codes 5.6 Huffman Codes 5.7 Some Comments on Huffman Codes 5.8 Optimality of Huffman Codes 5.9 Shannon-Fano-Elias Coding 5.10 Competitive Optimality of the Shannon Code 5.11 Generation of Discrete Distributions from Fair Coins Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 2/41

3 5.1 Example of Codes Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 3/41

4 Source code Definition (Source code) A source codec for a random variablex is a mapping fromx, the range ofx, tod, the set of finite-length strings of symbols from ad-ary alphabet. LetC(x) denote the codeword corresponding toxand letl(x) denote the length ofc(x). For example, C(red) = 00, C(blue) = 11 is a source code with mapping fromx = {red, blue} tod 2 with alphabetd = {0,1}. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 4/41

5 Source code Definition (Expected length) The expected lengthl(c) of a source codec(x) for a random variablex with probability mass functionp(x) is given by L(C) = x X p(x)l(x). wherel(x) is the length of the codeword associated withx. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 5/41

6 Example Example LetX be a random variable with the following distribution and codeword assignment Pr{X = 1} = 1, codewordc(1) = 0 2 Pr{X = 2} = 1, codewordc(2) = 10 4 Pr{X = 3} = 1, codewordc(3) = Pr{X = 4} = 1, codewordc(4) = H(X) = 1.75 bits. El(x) = 1.75 bits. uniquely decodable Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 6/41

7 Example Example Consider following example. Pr{X = 1} = 1, codewordc(1) = 0 3 Pr{X = 2} = 1, codewordc(2) = 10 3 Pr{X = 3} = 1, codewordc(3) = 11 3 H(X) = 1.58 bits. El(x) = 1.66 bits. uniquely decodable Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 7/41

8 Source code Definition (non-singular) A code is said to be nonsingular if every element of the range ofx maps into a different string ind ; that is, x x C(x) C(x ) Definition (extension code) The extensionc of a codec is the mapping from finite length-strings ofx to finite-length strings ofd, defined by C(x 1 x 2 x n ) = C(x 1 )C(x 2 ) C(x n ) wherec(x 1 )C(x 2 ) C(x n ) indicates concatenation of the corresponding codewords. Example IfC(x 1 ) = 00 andc(x 2 ) = 11, thenc(x 1 x 2 ) = Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 8/41

9 Source code Definition (uniquely decodable) A code is called uniquely decodable if its extension is nonsingular. Definition (prefix code) A code is called a prefix code or an instantaneous code if no codeword is a prefix of any other codeword. For an instantaneous code, the symbolx i can be decoded as soon as we come to the end of the codeword corresponding to it. For example, the binary string produced by the code of Example is parsed as0,10,111,110,10. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 9/41

10 Source code X Singular Nonsingular, but not UD, UD But Not Inst. Inst Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 10/41

11 Decoding Tree Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 11/41

12 5.2 Kraft Inequality Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 12/41

13 Kraft Inequality Theorem (Kraft Inequality) For any instantaneous code (prefix code) over an alphabet of sized, the codeword lengthsl 1,l 2,...,l m must satisfy the inequality D l i 1. i Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these word lengths. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 13/41

14 Extended Kraft Inequality Theorem (Extended Kraft Inequality) For any countably infinite set of codewords that form a prefix code, the codeword lengths satisfy the extended Kraft inequality, i=1 D l i 1. Conversely, given anyl 1,l 2,... satisfying the extended Kraft inequality, we can construct a prefix code with these codeword lengths. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 14/41

15 5.3 Optimal Codes Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 15/41

16 Minimize expected length Problem Given the source pmfpp 1,p 2,...p m, find the code length l 1,l 2,...,l m such that the expected code length is minimized L = p i l i with constraint D l i 1. l 1,l 2,...,l m are integers. We first relax the original integer programming problem. The restriction of integer length is relaxed to real number. Solve by Lagrange multipliers. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 16/41

17 Solve the relaxed problem J = ( ) p i l i +λ D l i J, = p i λd l i lnd l i J = 0 D l i = p i l i λlnd D l i 1 λ = 1 lnd p i = D l i optimal code lengthli = log D p i The expected code length is L = p i l i = p i log D p i = H D (X) Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 17/41

18 Expected code length Theorem The expected lengthlof any instantaneousd-ary code for a random variablex is greater than or equal to the entropy H D (X); that is, L H D (X) with equality if and only ifd l i = p i. Proof. L H D (X) = p i l i + p i log D p i = p i log D D l i + p i log D p i = D(p q) 0 whereq i = D li. WRONG!! Because D l i 1 may not be a valid distribution. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 18/41

19 Expected code length Proof. L H D (X) = p i l i + p i log D p i Letc = j D l j,r i = D l i /c, = p i log D D l i + p i log D p i L H D (X) = p i log D p i r i log D c = D(p r)+log D 1 c 0 sinced(p r) 0 andc 1 by Kraft inequality. L H D (X) with equality iffp i = D l i. That is, iff log D p i is an integer for alli. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 19/41

20 D-adic Definition (D-adic) A probability distribution is calledd-adic if each of the probabilities is equal tod n for somen. L = H D (X) if and only if the distribution ofx isd-adic. How to find the optimal code? Find thed-adic distribution that is closest (in the relative entropy sense) to the distribution ofx. What is the upper bound of the optimal code? Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 20/41

21 5.4 Bound on Optimal Code Length Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 21/41

22 Optimal code length Theorem Letl 1,l 2,...,l m be optimal codeword lengths for a source distributionpand sad-ary alphabet, and letl be the associated expected length of an optimal code (L ast = p i l i ). Then H D (X) L < H D (X)+1. Proof. Letl i = log D 1 p i where x is the smallest integer x. These lengths satisfy the Kraft inequality since D log D 1 p i D log D 1 p i = p i = 1. These choice of codeword lengths satisfies log D 1 p i l i log D 1 p i +1. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 22/41

23 Optimal code length Multiplying byp i and summing overi, we obtain H D (X) L < H D (X)+1. SinceL is the expected length of the optimal code, L L < H D (X)+1. On another hand, from Theorem 5.3.1, L H D (X). Therefore, H D (X) L < H D (X)+1. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 23/41

24 Optimal code length Consider a system in which we send a sequence ofnsymbols fromx. DefineL n to be the expected codeword length per input symbol, We have L n = 1 p(x1,x 2,...,x n )l(x 1,x 2,...,x n ) n = 1 n E[l(X 1,X 2,...,X n )] H(X 1,X 2,...,X n ) E[l(X 1,X 2,...,X n )] < H(X 1,X 2,...,X n )+1 IfX 1,X 2,...,X n are i.i.d., we have H(X) L n < H(X)+ 1 n Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 24/41

25 Optimal code length Consider a system in which we send a sequence ofnsymbols fromx. DefineL n to be the expected codeword length per input symbol, We have L n = 1 p(x1,x 2,...,x n )l(x 1,x 2,...,x n ) n = 1 n E[l(X 1,X 2,...,X n )] H(X 1,X 2,...,X n ) E[l(X 1,X 2,...,X n )] < H(X 1,X 2,...,X n )+1 Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 25/41

26 Optimal code length IfX 1,X 2,...,X n are independent but not identically distributed, we have 1 n H(X 1,X 2,...,X n ) L n < 1 n H(X 1,X 2,...,X n )+1 If the random process is stationary 1 n H(X 1,X 2,...,X n ) H(X) Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 26/41

27 Optimal code length Theorem The minimum expected codeword length per symbol satisfies 1 n H(X 1,X 2,...,X n ) L n < 1 n H(X 1,X 2,...,X n )+1 IfX 1,X 2,...,X n is a stationary random process, L n H(X) whereh(x) is the entropy rate of the random process. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 27/41

28 Wrong Code Theorem (Wrong Code) The expected length underp(x) of the code assignmentl(x) = log 1 q(x) satisfies H(p)+D(p q) E p [l(x)] < H(p)+D(p q)+1. Proof. The expected codelength is E[l(X)] = x = x = x p(x) log 1 q(x) < x p(x) p(x)log 1 q(x) +1 ( p(x) log p(x) ) q(x) logp(x) +1 = H(p)+D(p q)+1 ( log 1 ) q(x) +1 The lower bound can be derived similarly. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 28/41

29 5.5 Kraft Inequality for Uniquely Decodable Codes Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 29/41

30 Uniquely Decodable Codes Theorem (Wrong Code) The codeword lengths of any uniquely decodabled-ary code must satisfy the Kraft inequality D l i 1. Conversely, given a set of codeword lengths that satisfy this inequality, it is possible to construct a uniquely decodable code with these codeword lengths. Proof. Consider ( ) k D l(x) = D l(x1) l(xk) D l(x2) D x x 1 x 2 x k = D l(x1) D l(x2) D l(xk) = x 1,...,x k X k x k X k D l(x k ) Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 30/41

31 Uniquely Decodable Codes We now gather the terms by word lengths to obtain x k X k D l(xk) = kl max m=1 a(m)d m wherel max is the maximum codeword length anda(m) is the number of source sequences mapping into codewords of lengthm. Since the code is uniquely decodable, so there is at most one sequence mapping into each codem-sequence and there are at mostd m code m-sequences. Thus, a(m) D m. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 31/41

32 Uniquely Decodable Codes Therefore, ( D l(x) ) k = kl max a(m)d m kl max D m D m = kl max x m=1 m=1 or D l j (kl max ) 1/k j Since this inequality is true for allk, it is true in the limit ask. Since(kl max ) 1/k 1, we have D l j 1. j Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 32/41

33 5.6 Huffman Codes Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 33/41

34 Example X = {1,2,3,4,5},p = {0.25,0.25,0.2,0.15,0.15}, binary. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 34/41

35 Example X = {1,2,3,4,5},p = {0.25,0.25,0.2,0.15,0.15}, ternary. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 35/41

36 Example X = {1,2,3,4,5,6},p = {0.25,0.25,0.2,0.1,0.1,0.1}, ternary. Atkth stage, the total number of symbols is1+k(d 1). Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 36/41

37 Shannon Code Using codeword lengths of log 1 p i May be much worse than the optimal code. For example, p 1 = 1 1/1024 andp 2 = 1/1024. log 1 p 1 = 1 and log 1 p 2 = 10. However, we can use exactly 1 bit. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 37/41

38 5.8 Optimality of Huffman Codes Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 38/41

39 Properties of Optimal Codes Lemma For any distribution, there exists an optimal instantaneous code that satisfies the following properties: 1. Ifp j > p k, thenl j l k. 2. The two longest codewords have the same length. 3. Two of the longest codewords differ only in the last bits. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 39/41

40 Optimality of Huffman codes Lemma LetC be the codes with distribution p 1 p 2 p K 1 p K. C is the codes with distribution p 1,p 2,...,p K 1 +p K. IfC is optimal with code assignment p 1 w 1,p 2 w 2,...,p K 1 +p K w K 1, thenc is also optimal with code assignment p 1 w 1,p 2 w 2,...,p K 1 w K 1 0,p K w K 1 1. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 40/41

41 Optimality of Huffman codes Proof. The average length forc is L(C ) = p 1 l 1 +p 2 l 2 + +(p K 1 +p K )l K 1. The average length forc is L(C) = p 1 l 1 +p 2 l 2 + +p K 1 (l K 1 +1)+p K (l K 1 +1). We have L(C) = L(C )+p K 1 +p K. That is, we can minimizel(c) by minimizingl(c ) sincep K 1 +p K is a constant. Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 41/41

Chapter 5: Data Compression

Chapter 5: Data Compression Definition. A source code C for a random variable X is a mapping from the range of X to the set of finite length strings of symbols from a D-ary alphabet. ˆX: source alphabet,