Huffman Coding C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University http://www.csie.nctu.edu.tw/~cmliu/courses/compression/ Office: EC538 (03)573877 cmliu@cs.nctu.edu.tw
Overview Huffman Coding Algorithm The procedure to build Huffman codes. Extended Huffman Codes Adaptive Huffman Coding Update Procedure Decoding Procedure Golomb Codes
Shannon-Fano Coding 3 The first code based on Shannon s theory Suboptimal (it took a graduate student to fix it!) Algorithm Start with empty codes Compute frequency statistics for all symbols Order the symbols in the set by frequency Split the set to minimize * difference Add 0 to the codes in the first set and to the rest Recursively assign the rest of the code bits for the two subsets, until sets cannot be split.
Shannon-Fano Coding () 4 0 0 a b c d e f 9 8 6 5 4 0 a b 9 8 c d e f 6 5 4
Shannon-Fano Coding (3) 5 0 0 a b c d e f 9 8 6 5 4 0 0 c d e f 6 5 4 a b 9 8
Shannon-Fano Coding (4) 6 00 0 a b c d e f 9 8 6 5 4 0 0 c d e f 6 5 4 a b 9 8
Shannon-Fano Coding (5) 7 00 0 a b c d e f 9 8 6 5 4 0 0 0 a b 9 8 c d e f 6 5 4
Shannon-Fano Coding (6) 8 00 0 0 0 a b c d e f 9 8 6 5 4 0 0 0 a b 9 8 c d e f 6 5 4
Shannon-Fano Coding (7) 9 00 0 0 0 a b c d e f 9 8 6 5 4 0 0 0 a b 9 8 0 c d 6 5 e f 4
Shannon-Fano Coding (8) 0 00 0 00 0 a b c d e f 9 8 6 5 4 0 0 0 a b 9 8 0 c d 6 5 e f 4
Shannon-Fano Coding (9) 00 0 00 0 a b c d e f 9 8 6 5 4 0 0 0 a b 9 8 0 c d 6 5 0 e f 4
Shannon-Fano Coding (0) 00 0 00 0 0 a b c d e f 9 8 6 5 4 0 0 0 a b 9 8 0 c d 6 5 0 e f 4
Shannon-Fano Coding: Remarks 3 Shannon-Fano does not always produce optimal prefix codes; the set of probabilities {0.35, 0.7, 0.7, 0.6, 0.5} Huffman coding is almost as computationally simple and produces prefix codes that always achieve the lowest expected code word length, under the constraints that each symbol is represented by a code formed of an integral number of bits. Symbol-by-symbol Huffman coding is only optimal if the probabilities of the symbols are independent and are some power of a half, i.e. / n
Optimum Prefix Codes 4 Key observations on optimal codes. Symbols that occur more frequently will have shorter codewords. The two least frequent symbols will have the same length Proofs. Assume the opposite code is clearly sub-optimal. Assume the opposite Let X, Y be the least frequent symbols & code(x) = k, code(y) = k+ Then by unique decodability (UD), code(x) cannot be a prefix for code(y) also, all other codes are shorter Dropping the last bit of code(y) would generate a new, shorter, uniquely decodable code!!! This contradicts optimality assumption!!!
Huffman Coding 5 David Huffman (95) Grad student of Robert M. Fano (MIT) Term paper(!) Explained by example Letter Code Probability Set Set Prob a 0. b 0.4 c 0. d 0. e 0.
Huffman Coding by Example 6 Init: Create a set out of each letter Letter Code Probability Set Set Prob a 0. b 0.4 c 0. d 0. e 0.
Huffman Coding by Example 7. Sort sets according to probability (lowest first) Letter Code Probability Set Set Prob a 0. a 0. b 0.4 b 0.4 c 0. c 0. d 0. d 0. e 0. e 0.
Huffman Coding by Example 8. Insert prefix into the codes of top set letters Letter Code Probability Set Set Prob a 0. d 0. b 0.4 e 0. c 0. a 0. d 0. c 0. e 0. b 0.4
Huffman Coding by Example 9 3. Insert prefix 0 into the codes of the second set letters Letter Code Probability Set Set Prob a 0. d 0. b 0.4 e 0. c 0. a 0. d 0. c 0. e 0 0. b 0.4
Huffman Coding by Example 0 4. Merge the top two sets Letter Code Probability Set Set Prob a 0. de d 0. 0. b 0.4 ea 0. 0. c 0. ac 0. d 0. dc 0.4 0. e 0 0. b 0.4
Huffman Coding by Example. Sort sets according to probability (lowest first) Letter Code Probability Set Set Prob a 0. de 0. b 0.4 a 0. c 0. c 0. d 0. b 0.4 e 0 0.
Huffman Coding by Example. Insert prefix into the codes of top set letters Letter Code Probability Set Set Prob a 0. de 0. b 0.4 a 0. c 0. c 0. d 0. b 0.4 e 00 0.
Huffman Coding by Example 3 3. Insert prefix 0 into the codes of the second set letters Letter Code Probability Set Set Prob a 0 0. de 0. b 0.4 a 0. c 0. c 0. d 0. b 0.4 e 0 0.
Huffman Coding by Example 4 4. Merge the top two sets Letter Code Probability Set Set Prob a 0 0. dea 0.4 0. b 0.4 ac 0. c 0. bc 0.4 0. d 0. b 0.4 e 0 0.
Huffman Coding by Example 5. Sort sets according to probability (lowest first) Letter Code Probability Set Set Prob a 0 0. dea c 0. 0.4 b 0.4 dea c 0.4 0. c 0. b 0.4 d 0. e 0 0.
Huffman Coding by Example 6. Insert prefix into the codes of top set letters Letter Code Probability Set Set Prob a 0 0. c 0. b 0.4 dea 0.4 c 0. b 0.4 d 0. e 0 0.
Huffman Coding by Example 7 3. Insert prefix 0 into the codes of the second set letters Letter Code Probability Set Set Prob a 000 0. c 0. b 0.4 dea 0.4 c 0. b 0.4 d 0 0. e 00 0 0.
Huffman Coding by Example 8 4. Merge the top two sets
Huffman Coding by Example 9. Sort sets according to probability (lowest first) Letter Code Probability Set Set Prob a 00 0. cdea 0.6 b 0.4 b 0.4 c 0. d 0 0. e 00 0.
Huffman Coding by Example 30. Insert prefix into the codes of top set letters Letter Code Probability Set Set Prob a 00 0. b 0.4 b 0.4 cdea 0.6 c 0. d 0 0. e 00 0.
Huffman Coding by Example 3 3. Insert prefix 0 into the codes of the second set letters Letter Code Probability Set Set Prob a 00 0. b 0.4 b 0.4 cdea 0.6 c 0. d 0 0. e 00 0.
Huffman Coding by Example 3 4. Merge the top two sets Letter Code Probability Set Set Prob a 000 0. b 0.4 b 0.4 cdea 0.6 c 0 0. d 00 0. e 000 0. The END
Example Summary 33 Average code length l = 0.4x + 0.x + 0.x3 + 0.x4 + 0.x4 =. bits/symbol Entropy H = Σ s=a..e P(s) log P(s) =. bits/symbol Redundancy l - H = 0.078 bits/symbol
Huffman Tree 34 0 0 b 0.4 0 c 0. a 0. 0 e d 0. 0.
Building a Huffman Tree 35 Letter a b c d Code c 0. b 0.4 e a 0. 0 0. e d 0. 0.
Building a Huffman Tree 36 Letter Code a b c d 0 0.4 c 0. b 0.4 e 0 a 0. 0 0. e d 0. 0.
Building a Huffman Tree 37 Letter Code a 0 b c d 0 0.4 0 0.6 c 0. b 0.4 e 0 a 0. 0 0. e d 0. 0.
Building a Huffman Tree 38 0.0 Letter Code a 00 b c d 0 0 0.4 0 0.6 c 0. b 0.4 e 00 a 0. 0 0. e d 0. 0.
An Alternative Huffman Tree 39 Letter Code a b c d e 0 a c 0. 0. 0 0. e b 0.4 d 0. 0.
An Alternative Huffman Tree 40 Letter Code a 0 b c d e 0 0 0.4 a c 0. 0. 0 0. e b 0.4 d 0. 0.
An Alternative Huffman Tree 4 Letter Code a 000 b c 0 d e 0 0 0.4 a c 0. 0. 0 0.6 0 0. e b 0.4 d 0. 0.
An Alternative Huffman Tree 4 Letter Code a 000 b c 000 d 0 e 00 Average code length 0 0.4 a c 0. 0. 0 l = 0.4x + (0. + 0. + 0. + 0.)x3=. bits/symbol 0.6 0 0 0. e b 0.4 d 0. 0.
Yet Another Tree 43 Letter Code a 00 b c 0 d 0 e 00 0 0.4 a c 0. 0. 0 0 0. 0 0.6 b 0.4 Average code length e d 0. 0. l = 0.4x+ (0. + 0.)x + (0. + 0.)x3=. bits/symbol
44 Design Examples
Min Variance Huffman Trees 45 Huffman codes are not unique All versions yield the same average length Which one should we choose? The one with the minimum variance in codeword lengths I.e. with the minimum height tree Why? It will ensure the least amount of variability in the encoded stream How to achieve it? During sorting, break ties by placing smaller sets higher Alternatively, place newly merged sets as low as possible
Extended Huffman Codes 46 Consider the source: A = {a, b, c}, P(a) = 0.8, P(b) = 0.0, P(c) = 0.8 H = 0.86 bits/symbol Huffman code: a 0 b c 0 l =. bits/symbol Redundancy = 0.384 b/sym (47%!) Q: Could we do better?
Extended Huffman Codes () 47 Idea Consider encoding sequences of two letters as opposed to single letters Letter Probability Code aa 0.6400 0 ab 0.060 00 ac 0.440 ba 0.060 0000 bb 0.0004 0000 bc 0.0036 000 ca 0.440 00 cb 0.0036 00000 cc 0.034 0 l =.78/ = 0.864 Red. = 0.0045 bits/symbol
Extended Huffman Codes (3) 48 The idea can be extended further Consider all possible n m sequences (we did 3 ) In theory, by considering more sequences we can improve the coding In reality, the exponential growth of the alphabet makes this impractical E.g., for length 3 ASCII seq.: 56 3 = 4 = 6M Most sequences would have zero frequency Other methods are needed
Adaptive Huffman Coding 49 Problem Huffman requires probability estimates This could turn it into a two-pass procedure:. Collect statistics, generate codewords. Perform actual encoding Not practical in many situations E.g. compressing network transmissions Theoretical solution Start with equal probabilities Based on the first k symbol statistics (k =,, ) regenerate codewords and encode k+ st symbol Too expensive in practice
Adaptive Huffman Coding () 50 Basic idea Alphabet A = {a,, a n } Notes: Pick a fixed default binary codes for all symbols Start with an empty Huffman tree Read symbol s from source If NYT(s) // Not Yet Transmitted Send NYT, default(s) Update tree (and keep it Huffman) Else Until done Send codeword for s Update tree Codewords will change as a function of symbol frequencies Encoder & decoder follow the same procedure so they stay in sync
Adaptive Huffman Tree 5 Tree has at most n - nodes Node attributes symbol, left, right, parent, siblings, leaf weight If x k is leaf then weight(x k ) = frequency of symbol(x k ) Else x k = weight( left(x k )) + weight( right(x k )) id, assigned as follows: If weight(x ) weight(x ) weight(x n- ) then id(x ) id(x ) id(x n- ) Also, parent(x k- ) = parent(x k ), for k n Sibling property
Updating the Tree 5 Assign id(root) = n-, weight(nyt) = 0 Start with an NYT node Whenever a new symbols is seen, a new node is formed by splitting the NYT Maintaining sibling property Whenever node x is updated Repeat If weight(x) < weight(y), for all y siblings(x) weight(x)++ exit Else swap(x, z), where z rightmost sibling: weight(x) == weight(z) weight(x)++ x = parent(x) Until x == root
Adaptive Huffman Encoding 53 Input: aardvark Output: Symbol NYT a r d v k Code NYT slightly more efficient default codes are possible (4-/5-bit combination) 0 5 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 54 Input: aardvark Output: 00000 5 Symbol NYT 0 a Code NYT 0 49 50 a r d v k a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 55 Input: aardvark Output: 00000 5 Symbol NYT 0 a Code NYT 0 49 50 a r d v k a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 56 Input: aardvark Output: 000000000 3 5 Symbol Code NYT 00 a r 0 d v NYT 0 47 49 48 r 50 a k a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 57 Input: aardvark Output: 00000000000000 4 5 Symbol Code NYT 000 a r 0 d 00 v k NYT 0 47 49 d 48 r 50 a 45 46 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 58 4 Input: aardvark 5 Output: 00000000000000000 Symbol Code NYT 000 a r 0 d 00 v k NYT 0 45 47 v 49 46 d 48 r 50 a 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 59 4 Input: aardvark 5 Output: 00000000000000000 Symbol Code NYT 000 a r 0 d 00 v k NYT 0 45 47 v 49 46 d 48 r 50 a 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 60 4 Input: aardvark 5 Output: 00000000000000000 Symbol Code NYT 000 a r 0 d 00 v k NYT 48 0 r 3 49 45 47 v 50 46 a d 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 6 4 Input: aardvark 5 Output: 00000000000000000 Symbol Code NYT 000 a r 0 d 00 v k NYT 48 0 r 3 49 45 47 v 50 46 a d 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 6 5 Input: aardvark Output: 0000000000000000000 Symbol Code NYT 00 a 0 r 0 d v 0 k a 50 NYT 5 48 0 r 3 49 45 47 v 46 d 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 63 6 Input: aardvark Output: 00000000000000000000 Symbol Code NYT 00 a 0 r 0 d v 0 k 3 a 50 NYT 5 48 0 r 3 49 45 47 v 46 d 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 64 7 Input: aardvark Output: 000000000000000000000 Symbol Code NYT 00 a 0 r 0 d v 0 k 3 a 50 NYT 5 48 0 r 4 49 45 47 v 46 d 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Encoding 65 3 a 4 Input: aardvark 50 49 7 5 Output: 000000000000000000000 00 Symbol Code NYT 00 a 0 r 0 d v 0 k 43 48 r 45 47 44 v 46 d NYT 0 k 4 4
Adaptive Huffman Encoding 66 3 a 4 Input: aardvark 50 49 7 5 Output: 000000000000000000000 00 Symbol Code NYT 00 a 0 r 0 d v 0 k 43 48 r 45 47 44 v 46 d NYT 0 k 4 4
Adaptive Huffman Encoding 67 8 5 Input: aardvark 3 50 a 5 49 Output: 000000000000000000000 00000 Symbol Code NYT 00 a 0 r 0 d 0 48 r 46 d 3 47 45 v v k 0 43 44 k 000 NYT 0 4 4 k
Adaptive Huffman Decoding 68 Output: Input: 000000000000000000000 00000 Symbol NYT a Code NYT 0 5 r d v k a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 69 Output: a Input: 000000000000000000000 00000 5 Symbol NYT 0 a Code NYT 0 49 50 a r d v k a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 70 Output: aa Input: -----0000000000000000 00000 5 Symbol NYT 0 a Code NYT 0 49 50 a r d v k a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 7 Output: aa Input: ------0000000000000000 00000 5 Symbol NYT 0 a Code NYT 0 49 50 a r d v k a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 7 Output: aar Input: -------000000000000000 00000 3 5 Symbol NYT 00 a r 0 d v Code NYT 0 47 49 48 r 50 a k a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 73 Output: aar Input: ------------000000000000 00000 3 5 Symbol NYT 00 a r 0 d v Code NYT 0 47 49 48 r 50 a k a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 74 Output: aard Input: --------------0000000000 00000 4 5 Symbol Code NYT 000 a r 0 d 00 v k NYT 0 47 49 d 48 r 50 a 45 46 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 75 Output: aard Input: -------------------0000000 00000 4 5 Symbol Code NYT 000 a r 0 d 00 v k NYT 0 47 49 d 48 r 50 a 45 46 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 76 Output: aardv Input: ----------------------0000 00000 Symbol NYT 000 a r 0 d 00 v k Code NYT 0 45 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000 47 v 49 46 d 4 5 48 r 50 a
Adaptive Huffman Decoding 77 5 Output: aardv Input: ---------------------------00 00000 Symbol Code NYT 00 a 0 r 0 d v 0 k 50 a NYT 5 48 0 r 3 49 45 47 v 46 d 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 78 6 Output: aardva Input: ---------------------------00 00000 Symbol Code NYT 00 a 0 r 0 d v 0 k 3 50 a NYT 5 48 0 r 3 49 45 47 v 46 d 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 79 7 Output: aardvar Input: ---------------------------0 00000 Symbol Code NYT 00 a 0 r 0 d v 0 k 3 50 a NYT 5 48 0 r 4 49 45 47 v 46 d 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 80 7 Output: aardvar Input: ----------------------------- 00000 Symbol Code NYT 00 a 0 r 0 d v 0 k 3 50 a NYT 5 48 0 r 4 49 45 47 v 46 d 43 44 a 00000 f 000 k 000 p 0 u 000 b 0000 g 000 l 00 q 0000 v 00 c 0000 h 00 m 000 r 000 w 00 d 000 i 0000 n 00 s 000 x 0 e 0000 j 000 o 00 t 00 y 000
Adaptive Huffman Decoding 8 7 Output: aardvark Input: ------------------------ ----000 Symbol Code NYT 00 a 0 r 0 d v 0 k 3 50 a 5 48 r 4 49 45 47 v 46 d 43 44 k 000 NYT 0 4 4 k
Adaptive Huffman Decoding 8 8 5 3 a 5 Symbol Code NYT 00 a 0 r 0 d 0 v k 0 50 48 r 49 46 d 3 47 45 v 43 44 NYT 0 k 4 4
Dealing with Counter Overflow 83 Over time counters can overflow E.g., 3-bit counter ~ 4 billion BIG but still finite and can overflow on long network connections Solution? Rescale all frequency counts (of leaf nodes) when limit is reached E.g., divide by two all of them Re-compute the rest of the tree (keep it Huffman!) Note: After rescaling, new symbols will count twice as much as old ones! This is mostly a feature, not a bug: Data tends to have strong local correlation I.e., what happened a long time ago is not as important as what happened more recently
Huffman Image Compression 84 Example images: 56x56 pixels, 8 bits/pixel, 65,536 bytes Sena Sensin Earth Omaha Huffman coding of pixel values Image Bits/pixel Size (bytes) Compression Ratio Sena 7.0 57,504.4 Sensin 7.49 6,430.07 Earth 4.94 40,534.6 Omaha 7. 58,374.
Huffman Image Compression () 85 Basic observations The plain Huffman yields modest gains, except in the Earth case Lots of black skews the pixel distribution nicely We are not taking into account obvious correlations of pixel values Huffman coding of pixel differences Image Bits/pixel Size (bytes) Compression Ratio Sena 4.0 3,968.99 Sensin 4.70 38,54.70 Earth 4.3 33,880.93 Omaha 6.4 5,643.4
Two-pass Huffman vs. Adaptive Huffman 86 Two-pass Image Bits/pixel Size (bytes) Compression Ratio Sena 4.0 3,968.99 Sensin 4.70 38,54.70 Earth 4.3 33,880.93 Omaha 6.4 5,643.4 Adaptive Image Bits/pixel Size (bytes) Compression Ratio Sena 3.93 3,6.03 Sensin 4.63 37,896.73 Earth 4.8 39,504.66 Omaha 6.39 5,3.5
Huffman Text Compression 87 PDF(letters): US Constitution vs. Chapter 3 0. P(Consitution) P(Chapter) 0.0 0.08 Probability 0.06 0.04 0.0 0.00 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Letter
Huffman Audio Compression 88 Huffman coding: 6-bit CD audio (44,00 Hz) x channels File Name Original File Size (bytes) Entropy (bits) Est. Compressed File Size (bytes) Compression Ratio Mozart 939,86.8 75,40.6 Cohn 40,44 3.8 349,300.5 Mir 884,00 3.7 759,540.30 Difference Huffman Coding File Name Original File Size (bytes) Entropy of Diff. (bits) Est. Compressed File Size (bytes) Compression Ratio Mozart 939,86 9.7 569.79.65 Cohn 40,44 0.4 6,590.54 Mir 884,00 0.9 60,40.47
Golomb Codes 89 Invented by Solomon W. Golomb in the 960s. Golomb coding is optimal for the geometric distribution Rice coding Golomb code has a tunable parameter that can be any positive value, Rice codes are those in which the tunable parameter is a power of two Unary code The unary representation of the number followed by 0 0 0 0 0 3 0 Identical to Huffman code for {,, 3, } and P(k) = / k Optimal for the probability model
Golomb Codes () 90 Uses a tunable parameter m to divide an input value into the quotient and the remainder. To represent n, we compute q = n/m (quotient) r = n - qm (remainder) Represent q in unary code, followed by r in log m bits If m is not a power of then we can use log m bits Truncated Binary Encoding log m -bit representation for 0 r log m -m- log m -bit representation of r+ log m -m for the rest
Golomb Codes Truncated binary coding 9 Truncated binary coding An entropy encoding typically used for uniform probability distributions with a finite alphabet. A more general form of binary encoding when n is not a power of two. Coding (A Prefix Code) For k n k+, there are u = k+ n unused entries. k-bit codes for 0 r u-. (k+)-bit codes for the rest by r+u. U Truncated binary k U Encoding Standard binary 0 000 0 00 00 UNUSED 0 3 UNUSED 00 4 UNUSED 0 5/UNUSED U n-u 3 0 6/UNUSED 4 7/UNUSED k+ N=5
Golomb Codes Truncated binary coding 9 Input value Offset value Standard Binary Truncated Binary 0 0 000 00 00 00 3 00 0 3 4 0 00 4 5 00 0 5 6 0 0 6 7 0 N=7 Input value Offset value Standard Binary Truncated Binary 0 0 0000 000 000 00 000 00 3 3 00 0 4 4 000 00 5 5 00 0 6 00 00 7 3 0 0 8 4 000 0 9 5 00 N=0
Golomb Code Example 93 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000
Golomb Code Example () 94 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00
Golomb Code Example (3) 95 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00 0 000
Golomb Code Example (4) 96 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00 0 000 3 0 3 00
Golomb Code Example (5) 97 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00 0 000 3 0 3 00 4 0 4 00
Golomb Code Example (6) 98 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00 0 000 3 0 3 00 4 0 4 00 5 0 5 0
Golomb Code Example (7) 99 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00 n q r Codeword 6 0 000 0 000 3 0 3 00 4 0 4 00 5 0 5 0
Golomb Code Example (8) 00 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00 0 000 n q r Codeword 6 0 000 7 00 3 0 3 00 4 0 4 00 5 0 5 0
Golomb Code Example (9) 0 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00 0 000 3 0 3 00 n q r Codeword 6 0 000 7 00 8 000 4 0 4 00 5 0 5 0
Golomb Code Example (0) 0 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00 0 000 3 0 3 00 4 0 4 00 n q r Codeword 6 0 000 7 00 8 000 9 3 00 5 0 5 0
Golomb Code Example () 03 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00 0 000 3 0 3 00 4 0 4 00 5 0 5 0 n q r Codeword 6 0 000 7 00 8 000 9 3 00 0 4 00
Golomb Code Example () 04 m = 6 log m = log m = 3 -bit codes for 0 r log 6-6- 0 r 3-bit codes of r+ log 6-6 for the rest r+ n q r Codeword 0 0 0 000 0 00 0 000 3 0 3 00 4 0 4 00 5 0 5 0 n q r Codeword 6 0 000 7 00 8 000 9 3 00 0 4 00 5 0
Golomb Codes: Choosing m 05 Assume a binary string (zeroes & ones) It can be encoded counting the runs of identical bits (either zeroes or ones) A.k.a. run-length encoding (RLE) E.g. 00000000000000000000000000000000000 ---4-0--3 - --------9-00---4 ---4 --3-4,,0,3,,9,,0,0,4,4,3, 35 zeroes, ones P(0) = 35/(35+) = 0.745 log + p log + 0.745 m = m = log p log 0.745 ( ) ( ) =
Summary 06 Early Shannon Fano code Huffman code Original (two-pass) version Collect symbol statistics, assign codes Perform actual encoding of the source Extended version Group multiple symbols to reduce entropy estimate Adaptive version Most practical build Huffman tree on the fly Single pass Escape codes for NYT symbols Encoder & decoder are synchronized More sensitive to local variation, tends to forget older data Homeworks (pp. 78) 4, 5, 6, 0.