COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1

Original readings Login to your cse account: cd ~cs9319/papers Original readings of each lecture will be placed there. 2

Course schedule Data compression Search Data compression + Search Web data compression + Search Optional topics 3

Huffman coding S Freq Huffman 100010 1 a 100000 0 b 6 10 c 2 110 d 1 1110 0 0 10 1 4 1 e 1 1111 0 2 0 1 4 100k 6 2 1 1 a b c d e

Huffman not optimal H = 0.9999 log 1.0001 + 0.00006 log 16668.333 + + 1/100010 log 100010 0.00 L = (100000*1 + )/100010 1 5

Problems of Huffman coding Huffman codes have an integral # of bits. E.g., log (3) = 1.585 while Huffman may need 2 bits Noticeable non-optimality when prob of a symbol is high. => rithmetic coding 6

Problems of Static coding Need statistics & static: e.g., single pass over the data just to collect stat & stat unchanged during encoding To decode, the stat table need to be transmitted. Table size can be significant for small msg. => daptive compression e.g., adaptive huffman 7

daptive compression Encoder Initialize the model Repeat for each input char ( Encode char Update the model ) Decoder Initialize the model Repeat for each input char ( Decode char Update the model ) Make sure both sides have the same Initialize & update model algorithms. 8

daptive Huffman Coding (dummy) Encoder Reset the stat Repeat for each input char ( Encode char Update the stat Rebuild huffman tree ) Decoder Reset the stat Repeat for each input char ( Decode char Update the stat Rebuild huffman tree ) 9

daptive Huffman (lgorithm outline) 1. If current symbol is NYT, add two child nodes to NYT node. One will be a new NYT node the other is a leaf node for our symbol. Increase weight for the new leaf node and the old NYT and go to step 4. If not, go to symbol's leaf node. 2. If this node does not have the highest number in a block, swap it with the node having the highest number 3. Increase weight for current node 4. If this is not the root node go to parent node then go to step 2. If this is the root, end. 11

The update procedure from Introduction to Data Compression by by Sayood Khalid lso, Wikipedia provides a good summary, example and explanation (i.e., http:// en.wikipedia.org/wiki/ daptive_huffman_coding) 12

daptive Huffman abbbbba: 01100001011000100110001001100010011000100110001001100001 abbbbba: 011000010011000100111101 13 a: 01100001 b: 01100010 Modified from Wikipedia

More example 256: W=17 252: W=3 254: W=7 253: W=4 e 255: W=10 More aaaa. coming a 248: W=1 b 249: W=2 c 250: W=2 d 251: W=2 14

More example 256: W=18 252: W=4 254: W=8 253: W=4 e 255: W=10 a 248: W=2 b 249: W=2 c 250: W=2 d 251: W=2 15

More example 256: W=19 252: W=4 254: W=9 253: W=5 e 255: W=10 d 248: W=2 b 249: W=2 c 250: W=2 a 251: W=3 16

More example 256: W=20 252: W=4 254: W=10 253: W=6 e 255: W=10 d 248: W=2 b 249: W=2 c 250: W=2 a 251: W=4 17

More example 256: W=20 252: W=4 254: W=10 253: W=6 e 255: W=10 d 248: W=2 b 249: W=2 c 250: W=2 a 251: W=4 18

More example +1 256: W=20 a 252: W=5 254: W=10 +1 253: W=6 e 255: W=10 251: W=4 c 250: W=2 19 d 248: W=2 b 249: W=2

More example 256: W=21 e 254: W=10 a 252: W=5 255: W=11 253: W=6 251: W=4 c 250: W=2 20 d 248: W=2 b 249: W=2

3 daptive Huffman (FGK)

daptive Huffman (FGK): when f is inserted 4

daptive Huffman (FGK vs Vitter) 1. FGK: (Explicit) node numbering Vitter: Implicit numbering 2. Vitter s Invariant: 5

aa bbb c (Huffman) a 2 b 3 c 1 sp 2 3 5 1 c 2 sp 2 a 3 b 6

aa bbb c (Huffman) a 2 b 3 c 1 sp 2 3 5 1 c 2 sp 2 a 3 b 7 a 01 b 1 c 000 sp 001 Total 16bits a 10 b 11 c 00 sp 01

8 daptive Huffman (Vitter s Invariant)

daptive Huffman (Vitter 1987) abbbbba: 01100001011000100110001001100010011000100110001001100001 abbbbba: 011000010011000100111101 9 a: 01100001 b: 01100010 Modified from Wikipedia

daptive Huffman (Vitter 1987) abbbbba: 01100001011000100110001001100010011000100110001001100001 abbbbba: 011000010011000101111101 1 255 a 254 10 a: 01100001 b: 01100010 You can correct the Wikipedia article Modified from Wikipedia

11 daptive Huffman (Vitter 87)

daptive Huffman Question: daptive Huffman vs Static Huffman 21

Compared with Static Huffman Dynamic and can offer better compression (cf. Vitter s experiments next) i.e., the tree can be smaller (hence shorter the code) before the whole bitstream is received. Works when prior stat is unavailable Saves symbol table overhead (cf. Vitter s expt next) 22

Vitter s experiments Include overheads such as symbol tables / leaf node code etc. 95 SCII chars + <end-of-line> Exclude overheads such as symbol tables / leaf node code etc. 23 From Vitter s paper. You know where it is.

24 More experiments

Next BWT BWT: Burrows Wheeler Transform It is a transform, not a compression; but it usually helps compression (esp. text compression). 25 Excerpted from Wikipedia

Recall from Lecture 1 s RLE and BWT example rabcabcababaabacabcabcabcababaa$ aabbbbccacccrcbaaaaaaaaaabbbbba$ aab4ccac3rcba10b5a$ 26

simple example Input: #BNNS 27 Excerpted from Wikipedia

ll rotations #BNNS S#BNN S#BNN NS#BN NS#BN NNS#B NNS#B BNNS# 28

Sort the rows #BNNS NNS#B NS#BN S#BNN BNNS# NNS#B NS#BN S#BNN 29

Output #BNNS NNS#B NS#BN S#BNN BNNS# NNS#B NS#BN S#BNN 30

Exercise: you can try the example rabcabcababaabacabcabcabcababaa$ aabbbbccacccrcbaaaaaaaaaabbbbba$ 31

Now the inverse Input: S B N N # 32

First add S B N N # 33

Then sort # B N N S 34

dd again S# B N N #B N N S 35

Then sort #B N N S B N N S# 36

Then add S#B BN NN NS #B N N S# 37

Then sort #B N N S# BN NN NS S#B 38

Then add S#B BN NN NS# #BN NN NS S#B 39

Then sort #BN NN NS S#B BN NN NS# S#B 40

Then add S#BN BNN NNS NS#B #BN NN NS# S#B 41

Then sort #BN NN NS# S#B BNN NNS NS#B S#BN 42

Then add S#BN BNN NNS# NS#B #BNN NNS NS#B S#BN 43

Then sort #BNN NNS NS#B S#BN BNN NNS# NS#B S#BN 44

Then add S#BNN BNNS NNS#B NS#BN #BNN NNS# NS#B S#BN 45

Then sort #BNN NNS# NS#B S#BN BNNS NNS#B NS#BN S#BNN 46

Then add S#BNN BNNS# NNS#B NS#BN #BNNS NNS#B NS#BN S#BNN 47

Then sort (?) #BNNS NNS#B NS#BN S#BNN BNNS# NNS#B NS#BN S#BNN 48

Implementation Do we need to represent the table in the encoder? No, a single pointer for each row is needed. 49

BWT(S) function BWT (string s) create a table, rows are all possible rotations of s sort rows alphabetically return (last column of the table) 50

InverseBWT(S) function inversebwt (string s) create empty table repeat length(s) times insert s as a column of table before first column of the table // first insert creates first column sort rows of the table alphabetically return (row that ends with the 'EOF' character) 51

Move to Front (MTF) Reduce entropy based on local frequency correlation Usually used for BWT before an entropyencoding step uthor and detail: Original paper at cs9319/papers http://www.arturocampos.com/ac_mtf.html 52

Example: abaabacad Symbol Code List a 0 abcde.. b 1 bacde.. a 1 abcde.. a 0 abcde.. b 1 bacde.. a 1 abcde.. c 2 cabde.. a 1 acbde.. d 3 dacbe.. 53 To transform a general file, the list has 256 SCII symbols.

BWT compressor vs ZIP ZIP (i.e., LZW based) BWT+RLE+MTF+C 54 From http://marknelson.us/1996/09/01/bwt/

Other ways to reverse BWT Consider L=BWT(S) is composed of the symbols V 0 V N-1, the transformed string may be parsed to obtain: The number of symbols in the substring V 0 V i-1 that are identical to V i. For each unique symbol, V i, in L, the number of symbols that are lexicographically less than that symbol. 55

Example Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 56

???????] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 57

??????] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 58

?????N] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 59

????N] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 60

???NN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 61

??NN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 62

?BNN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 63

[BNN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 64

[BNN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 65 Occ / Rank C [ ]

n illustration B N N [ ] B N N [ ] First Last 66

] B N N [ ] B N N [ ] 67

N] B N N [ ] B N N [ ] 68

N] B N N [ ] B N N [ ] 69

NN] B N N [ ] B N N [ ] 70

NN] B N N [ ] B N N [ ] 71

BNN] B N N [ ] B N N [ ] 72

[BNN] B N N [ ] B N N [ ] 73

Dynamic BWT? Instead of reconstructing BWT, local reordering from the original BWT. Details: Salson M, Lecroq T, Léonard M and Mouchard L (2009). " Four-Stage lgorithm for Updating a Burrows Wheeler Transform". Theoretical Computer Science 410 (43): 4350. 74