COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Size: px

Start display at page:

Download "COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT"

Tamsin Porter
6 years ago
Views:

1 COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1

2 Original readings Login to your cse account: cd ~cs9319/papers Original readings of each lecture will be placed there. 2

3 Course schedule Data compression Search Data compression + Search Web data compression + Search Optional topics 3

4 Huffman coding S Freq Huffman a b 6 10 c d e k a b c d e

5 Huffman not optimal H = log log / log L = (100000*1 + )/

6 Problems of Huffman coding Huffman codes have an integral # of bits. E.g., log (3) = while Huffman may need 2 bits Noticeable non-optimality when prob of a symbol is high. => rithmetic coding 6

7 Problems of Static coding Need statistics & static: e.g., single pass over the data just to collect stat & stat unchanged during encoding To decode, the stat table need to be transmitted. Table size can be significant for small msg. => daptive compression e.g., adaptive huffman 7

8 daptive compression Encoder Initialize the model Repeat for each input char ( Encode char Update the model ) Decoder Initialize the model Repeat for each input char ( Decode char Update the model ) Make sure both sides have the same Initialize & update model algorithms. 8

9 daptive Huffman Coding (dummy) Encoder Reset the stat Repeat for each input char ( Encode char Update the stat Rebuild huffman tree ) Decoder Reset the stat Repeat for each input char ( Decode char Update the stat Rebuild huffman tree ) 9

10 daptive Huffman Coding (dummy) Encoder Reset the stat Repeat for each input char ( Encode char Update the stat Rebuild huffman tree ) Decoder Reset the stat Repeat for each input char ( Decode char Update the stat Rebuild huffman tree ) This works but too slow! 10

11 daptive Huffman (lgorithm outline) 1. If current symbol is NYT, add two child nodes to NYT node. One will be a new NYT node the other is a leaf node for our symbol. Increase weight for the new leaf node and the old NYT and go to step 4. If not, go to symbol's leaf node. 2. If this node does not have the highest number in a block, swap it with the node having the highest number 3. Increase weight for current node 4. If this is not the root node go to parent node then go to step 2. If this is the root, end. 11

12 The update procedure from Introduction to Data Compression by by Sayood Khalid lso, Wikipedia provides a good summary, example and explanation (i.e., en.wikipedia.org/wiki/ daptive_huffman_coding) 12

13 daptive Huffman abbbbba: abbbbba: a: b: Modified from Wikipedia

14 More example 256: W=17 252: W=3 254: W=7 253: W=4 e 255: W=10 More aaaa. coming a 248: W=1 b 249: W=2 c 250: W=2 d 251: W=2 14

15 More example 256: W=18 252: W=4 254: W=8 253: W=4 e 255: W=10 a 248: W=2 b 249: W=2 c 250: W=2 d 251: W=2 15

16 More example 256: W=19 252: W=4 254: W=9 253: W=5 e 255: W=10 d 248: W=2 b 249: W=2 c 250: W=2 a 251: W=3 16

17 More example 256: W=20 252: W=4 254: W=10 253: W=6 e 255: W=10 d 248: W=2 b 249: W=2 c 250: W=2 a 251: W=4 17

18 More example 256: W=20 252: W=4 254: W=10 253: W=6 e 255: W=10 d 248: W=2 b 249: W=2 c 250: W=2 a 251: W=4 18

19 More example : W=20 a 252: W=5 254: W= : W=6 e 255: W=10 251: W=4 c 250: W=2 19 d 248: W=2 b 249: W=2

20 More example 256: W=21 e 254: W=10 a 252: W=5 255: W=11 253: W=6 251: W=4 c 250: W=2 20 d 248: W=2 b 249: W=2

21 3 daptive Huffman (FGK)

22 daptive Huffman (FGK): when f is inserted 4

23 daptive Huffman (FGK vs Vitter) 1. FGK: (Explicit) node numbering Vitter: Implicit numbering 2. Vitter s Invariant: 5

24 aa bbb c (Huffman) a 2 b 3 c 1 sp c 2 sp 2 a 3 b 6

25 aa bbb c (Huffman) a 2 b 3 c 1 sp c 2 sp 2 a 3 b 7 a 01 b 1 c 000 sp 001 Total 16bits a 10 b 11 c 00 sp 01

26 8 daptive Huffman (Vitter s Invariant)

27 daptive Huffman (Vitter 1987) abbbbba: abbbbba: a: b: Modified from Wikipedia

28 daptive Huffman (Vitter 1987) abbbbba: abbbbba: a a: b: You can correct the Wikipedia article Modified from Wikipedia

29 11 daptive Huffman (Vitter 87)

30 daptive Huffman Question: daptive Huffman vs Static Huffman 21

31 Compared with Static Huffman Dynamic and can offer better compression (cf. Vitter s experiments next) i.e., the tree can be smaller (hence shorter the code) before the whole bitstream is received. Works when prior stat is unavailable Saves symbol table overhead (cf. Vitter s expt next) 22

32 Vitter s experiments Include overheads such as symbol tables / leaf node code etc. 95 SCII chars + <end-of-line> Exclude overheads such as symbol tables / leaf node code etc. 23 From Vitter s paper. You know where it is.

33 24 More experiments

34 Next BWT BWT: Burrows Wheeler Transform It is a transform, not a compression; but it usually helps compression (esp. text compression). 25 Excerpted from Wikipedia

35 Recall from Lecture 1 s RLE and BWT example rabcabcababaabacabcabcabcababaa$ aabbbbccacccrcbaaaaaaaaaabbbbba$ aab4ccac3rcba10b5a$ 26

36 simple example Input: #BNNS 27 Excerpted from Wikipedia

37 ll rotations #BNNS S#BNN S#BNN NS#BN NS#BN NNS#B NNS#B BNNS# 28

38 Sort the rows #BNNS NNS#B NS#BN S#BNN BNNS# NNS#B NS#BN S#BNN 29

39 Output #BNNS NNS#B NS#BN S#BNN BNNS# NNS#B NS#BN S#BNN 30

40 Exercise: you can try the example rabcabcababaabacabcabcabcababaa$ aabbbbccacccrcbaaaaaaaaaabbbbba$ 31

41 Now the inverse Input: S B N N # 32

42 First add S B N N # 33

43 Then sort # B N N S 34

44 dd again S# B N N #B N N S 35

45 Then sort #B N N S B N N S# 36

46 Then add S#B BN NN NS #B N N S# 37

47 Then sort #B N N S# BN NN NS S#B 38

48 Then add S#B BN NN NS# #BN NN NS S#B 39

49 Then sort #BN NN NS S#B BN NN NS# S#B 40

50 Then add S#BN BNN NNS NS#B #BN NN NS# S#B 41

51 Then sort #BN NN NS# S#B BNN NNS NS#B S#BN 42

52 Then add S#BN BNN NNS# NS#B #BNN NNS NS#B S#BN 43

53 Then sort #BNN NNS NS#B S#BN BNN NNS# NS#B S#BN 44

54 Then add S#BNN BNNS NNS#B NS#BN #BNN NNS# NS#B S#BN 45

55 Then sort #BNN NNS# NS#B S#BN BNNS NNS#B NS#BN S#BNN 46

56 Then add S#BNN BNNS# NNS#B NS#BN #BNNS NNS#B NS#BN S#BNN 47

57 Then sort (?) #BNNS NNS#B NS#BN S#BNN BNNS# NNS#B NS#BN S#BNN 48

58 Implementation Do we need to represent the table in the encoder? No, a single pointer for each row is needed. 49

59 BWT(S) function BWT (string s) create a table, rows are all possible rotations of s sort rows alphabetically return (last column of the table) 50

60 InverseBWT(S) function inversebwt (string s) create empty table repeat length(s) times insert s as a column of table before first column of the table // first insert creates first column sort rows of the table alphabetically return (row that ends with the 'EOF' character) 51

61 Move to Front (MTF) Reduce entropy based on local frequency correlation Usually used for BWT before an entropyencoding step uthor and detail: Original paper at cs9319/papers 52

62 Example: abaabacad Symbol Code List a 0 abcde.. b 1 bacde.. a 1 abcde.. a 0 abcde.. b 1 bacde.. a 1 abcde.. c 2 cabde.. a 1 acbde.. d 3 dacbe.. 53 To transform a general file, the list has 256 SCII symbols.

63 BWT compressor vs ZIP ZIP (i.e., LZW based) BWT+RLE+MTF+C 54 From

64 Other ways to reverse BWT Consider L=BWT(S) is composed of the symbols V 0 V N-1, the transformed string may be parsed to obtain: The number of symbols in the substring V 0 V i-1 that are identical to V i. For each unique symbol, V i, in L, the number of symbols that are lexicographically less than that symbol. 55

65 Example Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ ] Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 56

66 ???????] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ ] Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 57

67 ??????] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ ] Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 58

68 ?????N] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ ] Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 59

69 ????N] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ ] Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 60

70 ???NN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ ] Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 61

71 ??NN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ ] Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 62

72 ?BNN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ ] Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 63

73 [BNN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ ] Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 64

74 [BNN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ ] Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 65 Occ / Rank C [ ]

75 n illustration B N N [ ] B N N [ ] First Last 66

76 ] B N N [ ] B N N [ ] 67

77 N] B N N [ ] B N N [ ] 68

78 N] B N N [ ] B N N [ ] 69

79 NN] B N N [ ] B N N [ ] 70

80 NN] B N N [ ] B N N [ ] 71

81 BNN] B N N [ ] B N N [ ] 72

82 [BNN] B N N [ ] B N N [ ] 73

83 Dynamic BWT? Instead of reconstructing BWT, local reordering from the original BWT. Details: Salson M, Lecroq T, Léonard M and Mouchard L (2009). " Four-Stage lgorithm for Updating a Burrows Wheeler Transform". Theoretical Computer Science 410 (43):

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account:! cd ~cs9319/papers! Original readings of each lecture will be placed there. 2 Course