Data Compression Techniques

Size: px

Start display at page:

Download "Data Compression Techniques"

Tobias Whitehead
6 years ago
Views:

1 Data Compression Techniques Part 1: Entropy Coding Lecture 4: Asymmetric Numeral Systems Juha Kärkkäinen / 19

2 Asymmetric Numeral Systems Asymmetric numeral systems (ANS) is a recent entropy coding technique that combines the advantages of Huffman and arithmetic coding: Similarly to arithmetic coding, ANS gets very close to the entropy. Similarly to Huffman coding, ANS can be implemented to run much faster than arithmetic coding. The basic ANS coding procedure is: source string integer code string The mapping from integers to code strings is simple: Represent the integer in binary using (only) as many bits as needed. Remove the leading bit, which is always 1. The opposite direction is equally simple: Add 1 to to the front of the bit string and interpret as an integer. The missing piece is the mapping from source strings to integers. 2 / 19

3 The source string to integer mapping is based on the following observations: An integer x corresponds to a code string of length log x and thus should correspond to a source string of probability about 2 log x 1/x to get close to entropy. The code string with probability 0.5 for each bit has the right probability as a string. Thus the code string to integer mapping is the right kind of mapping. We want to generalize the mapping to handle larger alphabets and arbitrary probabilities. 3 / 19

4 Let us define the code string to integer mapping using arithmetic operations without relying directly on the binary representation of integers. Let x be an integer encoding some binary string B. The encoding function C(x, s) adds the bit s {0, 1} to the end of the string and returns the new integer: C(x, s) = 2x + s Starting with x = 1, which encodes the empty string, we can construct the encoding for any string by repeated application of the encoding function. The decoding function D(x) is the reverse mapping that extracts a bit from the end and returns the new integer and the extracted bit: D(x) = ( x/2, x mod 2) The bits are extracted until x becomes 1. 4 / 19

5 The encoding and decoding functions can be described using the following ANS coding table. C(x, s) s = x s = Using the table, the functions can be computed as follows: To compute C(x, s), find x on the row corresponding to s and return the value on the first row in that column. To compute D(x), find x on the first row and return the value below it as well as the bit on the row containing that value. Example C(3, 1) = 7 and D(7) = (3, 1). 5 / 19

6 By modifying the table, we can represent inequal probabilities. For example, the probabilities P(0) = 1/4 and P(1) = 3/4 could be represented with the following table. C(x, s) s = x s = Example Using the above table, the bitstring 0111 is encoded by the integer 13: and the integer 12 decodes into the bitstring 110 (in reverse order): The generalization to larger alphabets can be accomplished by adding more rows, one for each symbol. 6 / 19

7 Each symbol row in the coding table contains 0, 1, 2,.... Thus the code table can be described by specifying for each column, which symbol row contains a value in that column. Let E : N Σ be such mapping. In other words, E(x) is the last symbol encoded by the integer x, or equivalently the first symbol extracted during decoding, i.e., D(x) = (, E(x)). We can also consider E to be the (infinite) string E(0)E(1)E(2)..., which we call the ANS coding sequence. For example, the coding table on the preceding slide corresponds to the string = (0111). Example For Σ = {a, b, c} with the probabilities P(a) = 0.2, P(b) = 0.5 and P(c) = 0.3, we could choose E = (aabbbbbccc) as the ANS coding sequence and obtain the following coding table a b c / 19

8 Define the following operations on the sequence E: p-rank E (x) = the number of occurrences of E(x) in E[0..x 1] select E (x, s) = the position of (x + 1) th occurrence of s in E Example x E(x) a a b b b b b c c c a a b... p-rank E (x) select E (x, b) Then the encoding and decoding functions can be defined for any coding sequence as follows: C(x, s) = select E (x, s) D(x) = (p-rank E (x), E(x)) 8 / 19

9 A good coding sequence E for a probability distribution P over an alphabet Σ should satisfy, for all x N and s Σ, C(x, s) = select E (x, s) x/p(s) Then a string S Σ is encoded with an integer of value about 1/P(S) and the code string has a length of about log(1/p(s)). There are many possible ANS coding schemes. We will look at the two standard examples uabs and rans, and then some implementation techniques leading to a practical variant called tans. 9 / 19

10 uabs (uniform Asymmetric Binary Systems) Let Σ = {0, 1} with P(1) = p and P(0) = 1 p. Define the code sequence by E(x) = (x + 1)p xp Then the encoding function is C(x, s) = { x+1 1 p 1 if s = 0 x p if s = 1 and the decoding function for is { (x xp, E(x)) if E(x) = 0 D(x) = ( xp, E(x)) if E(x) = 1 There is also a variant using instead of. Example (p = 0.3) / 19

11 rans (range ANS) Let Σ = {s 1, s 2,..., r σ }. Choose an integer L and integers f i, i [1..σ], so that f i /L P(s i ) and i [1..σ] f i = L. In other words, we will approximate the probabilities with rational numbers f i /L. Then set See slide 7 for an example. E = (s f i 1 sf sfσ σ ) Setting F i = j [1..i 1] f j, the encoding function is The decoding function is C(x, s i ) = x/f i L + x mod f i + F i D(x) = ( x/l f i + x mod L F i, s i ) for i s.t. s i = E(x) In practice, L should be a power of two to speed up decoding. 11 / 19

12 Renormalization To avoid dealing with very large integers, we need occasionally to renormalize similarly to arithmetic coding. This is done with the decoding function. During encoding, we use the standard code string decoding function D(x) = ( x/2, x mod 2) and output the resulting bit. During decoding, we use the source string decoding function and output the resulting source symbols. With arithmetic coding renormalization does not change the eventual output. However, since the decoding function extracts bits in the reverse order, the extrated bits/symbols are from the middle of the sequence, not from the beginning as in arithmetic coding. This is not a problem, though, as long as the encoder decoder symmetry is maintained: Whenever the encoder reads source symbols and encodes them into the integer, the decoder extracts source symbols from the integer and writes them to output. Whenever the encoder renormalizes and outputs code bits, the decoder reads code bits and encodes them to the integer. 12 / 19

13 Let R be a suitable integer: For uabs, p = P(1) should be a multiple of 1/R. For rans, R should be a multiple of L. The decoding procedure under renormalization is: If x R, extract and output a source symbol. If x < R, read a code bit and encode it to the integer. Let C s be the source encoding function and D c the code decoding function. The encoding procedure under renormalization is as follows when s is the next source symbol to be encoded. If D c (C s (x, s)) R, extract and output a code bit (renormalize). If D c (C s (x, s)) < R, encode s to the integer. The symmetry between encoder and decoder is maintained because: The encoding rules ensure that x never grows too big. Thus, if the encoder renormalizes computing y = D c (x), then y < R so that the decoder does the inverse given integer y. The conditions on R ensure that, if the encoder computes y = C s (x, s), then y R so that the decoder does the inverse given integer y. 13 / 19

14 Example For rans with L = 8 let E = (abbbbbcc), i.e, P(a) 1/8, P(b) 5/8 and P(c) 2/8. The coding table is a 0 1 b c We use renormalization with R = 8. The string cab is encoded as follows. The encoding starts with x = 2 allowing immediate encoding of the first symbol and ends with extracting code bits until x is 1. 2 c a 8 b Inputs are above and outputs below the arrows. The decoder starts with x = 1, works in reverse order, and ends after outputting n = 3 symbols b 8 a c 2 14 / 19

15 tans (table ANS) rans with renormalization has the property that the integer x is always in the range [R..2R 1] right after encoding a source symbol during encoding and right before decoding a source symbol during decoding. The values of x in the range can be thought of as states of computation. The state transitions are then: Encoding: zero or more renormalization steps followed by a source symbol encoding. Decoding: source symbol extraction followed by zero or more code bit encodings. The idea of tans is to store all possible transitions into a table, and then do encoding and decoding with table lookups. 15 / 19

16 Example Let E = (abbbbbcc) and L = R = 8 (see slide 14). The tans encoding table is: a b c The string cab is encoded as follows. The first symbol is read just to set the initial state without output. c 14 a b The last bits of output identify the final state 12. The full output is , the same as on slide / 19

17 Example (continued) Decoding table b 1b 2b 3 a 8 + 4b 1 + 2b 2 + b 3 b 1 b 10 + b 1 b 1 b 12 + b 1 b 1 b 14 + b 1 Decoding starts with reading log R = 3 bits to establish the initial state as 8 + 4b 1 + 2b 2 + b 3. Then decoding proceeds according to the table. The last transition reads no bits b a c 12 b 8 13 b b 1b 2 c 8 + 2b 1 + b 2 b 1b 2 c b 1 + b 2 17 / 19

18 Large Code Alphabet ANS can also be modified for larger code alphabet Γ = [0..γ 1]. The coding function is then and the decoding function is C(x, s) = γx + s D(x) = ( x/γ, x mod γ) With tans the set of states becomes [R..γR 1]. A larger code alphabet can speed up coding by avoiding dealing with individual bits. In particular, γ = 2 8 allows byte level processing. 18 / 19

19 Summary: Entropy Coding Huffman coding, arithmetic coding and ANS coding address the same problem: encoding data using as few bits as possible given a probability distribution over the data. Such methods are often called entropy coding. ANS offering fast coding and close to entropy compression is the state of the art in entropy coding. Some codes such as Elias γ code are not explicitly constructed from a probability distribution. However, since the codeword lengths can be interpreted as probabilities, there is an underlying implicit probability distribution (see also Exercise 2.3). Entropy coding assumes that the probability distribution of each symbol is independent of other symbols. When this is not the case, such as with natural language text, other techniques are needed. However, even then entropy coding is usually applied in the last stage of compression to produce the actual bits. 19 / 19

Data Compression Techniques

Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques