SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Size: px

Start display at page:

Download "SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding"

Ashlie Bridges
5 years ago
Views:

1 SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding 1

2 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities, by using as code a (truncated) binary representation of the cumulative distribution function. Consider the random variable X taking as values m letters of the alphabet, {a 1, a 2,..., a m }, and for the letter a i the probability mass function is p(x = a i ) = p(a i ) > 0. The (cumulative) distribution function is F(x) = Prob(X x) = a k x p(a k ) where we assumed the lexicographic ordering relation a i < a j if i < j. Note that if one changes the ordering, the cumulative distribution function will be different. y = F(x) is a function having its plot as stairs, with jumps at x = a k (see plot on next page). Even though there is no inverse function x = F 1 (y), we may define a partial inverse as follows: If all p(a i ) > 0, an arbitrary value y [0, 1) uniquely determines a symbol a k, as that symbol that obeys F(a k 1 ) y < F(a k ). We may use the plot of F(x), to identify the value a k for which F(a k 1 ) y < F(a k ). Note that F(a k ) = F(a k 1 ) + p(a k ), which is a fast way to compute F(a k ). 2

3 To avoid dealing with interval boundaries, define F(a i ) = i 1 k=1 p(a k ) p(a i) The values F(a i ) are the midways of the steps in the distribution plot. If F(ai ) is given, one can find a i. The same is true if one gives an approximation of F(ai ), as long as it does not go outside the interval F(a k 1 ) y < F(a k ). Therefore the number F(a i ), or an approximation of it, can be used as a code for a i. Since the real number F(a i ) may happen to have an infinite binary representation, we have to look sometimes for numbers close to it, but having shorter binary representations. From Shannon codes we know that a good code for a i needs to be represented in about log 1 p(a i ) bits, therefore F(x) needs to be represented in about log 1 p(x) bits. 3

4 Probability mass function and cumulative distribution p(a i ) 0.5 F(x) i x 4

5 Probability mass function and cumulative distribution for strings To extend the previous reasoning from symbols to strings of symbols, x, we have to: compute for each of the strings of n symbols the mass probability p(x) (such that x p(x) = 1), define a lexicographic ordering for any two strings (each with n symbols) x and y, and denoted it by the ordering symbol, x < y. define the cumulative probability F(y) = x<y p(x) p(y) The code for x is obtained as follows: We truncate (floor operation) F(x) to l(x) bits to obtain F(x) l(x) where l(x) = log 1 p(x) + 1. Important notation distinction: F(x) k is the binary representation of the sub-unitary number F(x), using k bits for the fractional part. log 1 p(x) denotes as usual the smallest integer larger than log 1 p(x). The codeword to be used for encoding the string x is F(x) l(x). 5

6 Property 1. The code is well defined, i.e. with F(x) l(x) we uniquely identify x. Proof: F(x) F(x) l(x) < 1 2 l(x) F(x) 1 2 l(x) < F(x) l(x) Now we use the fact l(x) = log 1 p(x) + 1 and 2l(x) = 2 2 log 1 p(x) > 2 2 log 1 p(x) = 2 p(x) 1 p(x) < = 2l(x) 2 F(x) F(x 1) F(x 1) < F(x) 1 2 l(x) F(x 1) < F(x) 1 2 l(x) < F(x) l(x) F(x) Finally the uniqueness of x given F(x) l(x) follows from F(x 1) < F(x) l(x) F(x) (i.e. looking at the plot of the cumulative function z = F(x), the value z = F(x) l(x) falls on the step at x, between the basis of the step and the middle of it. 6

7 Property 2. The code is prefix free. Let associate to each codeword z 1 z 2... z l a closed interval, [0.z 1 z 2... z l ; 0.z 1 z 2...z l l ]. Any number outside the closed interval has at least one bit different in the bits 1 to l, and therefore z 1 z 2... z l is not a prefix of any number outside the closed interval. Extending now the reasoning to all codewords, the code is prefix free if and only if all intervals corresponding to codewords are disjoint. The interval corresponding to any codeword has length 2 l(x). A prefix of the codeword is e.g. z 1 z 2...z l 1. Can that prefix be a codeword itself? If z 1 z 2... z l 1 is a codeword, than it represents the interval [0.z 1 z 2... z l 1 ; 0.z 1 z 2... z l ]. But the number z 2 l 1 1 z 2... z l necessarily belongs to the interval [0.z 1 z 2... z l 1 ; 0.z 1 z 2... z l ], therefore there 2 l 1 is an overlap of the intervals. We already have F(x 1) < F(x) l(x), and similarly: 1 p(x) < = F(x) 2l(x) 2 F(x) F(x) > F(x) > F(x) l(x) l(x) l(x) and therefore the interval [ F(x) l(x), F(x) l(x) + 1 ] is totaly included in the interval 2 l(x) [F(x 1), F(x)]. Now, the overlap of intervals is contradicted by our consideration on the cumulative distribution for symbols. Consequently Shannon-Fano-Elias code is prefix free. 7

8 Average length of Shannon-Fano-Elias Codes We use l(x) = log 1 p(x) + 1 bits to represent x. The expected codelength is L = p(x)l(x) = p(x) log 1 x x p(x) + 1 < H + 2 (1) where the entropy is H = x p(x) log 1 p(x) Example 1. All probabilities are integer powers of 1 2. x p(x) F(x) F(x) F(x) in binary l(x) = log 1 p(x) + 1 Codeword

9 Average length of Shannon-Fano-Elias Codes The average codelength is 2.75 bits, the entropy is 1.75 bits. Since all probabilities are powers of two, the Huffman code attains the entropy. One can remove the last bit in the last two codewords of Shannon-Fano-Elias Code in Example 1! Example 2. All probabilities are not integer powers of 1 2. The Huffman code is in average 1.2 bits shorter than Shannon-Fano-Elias Code in Example 2. x p(x) F(x) F(x) F(x) in binary l(x) = ( log 1 p(x) + 1 Codeword (0011) (0011) (0110)

10 Motivation for using arithmetic codes Huffman codes are optimal codes, for a given probability distribution of the source. However, their average length is longer than the entropy, within 1 bit distance. To reach average codelength closer to the entropy, Huffman is applied to blocks of symbols, instead of individual symbols. The size of the Huffman table needed to store the code increases exponentially with the length of the block. If during encoding we improve our knowledge of the symbol probabilities, we have either to redesign the Huffman table again, or use an adaptive variant of Huffman (but everybody agrees adaptive Huffman is not elegant nor computationally attractive). When encoding binary images, the probability of one symbol may be extremely small, therefore the entropy is close to zero. Without blocking the symbols, Huffman average length is 1 bit! Long blocks are strictly necessary in this application. Whenever somebody needs to encode long blocks of symbols, or wants to change the code to make optimal for the new distribution, the solution is arithmetic coding. Its principle is similar to Shannon-Fano-Elias Coding, i.e. handling the cumulative distribution to find codes. However, arithmetic coding is better engineered, allowing very efficient implementations (as speed and compression ratio) and an easy adaptation mechanism. 10

11 Principle of arithmetic codes Essential idea: efficiently calculate the probability mass function p(x n ) and the cumulative distribution function F(x n ) for the source sequence x n = x 1 x 2... x n. Then, similar to Shannon-Fano-Elias Codes, use a number in the interval [F(x n ) p(x n ); F(x n )] as the code for x n. A sketch: Expressing F(x n ) with an accuracy of log 1 will give a code for the source. So p(x) the codewords for different sequences are different. But it is no guarantee that the codeword are prefix free. As in Shannon-Fano-Elias Codes, we may use log 1 p(x) + 1 bits to round F(x n ), in which case the prefix condition is satisfied. A simplified variant. Consider a binary source alphabet, assume we have a fixed block length n that is known to both the encoder and decoder. We assume we have a simple procedure to calculate p(x 1 x 2..., x n ) for any string x 1 x 2..., x n. We will use the natural lexicographic order on strings: a string x is greater than a string y if x i = 1, y i = 0 for the first i such that x i y i. Equivalently, x > y if i x i 2 i > i y i 2 i, i.e. the binary numbers satisfy 0.x > 0.y. The strings can be arranged as leaves in a tree of depth n (parsing tree, not coding tree!). In the tree, the order x > y of two strings means that x is at right of y. 11

12 We need to compute the cumulative distribution F(x n ) for a string x n, i.e. to add all p(y n ) for which y n < x n. However, there is a much smarter way to perform the sum, described next. Let T x1 x 2...x k 1 0 the subtree starting with x 1 x 2... x k 1 0. The probability of the subtree is P(T x1 x 2...x k 0) = p(x 1 x 2... x k 1 0z k+1... z n ) = p(x 1 x 2... x k 1 0) z k+1...z n The cumulative probability can therefore be computed as F(x n ) = p(yn ) = p(t) y n x n T:T is to the left of x n p(x 1 x 2... x k 1 0) (2) = k:x k =1 Example: (See Fig.2 in the file Lectures4Figures.pdf) For a Bernoulli source with θ = p(1) we have F(01110) = p(t 1 ) + p(t 2 ) + P(T 3 ) = p(00) + p(010) + p(0110) = (1 θ) 2 + θ(1 θ) 2 + θ 2 (1 θ) 2 To encode the next bit of the source sequence, we need to calculate p(x i x i+1 ) and update F(x i x i+1 ). To decode the sequence, we use the same procedure to calculate p(x i x i+1 ) and update F(x i x i+1 ) for various x i+1, and check when the cumulative distribution exceeds the value corresponding to the codeword. 12

13 The most used mechanisms for computing the probabilities are i.i.d. sources and Markov sources. For i.i.d. sources For Markov sources of first order p(x n ) = n i=1 p(x i ) p(x n ) = p(x 1 ) n i=2 p(x i x i 1 ) Encoding is efficient if the distribution used by the arithmetic coder is close to the true distributions. The adaptation of the probability distribution will be discussed in a separate lecture. The implementation issues are related to the computational accuracy, buffer sizes, speed. 13

14 Statistical modelling + Arithmetic coding = Modern data compression Statistical modeller Next Symbol Arithmetic encoder Cumulative Distribution of Symbols Input image Stream of bits 14

15 Arithmetic coding message: BILL GATES Character Probability Range SPACE 1/ A 1/ B 1/ E 1/ G 1/ I 1/ L 2/ S 1/ T 1/ New Character Low value High Value B I L L SPACE G A T E S BILL GATES ( , ) = ( 41D8F565H 2 32, 41D8F567H 2 32 ) 32 bits H = 10 i=1 p i log 2 (p i ) = 3.12 bits / character Shannon: To encode BILL GATES we need at least 31.2 bits 15

16 Encoding principle Set low to 0.0 Set high to 1.0 While there are still input symbols do get an input symbol code range = high - low. high = low + range high range(symbol) low = low + range low range(symbol) End of While output low 16

17 Arithmetic coding: Decoding Principle get encoded number Do find symbol whose range straddles the encoded number output the symbol range = symbol high value - symbol low value subtract symbol low value from encoded number divide encoded number by range until no more symbols Encoded Number Output Low High Range Symbol B I L L SPACE G A T E S

18 References on practical implementations of Arithmetic coding 0 Moffat, Neal and Witten (1998) Source code ftp : //munnari.oz.au/pub/arith coder/ 1 Witten, Neal and Cleary(1987) Source code f tp : //f tp.cpsc.ucalgary.ca/pub/projects/ar.cod/cacm 87.shar 18

19 Statistical modeller Next Symbol Arithmetic encoder Cumulative Distribution of Symbols Input image Stream of bits

20 T3 T1 T2 FIG. 2

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code