SGN-2306 Signal Compression. 1. Simple Codes

SGN-236 Signal Compression. Simple Codes. Signal Representation versus Signal Compression.2 Prefix Codes.3 Trees associated with prefix codes.4 Kraft inequality.5 A lower bound on the average length of optimal prefix codes.6 Shannon codes.7 Encoding a binary tree. Signal Representation versus Signal Compression An example of image representation The luminance at one pixel varies from black to white, in 256 gray levels. Alphabet: A = {,, 2, 3,..., 255} having 256 symbols. Each symbol is represented using 8 bits = byte. Image Size: 52 rows 52 columns, in total 26244 pixels, thus the file size is 26244 bytes Empirical probability of the symbol x (symbol frequency, or histogram) h(x) = Number of occurences of the symbol x Number of pixels in the image, for all symbols x A Original image Lena Symbol frequency (histogram).2 Original Image..8 h(x) (histogram).6.4.2 5 5 2 25 3 x (graylevel value) 2

Ready made tools for signal compression applied to Lena image Comparing several widely available compression programs for general use (Original file size Compressed file size) Unix compress utility (based on Ziv-Lempel method) File size: 26244 bytes 24462 bytes (7.46 bits/pixel) Unix gzip utility (a variation on Ziv-Lempel method) with option - (faster than with option -9) File size: 26244 bytes 225666 bytes (6.88 bits/pixel) Unix gzip utility with option -9 (takes longer than option -, but usually gives better compression) File size: 26244 bytes 23422 bytes (7.4 bits/pixel) Unix bzip2 utility (Burrows-Wheeler method) with option - (block size= k) File size: 26244 bytes 85244 bytes (5.65 bits/pixel) Unix bzip2 utility with option -9 (block size= 9k) File size: 26244 bytes 8462 bytes (5.6 bits/pixel) 3 Store the first column of lena, D(:, ). A simple (revertible) image transformation for Lena For the second column, D(:, 2), do not store it, but store instead the differences E(:, 2) = D(:, 2) D(:, ). Having available D(:, ) and E(:, 2) we can recover the second column D(:, 2) = E(:, 2) + D(:, ). For the third column, D(:, 3), instead of storing it, store the differences E(:, 3) = D(:, 3) D(:, 2). Continue to store in the same way E(:,k) = D(:,k) D(:,k ). The difference image is shown in the middle, below. Histogram of the original D, the difference image E, and histogram of the transformed image..2 Original Image.9 Difference Image ( pixel horiz. transl.).8..7.8.6 h(x) (histogram).6 h(x) (histogram).5.4.4.3.2.2. 5 5 2 25 3 x (graylevel value) 5 5 2 25 3 x (graylevel value) Entropy= 7.44 Difference image Entropy= 5.5 Note: the alphabet of the image E(:, :) is ideally A = { 255, 254,..., 3, 2,,,, 2, 3,...,255}, but for image lena it is A = { 25, 254,..., 3, 2,,,, 2, 3,...,75}. We added to each pixel in the difference image the value 25, in order to be able to display the negative values, finally using imagesc function in matlab for showing the values. 4

Shannon lossless coding theorem The basic fact in Data Compression If a source generates N symbols independently, with probabilities 2... N p p p 2... p N then, the lowest number of bits/symbol needed for coding is H = N i= p i log 2 p i () Example: For original data represented with 8 bits, the symbols are {,, 2,...,255}. If all symbols are independent and uniformly distributed, p i = const = /256, and then H = N i= p i log 2 p i = N (/256) ( 8) = (/256) 8 256 = 8 (2) i= thus data is incompressible (the original representation is as good as one can get). If the value 28 occurs with probability p 28 =.99 while the rest of the values occur with probability p i = const =./255, then the entropy is H = N and therefore data is compressible 5: (since 8/.6 = 5). i= p i log 2 p i =.6, (3) 5 The Ways of Signal Compression If the symbols are generated independently, nobody can encode a signal at a bitrate lower than the entropy of the independent source. The techniques for designing the codes with bitrates close to entropy are described in the Data compression part of the course (lectures to 6). If the symbols are dependent, the decorrelation of the source (using any of the following techniques: prediction, transforms, filterbanks) will improve the compression. If the original signal is transformed in a lossy manner (using scalar/vector quantization in conjunction with the following techniques: prediction, transforms, filterbanks) the compression ratio of the signal can be orders of magnitude better. 6

.2 Prefix codes A code is a set of codewords which are used to represent the symbols (or strings of symbols) of the original alphabet. Consider the original alphabet as the integers I 8 = {,,...,255}. The binary representation using 8 bits is the most simple code. A gray level image has the pixel values usually in the set I 8 and we refer to an uncompressed image as the file containing, in the scanning order left-right and up-down, the values of the pixels represented in binary form. Consider the ASCII set of symbols. The ASCII table specifies 7-bit binary codes for all English alphabet letters, punctuation signs, digits and several other special symbols. We refer usually to an original (uncompressed) text as the file containing the text stored as a sequence of ASCII symbols. The above codes have a common feature, all their codewords have the same length in bits. They satisfy an essential property of coding: reading a sequence of codes one can uniquely decode the message, without the need of separators between codewords. But even with unequal length of codewords the instantaneous decoding property may hold, this being the case of prefix codes. 7.3 Trees associated with prefix codes We will visualize prefix codes as leaves in binary (or D-ary) trees. The binary trees will have their nodes labelled by the sequence of bits read on the path from the root to the node (with the convention that a left branch carry a, and a right branch carry a ). In the sequel we will usually identify the codewords with the leaves in a tree. Example of a code for the alphabet {a,b, c,d} DEPTH DEPTH DEPTH 2 ROOT 2 2 2 2 INTERIOR NODE LEAF CODEWORD Table Symbol Codeword Codeword length a 2 b 2 c 2 d 2 BINARY TREE FOR THE CODE IN TABLE The string d,d, c,a, a, b is coded as. To decode correctly there is no need of codeword separators! 8

Example of a code for the alphabet {a,b, c,d, e, f, g} 2 2 2 2 3 3 3 3 4 4 Table 2 Symbol Codeword Codeword length a 3 b 3 c 4 d 4 e 3 f 2 g 2 BINARY TREE FOR THE CODE IN TABLE 2 The string eeabgf is coded as. To decode correctly there is no need of codeword separators! 9 Example of a code for the alphabet {,, 2, 3, 4, 5, 6} (continued) Assume now that the frequencies of the symbols are as follows: Symbol i Codeword Codeword length l(i) Symbol frequency p(i) (a) 3.25=/8 (b) 3.25=/8 2(c) 4.625=/6 3(d) 4.625=/6 4(e) 3.25=/8 5(f) 2.25=/4 6(g) 2.25=/4 The average codelength is L = 6 i= p(i)l(i) = 3 (.25 3) + 2 (.625 4) + 2 (.25 2) = 2.625 When we encode a very long string, of n symbols, nl is approximately the length of the encoded string, in bits. Why? The usual binary representation of {,, 2, 3, 4, 5, 6} requires 3 bits, resulting in an average codelength of 3 bits. The code of Table 2 has a shorter average codelength than the binary representation.

In fact for the given p(i) the code in Table 2 achieves the shortest possible codelength in the class of prefix codes. This will be obvious at the end of present lecture (exercise!). The code in Table 2 obeys the following rule: the more probable a symbol, the shorter its codelength..4 Kraft inequality The prefix condition introduces constraints on the set of codeword lengths, e.g. there is no prefix code with five codewords such that w = w 2 = w 3 = w 4 = w 5 = 2, where w denotes the length of the codeword w. These constraints on possible codeword lengths are completely characterized by the Kraft inequality for prefix codes, very easy to prove and visualize using the associated trees. Theorem (Kraft inequality) a) Given a set of numbers l,...,l n which satisfy the inequality then there is a prefix code with w = l,..., w n = l n. n i= 2 l i (4) b) Reciprocally, for every prefix free code having the codeword lengths w = l,..., w n = l n, the inequality (4) is true. 2

Proof (by a simple example). Depth = Depth = Depth =2 Depth =3 Depth =4 (Maximum)Depth =5 Partition of nodes 2 (5 3) + 2 +2 +2 (5 4) + 2 (5 2) + 2 + 2 + 2 (5 4) + 2 (5 3) + 2 (5 2) = 2 5 The tree in red has the set of codewords (the leafs marked with red asterisk): (,,,,,,,,,), having the codelengths l i (3,5,5,4,2,5,5,4,3,2). Each leaf has exactly 2 5 l i descendants on the last level, at maximum depth 5. The total number of these descendants can t exceed the total number of nodes at depth 5, which is 2 5. From i 2 5 l i 2 5 it results i 2 l i. 3 A second proof of Part a) (showing how to construct a prefix tree once Kraft inequality is satisfied) Without loss of generality suppose the numbers l,...,l n are ordered increasingly, and denote k i the number of occurrences of the integer i in the sequence l,...,l n. The sum in (4) can be alternatively written n i= 2 l i = l n j=l k j 2 j. We construct a prefix code in the following way: take k j leaves at depth j = l (which is possible since k j 2 j, i.e. k j 2 j, and obviously there are 2 j nodes at depth j in the tree). At depth j + we need to place k j + leaves, and from (4) one gets easily k j + 2 j + 2k j, which shows that we have to place at depth j + fewer leaves than the available places (at depth j + there are 2 j + nodes, but 2k j are not free, namely the siblings of the k j nodes which were declared leaves at depth j ). Continuing the same way, the inequality (4) will constrain the number of leaves to be placed at depth j to k j 2 j 2 j j k j... 2k j, which makes the placement possible. A second proof of part b) The proof is by successively reducing the associated tree to smaller depth trees. Consider the maximal depth of the original tree l max = l n (there are m n leaves with length l max, which contributes m n times the terms 2 l max in the sum (4)). 4

We create another tree, with maximal depth l max, by pruning two by two the leaves which are siblings to create shorter words, now of length l max, or if a leaf in the original tree does not have a sibling simply replace it by its parent. When pruning the siblings w and w to get the parent node w we have 2 w = 2 w + 2 w, while when replacing a son wb (b is either or ) by the parent w we have 2 w > 2 wb, therefore the overall sum w W lmax 2 w over the leaves of the new tree is greater or equal than the overall sum over the original tree w W lmax 2 w. By repeating the process, we find the chain of inequalities 2 w 2 w... 2 w = 2 + 2 = w W lmax w W lmax w W If the code is D-ary (the codewords are represented using D symbols, then the Kraft inequality reads n D l i (5) i= and Theorem holds true with (5) replacing (4). Kraft inequality holds true also for any uniquely decodable codes (not only for prefix codes, or instantaneous codes). Example: The code in Table 2 satisfies Kraft inequality with equality: n i= 2 l i = 2 2 2 + 3 2 3 + 2 2 4 = 5.5 A lower bound on the average length of optimal prefix codes The average length of the codewords is L = n p il i (6) i= which is the quantity we want to minimize for a given probability distribution p,p 2,...,p n. The prefix code strategy will not allow the lengths l,...,l n to be selected freely, they will be constrained by the Kraft inequality. The optimization problem is an integer linear programming problem, but with nonlinear constraints. To get a hint to the optimal solution we relax the integer requirement for the code length. 6

A lower bound on the average length of optimal prefix codes (continued) We minimize the extended criterion p il i + λ( n i i= 2 l i ) (7) where λ is an extra variable (Lagrange multiplier) to be determined such that the (Kraft) constraint is fulfilled. Taking derivative of (7) with respect to the real parameters l j and then equating to zero we get p j + λ2 l j ln 2 =, or p j λ ln 2 = 2 l j introducing it in the constraint equality n i= p i λ ln 2 = we obtain the optimal value λ = n p i i= ln 2 = ln 2. Finally we get 2 l j = p j λ ln 2 = p j or l j = log 2 p j (8) The optimal length of the code for the symbol j is therefore log 2 p j, quantity which is called the self-information of the symbol j. The average length of the optimal codewords is n i= p i log 2 p i (9) 7 and it is called entropy. By the optimality of l j = log 2 p j that we proved, no code can get a better average length than the entropy. In the case when the values log 2 p j are integers, there is a prefix code actually attaining the entropy bound. Finding the prefix code which minimizes the average codelength L for the case when at least for one j the value log 2 p j is non-integer is solved by the simple construction discovered by Huffman in 952 (to be discussed later in the lecture). 8

.6 Shannon code Shannon s code is derived from the result of the optimization solution (8). If the self-information log 2 p j is not integer, a natural attempt to allocate the code lengths is: where x denotes the smallest integer greater than x. l j = log 2 p j () To show that the lengths () define a prefix code we need to check the Kraft inequality, which results at once: n i= 2 l i = n 2 log 2 p i n 2 log 2 p i = n p i = () i= i= the first inequality being due to log 2 p i log 2 p i. The average codelength of the prefix code defined by () is L = i p i log 2 p i and can be lower and upper bounded by using log 2 p j log 2 p j log 2 p j + and therefore i p i log 2 p i i p i log 2 p i i i= p i log 2 + p i 9 which immediately gives H(p) L H(p) + stating that the average codelength is within bit of the entropy. 2

Example of a Shannon code for the alphabet {,, 2, 3, 4, 5, 6} Exercise: consider the probabilities given in Table 3 and draw in the template below one (out of many possible) prefix trees corresponding to the Shannon code: 2 2 2 2 3 3 3 3 4 4 5 5 Template where you should draw the Shanon code 3 3 4 4 Table 3 Symbol i probability p(i) l(i) = log 2 p(i). 4.5 3 2.5 5 3.9 4 4.4 3 5.27 2 6.2 3 2 The entropy of the given source is H = 6 p i log 2 p i = 2.643 i= The average codelength of the Shannon code is L = i p i l(i) = 3.2 A slightly better code has the codelengths l = [4 3 4 4 3 2 3] and the average codelength L = 2.97 The trivial binary representation of the symbols requires only 3 bits, which is better than the Shannon code. A simple method to code with at a bitrate closer to the entropy is to block the alphabet symbols. 22

Coding blocks of symbols We start with an example: consider as symbols of the alphabet all the pairs (s,s 2 ) for the source in Table 3. The new alphabet has now 7 7 = 49 symbols, in the set {(, ), (, ),...,(5, 6), (6, 6)} and their probabilities are p(s,s 2 ) = p(s )p(s 2 ) The entropy of the pairs is 2 times the entropy found for a single symbol: H(p(s,s 2 )) = p(s i,s j ) log 2 p(s i,s j ) = p(s i )p(s j ) log 2 p(s i )p(s j ) (i,j) i j = p(s i )p(s j )(log 2 p(s i ) + log 2 p(s j )) = p(s i ) p(s j ) log 2 p(s j ) i j i j p(s j ) p(s i ) log 2 p(s i ) = H(p(s)) p(s i ) + H(p(s)) p(s j ) = 2H(p(s)) j i i j The length of the Shannon code for the new alphabet is bounded: L((s,s 2 )) < H(p(s,s 2 ))+ = 2H(p(s))+, therefore the number of bits per symbol will be 2 L((s,s 2 )) < H(p(s))+ 2. Taking now blocks of n symbols and constructing a Shannon code for them, the number of bits per symbol will be n L((s,...,s n )) < H(p(s)) + n. Increasing the block length, one can be as close to the entropy as one wishes. Exercise: Extend the alphabet of the source to blocks of ten symbols and then design the Shannon code and find its actual length (you need a computer to solve this). 23.7 Encoding a binary tree Suppose we have a n symbol alphabet, for which we associated by a code design procedure a binary tree, having n leaves. How can we enumerate all binary trees which have n leaves? First we define an order in which we are going to inspect the interior nodes and leaves of the tree: We start from the root, than we go through the nodes/leaves at the first depth level from left to right, continue with second depth level and continue until the last depth level of the tree. When inspecting the nodes, we write down a string, one bit for each node. If the node is an interior node we write a, if it is a leaf we write down a. When somebody else is reading the string, he will be able to draw the tree! A tree with n leaves has n interior nodes (this can be proven easily by mathematical induction). The number of bits necessary to encode a binary tree with n leaves is therefore n + (n ) = 2n. Not all combinations of 2n bits will correspond to a valid binary tree, therefore there are at most 2 2n binary trees with n leaves. 24

Encoding a binary tree: example Inspect the tree and insert a in the interior nodes, a zero in the leaf nodes. Then write the bits in top-down, left -right scan. Encoding the structure of a binary tree The binary string associated with the tree is:,,,,, Commas have been added for clarity, to see the different depth levels, they are not needed if one wants to decode the string of bits to a tree structure. There are n = leaves and 2n = 9 bits in the sequence. 25 The complete message for you (the prefix tree is also included and is listed first): Space c d g o u k L 26

SIGNAL COMPRESSION 2. Huffman Codes 2. Huffman algorithm for binary coding alphabets 2.2 Canonical Huffman coding 2.3 Huffman algorithm for D-ary coding alphabets 2.4 Huffman codes for infinite alphabets 27 A recap of the previous lecture A prefix code can be represented by a binary tree The length of the codewords should be small for frequently occurring symbols Kraft inequality constrains the lengths of the symbols The average codelength is a good measure of coding performance The ideal codelength for a source emitting symbol i with probability p(i) is log 2 p(i) (it minimizes the average codelength). If the ideal log 2 p(i) codelength is not integer, there is a need of an optimization algorithm. This is curent lecture s topic: Huffman algorithm. 28

2. Huffman algorithm for binary sources Huffman coding associates to each symbol i, i {, 2,...,n}, a code, based on the known symbol probabilities p i, p i {p,p 2,...,p n }, such that the average codelength is minimized. No code assignment can be better than Huffman s code if the source is independent. Procedure. Arrange symbol probabilities p i in increasing order; Associate to them a list of n nodes at the zero th (initial) level of a tree.. While there is more than one node in the list.. Merge the two nodes with smallest probability from the list into a new node, which will have assigned as probability the sum of probabilities in the merged nodes. Delete from the list the two nodes which were merged..2 Arbitrarily assign and to the branches from the new node to the nodes which were merged. 2. Associate to each symbol the code formed by the sequence of s and s read from the root node to the leaf node. 29 Memory requirement: Once the Huffman codes have been associated to symbols, they are stored in a table. This table must be stored and be used both at encoder and decoder (e.g. it may be transmitted along with the data, at the beginning of the message). Extremely fast: Coding (and encoding) takes basically one table look up. 3

Example of using Huffman algorithm We consider the same example which was used to illustrate the Shannon code. The probabilities are given in Table 3. The steps of the algorithm are illustrated in the next two pages HUFFMAN.4.5.9 c= d=.29.27.24.2.5..4 b= a= e=.56.44 f= g= Table 3 Symbol probability Huffman code Shannon code i p(i) l H (i) l Sh (i) = log 2 p(i) a (). 3 4 b (2).5 3 3 c (3).5 4 5 d (4).9 4 4 e (5).4 3 3 f (6).27 2 2 g (7).2 2 3 Huffman code assigns shorter codewords than Shannon codewords to the symbols a, c and g. 3 Initial step of the algorithm.5.9..4.5.2.27 c d a e b g f Step of the algorithm.4.5.9..4.5.2.27 c d a e b g f Step 2 of the algorithm.4 Step 5 of the algorithm.56.44.29.24.4.24.5.9..4.5.2.27 c d a e b g f Step 3 of the algorithm.29.4.24.5.9..4.5.2.27 c d a e b g f Step 4 of the algorithm.44.29.5.9..4.5.2.27 c d a e b g f Step 6 of the algorithm.56.44.29.4.24.5.9..4.5.2.27 c d a e b g f.4.24.5.9..4.5.2.27 c d a e b g f 32

The entropy of the given source is H = 6 p i log 2 p i = 2.643 The average codelength of the Shannon code is L = p i l Sh (i) = 3.2 i i= The average codelength of the Huffman code is L = i p i l H (i) = 2.67 A simple method to code with at a bitrate closer to the entropy is to block the alphabet symbols. As an example, consider the new alphabet as all pairs of the symbols {(a,a),...,(g,f), (g, g)}. Building the Huffman code for this source we get the average codelength L 2 = (i,j) p (i,j) l H ((i,j)) = 5.336 which gives a per symbol average codelength 2 L 2 = 2.6568, better than the Huffman average codelength when coding symbol by symbol. For blocks of three symbols one gets a per symbol average codelength 3 L 3 = 2.6538. For blocks of four symbols one gets a per symbol average codelength 4 L 4 = 2.656. Note that the better compression is paid by a higher complexity: Coding symbol by symbol we need to store a code table with 7 entries, while when coding blocks of 4 symbols at a time we need a table with 24 entries. 33 Optimality of the Huffman code Proposition In the tree W we prune (merge) the symbols i n and i n to obtain the tree W P having the following symbols: (a) the n 2 symbols i,...,i n 2 and (b)the extra-symbol obtained by merging the symbols i n and i n (this extra-node having probability p(i n ) + p(i n )). Then the average lengths over the two trees are linked by Properties of the optimal code: L(W) = L(W P ) + p(i n ) + p(i n ) (a) If for two symbols i and i 2 we have p(i ) < p(i 2 ) then l(i ) l(i 2 ). (b) The longest two codewords in an optimal code have to be the sons of the same node and have to be assigned to the two symbols having the smallest and second smallest probability. Proof: (a) If l(i ) < l(i 2 ) and p(i ) < p(i 2 ) just switch the codewords for i and i 2. The resulting codelength minus the original codelength will be p(i 2 )(l(i ) l(i 2 )) + p(i )(l(i 2 ) l(i )) = (p(i 2 ) p(i ))(l(i ) l(i 2 )), which is negative, therefore the original code was not optimal. (b) By (a) we know the longest two codewords in an optimal code have to be assigned to the two symbols having the smallest and second smallest probability. If they don t have the same length, the lengthier can be shortened, therefore the code was not optimal (this can be seen also from the fact that the optimal tree has to be complete, i.e., each node has no descendents, or exactly two descendent; having only one descendent is not possible in an optimal tree). By a swapping argument always the smallest and second smallest can be arranged to be sons of the same node. Recursively applying property (b) one gets the Huffman algorithm, which results to provide the optimal tree. 34

Huffman procedure function [Tree, code,l_h,el_h,h]=huffman2(p) % The probabilities of the symbols...n are given %% Output: %% The codewords for the original symbols may be displayed by using %% for i=:n %% bitget(code(i),l_h(i):-:) %% end %% The Tree structure uses the original symbols :n and indices %% for interior nodes (n+):(2n-) %% The rows in Tree are nodes at same depth n=length(p); %% START THE HUFFMAN CONSTRUCTION % Initial step list_h=:n; plist=p; m=n+; pred=zeros(,2*n-); % Will keep indices of the nodes in the list % Will keep the probabilities of the nodes in list_h % Index of the first new node % pred(i) will show which node is the parent of node i while(length(list_h)>) 35 [psort,ind]=sort(plist); i= ind() ; i2= ind(2) ; % nodes ind() and ind(2) will be merged pnew=plist(i)+plist(i2); % Probability of the new node pred(list_h(i))=m; % the predecessor of list_h(i) is m pred(list_h(i2))=m; list_h([i,i2])=[]; % remove from the list nodes ind() and ind(2) plist([i,i2])=[]; % remove the probabilities of the deleted nodes list_h=[list_h m]; % add node m to the list plist=[plist pnew]; % specify the probability of node m m=m+; % Index of the new node end % end while %% END OF HUFFMAN CONSTRUCTION %% We extract now the structure of the coding tree Tree=[];nTree=[]; Tree(,)=2*n-; depth=; ind=find(pred==2*n-); % These are the sons of the root Tree(depth+,:2)=ind; ntree(depth+)=2; code(ind())=; code(ind(2))=; ntogo=2*n-4; 36

while(ntogo>) ntree(depth+2)=; for i=:ntree(depth+) father=tree(depth+,i); ind=find(pred==father); % These are the sons of the father if(~isempty(ind)) Tree(depth+2,nTree(depth+2)+[ 2])=ind; code(ind())=+2*code(father); code(ind(2))=+2*code(father); ntree(depth+2)=ntree(depth+2)+2; ntogo=ntogo-2; end % end if end % end for depth=depth+; end % end while %% We have the tree code structure and we find the %% codes and lengths for the symbols for i=:n [i,i2]=find(tree==i); L_H(i)=i-; bitget(code(i),l_h(i):-:) end 37 % Huffman s average code length EL_H=sum(p.*L_H(:n)) % Find the entropy of the source H indp=find(p>); H=-sum(p(indp).*log(p(indp)))/log(2); 38

2.2 Canonical Huffman coding Huffman construction may generate many different (but equivalent) trees When assigning and to the branches from a father to the sons, any choice is allowed. The resulting trees will be different, the codes will be different, but the length of a given codeword is the same for all trees. Therefore the optimal code is not unique. At every interior node we can switch the decision / of the two branches. There are n interior nodes, therefore Huffman algorithm can produce at least 2 n different codes. At one step in the algorithm, there may be ties, cases when the probabilities of the symbols are equal, and we may choose freely which symbol to consider first. This is a second source of non-unicity. Table 4 Symbol probability Huffman code Huffman code 2 Canonical Huffman i p(i) w(i) w(i) w(i) a (). b (2).5 c (3).5 d (4).9 e (5).4 f (6).27 g (7).2 39 Huffman construction may generate many different but equivalent trees HUFFMAN.56.44.44.56 HUFFMAN 2.4.29.27.24.2 f= g=.5..4 b= a= e=.2 g=.24.27 f=.4. e= a=.5 b=.29.4.5.9 c= d= CANONICAL HUFFMAN.9.5 d= c=..5.4 a= b= e=.27.2 f= g=.5.9 c= d= 4

The Canonical Huffman code in Table 5 is listed in Table 6 such that the codelengths are decreasing, and for a given codelength the symbols are listed in lexicographic order and their codewords are consecutive binary numbers. Table 5 Symbol Canonical Huffman i w(i) a () b (2) c (3) d (4) e (5) f (6) g (7) Table 6 Symbol Canonical Huffman i w(i) c (3) d (4) a () b (2) e (3) f (6) g (7) With Canonical Huffman codes all we need to store is the string from the first column of Table 6, and the table cdabefg Symbol Canonical Huffman i w(i) c (3) a () f (6) 4 Assigning the codewords in Canonical Huffman coding Given: the symbols {i}, for which the optimal lengths {l(i)} are known. Denote maxlength=max i {l(i)} /* Find how many codewords of length j =,...,maxlength are in {l(i)} */ For l = to maxlength Set num[l] = For i = to n Set num[l i ] = num[l i ] + 2 /* The integer for first code of length l is stored in firstcode(l). */ Set firstcode(maxlength) For l = maxlength- downto Set firstcode(l) (firstcode(l + ) +num[l + ] )/2 3 For l = to maxlength Set nextcode(l) firstcode(l) 4 For i = to n Set codeword(i) nextcode(l i ) Set symbol[l i,nextcode(l i ) firstcode(l i )] i Set nextcode(l i ) nextcode(l i ) + 42

Example of building firstcode[l] and Symbol[l i,j] for canonical Huffman coding Table 7 Symbol Code length codeword[i] Bit pattern l-bit prefix i l i 2 3 4 5 (a) 2 2(b) 5 3(c) 5 4(d) 3 5(e) 2 2 2 6(f) 5 2 2 7(g) 5 3 3 8(h) 2 3 3 num[l] 3 4 firstcode[l] 2 2 The array Symbol[l i,j] 2 3 2 (a) 5(e) 8(h) 3 4(d) 4 5 2(b) 3(c) 6(f) 7(g) 43 The Canonical Huffman code from Table 7 in a tree form 44

Decoding using a Canonical Huffman code Set v nextinputbit(). Set l. 2 While v < firstcode[l] do (a) Set v 2v + nextinputbit() (b) Set l l + At the end of while loop the integer v is a valid code, of l bits 3 Return symbol[l, v firstcode[l]] This is the index of the decoded symbol. 45 The codewords of length l occupy consecutive positions when read as integers. When decoding, we receive the bits one after another, and interpret them according to Decoding algorithm. Example for the canonical Huffman from Table 7: After we receive the first bit, b, we find it to be smaller than firstcode[] = 2, so we read the next bit, b 2. After we receive the second bit, if the binary number v = b b 2 is larger or equal to firstcode[2] = we are done, and read the decoded symbol from the table symbol[2,v firstcode[2]]. If the binary number v = b b 2 is smaller than firstcode[2] = we have to read a new bit, b 3. After we receive the third bit, if the binary number v = b b 2 b 3 is larger or equal to firstcode[3] = we are done, and read the decoded symbol from the table symbol[3,v firstcode[3]] = d. If the binary number v = b b 2 b 3 is smaller than firstcode[3] = we have to read a new bit, b 4. After we receive the fourth bit, b 4, we find it to be smaller than firstcode[4] = 2, so we read the next bit, b 5. After we receive the fifth bit, we find that the binary number v = b b 2 b 3 b 4 b 5 is larger or equal to firstcode[5] =, so we are done, and read the decoded symbol from the table symbol[5,v firstcode[5]]. As an example, if the bitstream received is... First stage: Read b =, initialize v = b =. Since v = < firstcode[] = 2 we read the next bit, b 2 =, and update v = 2 + =. Now v = < firstcode[2] =, so we read the next bit, b 3 =, and update v = 2 + =. Now v = = firstcode[3], so we decode by looking into symbol[3,v firstcode[3]] = symbol[3, ] = d. Second stage: We read now b 4 = and initialize v = b 4 =. Since v = < firstcode[] = 2 we read the next bit, b 5 =, and update v = 2 + = 2. Now v = 2 > firstcode[2] = so we decode by reading in symbol[2, v firstcode[2]] = symbol[2, ] = a. 46

2.3 Huffman algorithm for D-ary coding alphabets A prefix code for D ary coding alphabets can be represented as a D ary tree. A D ary complete tree has the following property: denote the number of interior nodes n i and number of leaves n. Then n = (D ) n i + (2) The proof is simple. Start with the complete tree having one interior node (the root), n i =, and n = D leaves; it obeys n = (D ) n i +. Now with any complete tree satisfying n = (D ) n i + we split one leaf, getting an extra interior node and D leaves, therefore after the split n new = n + D = (D ) n i + + D = (D ) (n i + ) + = (D ) n new i +. With D ary trees, the optimal codes have clearly the following properties: (a) If for two symbols i and i 2 we have p(i ) < p(i 2 ) then l(i ) l(i 2 ). (b) The longest codeword should not be the unique son of an interior node (at least another codeword must have the same length as the longest codeword). Huffman coding associates to each symbol a code, based on the known symbol probabilities {p,p 2,...,p n }. Procedure. Complete with n zero probability nodes the list of initial nodes {,...,n} such that in total there are n+n = (D ) n i + with n i an integer (i.e. we can build a complete D-ary tree having n + n leaves). 47. Arrange symbol probabilities p i in increasing order; Associate to them a list of n nodes at (initial) level of a tree.. While there is more than one node in the list.. Merge the D nodes with smallest probability from the list into a new node, which will have assigned as probability the sum of probabilities in the merged nodes. Delete from the list the D nodes which were merged..2 Arbitrarily assign,,...,d to the branches from the new node to the nodes which were merged. 2. Associate to each symbol the code formed by the sequence of,,...,d read from the root node to the leaf node. 48

2.4 Huffman codes for infinite alphabets Huffman algorithm starts building the code tree from the largest depth. If the alphabet is infinite, Huffman algorithm can not be directly applied. We consider as alphabet the natural numbers {,, 2,...} where the symbol probability is geometric with an arbitrary parameter θ (, ). P(i) = ( θ)θ i, i =,, 2,... (3) An example of geometric distribution is when the source is binary, Bernoulli i.i.d., with θ the probability of zero. The runs of s are to be encoded using one code, the runs of s using another code. The probability of i zero symbols followed by a one is P(i) = ( θ)θ i, hence the need to encode integers i having geometric distribution. More on run length coding latter! Consider a source with parameter θ. There is an integer l (sometimes denoted l(θ)) which satisfies both inequalities θ l + θ l+ < θ l + θ l (4) Proof: take θ =,θ,θ 2,θ 3..., which is monotonically decreasing to zero. The sequence θ + θ,θ + θ 2,θ 2 + θ 3..., is also monotonically decreasing to zero, and has the first term greater than. Therefore there is an integer l satisfying (4). The construction which follows will demonstrate a Huffman code, therefore an optimal code, for the geometric distribution with parameter θ. The same code will be optimal also for any other θ which satisfies θ l + θ l+ < θ l + θ l for l = l(θ). Therefore the code will be specific to a l value, not to a θ value. 49 Consider a source with parameter θ and the value l(θ) associated to it. For any m define the m-reduced source which has m + l + symbols, with following probabilities: Easily we see m+l i= P m (i) =. P m (i) = ( θ)θ i i m m < i m + l ( θ)θ i θ l The probabilities ( θ)θ i are monotonically decreasing when i increases, the same being true for ( θ)θi θ l. We show that the two symbols with smallest probabilities are m and m + l. First we show P m (m + l) P m (m ), or ( θ)θ m+l θ l ( θ)θ m which is a straight consequence of left side of (4). second we show that P m (m + l ) > P m (m), or ( θ)θ m+l > ( θ)θ m θ l which is a straight consequence of right side of (4) Knowing now that the two symbols with smallest probabilities are m and m + l, we apply Huffman algorithm, which merges the two symbols into a new one, with probability ( θ)θ m+l + ( θ)θ m ( θ)θm = (θ l + θ l ( θ)θm ) = = P θ l θ l θ l m (m) (6) 5 (5)

The source after merging the two nodes will have m + l symbols, with probabilities P m (i) P m (i) = ( θ)θ i i m m < i m + l ( θ)θ i θ l which is exactly the reduced source of order m. With the same type of merging two nodes, we reduce consecutively the original source until m =. At this last stage we merge two symbols to get a l alphabet with probabilities P (i) = (7) ( θ)θi θ l, i l (8) Little algebra will provide the optimal code for the reduced source. For convenience of notations we present now only the case when l is an integer power of two: l = 2 k. In this case the optimal code for (8) is as follows: use wordlength k for all symbols i =,...,l, i.e. use the binary representation using k bits for i. The overall code, named Golomb-Rice code, has the following rules: Any integer i is coded in two parts: First, the last k bits of i are written to the codeword (i.e. the binary representation of (i mod 2 k )). Then i 2 is computed and we append exactly i k 2 bits with value to the codeword, followed by a single. k Summary: the symbols m and m+l are merged at successive steps of the Huffman algorithm, until there is no more pair (m, m + l), i.e when there are only l symbols. The Huffman code for these l symbols is trivial when l = 2 k, it is just the binary representation of the l symbols. The overall code can be understood tracking back all operations, as shown in the next example. Example: take k = 2,l = 2 k = 4. The reduced tree and the final code tree are presented on the next page. 5 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 m-reduced source for l=4 and m=5 Based on the previous construction, Golomb-Rice code with k = 2 and l = 4 is optimal for all geometric 52

sources with θ obeying the inequalities: θ 4 + θ 5 < θ 3 + θ 4 i.e for θ (.892,.8567). Golomb-Rice codes: fast encoding and decoding The overall code, named Golomb-Rice code, has the following rules: Any integer i is coded in two parts: First, the last k bits of i are written to the codeword (i.e. the binary representation of (i mod 2 k )). Then i 2 is computed and we append exactly i k 2 bits with value to the codeword, followed by a single. k The GR codes are thus build of two distinct prefix codes: the last k bits of i form a first prefix code (a totally balanced tree of depth k), and the unary representation of i 2 (exactly i k 2 bits of followed by k a delimiter, ) forms a second prefix code. One can swap the two codes, so that the unary representation comes first and the k bits of i come next, and the code is still decodeable. 53 SGN-236 Signal Compression 3. Lempel-Ziv Coding 3. Dictionary methods 3.2 The LZ77 family of adaptive dictionary coders (Ziv-Lempel 77) 3.3 The gzip variant of Ziv-Lempel 77 3.4 The LZ78 family of adaptive dictionary coders (Ziv-Lempel 78) 3.5 The LZW variant of Ziv-Lempel 78 3.6 Statistical analysis of a simplified Ziv-Lempel algorithm 54

3. Dictionary methods Replace a substring in the file with a codeword that identifies the substring in a dictionary (or codebook). Static dictionary. One first builds a suitable dictionary, which will be used for all compression tasks. Examples: digram coding, where some of most frequently occurring pairs of letters are stored in the dictionary. Example: A reasonable small dictionary: 28 ASCII individual characters, followed by 28 pairs (properly selected out of the possible 2 4 pairs) of ASCII symbols. In clear an ASCII character needs 7 bits. With the above dictionary, the favorable cases are encoded by digrams (4 bits/character) while in the unfavorable cases, to encoding a single character one needs 8 bits/character instead of 7 bits/character). The dictionary may be enlarged, by adding longer words (phrases) to it (e.g. and, the). Unfortunately using a dictionary with long phrases will make it well adapted and efficient for a certain type of texts, but very inefficient for other texts (compare the dictionaries suitable for a mathematical textbook or for a collection of parliamentary speeches). 55 Semi-static dictionaries: one can build a dictionary well suited for a text. First, the dictionary is sent as side information, and afterwards the text is sent, encoded with the optimal dictionary. This has two drawbacks: (a) the overhead of side information may be very high, for short texts, and (b) at encoder we need to pass two times through the text (read two times a large file). Adaptive dictionary is both elegant and simple. The dictionary is built on the fly (or it need not to be built at all, it exists only implicitly) using the text seen so far. Advantages: (a) there is only one pass through the text, (b) the dictionary is changing all the time, following the specificity of the recently seen text. A substring of a text is replaced by a pointer to where it has occurred previously. Almost all dictionary methods are variations of two methods, developed by Jacob Ziv and Abraham Lempel in 977 and 978, respectively. In both methods, the same principle is used: the dictionary is essentially all or part of the text seen before (prior to the current position), the codewords specify two types of information: a) pointers to previous positions and b) the length of the text to be copied from the past. The variants of Ziv-Lempel coding differ in how pointers are represented and in the limitations they impose on what is referred to by pointers. 56

A (cartoon-like) example of encoding with an adaptive dictionary is given in the image below. The decoder has to figure out what to put in each empty box, by following each arrow, and taking the amount of text suggested by the size of each box. Pease porridge hot, Pease porridge cold, Pease porridge in a pot Nine days old. Some like it hot, Some like it cold, Some like it in a pot Nine days old. 57 Pease porridge hot, Pease porridge cold, Pease porridge in a pot Nine days old. Some like it hot, Some like it cold, Some like it in a pot Nine days old. 58

3.2 The LZ77 family of adaptive dictionary coders (Ziv-Lempel 77) The algorithm was devised such that decoding is fast and the memory requirements are low (the compression ratio was sacrificed in favor of low complexity). Any string of characters is first transformed into a strings of triplets, where components have the following significance: The first component of a triplet says how far back to look in the previous text to find the next phrase. The second component records how long the phrase is. The first and second components form a pointer to a phrase in the past text. The third component gives the character which will follow the next phrase. This is absolutely necessary if there is no phrase match in the past. It is included in every triplet for uniformity of decoding. We start with a decoding example. Suppose the encoded bitstream contains the triplets <,,a ><,,b >< 2,,a >< 3, 2,b >< 5, 3,b ><,,a > When the triplet < 5, 3,b > is received, the previous decoded text is abaabab. The pointer < 5, 3, > tells to copy the past phrase aab after abaabab. The character <,, b > tells to append a b after abaababaab 59 When the triplet <,,a > is received, it tells to copy characters starting with the last available character. This is a recursive reference, but fortunately it can be solved easily. We find that the characters are in fact bbbbbbbbbb. Thus recursive references are similar to run-length coding (to be discussed in a later course). In LZ77 there are limitations on how far back a pointer can refer and the maximum size of the string referred to. Usually the window for search is limited to a few thousand characters. Example: with 3 bits one can address 892 previous positions (several book pages). The length of the phrase is limited to about 6 characters. Longer pointers are expensive in bits, without a significant improvement of the compression. If the length of the phrase is the position is not relevant The decoder is very simple and fast, because each character decoded requires only a table look-up (the size of the array is usually smaller than the cache size). The decoding program is sometimes included with the data at very little cost, such that a compressed file can be downloaded from the network without any software. When executed, the program generates the original file. 6

Example of LZ77 compression encoder output <,,a > <,,b > < 2,,a > < 3, 2,b > < 5, 3,b > <,,a > decoder output a b aa bab aabb bbbbbbbbbba 6 Encoding procedure Goal: Given the text S[...N], and the window length W. Produce a stream of triplets < f, l, c > position-length-next character (the binary codes for f, l,c are discussed latter).. Set p. /* (S(p) is the next character to be encoded) */ 2. While (p N) /* while we did not reach the end of the text */ 2. Search for the longest match for S[p,p+,...] in S[p W,...,p ]. Denote m the position and l the length of the match S[m,m+,...,m+l ] S[p,p+,...,p+l ]. 2.2 Write in the output stream the triplet <position,length,next character>), i.e. < p m,l, S[p + l] > 2.3 Set p p + l +. /*Continue encoding from S[p + l + ])*/ 62

Decoding procedure Given a stream of triplets < f, l, c > (the binary codes for f, l,c are discussed latter). Set p. /* (S(p) is the next character to be decoded) */ 2 While there are non-decoded triplets < f, l,c > /* while we did not reach the end of the text */ 2. Read the triplet < f,l, c >. 2.2 Set S[p,p +,...,p + l ] = S[p f,p f +,...,p f + l ]. 2.3 Set S[p + l] c. 2.4 Set p p + l +. /*Continue encoding from S[p + l + ])*/ 63 3.3 The gzip variant of Ziv-Lempel 77 Distributed by Gnu Free Software Foundation (author Gailly, 993) Gzip uses a simple technique to speed up at the encoder the search for the best match in the past. The next three characters are used as addresses in a look up table, which contains a linked list showing where the next three characters have occurred in the past. The length of the list is restricted in size, by a parameter selected by the user before starting the encoding. If there are long runs of the same characters, limiting the size of the list helps removing the unhelpful references in the list. Recent occurrences are stored at the beginning of the list. Binary encoding of the triplets <position,length,next character> In gzip the encoding is done slightly differently than classical LZ77: instead of sending all the time the triplet <position,length,character>, gzip sends either a pair <length,position>, when a match is found, or it sends <character>, when no match was found in the past. Therefore a previous match is represented by a pointer consisting in position and length. The position is Huffman coded such that more frequent positions (usually recent ones) are encoded using fewer bits than older positions. 64

The match length and the next character are encoded with a single Huffman code (more efficient than separately Huffman encoding the length and the character and adding an extra bit to signal that what follows is length or character). The Huffman codes are generated semi-statically: blocks of up to 64Kbytes from the input file are processed at a time. The canonical Huffman codes are generated for the pointers and raw characters, and a code table is placed at the beginning of the compressed form of the block. The program does not need to read twice the file (64 Kbytes can be kept in memory). With its fast list search method and compact Huffman representation of pointers and characters on Huffman codes, gzip is faster and compresses better than other Ziv-Lempel methods. However, faster versions exist, but their compression ratio is smaller. 65 3.4 The LZ78 family of adaptive dictionary coders In LZ77 pointers can refer to any substring in the window of previous text. This may be inefficient, since the same substring may appear many times in the window, and we spare multiple codewords for the same substring. In LZ78 only some substrings can be referenced, but now there is no window restriction in the previous text. The encoded stream consists of pairs <index, character>, where <index, > points in a table to the longest substring matching the current one, and < character> is the character following the matched substring. Example We want to encode the string abaababaa The encoder goes along the string and creates a table where it dynamically adds new entries. When encoding a new part of the string, the encoder searches the existing table to find a match for the new part, and if there are many such matches, it selects the longest one. Then it will add to the encoded stream the address in the table of the longest match. Additionally, he will add to the bitstream the code for the next character. 66

When starting to encode the string abaababaa, the table is empty, so there is no match in it, and the encoder adds to the output bitstream the pair <,a > ( for no match found in the table, and a for the next character). After this, the encoder adds to the dictionary an entry for the string a, which will have address. Continuing to encode the rest of the string, baababaa, the table has the single entry a, so no match is found in the table. The encoder adds to the output bitstream the pair <,b > ( for no match found in the table, and b for the next character). After this, the encoder adds to the dictionary an entry for the string b, which will have address 2. Continuing to encode the rest of the string, aababaa, we can find in the table the entry a, which is the longest match now. The encoder adds to the output bitstream the pair <,a > ( for match found in the first entry of the table, and a for the next character). After this, the encoder adds to the dictionary an entry for the string aa, which will have address 3. 67 How the decoder works. encoder output <,a > <,b > <,a > < 2,a > < 4,a > < 4,b > < 2,b > < 7,b > < 8,b > decoder output a, b, aa, ba, baa, Table entries a b aa ba baa Table addresses 2 3 4 5 In the example the encoded stream is <,a ><,b ><,a >< 2,a >< 4,a >< 4,b >< 2,b >< 7,b >< 8,b >. After decoding the first 5 pairs we have found the original text a,b, aa, ba,baa. When processing the sixth pair, < 4,b >, which represents the phrase 4 (i.e. ba) and the following character is b, therefore we complete the decoded string to a,b, aa, ba,baa, bab and bab is added to the dictionary as the phrase 6. The rest of the decoded string is a run of b s. The separation of the input string in substrings (the separation by commas in the previous example) is called a parsing strategy. The parsing strategy can be implemented in a trie structure. The characters of each phrase specify the path from the root to the node labeled by the index of the phrase. 68

TRIE DATA FOR LZ78 CODING a a 3 b a 2 4 a b 5 6 69 the data structure in LZ78 grows without any bounds, so the growth must be stopped to avoid the use of too much memory. At the stopping moment the trie can be removed and re-initialized. Or it can be partly rebuilt using a few hundred of the recent bytes. Encoding with LZ78 may be faster than with LZ77, but decoding is slower, since we have to rebuild the dictionaries (tables) at decoding time. 7

3.5 The Lempel-Ziv-Welch (LZW) variant of LZ78 LZW is more popular than Ziv-Lempel coding, it is the basis of Unix compress program. LZW encodes only phrase numbers and does not have explicit characters in the encoded stream. This is possible by initializing the list of phrases to include all characters, say the entries to 28, such that a has address 97 and b has address 98. a new phrase is built from an existing one by appending the first character of the next phrase to it. encoder input a b a ab ab ba aba abaa encoder output 97 98 97 28 28 29 3 34 New entry added ab ba aa aba abb baa abaa address of new entry 28 29 3 3 32 33 34 in the example the encoder output is formed of the indexes in the dictionary 97, 98, 97, 28, 28, 29, 3, 34. Decoding 97, 98, 97, 28, 28 we find the original text a,b, a,ab, ab and construct the new entries in the dictionary 28, 29, 3, 3. We explain in detail the decoding starting from the next received index, 29. First we read from the encoded stream the entry 29 to be ba, which can be appended to 7 the decoded string, a,b, a, ab,ab, ba. At this moment the new phrase to be added to the dictionary is phrase 32 = abb. Then we read from the encoded stream the entry 3. This is found to be aba and added to the decoded string, a,b, a, ab,ab, ba, aba. We also add to the dictionary the new phrase 33=baa. the lag in the construction of the dictionary creates a problem when the encoder references a phrase index which is not yet available to the decoder. This is the case when 34 is received in the encoded stream: there is no 34 index in the dictionary yet. However, we know that 34 should start with aba and contains an extra character. Therefore, we add to the decoded string a,b, a,ab, ab, ba,aba, aba?. Now we are able to say what is the phrase 34, namely abaa, and after this we can substitute? by a. There are several variants of LZW. Unix compress is using an increasing number of bits for the indices: fewer when there are fewer entries (other variants are using for the same file the maximum number of bits necessary to encode all parsed substrings of the file). When a specified number of phrases are exceeded (full dictionary) the adaptation is stopped. The compression performance is monitored, and if it deteriorates significantly, the dictionary is rebuilt from scratch. 72

Encoding procedure for LZW Given the text S[...N]. Set p. /* (S(p) is the next character to be encoded) */ 2. For each character d {,..., q } in the alphabet do /* initial dictionary */ Set D[d] character d. 3. Set d q /* d points to the last entry in the dictionary */ 4. While there is still text remaining to be coded do 4. Search for the longest match for S[p,p +,...] in D. Suppose the match occurs at entry c, with length l. 4.2 Output the code of c 4.3 Set d d +. /* Add an entry to the dictionary*/ 4.4 Set p p + l. 4.5 Set D[d] D[c] + +S[p] /* Add an entry to the dictionary by concatenation*/ 73 Decoding procedure for LZW. Set p. /* (S(p) is the next character to be decoded) */ 2. For each character d {,..., q } in the alphabet do /* initial dictionary */ Set D[d] character d. 3. Set d q /* d points to the last entry in the dictionary */ 4. For each code c in the input do 4. If d (q ) then /* first time is an exception */ Set last character of D[d] first character of D[c]. 4.2 Output D[c]. 4.3 Set d d +. /* Add an entry to the dictionary */ 4.4 Set D[d] D[c] + +? /* Add an entry to the dictionary by concatenation, but the last character is currently unknown*/ 74

3.6 Statistical analysis of a simplified Ziv-Lempel Algorithm for the universal data compression system: The binary source sequence is sequentially parsed into strings that have not appeared so far. Let c(n) be the number of phrases in the parsing of the input n-sequence. We need log c(n) bits to describe the location of the prefix to the phrase and bit to describe the last bit. The above two pass algorithm may be changed to a one pass algorithm, which allocates fewer bits for coding the prefix location. The modifications do not change the asymptotic behavior. Parse the source string into segments. Collect a dictionary of segments. Add to the dictionary a segment one symbol longer than the longest match so far found Coding : Transmit the index of the matching segment in the dictionary plus the terminal bit; Example: 75 Index in dictionary Segment Transmitted message (,) 2 (,) 3 (,) 4 (,) 5 (3,) Length of the code for increasing sizes of segment indices is L = Number of segments j= log 2 (j) + Number of segments If we assign the worst case length to all segment indices, and if the number of segments is c(n) with n the total length of the input string, the length is and the average length per input symbol is l = c(n)( + log c(n)) l = c(n)( + log c(n)) n 76

Definition A parsing of a binary string x x 2...x n is a division of the string into phrases, separated by commas. A distinct parsing is a parsing such that no two phrases are identical. Lemma (Lempel and Ziv) The number of phrases c(n) in a distinct parsing of a binary sequence x x 2...x n satisfies n c(n) ( ε n ) log n log(log n)+4 where ε = min(, log n ). Theorem Let {X n } be a stationary ergodic process with entropy rate H(X) and let c(n) be the number of distinct phrases in a distinct parsing of a a sample of length n from this process. Then with probability. c(n) log c(n) lim sup H(X) n n 77 Theorem Let {X n } be a stationary ergodic process with entropy rate H(X). Let l(x,x 2,...,X n ) be the Lempel-Ziv codeword length associated with X,X 2,...,X n. Then with probability. lim sup n n l(x,x 2,...,X n ) H(X) Proof We know that l(x,x 2,...,X n ) = c(n)(+log c(n)). By Lemma Lempel-Ziv c(n) n and thus lim sup n n l(x,x 2,...,X n ) = lim sup n c(n) log c(n) n + c(n) H(X) n 78

SGN-236 Signal Compression 4. Shannon-Fano-Elias Codes and Arithmetic Coding 4. Shannon-Fano-Elias Coding 4.2 Arithmetic Coding 79 4. Shannon-Fano-Elias Coding We discuss how to encode the symbols {a,a 2,...,a m }, knowing their probabilities, by using as code a (truncated) binary representation of the cumulative distribution function. Consider the random variable X taking as values m letters of the alphabet, {a,a 2,...,a m }, and for the letter a i the probability mass function is p(x = a i ) = p(a i ) >. The (cumulative) distribution function is F(x) = Prob(X x) = a k x p(a k ) where we assumed the lexicographic ordering relation a i < a j if i < j. Note that if one changes the ordering, the cumulative distribution function will be different. y = F(x) is a function having its plot as stairs, with jumps at x = a k (see plot on next page). Even though there is no inverse function x = F (y), we may define a partial inverse as follows: If all p(a i ) >, an arbitrary value y [, ) uniquely determines a symbol a k, as that symbol that obeys F(a k ) y < F(a k ). We may use the plot of F(x), to identify the value a k for which F(a k ) y < F(a k ). Note that F(a k ) = F(a k ) + p(a k ), which is a fast way to compute F(a k ). 8

To avoid dealing with interval boundaries, define F(a i ) = i k= p(a k ) + 2 p(a i) The values F(a i ) are the midways of the steps in the distribution plot. If F(ai ) is given, one can find a i. The same is true if one gives an approximation of F(ai ), as long as it does not go outside the interval F(a k ) y < F(a k ). Therefore the number F(a i ), or an approximation of it, can be used as a code for a i. Since the real number F(a i ) may happen to have an infinite binary representation, we have to look sometimes for numbers close to it, but having shorter binary representations. From Shannon codes we know that a good code for a i needs to be represented in about log p(a i ) bits, therefore F(x) needs to be represented in about log p(x) bits. 8 Probability mass function and cumulative distribution.4.9.2.8.7.6.8 p(a i ).5 F(x).4.6.3.4.2..2.5.5 2 2.5 3 3.5 4 4.5 5 i.5.5 2 2.5 3 3.5 4 4.5 5 x 82

Probability mass function and cumulative distribution for strings To extend the previous reasoning from symbols to strings of symbols, x, we have to: compute for each of the strings of n symbols the mass probability p(x) (such that x p(x) = ), define a lexicographic ordering for any two strings (each with n symbols) x and y, and denoted it by the ordering symbol, x < y. define the cumulative probability F(y) = x<y p(x) + 2 p(y) The code for x is obtained as follows: We truncate (floor operation) F(x) to l(x) bits to obtain F(x) l(x) where l(x) = log p(x) +. Important notation distinction: F(x) k is the binary representation of the sub-unitary number F(x), using k bits for the fractional part. log p(x) denotes as usual the smallest integer larger than log p(x). The codeword to be used for encoding the string x is F(x) l(x). 83 Property. The code is well defined, i.e. with F(x) l(x) we uniquely identify x. Proof: F(x) F(x) l(x) < 2 l(x) F(x) 2 l(x) < F(x) l(x) Now we use the fact l(x) = log p(x) + and 2l(x) = 2 2 log p(x) > 2 2 log p(x) = 2 p(x) p(x) < = 2l(x) 2 F(x) F(x ) F(x ) < F(x) 2 l(x) F(x ) < F(x) 2 l(x) < F(x) l(x) F(x) Finally the uniqueness of x given F(x) l(x) follows from F(x ) < F(x) l(x) F(x) (i.e. looking at the plot of the cumulative function z = F(x), the value z = F(x) l(x) falls on the step at x, between the basis of the step and the middle of it. 84

Property 2. The code is prefix free. Let associate to each codeword z z 2...z l a closed interval, [.z z 2...z l ;.z z 2...z l + ]. 2 l Any number outside the closed interval has at least one bit different in the bits to l, and therefore z z 2...z l is not a prefix of any number outside the closed interval. Extending now the reasoning to all codewords, the code is prefix free if and only if all intervals corresponding to codewords are disjoint. The interval corresponding to any codeword has length 2 l(x). A prefix of the codeword is e.g. z z 2...z l. Can that prefix be a codeword itself? If z z 2...z l is a codeword, than it represents the interval [.z z 2...z l ;.z z 2...z l + ]. But the number z 2 l z 2...z l necessarily belongs to the interval [.z z 2...z l ;.z z 2...z l + ], therefore there 2 l is an overlap of the intervals. We already have F(x ) < F(x) l(x), and similarly: p(x) < = F(x) 2l(x) 2 F(x) F(x) > F(x) + 2 > F(x) l(x) l(x) + 2 l(x) and therefore the interval [ F(x) l(x), F(x) l(x) + ] is totaly included in the interval 2 l(x) [F(x ),F(x)]. Now, the overlap of intervals is contradicted by our consideration on the cumulative distribution for symbols. Consequently Shannon-Fano-Elias code is prefix free. 85 Average length of Shannon-Fano-Elias Codes We use l(x) = log p(x) + bits to represent x. The expected codelength is L = p(x)l(x) = p(x) log x x p(x) + < H + 2 (9) where the entropy is H = x p(x) log p(x) Example. All probabilities are integer powers of 2. x p(x) F(x) F(x) F(x) in binary l(x) = log p(x) + Codeword.25.25.25. 3 2.5.75.5. 2 3.25.875.825. 4 4.25..9375. 4 86

Average length of Shannon-Fano-Elias Codes The average codelength is 2.75 bits, the entropy is.75 bits. Since all probabilities are powers of two, the Huffman code attains the entropy. One can remove the last bit in the last two codewords of Shannon-Fano-Elias Code in Example! Example 2. All probabilities are not integer powers of 2. The Huffman code is in average.2 bits shorter than Shannon-Fano-Elias Code in Example 2. x p(x) F(x) F(x) F(x) in binary l(x) = ( log p(x) + Codeword.25.25.25. 3 2.25.5.375. 3 3.2.7.6.() 4 4.5.85.775.() 4 5.5..925.() 4 87 4.2 Arithmetic coding Motivation for using arithmetic codes Huffman codes are optimal codes, for a given probability distribution of the source. However, their average length is longer than the entropy, within bit distance. To reach average codelength closer to the entropy, Huffman is applied to blocks of symbols, instead of individual symbols. The size of the Huffman table needed to store the code increases exponentially with the length of the block. If during encoding we improve our knowledge of the symbol probabilities, we have either to redesign the Huffman table again, or use an adaptive variant of Huffman (but everybody agrees adaptive Huffman is not elegant nor computationally attractive). When encoding binary images, the probability of one symbol may be extremely small, therefore the entropy is close to zero. Without blocking the symbols, Huffman average length is bit! Long blocks are strictly necessary in this application. Whenever somebody needs to encode long blocks of symbols, or wants to change the code to make optimal for the new distribution, the solution is arithmetic coding. Its principle is similar to Shannon-Fano-Elias Coding, i.e. handling the cumulative distribution to find codes. However, arithmetic coding is better engineered, allowing very efficient implementations (as speed and compression ratio) and an easy adaptation mechanism. 88

Principle of arithmetic codes Essential idea: efficiently calculate the probability mass function p(x n ) and the cumulative distribution function F(x n ) for the source sequence x n = x x 2...x n. Then, similar to Shannon-Fano-Elias Codes, use a number in the interval [F(x n ) p(x n ); F(x n )] as the code for x n. A sketch: Expressing F(x n ) with an accuracy of log p(x) will give a code for the source. So the codewords for different sequences are different. But it is no guarantee that the codeword are prefix free. As in Shannon-Fano-Elias Codes, we may use log p(x) + bits to round F(x n ), in which case the prefix condition is satisfied. A simplified variant. Consider a binary source alphabet, assume we have a fixed block length n that is known to both the encoder and decoder. We assume we have a simple procedure to calculate p(x x 2...,x n ) for any string x x 2...,x n. We will use the natural lexicographic order on strings: a string x is greater than a string y if x i =,y i = for the first i such that x i y i. Equivalently, x > y if i x i 2 i > i y i 2 i, i.e. the binary numbers satisfy.x >.y. The strings can be arranged as leaves in a tree of depth n (parsing tree, not coding tree!). In the tree, the order x > y of two strings means that x is at right of y. 89 We need to compute the cumulative distribution F(x n ) for a string x n, i.e. to add all p(y n ) for which y n < x n. However, there is a much smarter way to perform the sum, described next. Let T x x 2...x k the subtree starting with x x 2...x k. The probability of the subtree is P(T x x 2...x k ) = p(x x 2...x k z k+...z n ) = p(x x 2...x k ) z k+...z n The cumulative probability can therefore be computed as F(x n ) = p(yn ) = p(t) y n x n T:T is to the left of x n p(x x 2...x k ) (2) = k:x k = 9

Example: For a Bernoulli source with θ = p() we have F() = p(t ) + p(t 2 ) + P(T 3 ) = p() + p() + p() = ( θ) 2 + θ( θ) 2 + θ 2 ( θ) 2 To encode the next bit of the source sequence, we need to calculate p(x i x i+ ) and update F(x i x i+ ). To decode the sequence, we use the same procedure to calculate p(x i x i+ ) and update F(x i x i+ ) for various x i+, and check when the cumulative distribution exceeds the value corresponding to the codeword. The most used mechanisms for computing the probabilities are i.i.d. sources and Markov sources. 9 For i.i.d. sources For Markov sources of first order p(x n ) = n p(x i ) i= p(x n ) = p(x ) n p(x i x i ) Encoding is efficient if the distribution used by the arithmetic coder is close to the true distributions. The adaptation of the probability distribution will be discussed in a separate lecture. The implementation issues are related to the computational accuracy, buffer sizes, speed. i=2 92

Statistical modelling + Arithmetic coding = Modern data compression Statistical modeller Next Symbol Arithmetic encoder Cumulative Distribution of Symbols Input image Stream of bits 93 Arithmetic coding example message: BILL GATES Character Probability Range SPACE /. -. A /. -.2 B /.2 -.3 E /.3 -.4 G /.4 -.5 I /.5 -.6 L 2/.6 -.8 S /.8 -.9 T /.9 -. New Character Low value High Value.. B.2.3 I.25.26 L.256.258 L.2572.2576 SPACE.2572.25724 G.25726.25722 A.257264.257268 T.2572676.257268 E.25726772.25726776 S.257267752.257267756 BILL GATES (.257267752,.257267756) = ( 4D8F565H 2 32, 4D8F567H 2 32 ) 32 bits H = i= p i log 2 (p i ) = 3.2 bits / character Shannon: To encode BILL GATES we need at least 3.2 bits 94

Encoding principle Set low to. Set high to. While there are still input symbols do get an input symbol code range = high - low. high = low + range high range(symbol) low = low + range low range(symbol) End of While output low 95 Arithmetic coding: Decoding Principle get encoded number Do find symbol whose range straddles the encoded number output the symbol range = symbol high value - symbol low value subtract symbol low value from encoded number divide encoded number by range until no more symbols Encoded Number Output Low High Range Symbol.257267752 B.2.3..57267752 I.5.6..7267752 L.6.8.2.683876 L.6.8.2.4938 SPACE....4938 G.4.5..938 A.2.3..938 T.9...38 E.3.4..8 S.8.9.. 96

References on practical implementations of Arithmetic coding Moffat, Neal and Witten (998) Source code ftp : //munnari.oz.au/pub/arith coder/ Witten, Neal and Cleary(987) Source code f tp : //f tp.cpsc.ucalgary.ca/pub/projects/ar.cod/cacm 87.shar 97 SGN-236 Signal Compression 5. Adaptive Models for Arithmetic Coding 5. Adaptive arithmetic coding 5.2 Models for data compression 5.3 Prediction by partial matching 98

5. Adaptive arithmetic coding: an example We want to encode the data string bccb from the ternary alphabet {a,b, c}, using arithmetic coding, and the decoder knows that we want to send 4 symbols. We will use an adaptive zero-model (which only counts the frequencies of occurrence of the symbols with no conditioning, the so called zero memory model). To evaluate the probability of the symbols p(a),p(b) and p(c) we denote by N(a) the number of occurrences of the character a in the already received substring, N(b) the number of occurrences of the character b in the already received substring, and similarly N(c), and then assign the probability by p(a) = N(a) + N(a) + N(b) + N(c) + 3 ; p(b) = N(b) + N(a) + N(b) + N(c) + 3 ; p(c) = N(c) + N(a) + N(b) + N(c) + 3 ; (2) When encoding the first b, the history available to the decoder is empty, and consequently there is no knowledge about which of the three symbols is more frequent, therefore we assume p(a) = p(b) = p(c) = 3, which is consistent with formula (2), since N(a) = N(b) = N(c) =. The arithmetic coder will finally provide us a small subinterval of (, ), and virtually 99 any number in that interval will be a codeword for the encoded string. During encoding, the original interval (, ) is successively reduced, and we denote low and high the current limits of the interval (initially low=, high=). At this first encoded symbol, b, the interval (, ) is split according to the probability distribution, in three equal parts (see Fig. 2.8), the interval needed to specify b is (.3333,.6667), therefore low becomes.3333 and high becomes.6667. When encoding c (the second character in bccb), we have N(a) = ;N(b) =,N(c) = and therefore p(a) = p(c) = /4 and p(b) = 2/4. To specify c, the arithmetic coder will change low to.5834 and high will be set to.6667. When encoding c, (the third character in bccb) we have N(a) = ; N(b) =,N(c) = and therefore p(a) = /5 and p(b) = p(c) = 2/5. To specify c, the arithmetic coder will change low to.6334 and high will be set to.6667. When encoding b, (the fourth character in bccb) we have N(a) = ; N(b) =, N(c) = 2 and therefore p(a) = /6, p(b) = 2/6 and p(c) = 3/6. To specify c, the arithmetic coder will change low to.639 and high will be set to.65. Now the encoder found the interval representing the string bccb, and may choose whatever number in the interval.639,.65 to be sent to the decoder. If we assume the encoding is done in decimal,.64 will be a suitable value to choose, and the encoder has to send the message 64.

What will be the hexa number suitable to be sent, if we assume hexadecimal coding, knowing the hexa representations of the decimal involved numbers? (What about the binary number?) Decimal.3333.6667.5834.6334.639.65 Hexadecimal.5553267.AAACD9E8.9559B3D.A22689D.A39586.A66CF4F 2