Data Compression Using a Sort-Based Context Similarity Measure

Size: px

Start display at page:

Download "Data Compression Using a Sort-Based Context Similarity Measure"

Ethel Shelton
5 years ago
Views:

1 Data Compression Using a Sort-Based Context Similarity easure HIDETOSHI YOKOO Department of Computer Science, Gunma University, Kiryu, Gunma 76, Japan yokoo@cs.gunma-u.ac.jp Every symbol in the data can be predicted by taking the immediately preceding symbols, or context, into account. This paper proposes a new adaptive data-compression method based on a context similarity measure. We measure the similarity of contexts using a context sorting mechanism. The aim of context sorting is to store a set of contexts in a specific order so that contexts more similar to the current one are more accessible. The proposed method predicts the next symbol by ranking the previous context symbol pairs in order of context similarity. The codeword for the next symbol represents the rank of the symbol in this ordered sequence. The compression performance is evaluated both analytically and empirically. Although the proposed method uses no probability distribution to make a prediction, it gains good compression. It also reveals a strong relation between symbol-ranking compression and the Ziv Lempel textual substitution method. Received June 28, 1996; revised April 0, INTRODUCTION Every symbol in the data can be predicted by taking the immediately preceding text, or context, into account. We propose a new context-based data compression method, which utilizes a similarity measure of contexts in order to predict the incoming symbol in anticipation that similar symbols occur in similar contexts. The method is universal in the sense that it requires no prior knowledge about the data, and is particularly effective for text compression. ost previous methods for the same purpose are grouped into two classes [1]: Ziv Lempel-type dictionary methods and statistical modelling techniques. Our method, however, belongs to neither class, and in this respect we can emphasize its novelty 1. In most context-based methods proposed so far, whether intended for practical use, e.g. PP [2], or for theoretical modelling, e.g. [], both context selection and arithmetic coding are supposed to be essential components. However, the problem of adaptive context selection is so difficult that there is no widely accepted theoretical basis for it. This difficulty has led to a recent recognition [4] that the notion of context selection is not as natural as we have believed. Another problem of context-based methods lies in the use of arithmetic coding. Arithmetic coding is powerful and flexible, but more complex in time and implementation than the Ziv Lempel algorithm. As a result, in spite of its compression performance, no context-based method has yet been commonly used in practice. In contrast, even though 1 This work was presented in preliminary form at the IEEE Data Compression Conference, Snowbird, UT, arch April, the proposed method shares the same idea as the contextbased techniques, it assumes neither context selection nor arithmetic coding. The existing method most related to the present one is the block-sorting method of Burrows and Wheeler [5] (see also [6]). What most distinguishes it from the proposed method is that the former is not adaptive while ours is on-line adaptive. The evaluation of compression performance by Burrows and Wheeler is less analytic. In addition, the present method can make direct use of an internal state of the encoder, which makes it possible to tune a code character by character. These two methods can be classified into the family of symbolranking text compressors, which maintain a list of symbol candidates ranked in the order of their likelihood of appearing in every context. The symbol-ranking text compression family has been studied intensively by Fenwick [7 9], who pointed out its relation with Shannon s methods of measuring the information of English [10]. Howard and Vitter [11] also proposed a symbol-ranking mechanism to avoid explicit frequency counting in a PP-style compression system. Apart from Shannon s work, these existing symbol-ranking compressors are motivated primarily by pragmatic demands on text compression. The main purpose of the present paper is to bring together practical and theoretical aspects of the symbol-ranking family, including its relation with the Ziv Lempel algorithm. The proposed method encodes one symbol at a time in principle. The encoding of every symbol consists of several operations. As an illustration of the operations, Figure 1a shows all the context symbol pairs obtained after the sample string bacacaba has been processed. In Figure 1, the null

2 SORT-BASED CONTEXT SIILARITY EASURE 95 context λ b ba bac baca bacac bacaca bacacab bacacaba symbol b a c a c a b a x similarity context symbol rank bacacaba x 2 ba c 1 1 baca c bacaca b 2 === λ b b a 0 bac a bacac a bacacab a (a) (b) FIGURE 1. Ranking symbol candidates based on context similarity. string of length 0 and the next symbol, which is about to be encoded, are denoted by λ and x respectively. Here, we measure the similarity between two contexts by the length of their common suffixes. In the present example, by matching the previous contexts against the current one bacacaba, we can group the existing context symbol pairs into three equivalence classes, which are shown in Figure 1b. Then, we enumerate candidates for the next symbol in the order of the rightmost column of the figure. The ranks from 1 to in the column are obtained by counting only distinct symbols. If the actual next symbol x is equal to one of these three symbols, the corresponding rank is then encoded. Otherwise, the virtual rank 4 is encoded, followed by a transmission of x as a raw symbol. For the actual encoding of ranks, we can use representations of the integers [12]. It should be noted that it is insufficient to measure the similarity of contexts only by the length of their common suffixes. We may have more than one context which shares the same similarity to the current context. This causes a problem when we wish to obtain a unique ranking of context symbol pairs. In order to overcome the problem, we define the notion of context sorting, which gives a totally ordered set of ranks to symbol candidates. The resulting similarity measure may be regarded as a refinement of that used in our previous method [1]. In the following, we first describe the context sorting with a bounded context length. It provides not only a similarity measure on contexts but also a concrete method for giving a rank to an input symbol. Following that we give a theoretical analysis of the compression performance, sample coding methods of ranks and their empirical evaluation. This analysis and evaluation shows that, in spite of its simplicity, our method gives good compression performance both theoretically and practically. Finally, we make a brief remark on the possibility of improvements to the Ziv Lempel methods by context sorting. 2. CONTEXT SORTING WITH A BOUNDED CONTEXT LENGTH Let A = {a 0, a 1,..., a α 1 } (1) be an input alphabet of α symbols. An ordering relation on A is denoted by a i a j for a i, a j A and 0 i < j α 1. (2) Let A denote the set of all finite length strings over A. A string x of length n on A is an ordered n-tuple x = x 1 x 2... x n of symbols from A. To indicate a substring of x which starts at position i and ends at position j, we write x[i... j]. When i j, x[i... j] = x i x i+1... x j, but when i > j, we take x[i... j] = λ, the null string. Let x[1... m] and y[1... n] be two strings in A. In order to measure the similarity of the two strings, we introduce lexicographic order on reversed strings of a bounded length. For any integer greater than 0, we write x[1... m] y[1... n], () if either of the following two conditions holds. 1. We have x m i y n i for an integer i such that 0 i < min{m, n, } and x m j = y n j for any integer j such that 0 j < i. 2. We have m < n, m < and x m i = y n i for any integer i such that 0 i < m. When we read strings backwards, the relation corresponds to a lexicographic order of at most symbols. In particular, we write x[1... m] = y[1... n] if they have exactly the same last symbols, and x[1... m] y[1... n] if x[1... m] y[1... n] or x[1... m] = y[1... n]. Let x[1... N] = x 1 x 2 x N (x i A) be a text being compressed. For the ith symbol x i of x[1... N], its preceding i 1 symbols form its context We define Cont x (1) = λ. Cont x (i) = x[1... i 1]. (4) The null string λ satisfies the relation λ x for any non-null string x. The goal of

3 96 H. YOKOO context sorting is, for a given, to sort a set of contexts {Cont x ( j) j = 1,..., i} in ascending order of the relation. As an example, consider the string λ : b b: a x[1... 8] = bacacaba, (5) which has the following eight contexts: ba: c bac:a aca: b cab: a cac: a Cont x (1) = λ, Cont x (2) = b, Cont x () = ba, Cont x (5) = baca, Cont x (7) = bacaca, Cont x (4) = bac, Cont x (6) = bacac, Cont x (8) = bacacab. If we use the English alphabetic order as and if we take account of the underlined parts only, i.e. we take =, then we have Cont x (1) = Cont x (7) Cont x (4) Cont x () Cont x (2) Cont x (6). Cont x (5) Cont x (8) In order to describe our compression method, assume that an initial segment x[1... i 1] of x[1... N] has already been encoded and the ith symbol x i is about to be encoded. We further assume that i is greater than 1. The first symbol x 1 is transmitted as it is. First, sort the so-far observed contexts {Cont x ( j) j = 1,..., i 1} in order of, and write the resulting sequence as C 1 C2 CK, (6) where C 1 = Cont x (1) = λ. Here, we assume that the sequence (6) has no equality. If equality occurs, we eliminate the older context that is = to the newer one. Thus, we have K min{α, i 1}. This assumption is not essential, but is made in order to simplify the encoding procedure. Second, in order to find existing contexts that are similar to the current one, we then search a position p in (6) such that C p z C p+1, 1 p K (7) for z = x[1... i 1]. Then, C p or C p+1 (or both) is among the previous contexts most similar to the current one. By merging the lists of C p, C p 1,... and C p+1, C p+2,..., we can obtain a linear sequence of context symbol pairs sorted in order of context similarity. What we need to do to get the rank of x i is to count the number of distinct symbols prior to x i in this sorted sequence of context symbol pairs. If we fix a merging method in the course of encoding and decoding, we can rank the symbol candidates uniquely. Here, we define the most similar context (SC) to the current context as follows. If the similarity between C p+1 and the current context is greater than that of C p, then C p+1 is the SC to the current context. Otherwise, C p is defined to be the SC to the current context. Note that the similarity between a context FIGURE 2. Sample data structure corresponding to the string (5). Bold arrows represent links of a doubly linked list. and its SC may be zero because the SC is only defined in terms of the lexicographic order of reversed contexts. Any position i except for i = 1 in the data string has a unique previous context as the SC to its context Cont x (i). In our sample implementation we use a binary search tree to store context symbol pairs in order of contexts. This is quite similar to the method for storing strings in lexicographic order [14], which was proposed as an implementation of the Ziv Lempel data compression method. As is easily seen, symmetric order (or in-order) in the tree corresponds to the order of contexts. We link the nodes in the tree by a doubly linked list in symmetric order. Two pointers which follow the list in opposite directions enable us to compute the rank of the next symbol easily. Figure 2 shows an example of the data structure, which corresponds to the sample string (5). After deciding the rank, we insert the current context symbol pair into an appropriate position in the tree. If the current context is = to one of the existing contexts, we store only the current pair by deleting the older one. In the above implementation, every node in the tree stores just a pointer to its corresponding position in the data string instead of storing an actual context symbol pair. Therefore, its memory requirement is proportional to the data length; it never depends on the maximal context length. On the other hand, we need at most O() time to compare two contexts. So, the time complexity is O( 2 log α) both in inserting a context symbol pair into the tree and in obtaining the position p in (7), provided that the tree is reasonably balanced. Furthermore, we need extra time to get a rank by searching from the position p, which is relatively negligible because the rank actually concentrates around p. Our sample implementation with = 8 requires about 0. s Kbyte 1 to encode on a Sun SPARCstation10. This is more than 20 times slower than Gzip. There still remains work for reducing the time complexity.. ANALYTICAL BOUND ON COPRESSION PERFORANCE Before describing the actual encoding of the rank, we analyse a bound on the compression performance of the proposed method. The rank r of an incoming symbol is a positive integer which takes a value in a pre-determined range. We

4 SORT-BASED CONTEXT SIILARITY EASURE 97 can, of course, make use of the range in both encoding and decoding, as will be shown in the next section. In this section, however, in order to simplify the analysis and to allow an extended alphabet of an arbitrary order, we encode r by an infinite prefix-free codeword set. It is known that we can construct a representation of the integers which has the length function 2 L(r) = log r + O(log(1 + log r)). (8) The function L(r) is convex- and non-decreasing in r. A typical example of such a representation is the δ code by Elias [12]. In this section, we assume that the rank r is coded by a codeword the length of which is L(r) in bits. Suppose > N for simplicity. We also assume that the data string is generated from a finite-order arkov source. Thus, there is a finite k which characterizes a probability distribution: Pr[X i = a m ] = P(a m x[i k... i 1]), m = 0,..., α 1, (9) where X i is a random variable corresponding to the ith symbol x i. For a fixed context C = x[i k... i 1], define the conditional entropy H(C) by where α 1 H(C) = m=0 p m log p m, (10) p m = P(a m C), m = 0,..., α 1. Let S be the set of thus far occurred contexts that are k = to C, that is, S = {Cont x ( j) Cont x ( j) k = x[i k... i 1], 1 j < i}. Define S to be the number of contexts in S. Note that, since > N, there is no context in S that is = to any other context in S. The sequence obtained by ranking the existing contexts with respect to x[1... i 1] has the top S context symbol pairs from S. Assume that, in this sorted sequence, the first occurrence of the symbol a m is in the nth pair and its rank is r i. Then, we have r i n. (11) Thus, the average rank and the first position of a m in the sorted sequence satisfy E[r i ] E[n], (12) where the expectation E is with respect to the data string. The probability that the first occurrence of a m is in the nth 2 All logarithms in this paper are taken to base 2. In Equation (8) we can think that O(log(1 + log r)) = O(log log r) in the usual sense of the O- notation. However, we have adopted the above form so that the argument inside O has a finite value even for r = 1. place in the sequence is given by Pr[first occurrence of a m is in the nth place] = (1 p m ) n 1 p m for 1 n S (1 p m ) S for S < n. The size S increases in linear order of i, and therefore + lim lim E[n] = n(1 p m ) n 1 p m = 1. N + i N p m Thus, we have n=1 lim lim E[r i] 1. (1) N + i N p m The expected length of a codeword representing a symbol a m in the context C tends to E[L(r i )] = E[log r i + O(log(1 + log r i ))] log E[r i ] + O(log(1 + log E[r i ])) ( )) log p m + O log (1 + log 1pm as i N and N +. The above inequality comes from the convexity of the logarithmic function. It follows from this that the expected codeword length per symbol in the context C is asymptotically bounded above by α 1 m=0 α 1 p m E[L(r i )] m=0 α 1 + m=0 p m log p m ( )) p m O log (1 + log 1pm H(C) + O(log(H(C) + 1)). This means that the average codeword length representing a symbol in any context approaches the conditional entropy within a loss of logarithmic order of itself. This depends neither on the order k nor on a specific probability distribution. The above discussion can be applied to any extension of the alphabet. The entropy of such an extension is proportional to the order d of extension. On the other hand, the difference between the rate attained by our method and the source entropy increases in logarithmic order of d. Therefore, by encoding a block of d symbols at a time, it is possible to achieve asymptotically the entropy bound as closely as desired. 4. COPRESSION EXPERIENTS In implementing our compression method we must take two issues into account. The first issue is related to the time-consuming steps of context sorting. In Section 2 we have outlined a possible solution to the problem. We now focus on the second problem the actual encoding of ranks. Although, as shown in the previous section, there exists a good code, good at least in the asymptotic sense,

5 98 H. YOKOO TABLE 1. Various codes for the ranks from 1 to 10 TGcode ethod C r R = 10 ethod B in (18) the development of practically efficient codes is another problem. Once we have obtained the rank r of the current symbol, we convert it to a codeword of a uniquely decodable code. Let D i denote the highest possible rank of the ith symbol x i. The actual rank r of x i satisfies the inequalities 1 r min{α, D i + 1}, where D i + 1 is a virtual rank. This range of r can be known also in the course of decoding. Denoting the range generally by 1 r R, (14) we give three coding methods to encode a value in this range. ethod A (Truncated γ code) The first code, called truncated γ code or simply TGcode, has the same length function as that of the representation γ by Elias [12] in 1 r 2 log R 1. Therefore, this code approaches an equivalent of γ as R tends to +. A specific procedure for this code is given by: TGcode(r, R): Let 1s be the standard binary representation of r. If 1 r 2 log R 1 then output 1 s 0s, where 1 s is a concatenation of the same number of 1 s as the length of s this is actually log r bits. If 2 log R r R then output 1 log R PBcode(r 2 log R, R 2 log R + 1). In the above procedure, PBcode( j, J) encodes an integer j over the interval [0, J 1] using minimum bits of almost equal length (see Section A.2 in [1]). As an example, we show the codewords of TGcode(r, R) with R=10 in Table 1. It is easy to see that any TGcode(r, R) code is a prefix-free complete code. Elias s γ code does not share the length function of (8), and therefore ethod A never attains the asymptotic bound given in the previous section. However, it matches quite well with empirical distributions of ranks especially on text files. relative frequency FIGURE. Table r Empirical distribution of ranks on the file bib in ethod B We have observed the high frequency of r = 1. Figure shows an example of an empirical distribution of ranks. A proper treatment of the high frequency of r = 1 is to give a special method for the encoding of runs of r = 1. In ethod B, instead, we extend this idea in a more general way. For the description of the method, we assume that the parameter is sufficiently large. Let r i denote the rank of x i. For i 2 and sufficiently large, we can show the following propositions. LEA 4.1. We have r i = 1 iff Cont x ( j), 1 j < i, is the SC to the current context Cont x (i) and x i = x j. Proof. It is straightforward from the definitions. LEA 4.2. Let Cont x ( j) be the SC to Cont x (i) (1 j < i). If r i = 1, then Cont x ( j + 1) is the SC to Cont x (i + 1). Proof. In this lemma, it is essential that the parameter is sufficiently large. Since Cont x ( j) is the SC to Cont x (i), we have no previous context C that satisfy or Cont x ( j) C Cont x (i) (15) Cont x (i) C Cont x ( j). (16) Note that, for sufficiently large, the inequalities (15) and (16) are proper and have no equality. It follows from Lemma 4.1 that r i = 1 yields x i = x j and there exists the SC to Cont x (i + 1) that is denoted by Cx i for a string C A. If this SC is not equal to Cont x ( j + 1), then it must satisfy either or Cont x ( j + 1) Cx i Contx (i + 1) Cont x (i + 1) Cx i Contx ( j + 1).

6 SORT-BASED CONTEXT SIILARITY EASURE 99 This means that we have a context C which satisfies (15) or (16). This contradicts the assumption. Therefore, the SC to Cont x (i + 1) should be the context Cont x ( j + 1). LEA 4.. Let Cont x ( j) be the SC to Cont x (i) (1 j < i). If r i = r i+1 = 1, then x[ j... j + 1] = x[i... i + 1]. Proof. Let Cont x ( j + 1) denote the SC to Cont x (i + 1). From the condition of the lemma and from Lemma 4.1 we have both x i = x j and x i+1 = x j +1. The uniqueness of the SC can be combined with Lemma 4.2, which yields j = j. Thus, the proof is completed. LEA 4.4. Let Cont x ( j) be the SC to Cont x (i) (1 j < i). If r i = r i+1 = = r i+l 1 = 1 and the similarity between Cont x ( j) and Cont x (i) is more than 0, then x[ j 1... j + l 1] = x[i 1... i + l 1]. Proof. Since the similarity between Cont x ( j) and Cont x (i) is more than 0, the two symbols x j 1 and x i 1 are identical. The equality x[ j... j + l 1] = x[i... i + l 1] comes from a direct extension of Lemma 4.. Therefore, we have x[ j 1... j + l 1] = x[i 1... i + l 1]. Lemma 4.4 shows that there may exist a previous match which is longer than the corresponding run of rank 1s. For example, the string obladioblada has a sequence of ranks..., 6, 1, 1, 1, 1, (the rank 6 corresponds to the second o ). While the length of the successive 1s is 4, the length of the match oblad is 5. That is to say, we can encode a longer substring at a time when we encode a matched substring than when we encode a run of rank 1s. This observation leads to the following method. At the ith symbol x i it holds either that its rank r is virtual or that there exists a previous position j such that x i = x j and the (Cont x ( j) : x j ) pair gives the current rank r. In both cases, we first encode the rank r. After that, if the rank is virtual, then x i is transmitted as a raw symbol. In the latter case, we find l 1 such that x[i... i + l 1] = x[ j... j + l 1] and x i+l x j+l, and encode this longest match x[i... i + l 1] by its length l. Specifically, if the rank r is less than 6, it is encoded by the corresponding codeword in the second column in Table 1. For r > 5, we adopt TGcode(r 1, R 1). The match length l is coded by 0 for l = 1 and a concatenation of 1 and TGcode(l 1, + ) for l 2. These codes are again designed in accordance with the actual frequency of ranks and lengths. The codeword for rank 1 is relatively longer than other codewords in Table 1. This is because ethod B encodes most rank-1 symbols as parts of parsed strings and it decreases the relative frequency of r = 1. In this method, the rank, length pair serves as a pointer to a previous match. This is similar to a version of the LZ77 scheme [15]. This similarity suggests a unifying approach to the improvement of the Ziv Lempel code by context sorting. This issue will be discussed in the next section. FIGURE 4. Compression performance of ethod A versus the parameter on the filebib in Table 2. ethod C We can classify the cases of r = 1 into two types. Let (C : a) and (C : a ) denote the first and second pairs in the sequence of context symbol pairs sorted in order of context similarity. Namely, the context C is the SC to the current context. For the two symbols a A and a A, which occurred in the contexts C and C respectively, we have either or a = a (17) a a. (18) We have observed that in the case of (17) the frequency of r = 1 is extremely high (typically more than 80%), while in the case of (18) the difference between the frequencies of r = 1 and r = 2 is not so large. Since we can distinguish these two cases both in encoding and decoding, we apply slightly different codes to the two cases. In the case of (17): Output TGcode(r, R) (as in ethod A). In the case of (18): Represent r = 1 by 00, and r = 2 by 01. For r R, output TGcode(r 1, R 1). The codewords in the case of (18) with R = 10 are included in Table 1. We have implemented our data compression scheme, which uses the above three codes, in order to investigate their practical performance. Table 2 shows compression results for = 8 and = + on some files from the Calgary Corpus [1]. Here, compression figures are in bits per input symbol. To provide comparative performance evaluation, the table also includes compression results of the UNIX Gzip utility, PPC by offat [16] and the block-sorting method by Burrows and Wheeler [5]. Figure 4 shows an example of the effect of the parameter on compression performance. Table 2 shows that, in general, ethod C with = + is superior to the other proposed codes. However,

7 100 H. YOKOO TABLE 2. Compression comparisons (bit/symbol) ethod A ethod B ethod C Block File (bytes) Gzip PPC sorting = Text files bib ( ) book1 ( ) book2 ( ) news (77 109) paper1 (5 161) paper2 (82 199) progc (9 611) progl (71 646) progp (49 79) Binary files geo ( ) obj1 (21 504) obj2 ( ) b a a b a a a b b a a b b... a a <7, 5 > FIGURE 5. A version of LZ77: the longest match is encoded by the pair position, length. the difference in compression performance between = 8 and = + is minor. Although the compression performance tends to improve with an increase of, it reaches a sufficient level at a small value of. In most files, a practically suitable value of in ethods A and C is between 6 and 10. Even though the proposed method outperforms Gzip for sufficiently long files, it is not quite as good as the best competitors. In particular, its performance declines on non-textual data. The above three codes are useful for showing the different natures of our method. For example, ethod B reveals the relation of our system to the Ziv Lempel algorithm. ethod C is designed to show that we can use alternate codes depending on the internal state of the encoder. Despite such significance of the codes, they may still be insufficient to realize the compression power of our system. In order to improve the compression performance, we need another careful optimization in the code design. 5. PERSPECTIVE As noted in the last paragraph of ethod B, our method suggests an improvement of the Ziv Lempel code by context sorting. We give a more detailed description of this possibility for improvement. We assume again that the parameter is sufficiently large, as we assumed in ethod B. As an example, consider the encoding of the string x[ ] = baabaaa baabaabb. Suppose that we have already processed TABLE. Totally ordered contexts ranked by the context similarity (curr. cont. = baabaaaba) Similarity Context Rank 4 Cont x (6) = baaba 1 2 Cont x () = ba 2 1 Cont x (7) = abaa Cont x (4) = baa 4 Cont x (8) = baaa 5 0 Cont x (1) = λ 6 Cont x (2) = b 7 Cont x (9) = aaab 8 Cont x (5) = baab 9 x[1... 9] and we are going to encode an incoming string after x 9. Since the SC to the current context is Cont x (6) and we have x 6 = x 10 = a and x 7 x 11, ethod B encodes the substring x[ ] by the pair of r = 1 and l = 1. A version of LZ77, e.g. [17], on the other hand, searches the already-processed text x[1... 9] for the longest match with the incoming string. In this case, the longest match x[ ] is coded by the pointer pair of the previous position 7 and the match length 5 (see Figure 5). In some other versions, e.g. [14], the match position is better represented as the offset, or the relative displacement, which indicates how far back to look into the text to find the longest match. In addition, the offset may be encoded by a sophisticated code, e.g. a Huffman code, in order to represent more frequent offsets in fewer bits. Thus, the encoding of a pointer to a previous match is one of the most refined parts of the Ziv Lempel methods. We also modify the encoding of a position pointer. To describe our modification, we continue with the above example. We first arrange the so-far occurring contexts in

8 SORT-BASED CONTEXT SIILARITY EASURE b a a b a a a b a a b a a b b... <, 5 > FIGURE 6. Position indexes are rearranged according to their associated similarities to the current context. x[1... 9] by context sorting and sort them in total order of similarity to the current context Cont x (10). In Table, the ordered contexts are numbered serially in the rank column. If a context Cont x (i) has a rank r, then the position i in the previous text is converted to r, as shown in Figure 6. In our modified version of LZ77, the longest match x[ ] is then encoded by the, 5 pair. Thus, the difference between the current version and other versions of LZ77 lies only in the encoding of the position component of a previous match. The most remarkable point in our modification is that it assigns position indexes on a previous text in dynamically changing order. Furthermore, this approach can be applied not only to a specific version of LZ77 but also to many other variations [1] of the Ziv Lempel code. In order to describe the approach more generally, assume that a dictionary D of the Ziv Lempel code can be divided into two parts: D = D 0 D 1, where D 0 is an initial state of the dictionary and D 1 is a varying set of strings parsed from the input data. Namely, any dictionary entry t D 1 can be represented by t = x[i... i + t 1] with a position i and the length t. For any t = x[i... i + t 1], let C(t) denote its context Cont x (i). Assume D 1 to be an ordered set D 1 = {t 1, t 2,..., t D1 } and π be a permutation that rearranges the corresponding context set: {C(t 1 ), C(t 2 ),..., C(t D1 )} to another context set which is ordered by the context similarity to the current context. Then, we use a varying dictionary D 0 πd 1 instead of D. The above idea is in a sense a refinement of the multipledictionary approach to the Ziv Lempel code (see, e.g. [18]). A multiple-dictionary method uses separate dictionaries for each previous context, which usually consists of a single character. If we increase the length of a conditioning context, we must maintain a good number of dictionaries, each of which in turn has a fewer number of terms. Thus, the increase of the context length does not necessarily contribute to the improvement of compression performance. On the other hand, our method can keep the dictionary size unchanged without depending on the context length. 6. CONCLUSIONS We have proposed a new adaptive lossless data compression method, based on the notion of context similarity. The method maintains previously seen contexts via context sorting, which assigns a similarity to each context that indicates how closely it matches the current context. In spite of the simplicity of the context sorting mechanism, the proposed method has both theoretically and practically good compression performance. The proposed method is essentially a context-based symbol-ranking compressor. Yet it also has a strong relation to the Ziv Lempel-type dictionary methods. We have outlined a unifying approach to the improvement of the Ziv Lempel code by context sorting. Furthermore, we should emphasize that both in encoding and decoding we can know not only the value of the similarity rank but also an underlying state. This is in contrast to the block-sorting method, which is an off-line variant of sort-based symbolranking compressors. In these respects, the proposed method is essentially rich in variation, which should be explored in order to develop better encoding and a faster implementation. ACKNOWLEDGEENT The author wishes to thank. Takahashi for his programming support. He is also grateful to A. offat, the Editor of this issue, who pointed out some existing work on symbol rank compression and to the referees, who made several useful suggestions. REFERENCES [1] Bell, T. C., Cleary, J. G. and Witten, I. H. (1990) Text Compression. Prentice Hall, Englewood Cliffs, NJ. [2] Cleary, J. G. and Witten, I. H. (1984) Data compression using adaptive coding and partial string matching. IEEE Trans. Commun., CO-2, [] Weinberger,. J., Rissanen, J. J. and Feder,. (1995) A universal finite memory source. IEEE Trans. Inform. Theory, IT-41, [4] Willems, F.. J., Shtarkov, Y.. and Tjalkens, T. J. (1995) The context-tree weighting method: basic properties. IEEE Trans. Inform. Theory, IT-41, [5] Burrows,. and Wheeler, D. J. (1994) A block-sorting lossless data compression algorithm. SRC Research Report, Vol [6] Cleary, J. G., Teahan, W. J. and Witten, I. H. (1995) Unbounded length contexts for PP. In Storer, J. A. and Cohn,. (eds.), DCC 95 Proc. Data Compression Conf., Snowbird, UT, arch, pp IEEE Computer Society Press, Los Alamitos, CA. [7] Fenwick, P.. (1996) Block sorting text compression. ACSC 96, Proc. 19th Australian Computer Science Conf., elbourne, Australia. [8] Fenwick, P.. (1996) Block-sorting text compression Final Report. Technical Report 10, Department of Computer Science, University of Auckland, Auckland, New Zealand. Available at: ftp:// auckland.ac.nz/out/peter-f/techrep10.ps.

9 102 H. YOKOO [9] Fenwick, P.. (1996) Symbol ranking text compression. Technical Report 12, Department of Computer Science, University of Auckland, Auckland, New Zealand. Available at: ftp:// ac.nz/out/peter-f/techrep12.ps. [10] Shannon, C. E. (1951) Prediction and entropy of printed English. Bell Syst. Tech. J., 0, [11] Howard, P. G. and Vitter, J. S. (199) Design and analysis of fast text compression based on quasi-arithmetic coding. In Storer, J. A. and Cohn,. (eds), DCC 9 Proc. Data Compression Conf., Snowbird, UT, pp IEEE Computer Society Press, Los Alamitos, CA. [12] Elias, P. (1975) Universal codeword sets and representations of the integers. IEEE Trans. Inform. Theory, IT-21, [1] Yokoo, H. and Takahashi,. (1996) Data compression by context sorting. IEICE Trans. Fundamentals, E79-A, [14] Bell, T. C. (1986) Better OP/L text compression. IEEE Trans. Commun., CO-4, [15] Ziv, J. and Lempel, A. (1977) A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, IT- 2, 7 4. [16] offat, A.. (1990) Implementing the PP data compression scheme. IEEE Trans. Commun., CO-8, [17] Ziv, J. (1978) Coding theorems for individual sequences. IEEE Trans. Inform. Theory, IT-24, [18] Hoang, D. T., Long, P.. and Vitter, J. S. (1995) ultipledictionary compression using partial matching. In Storer, J. A. and Cohn,. (eds), DCC 95 Proc. Data Compression Conf., Snowbird, UT, pp IEEE Computer Society Press, Los Alamitos, CA.

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.