Data Compression Using a Sort-Based Context Similarity Measure

Size: px
Start display at page:

Download "Data Compression Using a Sort-Based Context Similarity Measure"

Transcription

1 Data Compression Using a Sort-Based Context Similarity easure HIDETOSHI YOKOO Department of Computer Science, Gunma University, Kiryu, Gunma 76, Japan yokoo@cs.gunma-u.ac.jp Every symbol in the data can be predicted by taking the immediately preceding symbols, or context, into account. This paper proposes a new adaptive data-compression method based on a context similarity measure. We measure the similarity of contexts using a context sorting mechanism. The aim of context sorting is to store a set of contexts in a specific order so that contexts more similar to the current one are more accessible. The proposed method predicts the next symbol by ranking the previous context symbol pairs in order of context similarity. The codeword for the next symbol represents the rank of the symbol in this ordered sequence. The compression performance is evaluated both analytically and empirically. Although the proposed method uses no probability distribution to make a prediction, it gains good compression. It also reveals a strong relation between symbol-ranking compression and the Ziv Lempel textual substitution method. Received June 28, 1996; revised April 0, INTRODUCTION Every symbol in the data can be predicted by taking the immediately preceding text, or context, into account. We propose a new context-based data compression method, which utilizes a similarity measure of contexts in order to predict the incoming symbol in anticipation that similar symbols occur in similar contexts. The method is universal in the sense that it requires no prior knowledge about the data, and is particularly effective for text compression. ost previous methods for the same purpose are grouped into two classes [1]: Ziv Lempel-type dictionary methods and statistical modelling techniques. Our method, however, belongs to neither class, and in this respect we can emphasize its novelty 1. In most context-based methods proposed so far, whether intended for practical use, e.g. PP [2], or for theoretical modelling, e.g. [], both context selection and arithmetic coding are supposed to be essential components. However, the problem of adaptive context selection is so difficult that there is no widely accepted theoretical basis for it. This difficulty has led to a recent recognition [4] that the notion of context selection is not as natural as we have believed. Another problem of context-based methods lies in the use of arithmetic coding. Arithmetic coding is powerful and flexible, but more complex in time and implementation than the Ziv Lempel algorithm. As a result, in spite of its compression performance, no context-based method has yet been commonly used in practice. In contrast, even though 1 This work was presented in preliminary form at the IEEE Data Compression Conference, Snowbird, UT, arch April, the proposed method shares the same idea as the contextbased techniques, it assumes neither context selection nor arithmetic coding. The existing method most related to the present one is the block-sorting method of Burrows and Wheeler [5] (see also [6]). What most distinguishes it from the proposed method is that the former is not adaptive while ours is on-line adaptive. The evaluation of compression performance by Burrows and Wheeler is less analytic. In addition, the present method can make direct use of an internal state of the encoder, which makes it possible to tune a code character by character. These two methods can be classified into the family of symbolranking text compressors, which maintain a list of symbol candidates ranked in the order of their likelihood of appearing in every context. The symbol-ranking text compression family has been studied intensively by Fenwick [7 9], who pointed out its relation with Shannon s methods of measuring the information of English [10]. Howard and Vitter [11] also proposed a symbol-ranking mechanism to avoid explicit frequency counting in a PP-style compression system. Apart from Shannon s work, these existing symbol-ranking compressors are motivated primarily by pragmatic demands on text compression. The main purpose of the present paper is to bring together practical and theoretical aspects of the symbol-ranking family, including its relation with the Ziv Lempel algorithm. The proposed method encodes one symbol at a time in principle. The encoding of every symbol consists of several operations. As an illustration of the operations, Figure 1a shows all the context symbol pairs obtained after the sample string bacacaba has been processed. In Figure 1, the null

2 SORT-BASED CONTEXT SIILARITY EASURE 95 context λ b ba bac baca bacac bacaca bacacab bacacaba symbol b a c a c a b a x similarity context symbol rank bacacaba x 2 ba c 1 1 baca c bacaca b 2 === λ b b a 0 bac a bacac a bacacab a (a) (b) FIGURE 1. Ranking symbol candidates based on context similarity. string of length 0 and the next symbol, which is about to be encoded, are denoted by λ and x respectively. Here, we measure the similarity between two contexts by the length of their common suffixes. In the present example, by matching the previous contexts against the current one bacacaba, we can group the existing context symbol pairs into three equivalence classes, which are shown in Figure 1b. Then, we enumerate candidates for the next symbol in the order of the rightmost column of the figure. The ranks from 1 to in the column are obtained by counting only distinct symbols. If the actual next symbol x is equal to one of these three symbols, the corresponding rank is then encoded. Otherwise, the virtual rank 4 is encoded, followed by a transmission of x as a raw symbol. For the actual encoding of ranks, we can use representations of the integers [12]. It should be noted that it is insufficient to measure the similarity of contexts only by the length of their common suffixes. We may have more than one context which shares the same similarity to the current context. This causes a problem when we wish to obtain a unique ranking of context symbol pairs. In order to overcome the problem, we define the notion of context sorting, which gives a totally ordered set of ranks to symbol candidates. The resulting similarity measure may be regarded as a refinement of that used in our previous method [1]. In the following, we first describe the context sorting with a bounded context length. It provides not only a similarity measure on contexts but also a concrete method for giving a rank to an input symbol. Following that we give a theoretical analysis of the compression performance, sample coding methods of ranks and their empirical evaluation. This analysis and evaluation shows that, in spite of its simplicity, our method gives good compression performance both theoretically and practically. Finally, we make a brief remark on the possibility of improvements to the Ziv Lempel methods by context sorting. 2. CONTEXT SORTING WITH A BOUNDED CONTEXT LENGTH Let A = {a 0, a 1,..., a α 1 } (1) be an input alphabet of α symbols. An ordering relation on A is denoted by a i a j for a i, a j A and 0 i < j α 1. (2) Let A denote the set of all finite length strings over A. A string x of length n on A is an ordered n-tuple x = x 1 x 2... x n of symbols from A. To indicate a substring of x which starts at position i and ends at position j, we write x[i... j]. When i j, x[i... j] = x i x i+1... x j, but when i > j, we take x[i... j] = λ, the null string. Let x[1... m] and y[1... n] be two strings in A. In order to measure the similarity of the two strings, we introduce lexicographic order on reversed strings of a bounded length. For any integer greater than 0, we write x[1... m] y[1... n], () if either of the following two conditions holds. 1. We have x m i y n i for an integer i such that 0 i < min{m, n, } and x m j = y n j for any integer j such that 0 j < i. 2. We have m < n, m < and x m i = y n i for any integer i such that 0 i < m. When we read strings backwards, the relation corresponds to a lexicographic order of at most symbols. In particular, we write x[1... m] = y[1... n] if they have exactly the same last symbols, and x[1... m] y[1... n] if x[1... m] y[1... n] or x[1... m] = y[1... n]. Let x[1... N] = x 1 x 2 x N (x i A) be a text being compressed. For the ith symbol x i of x[1... N], its preceding i 1 symbols form its context We define Cont x (1) = λ. Cont x (i) = x[1... i 1]. (4) The null string λ satisfies the relation λ x for any non-null string x. The goal of

3 96 H. YOKOO context sorting is, for a given, to sort a set of contexts {Cont x ( j) j = 1,..., i} in ascending order of the relation. As an example, consider the string λ : b b: a x[1... 8] = bacacaba, (5) which has the following eight contexts: ba: c bac:a aca: b cab: a cac: a Cont x (1) = λ, Cont x (2) = b, Cont x () = ba, Cont x (5) = baca, Cont x (7) = bacaca, Cont x (4) = bac, Cont x (6) = bacac, Cont x (8) = bacacab. If we use the English alphabetic order as and if we take account of the underlined parts only, i.e. we take =, then we have Cont x (1) = Cont x (7) Cont x (4) Cont x () Cont x (2) Cont x (6). Cont x (5) Cont x (8) In order to describe our compression method, assume that an initial segment x[1... i 1] of x[1... N] has already been encoded and the ith symbol x i is about to be encoded. We further assume that i is greater than 1. The first symbol x 1 is transmitted as it is. First, sort the so-far observed contexts {Cont x ( j) j = 1,..., i 1} in order of, and write the resulting sequence as C 1 C2 CK, (6) where C 1 = Cont x (1) = λ. Here, we assume that the sequence (6) has no equality. If equality occurs, we eliminate the older context that is = to the newer one. Thus, we have K min{α, i 1}. This assumption is not essential, but is made in order to simplify the encoding procedure. Second, in order to find existing contexts that are similar to the current one, we then search a position p in (6) such that C p z C p+1, 1 p K (7) for z = x[1... i 1]. Then, C p or C p+1 (or both) is among the previous contexts most similar to the current one. By merging the lists of C p, C p 1,... and C p+1, C p+2,..., we can obtain a linear sequence of context symbol pairs sorted in order of context similarity. What we need to do to get the rank of x i is to count the number of distinct symbols prior to x i in this sorted sequence of context symbol pairs. If we fix a merging method in the course of encoding and decoding, we can rank the symbol candidates uniquely. Here, we define the most similar context (SC) to the current context as follows. If the similarity between C p+1 and the current context is greater than that of C p, then C p+1 is the SC to the current context. Otherwise, C p is defined to be the SC to the current context. Note that the similarity between a context FIGURE 2. Sample data structure corresponding to the string (5). Bold arrows represent links of a doubly linked list. and its SC may be zero because the SC is only defined in terms of the lexicographic order of reversed contexts. Any position i except for i = 1 in the data string has a unique previous context as the SC to its context Cont x (i). In our sample implementation we use a binary search tree to store context symbol pairs in order of contexts. This is quite similar to the method for storing strings in lexicographic order [14], which was proposed as an implementation of the Ziv Lempel data compression method. As is easily seen, symmetric order (or in-order) in the tree corresponds to the order of contexts. We link the nodes in the tree by a doubly linked list in symmetric order. Two pointers which follow the list in opposite directions enable us to compute the rank of the next symbol easily. Figure 2 shows an example of the data structure, which corresponds to the sample string (5). After deciding the rank, we insert the current context symbol pair into an appropriate position in the tree. If the current context is = to one of the existing contexts, we store only the current pair by deleting the older one. In the above implementation, every node in the tree stores just a pointer to its corresponding position in the data string instead of storing an actual context symbol pair. Therefore, its memory requirement is proportional to the data length; it never depends on the maximal context length. On the other hand, we need at most O() time to compare two contexts. So, the time complexity is O( 2 log α) both in inserting a context symbol pair into the tree and in obtaining the position p in (7), provided that the tree is reasonably balanced. Furthermore, we need extra time to get a rank by searching from the position p, which is relatively negligible because the rank actually concentrates around p. Our sample implementation with = 8 requires about 0. s Kbyte 1 to encode on a Sun SPARCstation10. This is more than 20 times slower than Gzip. There still remains work for reducing the time complexity.. ANALYTICAL BOUND ON COPRESSION PERFORANCE Before describing the actual encoding of the rank, we analyse a bound on the compression performance of the proposed method. The rank r of an incoming symbol is a positive integer which takes a value in a pre-determined range. We

4 SORT-BASED CONTEXT SIILARITY EASURE 97 can, of course, make use of the range in both encoding and decoding, as will be shown in the next section. In this section, however, in order to simplify the analysis and to allow an extended alphabet of an arbitrary order, we encode r by an infinite prefix-free codeword set. It is known that we can construct a representation of the integers which has the length function 2 L(r) = log r + O(log(1 + log r)). (8) The function L(r) is convex- and non-decreasing in r. A typical example of such a representation is the δ code by Elias [12]. In this section, we assume that the rank r is coded by a codeword the length of which is L(r) in bits. Suppose > N for simplicity. We also assume that the data string is generated from a finite-order arkov source. Thus, there is a finite k which characterizes a probability distribution: Pr[X i = a m ] = P(a m x[i k... i 1]), m = 0,..., α 1, (9) where X i is a random variable corresponding to the ith symbol x i. For a fixed context C = x[i k... i 1], define the conditional entropy H(C) by where α 1 H(C) = m=0 p m log p m, (10) p m = P(a m C), m = 0,..., α 1. Let S be the set of thus far occurred contexts that are k = to C, that is, S = {Cont x ( j) Cont x ( j) k = x[i k... i 1], 1 j < i}. Define S to be the number of contexts in S. Note that, since > N, there is no context in S that is = to any other context in S. The sequence obtained by ranking the existing contexts with respect to x[1... i 1] has the top S context symbol pairs from S. Assume that, in this sorted sequence, the first occurrence of the symbol a m is in the nth pair and its rank is r i. Then, we have r i n. (11) Thus, the average rank and the first position of a m in the sorted sequence satisfy E[r i ] E[n], (12) where the expectation E is with respect to the data string. The probability that the first occurrence of a m is in the nth 2 All logarithms in this paper are taken to base 2. In Equation (8) we can think that O(log(1 + log r)) = O(log log r) in the usual sense of the O- notation. However, we have adopted the above form so that the argument inside O has a finite value even for r = 1. place in the sequence is given by Pr[first occurrence of a m is in the nth place] = (1 p m ) n 1 p m for 1 n S (1 p m ) S for S < n. The size S increases in linear order of i, and therefore + lim lim E[n] = n(1 p m ) n 1 p m = 1. N + i N p m Thus, we have n=1 lim lim E[r i] 1. (1) N + i N p m The expected length of a codeword representing a symbol a m in the context C tends to E[L(r i )] = E[log r i + O(log(1 + log r i ))] log E[r i ] + O(log(1 + log E[r i ])) ( )) log p m + O log (1 + log 1pm as i N and N +. The above inequality comes from the convexity of the logarithmic function. It follows from this that the expected codeword length per symbol in the context C is asymptotically bounded above by α 1 m=0 α 1 p m E[L(r i )] m=0 α 1 + m=0 p m log p m ( )) p m O log (1 + log 1pm H(C) + O(log(H(C) + 1)). This means that the average codeword length representing a symbol in any context approaches the conditional entropy within a loss of logarithmic order of itself. This depends neither on the order k nor on a specific probability distribution. The above discussion can be applied to any extension of the alphabet. The entropy of such an extension is proportional to the order d of extension. On the other hand, the difference between the rate attained by our method and the source entropy increases in logarithmic order of d. Therefore, by encoding a block of d symbols at a time, it is possible to achieve asymptotically the entropy bound as closely as desired. 4. COPRESSION EXPERIENTS In implementing our compression method we must take two issues into account. The first issue is related to the time-consuming steps of context sorting. In Section 2 we have outlined a possible solution to the problem. We now focus on the second problem the actual encoding of ranks. Although, as shown in the previous section, there exists a good code, good at least in the asymptotic sense,

5 98 H. YOKOO TABLE 1. Various codes for the ranks from 1 to 10 TGcode ethod C r R = 10 ethod B in (18) the development of practically efficient codes is another problem. Once we have obtained the rank r of the current symbol, we convert it to a codeword of a uniquely decodable code. Let D i denote the highest possible rank of the ith symbol x i. The actual rank r of x i satisfies the inequalities 1 r min{α, D i + 1}, where D i + 1 is a virtual rank. This range of r can be known also in the course of decoding. Denoting the range generally by 1 r R, (14) we give three coding methods to encode a value in this range. ethod A (Truncated γ code) The first code, called truncated γ code or simply TGcode, has the same length function as that of the representation γ by Elias [12] in 1 r 2 log R 1. Therefore, this code approaches an equivalent of γ as R tends to +. A specific procedure for this code is given by: TGcode(r, R): Let 1s be the standard binary representation of r. If 1 r 2 log R 1 then output 1 s 0s, where 1 s is a concatenation of the same number of 1 s as the length of s this is actually log r bits. If 2 log R r R then output 1 log R PBcode(r 2 log R, R 2 log R + 1). In the above procedure, PBcode( j, J) encodes an integer j over the interval [0, J 1] using minimum bits of almost equal length (see Section A.2 in [1]). As an example, we show the codewords of TGcode(r, R) with R=10 in Table 1. It is easy to see that any TGcode(r, R) code is a prefix-free complete code. Elias s γ code does not share the length function of (8), and therefore ethod A never attains the asymptotic bound given in the previous section. However, it matches quite well with empirical distributions of ranks especially on text files. relative frequency FIGURE. Table r Empirical distribution of ranks on the file bib in ethod B We have observed the high frequency of r = 1. Figure shows an example of an empirical distribution of ranks. A proper treatment of the high frequency of r = 1 is to give a special method for the encoding of runs of r = 1. In ethod B, instead, we extend this idea in a more general way. For the description of the method, we assume that the parameter is sufficiently large. Let r i denote the rank of x i. For i 2 and sufficiently large, we can show the following propositions. LEA 4.1. We have r i = 1 iff Cont x ( j), 1 j < i, is the SC to the current context Cont x (i) and x i = x j. Proof. It is straightforward from the definitions. LEA 4.2. Let Cont x ( j) be the SC to Cont x (i) (1 j < i). If r i = 1, then Cont x ( j + 1) is the SC to Cont x (i + 1). Proof. In this lemma, it is essential that the parameter is sufficiently large. Since Cont x ( j) is the SC to Cont x (i), we have no previous context C that satisfy or Cont x ( j) C Cont x (i) (15) Cont x (i) C Cont x ( j). (16) Note that, for sufficiently large, the inequalities (15) and (16) are proper and have no equality. It follows from Lemma 4.1 that r i = 1 yields x i = x j and there exists the SC to Cont x (i + 1) that is denoted by Cx i for a string C A. If this SC is not equal to Cont x ( j + 1), then it must satisfy either or Cont x ( j + 1) Cx i Contx (i + 1) Cont x (i + 1) Cx i Contx ( j + 1).

6 SORT-BASED CONTEXT SIILARITY EASURE 99 This means that we have a context C which satisfies (15) or (16). This contradicts the assumption. Therefore, the SC to Cont x (i + 1) should be the context Cont x ( j + 1). LEA 4.. Let Cont x ( j) be the SC to Cont x (i) (1 j < i). If r i = r i+1 = 1, then x[ j... j + 1] = x[i... i + 1]. Proof. Let Cont x ( j + 1) denote the SC to Cont x (i + 1). From the condition of the lemma and from Lemma 4.1 we have both x i = x j and x i+1 = x j +1. The uniqueness of the SC can be combined with Lemma 4.2, which yields j = j. Thus, the proof is completed. LEA 4.4. Let Cont x ( j) be the SC to Cont x (i) (1 j < i). If r i = r i+1 = = r i+l 1 = 1 and the similarity between Cont x ( j) and Cont x (i) is more than 0, then x[ j 1... j + l 1] = x[i 1... i + l 1]. Proof. Since the similarity between Cont x ( j) and Cont x (i) is more than 0, the two symbols x j 1 and x i 1 are identical. The equality x[ j... j + l 1] = x[i... i + l 1] comes from a direct extension of Lemma 4.. Therefore, we have x[ j 1... j + l 1] = x[i 1... i + l 1]. Lemma 4.4 shows that there may exist a previous match which is longer than the corresponding run of rank 1s. For example, the string obladioblada has a sequence of ranks..., 6, 1, 1, 1, 1, (the rank 6 corresponds to the second o ). While the length of the successive 1s is 4, the length of the match oblad is 5. That is to say, we can encode a longer substring at a time when we encode a matched substring than when we encode a run of rank 1s. This observation leads to the following method. At the ith symbol x i it holds either that its rank r is virtual or that there exists a previous position j such that x i = x j and the (Cont x ( j) : x j ) pair gives the current rank r. In both cases, we first encode the rank r. After that, if the rank is virtual, then x i is transmitted as a raw symbol. In the latter case, we find l 1 such that x[i... i + l 1] = x[ j... j + l 1] and x i+l x j+l, and encode this longest match x[i... i + l 1] by its length l. Specifically, if the rank r is less than 6, it is encoded by the corresponding codeword in the second column in Table 1. For r > 5, we adopt TGcode(r 1, R 1). The match length l is coded by 0 for l = 1 and a concatenation of 1 and TGcode(l 1, + ) for l 2. These codes are again designed in accordance with the actual frequency of ranks and lengths. The codeword for rank 1 is relatively longer than other codewords in Table 1. This is because ethod B encodes most rank-1 symbols as parts of parsed strings and it decreases the relative frequency of r = 1. In this method, the rank, length pair serves as a pointer to a previous match. This is similar to a version of the LZ77 scheme [15]. This similarity suggests a unifying approach to the improvement of the Ziv Lempel code by context sorting. This issue will be discussed in the next section. FIGURE 4. Compression performance of ethod A versus the parameter on the filebib in Table 2. ethod C We can classify the cases of r = 1 into two types. Let (C : a) and (C : a ) denote the first and second pairs in the sequence of context symbol pairs sorted in order of context similarity. Namely, the context C is the SC to the current context. For the two symbols a A and a A, which occurred in the contexts C and C respectively, we have either or a = a (17) a a. (18) We have observed that in the case of (17) the frequency of r = 1 is extremely high (typically more than 80%), while in the case of (18) the difference between the frequencies of r = 1 and r = 2 is not so large. Since we can distinguish these two cases both in encoding and decoding, we apply slightly different codes to the two cases. In the case of (17): Output TGcode(r, R) (as in ethod A). In the case of (18): Represent r = 1 by 00, and r = 2 by 01. For r R, output TGcode(r 1, R 1). The codewords in the case of (18) with R = 10 are included in Table 1. We have implemented our data compression scheme, which uses the above three codes, in order to investigate their practical performance. Table 2 shows compression results for = 8 and = + on some files from the Calgary Corpus [1]. Here, compression figures are in bits per input symbol. To provide comparative performance evaluation, the table also includes compression results of the UNIX Gzip utility, PPC by offat [16] and the block-sorting method by Burrows and Wheeler [5]. Figure 4 shows an example of the effect of the parameter on compression performance. Table 2 shows that, in general, ethod C with = + is superior to the other proposed codes. However,

7 100 H. YOKOO TABLE 2. Compression comparisons (bit/symbol) ethod A ethod B ethod C Block File (bytes) Gzip PPC sorting = Text files bib ( ) book1 ( ) book2 ( ) news (77 109) paper1 (5 161) paper2 (82 199) progc (9 611) progl (71 646) progp (49 79) Binary files geo ( ) obj1 (21 504) obj2 ( ) b a a b a a a b b a a b b... a a <7, 5 > FIGURE 5. A version of LZ77: the longest match is encoded by the pair position, length. the difference in compression performance between = 8 and = + is minor. Although the compression performance tends to improve with an increase of, it reaches a sufficient level at a small value of. In most files, a practically suitable value of in ethods A and C is between 6 and 10. Even though the proposed method outperforms Gzip for sufficiently long files, it is not quite as good as the best competitors. In particular, its performance declines on non-textual data. The above three codes are useful for showing the different natures of our method. For example, ethod B reveals the relation of our system to the Ziv Lempel algorithm. ethod C is designed to show that we can use alternate codes depending on the internal state of the encoder. Despite such significance of the codes, they may still be insufficient to realize the compression power of our system. In order to improve the compression performance, we need another careful optimization in the code design. 5. PERSPECTIVE As noted in the last paragraph of ethod B, our method suggests an improvement of the Ziv Lempel code by context sorting. We give a more detailed description of this possibility for improvement. We assume again that the parameter is sufficiently large, as we assumed in ethod B. As an example, consider the encoding of the string x[ ] = baabaaa baabaabb. Suppose that we have already processed TABLE. Totally ordered contexts ranked by the context similarity (curr. cont. = baabaaaba) Similarity Context Rank 4 Cont x (6) = baaba 1 2 Cont x () = ba 2 1 Cont x (7) = abaa Cont x (4) = baa 4 Cont x (8) = baaa 5 0 Cont x (1) = λ 6 Cont x (2) = b 7 Cont x (9) = aaab 8 Cont x (5) = baab 9 x[1... 9] and we are going to encode an incoming string after x 9. Since the SC to the current context is Cont x (6) and we have x 6 = x 10 = a and x 7 x 11, ethod B encodes the substring x[ ] by the pair of r = 1 and l = 1. A version of LZ77, e.g. [17], on the other hand, searches the already-processed text x[1... 9] for the longest match with the incoming string. In this case, the longest match x[ ] is coded by the pointer pair of the previous position 7 and the match length 5 (see Figure 5). In some other versions, e.g. [14], the match position is better represented as the offset, or the relative displacement, which indicates how far back to look into the text to find the longest match. In addition, the offset may be encoded by a sophisticated code, e.g. a Huffman code, in order to represent more frequent offsets in fewer bits. Thus, the encoding of a pointer to a previous match is one of the most refined parts of the Ziv Lempel methods. We also modify the encoding of a position pointer. To describe our modification, we continue with the above example. We first arrange the so-far occurring contexts in

8 SORT-BASED CONTEXT SIILARITY EASURE b a a b a a a b a a b a a b b... <, 5 > FIGURE 6. Position indexes are rearranged according to their associated similarities to the current context. x[1... 9] by context sorting and sort them in total order of similarity to the current context Cont x (10). In Table, the ordered contexts are numbered serially in the rank column. If a context Cont x (i) has a rank r, then the position i in the previous text is converted to r, as shown in Figure 6. In our modified version of LZ77, the longest match x[ ] is then encoded by the, 5 pair. Thus, the difference between the current version and other versions of LZ77 lies only in the encoding of the position component of a previous match. The most remarkable point in our modification is that it assigns position indexes on a previous text in dynamically changing order. Furthermore, this approach can be applied not only to a specific version of LZ77 but also to many other variations [1] of the Ziv Lempel code. In order to describe the approach more generally, assume that a dictionary D of the Ziv Lempel code can be divided into two parts: D = D 0 D 1, where D 0 is an initial state of the dictionary and D 1 is a varying set of strings parsed from the input data. Namely, any dictionary entry t D 1 can be represented by t = x[i... i + t 1] with a position i and the length t. For any t = x[i... i + t 1], let C(t) denote its context Cont x (i). Assume D 1 to be an ordered set D 1 = {t 1, t 2,..., t D1 } and π be a permutation that rearranges the corresponding context set: {C(t 1 ), C(t 2 ),..., C(t D1 )} to another context set which is ordered by the context similarity to the current context. Then, we use a varying dictionary D 0 πd 1 instead of D. The above idea is in a sense a refinement of the multipledictionary approach to the Ziv Lempel code (see, e.g. [18]). A multiple-dictionary method uses separate dictionaries for each previous context, which usually consists of a single character. If we increase the length of a conditioning context, we must maintain a good number of dictionaries, each of which in turn has a fewer number of terms. Thus, the increase of the context length does not necessarily contribute to the improvement of compression performance. On the other hand, our method can keep the dictionary size unchanged without depending on the context length. 6. CONCLUSIONS We have proposed a new adaptive lossless data compression method, based on the notion of context similarity. The method maintains previously seen contexts via context sorting, which assigns a similarity to each context that indicates how closely it matches the current context. In spite of the simplicity of the context sorting mechanism, the proposed method has both theoretically and practically good compression performance. The proposed method is essentially a context-based symbol-ranking compressor. Yet it also has a strong relation to the Ziv Lempel-type dictionary methods. We have outlined a unifying approach to the improvement of the Ziv Lempel code by context sorting. Furthermore, we should emphasize that both in encoding and decoding we can know not only the value of the similarity rank but also an underlying state. This is in contrast to the block-sorting method, which is an off-line variant of sort-based symbolranking compressors. In these respects, the proposed method is essentially rich in variation, which should be explored in order to develop better encoding and a faster implementation. ACKNOWLEDGEENT The author wishes to thank. Takahashi for his programming support. He is also grateful to A. offat, the Editor of this issue, who pointed out some existing work on symbol rank compression and to the referees, who made several useful suggestions. REFERENCES [1] Bell, T. C., Cleary, J. G. and Witten, I. H. (1990) Text Compression. Prentice Hall, Englewood Cliffs, NJ. [2] Cleary, J. G. and Witten, I. H. (1984) Data compression using adaptive coding and partial string matching. IEEE Trans. Commun., CO-2, [] Weinberger,. J., Rissanen, J. J. and Feder,. (1995) A universal finite memory source. IEEE Trans. Inform. Theory, IT-41, [4] Willems, F.. J., Shtarkov, Y.. and Tjalkens, T. J. (1995) The context-tree weighting method: basic properties. IEEE Trans. Inform. Theory, IT-41, [5] Burrows,. and Wheeler, D. J. (1994) A block-sorting lossless data compression algorithm. SRC Research Report, Vol [6] Cleary, J. G., Teahan, W. J. and Witten, I. H. (1995) Unbounded length contexts for PP. In Storer, J. A. and Cohn,. (eds.), DCC 95 Proc. Data Compression Conf., Snowbird, UT, arch, pp IEEE Computer Society Press, Los Alamitos, CA. [7] Fenwick, P.. (1996) Block sorting text compression. ACSC 96, Proc. 19th Australian Computer Science Conf., elbourne, Australia. [8] Fenwick, P.. (1996) Block-sorting text compression Final Report. Technical Report 10, Department of Computer Science, University of Auckland, Auckland, New Zealand. Available at: ftp:// auckland.ac.nz/out/peter-f/techrep10.ps.

9 102 H. YOKOO [9] Fenwick, P.. (1996) Symbol ranking text compression. Technical Report 12, Department of Computer Science, University of Auckland, Auckland, New Zealand. Available at: ftp:// ac.nz/out/peter-f/techrep12.ps. [10] Shannon, C. E. (1951) Prediction and entropy of printed English. Bell Syst. Tech. J., 0, [11] Howard, P. G. and Vitter, J. S. (199) Design and analysis of fast text compression based on quasi-arithmetic coding. In Storer, J. A. and Cohn,. (eds), DCC 9 Proc. Data Compression Conf., Snowbird, UT, pp IEEE Computer Society Press, Los Alamitos, CA. [12] Elias, P. (1975) Universal codeword sets and representations of the integers. IEEE Trans. Inform. Theory, IT-21, [1] Yokoo, H. and Takahashi,. (1996) Data compression by context sorting. IEICE Trans. Fundamentals, E79-A, [14] Bell, T. C. (1986) Better OP/L text compression. IEEE Trans. Commun., CO-4, [15] Ziv, J. and Lempel, A. (1977) A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, IT- 2, 7 4. [16] offat, A.. (1990) Implementing the PP data compression scheme. IEEE Trans. Commun., CO-8, [17] Ziv, J. (1978) Coding theorems for individual sequences. IEEE Trans. Inform. Theory, IT-24, [18] Hoang, D. T., Long, P.. and Vitter, J. S. (1995) ultipledictionary compression using partial matching. In Storer, J. A. and Cohn,. (eds), DCC 95 Proc. Data Compression Conf., Snowbird, UT, pp IEEE Computer Society Press, Los Alamitos, CA.

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.

More information

A Comparison of Methods for Redundancy Reduction in Recurrence Time Coding

A Comparison of Methods for Redundancy Reduction in Recurrence Time Coding 1 1 A Comparison of Methods for Redundancy Reduction in Recurrence Time Coding Hidetoshi Yokoo, Member, IEEE Abstract Recurrence time of a symbol in a string is defined as the number of symbols that have

More information

Lecture 4 : Adaptive source coding algorithms

Lecture 4 : Adaptive source coding algorithms Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

Motivation for Arithmetic Coding

Motivation for Arithmetic Coding Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater

More information

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols. Universal Lossless coding Lempel-Ziv Coding Basic principles of lossless compression Historical review Variable-length-to-block coding Lempel-Ziv coding 1 Basic Principles of Lossless Coding 1. Exploit

More information

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004 On Universal Types Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA University of Minnesota, September 14, 2004 Types for Parametric Probability Distributions A = finite alphabet,

More information

Second step algorithms in the Burrows Wheeler compression algorithm

Second step algorithms in the Burrows Wheeler compression algorithm Second step algorithms in the Burrows Wheeler compression algorithm Sebastian Deorowicz November 22, 200 This is a preprint of an article published in Software Practice and Experience, 2002; 32(2):99 Copyright

More information

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2 Text Compression Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 A Very Incomplete Introduction to Information Theory 2 3 Huffman Coding 5 3.1 Uniquely Decodable

More information

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 1: Entropy Coding Lecture 4: Asymmetric Numeral Systems Juha Kärkkäinen 08.11.2017 1 / 19 Asymmetric Numeral Systems Asymmetric numeral systems (ANS) is a recent entropy

More information

Source Coding Techniques

Source Coding Techniques Source Coding Techniques. Huffman Code. 2. Two-pass Huffman Code. 3. Lemple-Ziv Code. 4. Fano code. 5. Shannon Code. 6. Arithmetic Code. Source Coding Techniques. Huffman Code. 2. Two-path Huffman Code.

More information

Asymptotic Optimal Lossless Compression via the CSE Technique

Asymptotic Optimal Lossless Compression via the CSE Technique 2011 First International Conference on Data Compression, Communications and Processing Asymptotic Optimal Lossless Compression via the CSE Technique Hidetoshi Yokoo Department of Computer Science Gunma

More information

arxiv:cs/ v1 [cs.it] 21 Nov 2006

arxiv:cs/ v1 [cs.it] 21 Nov 2006 On the space complexity of one-pass compression Travis Gagie Department of Computer Science University of Toronto travis@cs.toronto.edu arxiv:cs/0611099v1 [cs.it] 21 Nov 2006 STUDENT PAPER Abstract. We

More information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code

More information

Chapter 2: Source coding

Chapter 2: Source coding Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent

More information

UNIT I INFORMATION THEORY. I k log 2

UNIT I INFORMATION THEORY. I k log 2 UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper

More information

The Burrows-Wheeler Transform: Theory and Practice

The Burrows-Wheeler Transform: Theory and Practice The Burrows-Wheeler Transform: Theory and Practice Giovanni Manzini 1,2 1 Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale Amedeo Avogadro, I-15100 Alessandria, Italy. 2

More information

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner

More information

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,

More information

CMPT 365 Multimedia Systems. Lossless Compression

CMPT 365 Multimedia Systems. Lossless Compression CMPT 365 Multimedia Systems Lossless Compression Spring 2017 Edited from slides by Dr. Jiangchuan Liu CMPT365 Multimedia Systems 1 Outline Why compression? Entropy Variable Length Coding Shannon-Fano Coding

More information

3F1 Information Theory, Lecture 3

3F1 Information Theory, Lecture 3 3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2011, 28 November 2011 Memoryless Sources Arithmetic Coding Sources with Memory 2 / 19 Summary of last lecture Prefix-free

More information

Smaller and Faster Lempel-Ziv Indices

Smaller and Faster Lempel-Ziv Indices Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an

More information

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

10-704: Information Processing and Learning Fall Lecture 10: Oct 3 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 0: Oct 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

Summary of Last Lectures

Summary of Last Lectures Lossless Coding IV a k p k b k a 0.16 111 b 0.04 0001 c 0.04 0000 d 0.16 110 e 0.23 01 f 0.07 1001 g 0.06 1000 h 0.09 001 i 0.15 101 100 root 1 60 1 0 0 1 40 0 32 28 23 e 17 1 0 1 0 1 0 16 a 16 d 15 i

More information

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive

More information

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms) Course Code 005636 (Fall 2017) Multimedia Multimedia Data Compression (Lossless Compression Algorithms) Prof. S. M. Riazul Islam, Dept. of Computer Engineering, Sejong University, Korea E-mail: riaz@sejong.ac.kr

More information

Lecture 1 : Data Compression and Entropy

Lecture 1 : Data Compression and Entropy CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for

More information

Lecture 3 : Algorithms for source coding. September 30, 2016

Lecture 3 : Algorithms for source coding. September 30, 2016 Lecture 3 : Algorithms for source coding September 30, 2016 Outline 1. Huffman code ; proof of optimality ; 2. Coding with intervals : Shannon-Fano-Elias code and Shannon code ; 3. Arithmetic coding. 1/39

More information

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression Sergio De Agostino Sapienza University di Rome Parallel Systems A parallel random access machine (PRAM)

More information

IN this paper, we study the problem of universal lossless compression

IN this paper, we study the problem of universal lossless compression 4008 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 9, SEPTEMBER 2006 An Algorithm for Universal Lossless Compression With Side Information Haixiao Cai, Member, IEEE, Sanjeev R. Kulkarni, Fellow,

More information

CS4800: Algorithms & Data Jonathan Ullman

CS4800: Algorithms & Data Jonathan Ullman CS4800: Algorithms & Data Jonathan Ullman Lecture 22: Greedy Algorithms: Huffman Codes Data Compression and Entropy Apr 5, 2018 Data Compression How do we store strings of text compactly? A (binary) code

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 3 & 4: Color, Video, and Fundamentals of Data Compression 1 Color Science Light is an electromagnetic wave. Its color is characterized

More information

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average

More information

3F1 Information Theory, Lecture 3

3F1 Information Theory, Lecture 3 3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2013, 29 November 2013 Memoryless Sources Arithmetic Coding Sources with Memory Markov Example 2 / 21 Encoding the output

More information

Stream Codes. 6.1 The guessing game

Stream Codes. 6.1 The guessing game About Chapter 6 Before reading Chapter 6, you should have read the previous chapter and worked on most of the exercises in it. We ll also make use of some Bayesian modelling ideas that arrived in the vicinity

More information

Neural Markovian Predictive Compression: An Algorithm for Online Lossless Data Compression

Neural Markovian Predictive Compression: An Algorithm for Online Lossless Data Compression Neural Markovian Predictive Compression: An Algorithm for Online Lossless Data Compression Erez Shermer 1 Mireille Avigal 1 Dana Shapira 1,2 1 Dept. of Computer Science, The Open University of Israel,

More information

arxiv: v1 [cs.ds] 21 Nov 2012

arxiv: v1 [cs.ds] 21 Nov 2012 The Rightmost Equal-Cost Position Problem arxiv:1211.5108v1 [cs.ds] 21 Nov 2012 Maxime Crochemore 1,3, Alessio Langiu 1 and Filippo Mignosi 2 1 King s College London, London, UK {Maxime.Crochemore,Alessio.Langiu}@kcl.ac.uk

More information

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for MARKOV CHAINS A finite state Markov chain is a sequence S 0,S 1,... of discrete cv s from a finite alphabet S where q 0 (s) is a pmf on S 0 and for n 1, Q(s s ) = Pr(S n =s S n 1 =s ) = Pr(S n =s S n 1

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

lossless, optimal compressor

lossless, optimal compressor 6. Variable-length Lossless Compression The principal engineering goal of compression is to represent a given sequence a, a 2,..., a n produced by a source as a sequence of bits of minimal possible length.

More information

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1) 3- Mathematical methods in communication Lecture 3 Lecturer: Haim Permuter Scribe: Yuval Carmel, Dima Khaykin, Ziv Goldfeld I. REMINDER A. Convex Set A set R is a convex set iff, x,x 2 R, θ, θ, θx + θx

More information

An O(N) Semi-Predictive Universal Encoder via the BWT

An O(N) Semi-Predictive Universal Encoder via the BWT An O(N) Semi-Predictive Universal Encoder via the BWT Dror Baron and Yoram Bresler Abstract We provide an O(N) algorithm for a non-sequential semi-predictive encoder whose pointwise redundancy with respect

More information

Fast Text Compression with Neural Networks

Fast Text Compression with Neural Networks Fast Text Compression with Neural Networks Matthew V. Mahoney Florida Institute of Technology 150 W. University Blvd. Melbourne FL 32901 mmahoney@cs.fit.edu Abstract Neural networks have the potential

More information

Compror: On-line lossless data compression with a factor oracle

Compror: On-line lossless data compression with a factor oracle Information Processing Letters 83 (2002) 1 6 Compror: On-line lossless data compression with a factor oracle Arnaud Lefebvre a,, Thierry Lecroq b a UMR CNRS 6037 ABISS, Faculté des Sciences et Techniques,

More information

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code Lecture 16 Agenda for the lecture Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code Variable-length source codes with error 16.1 Error-free coding schemes 16.1.1 The Shannon-Fano-Elias

More information

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)

More information

Shannon-Fano-Elias coding

Shannon-Fano-Elias coding Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The

More information

Introduction to information theory and coding

Introduction to information theory and coding Introduction to information theory and coding Louis WEHENKEL Set of slides No 5 State of the art in data compression Stochastic processes and models for information sources First Shannon theorem : data

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015 Outline Codes and Cryptography 1 Information Sources and Optimal Codes 2 Building Optimal Codes: Huffman Codes MAMME, Fall 2015 3 Shannon Entropy and Mutual Information PART III Sources Information source:

More information

Chapter 5: Data Compression

Chapter 5: Data Compression Chapter 5: Data Compression Definition. A source code C for a random variable X is a mapping from the range of X to the set of finite length strings of symbols from a D-ary alphabet. ˆX: source alphabet,

More information

CSEP 521 Applied Algorithms Spring Statistical Lossless Data Compression

CSEP 521 Applied Algorithms Spring Statistical Lossless Data Compression CSEP 52 Applied Algorithms Spring 25 Statistical Lossless Data Compression Outline for Tonight Basic Concepts in Data Compression Entropy Prefix codes Huffman Coding Arithmetic Coding Run Length Coding

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques

More information

Using an innovative coding algorithm for data encryption

Using an innovative coding algorithm for data encryption Using an innovative coding algorithm for data encryption Xiaoyu Ruan and Rajendra S. Katti Abstract This paper discusses the problem of using data compression for encryption. We first propose an algorithm

More information

Information Theory. Week 4 Compressing streams. Iain Murray,

Information Theory. Week 4 Compressing streams. Iain Murray, Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 4 Compressing streams Iain Murray, 2014 School of Informatics, University of Edinburgh Jensen s inequality For convex functions: E[f(x)]

More information

A Mathematical Theory of Communication

A Mathematical Theory of Communication A Mathematical Theory of Communication Ben Eggers Abstract This paper defines information-theoretic entropy and proves some elementary results about it. Notably, we prove that given a few basic assumptions

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Generalized Kraft Inequality and Arithmetic Coding

Generalized Kraft Inequality and Arithmetic Coding J. J. Rissanen Generalized Kraft Inequality and Arithmetic Coding Abstract: Algorithms for encoding and decoding finite strings over a finite alphabet are described. The coding operations are arithmetic

More information

On universal types. Gadiel Seroussi Information Theory Research HP Laboratories Palo Alto HPL September 6, 2004*

On universal types. Gadiel Seroussi Information Theory Research HP Laboratories Palo Alto HPL September 6, 2004* On universal types Gadiel Seroussi Information Theory Research HP Laboratories Palo Alto HPL-2004-153 September 6, 2004* E-mail: gadiel.seroussi@hp.com method of types, type classes, Lempel-Ziv coding,

More information

On the Cost of Worst-Case Coding Length Constraints

On the Cost of Worst-Case Coding Length Constraints On the Cost of Worst-Case Coding Length Constraints Dror Baron and Andrew C. Singer Abstract We investigate the redundancy that arises from adding a worst-case length-constraint to uniquely decodable fixed

More information

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code Chapter 2 Date Compression: Source Coding 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code 2.1 An Introduction to Source Coding Source coding can be seen as an efficient way

More information

CSCI 2570 Introduction to Nanocomputing

CSCI 2570 Introduction to Nanocomputing CSCI 2570 Introduction to Nanocomputing Information Theory John E Savage What is Information Theory Introduced by Claude Shannon. See Wikipedia Two foci: a) data compression and b) reliable communication

More information

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 2017 Author: Galen Reeves Last Modified: October 18, 2017 Outline of lecture: 5.1 Introduction to Lossless Source

More information

Kolmogorov complexity

Kolmogorov complexity Kolmogorov complexity In this section we study how we can define the amount of information in a bitstring. Consider the following strings: 00000000000000000000000000000000000 0000000000000000000000000000000000000000

More information

Coding of memoryless sources 1/35

Coding of memoryless sources 1/35 Coding of memoryless sources 1/35 Outline 1. Morse coding ; 2. Definitions : encoding, encoding efficiency ; 3. fixed length codes, encoding integers ; 4. prefix condition ; 5. Kraft and Mac Millan theorems

More information

Chapter 9 Fundamental Limits in Information Theory

Chapter 9 Fundamental Limits in Information Theory Chapter 9 Fundamental Limits in Information Theory Information Theory is the fundamental theory behind information manipulation, including data compression and data transmission. 9.1 Introduction o For

More information

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16 EE539R: Problem Set 4 Assigned: 3/08/6, Due: 07/09/6. Cover and Thomas: Problem 3.5 Sets defined by probabilities: Define the set C n (t = {x n : P X n(x n 2 nt } (a We have = P X n(x n P X n(x n 2 nt

More information

Fixed-Length-Parsing Universal Compression with Side Information

Fixed-Length-Parsing Universal Compression with Side Information Fixed-ength-Parsing Universal Compression with Side Information Yeohee Im and Sergio Verdú Dept. of Electrical Eng., Princeton University, NJ 08544 Email: yeoheei,verdu@princeton.edu Abstract This paper

More information

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes Weihua Hu Dept. of Mathematical Eng. Email: weihua96@gmail.com Hirosuke Yamamoto Dept. of Complexity Sci. and Eng. Email: Hirosuke@ieee.org

More information

Entropy as a measure of surprise

Entropy as a measure of surprise Entropy as a measure of surprise Lecture 5: Sam Roweis September 26, 25 What does information do? It removes uncertainty. Information Conveyed = Uncertainty Removed = Surprise Yielded. How should we quantify

More information

Lecture 1: September 25, A quick reminder about random variables and convexity

Lecture 1: September 25, A quick reminder about random variables and convexity Information and Coding Theory Autumn 207 Lecturer: Madhur Tulsiani Lecture : September 25, 207 Administrivia This course will cover some basic concepts in information and coding theory, and their applications

More information

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1 Kraft s inequality An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if N 2 l i 1 Proof: Suppose that we have a tree code. Let l max = max{l 1,...,

More information

SGN-2306 Signal Compression. 1. Simple Codes

SGN-2306 Signal Compression. 1. Simple Codes SGN-236 Signal Compression. Simple Codes. Signal Representation versus Signal Compression.2 Prefix Codes.3 Trees associated with prefix codes.4 Kraft inequality.5 A lower bound on the average length of

More information

Fault Tolerance Technique in Huffman Coding applies to Baseline JPEG

Fault Tolerance Technique in Huffman Coding applies to Baseline JPEG Fault Tolerance Technique in Huffman Coding applies to Baseline JPEG Cung Nguyen and Robert G. Redinbo Department of Electrical and Computer Engineering University of California, Davis, CA email: cunguyen,

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

CSEP 590 Data Compression Autumn Arithmetic Coding

CSEP 590 Data Compression Autumn Arithmetic Coding CSEP 590 Data Compression Autumn 2007 Arithmetic Coding Reals in Binary Any real number x in the interval [0,1) can be represented in binary as.b 1 b 2... where b i is a bit. x 0 0 1 0 1... binary representation

More information

CMPT 365 Multimedia Systems. Final Review - 1

CMPT 365 Multimedia Systems. Final Review - 1 CMPT 365 Multimedia Systems Final Review - 1 Spring 2017 CMPT365 Multimedia Systems 1 Outline Entropy Lossless Compression Shannon-Fano Coding Huffman Coding LZW Coding Arithmetic Coding Lossy Compression

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced

More information

ECE533 Digital Image Processing. Embedded Zerotree Wavelet Image Codec

ECE533 Digital Image Processing. Embedded Zerotree Wavelet Image Codec University of Wisconsin Madison Electrical Computer Engineering ECE533 Digital Image Processing Embedded Zerotree Wavelet Image Codec Team members Hongyu Sun Yi Zhang December 12, 2003 Table of Contents

More information

ELEC 515 Information Theory. Distortionless Source Coding

ELEC 515 Information Theory. Distortionless Source Coding ELEC 515 Information Theory Distortionless Source Coding 1 Source Coding Output Alphabet Y={y 1,,y J } Source Encoder Lengths 2 Source Coding Two coding requirements The source sequence can be recovered

More information

Information. Abstract. This paper presents conditional versions of Lempel-Ziv (LZ) algorithm for settings where compressor and decompressor

Information. Abstract. This paper presents conditional versions of Lempel-Ziv (LZ) algorithm for settings where compressor and decompressor Optimal Universal Lossless Compression with Side Information Yeohee Im, Student Member, IEEE and Sergio Verdú, Fellow, IEEE arxiv:707.0542v cs.it 8 Jul 207 Abstract This paper presents conditional versions

More information

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 28 Author: Galen Reeves Last Modified: September 27, 28 Outline of lecture: 5. Introduction to Lossless Source

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

CS 229r Information Theory in Computer Science Feb 12, Lecture 5

CS 229r Information Theory in Computer Science Feb 12, Lecture 5 CS 229r Information Theory in Computer Science Feb 12, 2019 Lecture 5 Instructor: Madhu Sudan Scribe: Pranay Tankala 1 Overview A universal compression algorithm is a single compression algorithm applicable

More information

ASYMMETRIC NUMERAL SYSTEMS: ADDING FRACTIONAL BITS TO HUFFMAN CODER

ASYMMETRIC NUMERAL SYSTEMS: ADDING FRACTIONAL BITS TO HUFFMAN CODER ASYMMETRIC NUMERAL SYSTEMS: ADDING FRACTIONAL BITS TO HUFFMAN CODER Huffman coding Arithmetic coding fast, but operates on integer number of bits: approximates probabilities with powers of ½, getting inferior

More information

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression Kirkpatrick (984) Analogy from thermodynamics. The best crystals are found by annealing. First heat up the material to let

More information

Coding for Discrete Source

Coding for Discrete Source EGR 544 Communication Theory 3. Coding for Discrete Sources Z. Aliyazicioglu Electrical and Computer Engineering Department Cal Poly Pomona Coding for Discrete Source Coding Represent source data effectively

More information

CSEP 590 Data Compression Autumn Dictionary Coding LZW, LZ77

CSEP 590 Data Compression Autumn Dictionary Coding LZW, LZ77 CSEP 590 Data Compression Autumn 2007 Dictionary Coding LZW, LZ77 Dictionary Coding Does not use statistical knowledge of data. Encoder: As the input is processed develop a dictionary and transmit the

More information

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Multimedia Communications. Mathematical Preliminaries for Lossless Compression Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when

More information

Data Compression. Limit of Information Compression. October, Examples of codes 1

Data Compression. Limit of Information Compression. October, Examples of codes 1 Data Compression Limit of Information Compression Radu Trîmbiţaş October, 202 Outline Contents Eamples of codes 2 Kraft Inequality 4 2. Kraft Inequality............................ 4 2.2 Kraft inequality

More information

Fibonacci Coding for Lossless Data Compression A Review

Fibonacci Coding for Lossless Data Compression A Review RESEARCH ARTICLE OPEN ACCESS Fibonacci Coding for Lossless Data Compression A Review Ezhilarasu P Associate Professor Department of Computer Science and Engineering Hindusthan College of Engineering and

More information

A Faster Grammar-Based Self-Index

A Faster Grammar-Based Self-Index A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University

More information

CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY

CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY 108 CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY 8.1 INTRODUCTION Klimontovich s S-theorem offers an approach to compare two different

More information

Lecture 3: Error Correcting Codes

Lecture 3: Error Correcting Codes CS 880: Pseudorandomness and Derandomization 1/30/2013 Lecture 3: Error Correcting Codes Instructors: Holger Dell and Dieter van Melkebeek Scribe: Xi Wu In this lecture we review some background on error

More information

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account:! cd ~cs9319/papers! Original readings of each lecture will be placed there. 2 Course

More information

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak 4. Quantization and Data Compression ECE 32 Spring 22 Purdue University, School of ECE Prof. What is data compression? Reducing the file size without compromising the quality of the data stored in the

More information

Binary Convolutional Codes of High Rate Øyvind Ytrehus

Binary Convolutional Codes of High Rate Øyvind Ytrehus Binary Convolutional Codes of High Rate Øyvind Ytrehus Abstract The function N(r; ; d free ), defined as the maximum n such that there exists a binary convolutional code of block length n, dimension n

More information