Fixed-Length-Parsing Universal Compression with Side Information

Size: px

Start display at page:

Download "Fixed-Length-Parsing Universal Compression with Side Information"

Barbra Hardy
6 years ago
Views:

1 Fixed-ength-Parsing Universal Compression with Side Information Yeohee Im and Sergio Verdú Dept. of Electrical Eng., Princeton University, NJ Abstract This paper presents a fixed-length-parsing universal compression algorithm for settings where compressor and decompressor share the same side information. We prove the optimality of the algorithm for any stationary processes. Furthermore, strategies to modify the algorithm are suggested in order to lower the data compression rate. I. INTRODUCTION Underlying the most prevalent compression applications, the empel-ziv class of algorithms, 2 is a widely used universal compression paradigm based on parsing into variable-length phrases, which achieves Shannon s fundamental entropy rate limit. In 3 Willems put forth a fixed-length counterpart which he analyzed using Kac s fundamental result on repetition times 4. In many environments, side information is accessible to both compressor and decompressor. When software has to be updated, the old version serves as side information for clients and servers. In video compression, an estimate of the previous frame is useful as side information 5. In some image transmission applications, a noisy version of the image is given to the decompressor in the form of side information to decode the digitized image 6. The rsync algorithm proposed in 7 conveys data of updated files, capitalizing on the fact that both files are highly similar. A wea and cheap checsum is compared first between blocs of the old and new files, and a more expensive and stronger checsum is calculated for blocs where the wea checsum matched, in order to ferret out discrepancies. Data compression with side information is also encountered in genomic sequencing. It was pointed out in 8 that the distance between two DNA sequences can be measured with an information-theoretic approach, and in this setting, a reference sequence taes the role of side information 9. An algorithm to compress a target genome was proposed in 0 when a reference genome is given. Taing into account that a high proportion of any two human genomes are identical, the difference between a target and a reference is compressed with a sliding-window empel-ziv algorithm. The problem of universal source coding with side information has been considered in various wors. In, an algorithm with side information was designed based on the Context Tree Weighting method 2, which weighs the This wor was partially supported by the Center for Science of Information, an NSF Science and Technology Center, under Grant CCF model distributions recursively using a binary context tree. The algorithm in and the conditional Multilevel Pattern Matching code introduced in 9 are optimal when source and side information are jointly stationary and ergodic processes. In addition, there has been some research on universal source coding with side information that applied empel-ziv algorithms. A conditional version of the established slidingwindow empel-ziv algorithm was proposed in 3 and shown to be optimal in 4 for the stationary ergodic sources. The Z78 algorithm was modified to exploit side information in 5, based on the algorithm given in 6 for universal decoding for finite-state channels. In this paper, we propose a universal fixed-length-parsing algorithm with side information available to compressor and decompressor, motivated by the counterpart without side information devised and elegantly analyzed by Willems in 3. Fixed-length parsing has worse transient behavior than variable-length parsing (Z) but is amenable to more straightforward analysis. We mae the first step to design an algorithm with fixed-length parsing for the system with side information, which we show is optimal for any jointly stationary sequences. While the algorithm in 3 traced bac to find the first match of fixed-length phrases, our algorithm loos for the match of source and side information simultaneously, and uses side information to represent the location of the match. Section II and Section III describe and analyze the algorithms, respectively. II. DESCRIPTION OF AGORITHMS In this section, we propose an extension of the Willems algorithm for fixed-length parsing which taes advantage of the side information nown to both of the compressor and the decompressor. Consider source X =(,X,X 2, ) taing values on a finite alphabet A. Side information Y =(,Y,Y 2, ) is given to the compressor and the decompressor, taing values on a finite alphabet B. The compressor deals with the source and the side information by parsing them into phrases of fixed size. The number of phrases of size in X which the compressor aims to compress is denoted by N. A. Algorithm et be defined as the number of bits needed to represent symbols from the alphabet A, i.e., log 2 A. The /7/$ IEEE 2563

2 Algorithm Send bits to describe (X,,X ). s i =min t : 0 <t<(i ) +, (XY ) i (i )+ =(XY )i t (i )+ t n i = t,s i : Y(i )+ (i )+ t i = Y i t. n i =0. if n i : 2 then send h (n i ). Send h (2 ). Send X i (i )+ without compression using bits. algorithm starts by sending bits to convey the first phrase (X,,X ) without any compression. At the i-th step, the compressor finds out where the same values of the i- th phrase, (X (i )+,,X i ) and (Y (i )+,,Y i ), appeared simultaneously before, for the last time. Tracing bac until the location of that (X, Y)-match, count the number of Y-matches of the phrase (Y (i )+,,Y i ). As the compressor transmits the information of the number of Y- matches, the decompressor can recognize the location of the (X, Y)-match uniquely. The number of Y-matches is conveyed using h ( ) the universal prefix code for : 2 in 3, the length of which is log2 ( + ) + log l(h (n)) = 2 n, if n<2, log 2 ( + ), if n =2 (), where l( ) is the length function. If the number of Y-matches n i : 2, the compressor sends h (n i ). When n i 2 or there is no match, h (2 ) is transmitted, followed by the uncompressed sequence (X (i )+,,X i ). B. Algorithm 2 If Algorithm cannot find an (X, Y)-match within the offset 2, the compressor simply sends h (2 ) and the uncompressed phrase. There is, however, a fair chance of finding an X-match in the past, permitting the compressor to capitalize on the original Z77 algorithm. The compressor loos bac for the (X, Y)-match of the i- th phrase, and calculates the number of Y-matches n i until the (X, Y)-match appears as in Algorithm. In case of n i : 2, the flag bit 0 is sent to denote that an (X, Y)-match was found, and h (n i ) is delivered. Otherwise, the compressor forwards the flag bit and finds out whether there exists an a : b j Z : a j b. Algorithm 2 Send bits to describe (X,,X ). s i =min t : 0 <t<(i ) +, (XY ) i (i )+ =(XY )i t (i )+ t n i = t,s i : Y(i )+ (i )+ t i = Y i t. n i =0. if n i : 2 then Send a flag bit 0 and h (n i ). Send a flag bit. r i =min t : 0 <t<(i ) +, X(i )+ i = Xi t (i )+ t if r i 2 m then send h m (r i ). Send h m (2 m ). Send X i (i )+ with bits. X-match. When the match is found, the number of symbols, r i, that it has to loo bac to reach the match is counted, and h m (r i ) is sent, if r i 2 m, for some fixed integer parameter m. When r i 2 m or no match was found, h m (2 m ) is conveyed, followed by uncompressed bits. C. Algorithm 3 In Algorithm, the prefix code h ( ) is used to convey the number of Y-matches found until the (X, Y)-match is reached. The code assigns a codeword when the number of matches is in : 2. However, when the sequences are not long enough, the number of all Y-matches can be smaller than 2, in which case it is inefficient to consider a code for : 2. The algorithm is modified in order that the compressor can reduce the amount of bits to send, by adopting a prefix code with another parameter. After sending i phrases, the match of the i-th phrase in the past is found, and the number of Y-matches n i until the match is counted as before. Besides, the algorithm computes p i, the number of Y-matches from the beginning until the current time. If p i < 2, then there is no need to use the code h ( ), since it is efficient only when every index, 2,, 2 is a possible candidate to be encoded. Choose the parameter i = log 2 (p i +). The decompressor as well as the compressor can compute i, since they are aware of 2564

3 Algorithm 3 Send bits to describe (X,,X ). s i =min t : 0 <t<(i ) +, (XY ) i (i )+ =(XY )i t (i )+ t n i = t,s i : Y(i )+ (i )+ t i = Y i t. n i =0. p i = t, (i ) : Y(i )+ (i )+ t i = Y i t. if p i < 2 then i = log 2 (p i +). if n i : 2 i then send h i (n i ). Send h i (2 i ). Send X(i )+ i using bits. if n i : 2 then send h (n i ). Send h (2 ). Send X i (i )+ using bits. every location of the Y-matches. The compressor conveys h i (n i ) if n i : 2 i, and otherwise, it sends h i (2 i ) with the uncompressed sequence. After several phrases have been sent, p i will become as large as 2. In that case, h (2 ) can be used. This modification is useful for the original Willems algorithm without side information 3 as well. III. ANAYSIS In this section we show the optimality of Algorithm and compare the performance of the modified algorithms. A. Optimality of Algorithm Willems 3 showed the asymptotic optimality of his fixedlength-parsing algorithm in the sense of expected length for stationary sources. Here we show that the length per symbol of the algorithm with side information approaches the conditional entropy rate as the bloclength and the number of phrases go to infinity. For any doubly infinite sequence a= (,a,a 0,a, ), the repeated recurrence times are defined as T 0 (a) =0, T j (a) =min t N : t>t j (a), a = a t t j. A random variable C indicates the number of Y-matches that appeared in the past since the latest (X, Y)-match, which is C = t N : t T (X, Y), Y = Y t t. (2) The following lemma provides the relationship between the expected value of C and the conditional probability. emma. For any stationary processes X, Y and sequences x, y, the conditional probability of x given y and the conditional expectation of C satisfy E C X = x,y = y ( P X = x Y = y ), (3) where C is defined in (2). Proof. We simplify notation as T j T j (y). Define a probability w n PX Tj x j n Y = y. Note that the difference between w n and w n+ represents the probability that X-match appears for the first time with the interval T n+, among the locations of Y-matches, as w n w n+ = P X Tj x j n, X Tn+ + = x Y = y It can be interpreted in the following way by stationarity: w n w n+ = P X x,x Tj x j n,x Tn = x Y = y. The conditional probability of C and X becomes P C = n, X = x Y = y = P X = x,x Tj x j n,x Tn = x Y = y = P X Tj x j n, X Tn = x Y = y P X x,x Tj x j n,x Tn = x Y = y =(w n w n ) (w n w n+ ). Finally, we get the desired result from E C X = x,y = y P X = x Y = y n = lim j (w j 2w j + w j+ ) (4) = lim w n n(w n w n+ ) (5) = lim ( w n) (6) = P X tj t j = x Y = y, (7) where (6) holds because (w j w j+ ) <, and accordingly, w n w n+ = o ( n). Note that emma reduces to Kac s lemma 4 in the case without side information, which implies that the meanrecurrence time of a sequence is lower than the reciprocal of the probability of the sequence. Based on emma, the average codelength of the algorithm with side information is analyzed next. et w i (X i Y i ) be 2565

4 the codeword constructed by Algorithm to send the i-th phrase X(i )+ i, when the previous phrases are X(i ), the current phrase of side information is Y(i )+ i, and the previous phrases of side information are Y. For convenience, the first recurrence time of sequences, T (X, Y), will be denoted by T from this point on. Theorem. The conditional average codelength of the i-th phrase is bounded as E l ( w i (X i Y i ) ) X(i )+ i = x,yi (i )+ = y ı X Y (x y)+ log 2( + ) + ((i ) +)PX = (8) x,y = y, for any x A, y B. Proof. Define the event E xy X = x,y = y. Using stationarity, the expected length of codeword of i-th phrase is E l ( w i (X i Y i ) ) X(i )+ i = x,yi (i )+ ( ) = y = E l w i (X (i ) Y (i ) ) E xy = log 2 j P C = j, T (i ) E xy 2 + P C 2 or T (i ) + E xy + log2 ( + ) 2 log 2 j P C = j E xy + log 2 j P C = j E xy j=2 + P C 2, T (i ) + E xy + log2 ( + ) (a) log 2 E C E xy + log 2 ( + ) + P C = j E xy P T j (Y) (i ) + C = j, E xy 2 (b) ı X Y (x y)+ log EC E xy 2( + ) + ((i ) +)PY = y (c) ı X Y (x y)+ log 2( + ) + ((i ) +)PE xy, where (a) holds by the concavity of logarithm and Jensen s inequality, and (b) follows by emma, Marov s inequality and E T j (Y) Y = y =j ET (Y) Y j = y PY = y (9) Inequality (c) is implied by emma. The codeword for X N with side information Y N is denoted by w(x N Y N ). Utilizing the bound for the codelength of each phrase given in Theorem, we show that the average compression rate is asymptotically bounded by the conditional entropy rate. Theorem 2. Algorithm is asymptotically optimal in the sense that lim lim E l(w(x N Y N )) H(X Y). (0) N N Proof. For any finite N and, the expected codelength is bounded as E l(w(x N Y N )) N = + E l ( w i (X i Y i ) ) () N = + P X(i )+ i = x,yi (i )+ = y x A y B E l ( w i (X i Y i ) ) X(i )+ i =x,yi (i )+ =y (2) N ( P E xy log 2 ( + ) + ı X Y (x y) x A y B ) + + (3) ((i ) +)P E xy = +(N ) log 2 ( + ) +(N )H(X Y ) N + A B (i ) +. (4) As the number of phrases goes to infinity, the asymptotic compression rate can be upper bounded as: E l(w(x N Y N )) lim N N log 2( + ) + H(X Y ) + A B lim N N N (i ) + (5) = log 2( + ) + H(X Y ). (6) Therefore, (0) follows as goes to infinity. B. Algorithm 2 While Algorithm exploits (X, Y)-matches only, X- matches are also exploited in Algorithm 2 in order to denote the match location. However, this modified strategy necessitates a flag bit to indicate whether the codeword signals an (X, Y)-match or an X-match. We show a sufficient condition for Algorithm 2 to outperform Algorithm. The codelengths are summarized in Table I, where n i is the number of Y-matches until the (X, Y)-match is reached, and r i is the number of symbols needed to reach the X-match, as before. For cases 2), 3), 4), Algorithm 2 constructs a shorter codeword than Algorithm does if m<, while Algorithm 2 outputs one more bit for case ). Hence, Algorithm 2 is efficient when the frequency of case ) is low. Denote E C 2, T (i ), E 2 C =2, T (i ), E 3,t ( C 2 + T (i ) + ) T (X) =t, E 4 ( C 2 + T (i ) + ) T (X) >c i, 2566

5 TABE I: Comparison of Codelengths Cases Algorithm Algorithm 2 ) n i 2 log 2 ( + ) + log 2 n i log 2 ( + ) + log 2 n i + 2) n i =2 log 2 ( + ) + log 2 ( + ) + 3) n i 2 +, r i 2 m log 2 ( + ) + log 2 ( + m) + log 2 r i + 4) n i 2 +, r i 2 m log 2 ( + ) + log 2 ( + m) + + where c i min(2 m, (i )). et w i 2( ) be codeword of the i-th phrase constructed with Algorithm 2. Then, the difference between the average codelengths is E l(w i (X i Y i )) X(i )+ i = x,yi (i )+ = y E l(w2(x i i Y i )) X(i )+ i i = x,y(i )+ = y c i = P E E xy +( )P E 2 E xy + P E 3,t E xy t= ( log 2 ( + ) + log 2 ( + m) log 2 t ) + PE 4 E xy ( log 2 ( + ) log 2 ( + m) ) (7) P E E xy +( )P E 2 E xy +P t ci E 3,t E xy ( log 2 ( + ) + log 2 ( + m) m ) + P E 4 E xy ( log2 ( + ) log 2 ( + m) ) (8) P E E xy + P E c E xy ( log 2 ( + ) log 2 ( + m) ). (9) This gap is positive if P E E xy log 2 ( + ) log 2 ( + m), (20) which means that Algorithm 2 is advantageous for relatively short sequences. C. Algorithm 3 Described in Section II-C, Algorithm 3 improves performance by exploiting the fact that the decompressor is able to locate every match of the side information. We focus on the codelength for the i-th phrase whose codeword is denoted by w 3 ( ). et ˆB B i be a set composed of y B i such that the number of matches, p i, of the i-th phrase among the i previous phrases is smaller than 2. Ify ˆB, a new parameter i =log 2 (p i +) replaces, and C is certainly lower than 2 i, given that T (i ). The conditional average codelength of Algorithm 3 for the i-th phrase is E l(w3(x i i Y i )) X(i )+ i = x,yi = y log 2 j P C = j, T (i ) X =x,y (i ) =y + P T>(i ) X =x,y (i ) =y + log 2 ( + i ), y ˆB. 2 i = The performance gap corresponding to the i-th phrase between Algorithm and Algorithm 3 for this case results in E l(w i (X i Y i )) X(i )+ i =x,yi =y E l(w3(x i i Y i )) X(i )+ i =x,yi =y = log 2 ( + ) log 2 ( + i ) 0, y ˆB. Hence, when y ˆB, the performance of Algorithm 3 is at least as good as that of Algorithm, while if y ˆB, both algorithms yield the same expected codelength. Thus, Algorithm always constructs codewords no shorter than those of Algorithm 3, which exploits the fact that the first several phrases typically find few side information matches. REFERENCES J. Ziv and A. empel, A universal algorithm for sequential data compression, IEEE Trans. Inf. Theory, vol. 24, pp , May J. Ziv and A. empel, Compression of individual sequences via variable-rate coding, IEEE Trans. Inf. Theory, pp , Sep F. M. J. Willems, Universal data compression and repetition times, IEEE Trans. Inf. Theory, vol. 35, no. I, pp 54 58, Jan M. Kac, On the notion of recurrence in discrete stochastic processes, Bulletin of the American Mathematical Society, pp , A. Aaron, R. Zhang, B. Girod, Wyner-Ziv coding of motion video, in Proc. Asilomar Conf. on Signals and Systems, Pacific Grove, Nov. 2002, pp S. S. Pradhan and K. Ramchandran, Enhancing analog image transmission systems using digital side information: a new wavelet-based image coding paradigm, in Proc. IEEE Data Compression Conf., Snowbird, Mar. 200, pp A. Tridgell and P. Macerras, The rsync algorithm, Technical Report TR-CS-96-05, Department of Computer Science, The Australian National University, M. i, J. H. Badger, X. Chen, S. Kwong, P. Kearney and H. Zhang, An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, pp 49 54, E. Yang, A. Kaltcheno, and J. C. Kieffer, Universal lossless data compression with side information by using a conditional MPM grammar transform, IEEE Trans. Inf. Theory, vol. 47, no. 6, Sep B. G. Chern, I. Ochoa, A. Manolaos, A. No, K. Venat and T. Weissman, Reference based genome compression, IEEE Inf. Theory Worshop, 202. H. Cai, S. R. Kularni, and S. Verdú, An algorithm for universal lossless compression with side information, IEEE Trans. Inf. Theory, vol. 52, no. 9, pp , Sep F. M. J. Willems, Y. M. Shtarov, and T. J. Tjalens, The context-tree weighting method: basic properties, IEEE Trans. Inf. Theory, vol. 4, no. 3, pp , May P. Subrahmanya and T. Berger, A sliding window empel-ziv algorithm for differential layer encoding in progressive transmission, IEEE Int l Symposium on Inf. Theory, Whistler, BC, Canada, June 995, p T. Jacob and R. K. Bansal, On the optimality of sliding window empel-ziv algorithm with side information, 2008 Int l Symposium on Inf. Theory and its Applications, Aucland, New Zealand, Dec T. Uyematsu and S Kuzuoa, Conditional empel-ziv complexity and its application to source coding theorem with side information, IEICE Trans. Fundam., vol. E86-A, no. 0, Oct J. Ziv, Universal decoding for finite-state channel, IEEE Trans. Inf. Theory, vol 3, no. 4, pp , July

Information. Abstract. This paper presents conditional versions of Lempel-Ziv (LZ) algorithm for settings where compressor and decompressor

Information. Abstract. This paper presents conditional versions of Lempel-Ziv (LZ) algorithm for settings where compressor and decompressor Optimal Universal Lossless Compression with Side Information Yeohee Im, Student Member, IEEE and Sergio Verdú, Fellow, IEEE arxiv:707.0542v cs.it 8 Jul 207 Abstract This paper presents conditional versions