Information Theory and Statistics Lecture 4: Lempel-Ziv code

Size: px

Start display at page:

Download "Information Theory and Statistics Lecture 4: Lempel-Ziv code"

Felix Lynch
5 years ago
Views:

1 Iformatio Theory ad Statistics Lecture 4: Lempel-Ziv code Łukasz Dębowski Ph. D. Programme 203/204

2 Etropy rate is the limitig compressio rate Theorem For a statioary process (X i) i=, let L deote the miimal expected compressio rate of a uiquely decodable code B : X {0, } for the block of variables. That is, We claim that lim L = h. L := mi E B(X,..., X). B Proof Assume that B is the Shao-Fao code for the block (X,..., X ). The H(X ) L E B (X,..., X ) H(X ) +. Hece the claim follows.

3 The problem of uiversal compressio To compute the Shao-Fao code we eed to kow the probability distributio of the block. Such a situatio is ulikely i practical applicatios of data compressio, where we have o prior iformatio about the probability distributio of blocks. Fortuately, as a importat corollary of the ergodic theorem, there exist uiversal codes whose compressio rates ted to the etropy rate for ay statioary process.

4 Uiversal codes Defiitio (weakly uiversal code) A uiquely decodable code B : X {0, } is called weakly uiversal if for ay statioary process (ot ecessarily ergodic) we have lim E B(X ) = h. Defiitio (strogly uiversal code) A uiquely decodable code B : X {0, } is called strogly uiversal if for ay statioary ergodic process iequality holds with probability. lim sup B(X ) h.

5 Strogly uiversal codes are better Theorem Let code B be strogly uiversal. If there exists a costat K such that B(x ) K for each strig x the code B is weakly uiversal.

6 Compressio ad statistics The problem of uiversal compressio falls uder the scope of statistics. Ideed, the iterest of statisticias lies i idetifyig parameters of a stochastic process basig o the data typical for that process. Etropy rate of a ergodic process is a example of such a parameter. Whe we have a uiversal code, we may estimate the etropy rate as the compressio rate.

7 Lempel-Ziv code The code was derived by Abraham Lempel (936 ) ad Jacob Ziv (93 ) i 977 ad is partly implemeted i gzip ad compress. Defiitio (LZ code) For simplicity of the algorithm descriptio we assume that the compressed data are biary sequeces, that is X = {0, }. The Lempel-Ziv compressio algorithm is as follows. The compressed sequece is parsed ito a sequece of shortest phrases that have ot appeared before (except for the last phrase). For example, the sequece is split ito phrases 0, 0, 00, 000,,, 00,... 2 I the followig, each phrase is described usig a biary idex of the logest prefix that appeared earlier ad a sigle bit that follows that prefix. For the cosidered sequece, this represetatio is as follows: (0, 0)(, )(0, 0)(, 0)(0, )(0, )(, 0).

8 The legth of the LZ code Theorem Let C be the umber of phrases i the compressed block X. If we kow C, we eed log C bits to idetify the prefix idex for each phrase ad bit to describe the followig bit. Thus the LZ code uses B(X ) = C [log C + O()] bits i total. A splittig of a sequece ito distict phrases will be called a distict parsig of the sequece. Let (X i) i= be a statioary ergodic process ad let C be the umber of phrases i a distict parsig of block (X, X 2,..., X ). With probability we have lim sup C [log C + O()] h. Remark: Hece the LZ code is strogly uiversal. It ca be also show that the LZ code is weakly uiversal.

9 The first lemma Lemma The umber of phrases C i ay distict parsig of block (X, X 2,..., X ) satisfies iequality C log lim.

10 Proof of the first lemma Let k = k j= j2j = (k )2 k+ + 2 be the sum of legths of distict phrases that are ot loger tha k. The umber of phrases C i a distict parsig will be maximal if the phrases are as short as possible. For k < k+ this happes if we take all phrases of legth k ad δ/(k + ) phrases of legth k +, where δ = k. The C k j= 2 j + δ k + k k + δ k + k. I the followig we will provide a boud for k give. We have k = (k )2 k k, so k log. Moreover < k+ = k2 k (log + 2)2 k+2. Hece k + 2 > log log + 2. Further trasformatios yield k > log log(log + 2) 3. Hece we obtai the claim.

11 Ziv iequality Let P k deote the measure of the k-th order Markov approximatio of the process (X i) i=. That is P k (X k+ X 0 k+) := i= P(X i X i i k ). Moreover, assume that sequece (X, X 2,..., X ) is parsed ito C distict phrases (Y, Y 2,..., Y C ). Let W i deote the k bits precedig Y i. Next, let C lw deote the umber of phrases Y i that have legth l ad cotext W i = w. Lemma (Ziv iequality) We have iequality log P k (X, X 2,..., X W ) l,w C lw log C lw.

12 Proof of Ziv iequality Proof Observe that C log P k (X, X 2,..., X W ) = log P(Y j W j) = l,w l,w C lw C lw C lw log j: Y j =l,w j =w C lw j: Y j =l,w j =w j= log P k (Y j W j) P k (Y j W j), where the iequality follows from the Jese iequality because the logarithm fuctio is cocave. Because the phrases Y j uder the sum are distict, we have j: Y j =l,w j =w Pk (Y j W j). Hece the claim follows.

13 Third lemma Lemma Let L be a oegative radom variable takig values i itegers ad havig expectatio E L. The etropy H(L) is bouded by iequality H(L) (E L + ) log (E L + ) E L log E L. The proof of this lemma will be discussed after the lecture o maximum etropy modelig as a easy exercise.

14 Proof that LZ code is uiversal Let L ad W be radom variables such that The expectatio of L is P(L = l, W = w) = Clw C. E L = l,w lc lw C = C. Hece by the third lemma, we obtai H(L) (E L + ) log (E L + ) E L log E L = log ( ) ( ) C + + log C C +.

15 Proof that LZ code is uiversal (cotiued) O the other had, H(W) k, so H(L, W) H(L) + H(W) log ( ) ( ) C + + log C C + + k. The by the first lemma, we have lim C H(L, W) = 0.

16 Proof that LZ code is uiversal (fiished) Now usig the first lemma agai, the Ziv iequality, ad the ergodic theorem, we obtai ( ) C [log C + O()] C log C lim sup = lim sup C H(L, W) = lim sup = lim sup C C lw log C + C C l,w = lim l,w i= log Clw C C lw log C lw lim log Pk (X X 0 k+) log P(X i X i i k ) = H(Xi Xi i k ). with probability. This iequality holds for ay k. Cosiderig k, we obtai the claim.

17 Motivatio for the R measure We ca estimate the etropy rate of a statioary process as the legth of the Lempel-Ziv code for a sequece of symbols draw for the process divided by the sequece legth. This method of estimatio is far from satisfactory sice the estimate of the etropy rate coverges very slow as a fuctio of the sequece legth. We will preset a method which seems to be better i case of empirical sources such as the atural laguage.

18 R measure Let the frequecy of substrig w k i strig z {0,,..., D } be Defiitio (R measure) k c(w z k ) = i=0 { w k = z i+k i+ }. Defie coditioal probabilities B(x + x, ) = D ad B(x + x, k) = c(x+ + k x ) + B(x + x, k ). c(x + k x ) + We write B(x, k) = i= B(xi xi, k). Let p k (0, ) satisfy pk =. The R measure is k= Q(x ) = p kb(x, k). k=

19 Uiversality of R measure A probability distributio Q is called weakly uiversal if for ay statioary process (ot ecessarily ergodic) we have lim E [ log Q(X )] = h. A probability distributio Q is called is called strogly uiversal if for ay statioary ergodic process iequality holds with probability. Theorem lim sup [ log Q(X )] h. The R measure is both strogly ad weakly uiversal.

20 Proof of uiversality Let P be a statioary ergodic distributio. Sice the alphabet of X i is fiite, by the ergodic theorem differeces B(X X, k) P(X X k ) coverge to 0 with P-probability. Hece [ ] lim [ log B(X, k)] = lim log P(X i X i i k ). i=k+ Applyig the ergodic theorem agai, we obtai [ ] [ ] lim log P(X i X i i k ) = E log P(X k+ X k ). Hece lim sup i=k+ [ log Q(X )] if k N lim [ log B(X, k)] ] = h. [ = if E log P(X k+ X k ) k N Hece the distributio Q is strogly uiversal. Sice log Q(X ) log p + log D, distributio Q is also weakly uiversal.

21 The R measure is effectively computable Deote the maximal legth of a substrig that appears at least twice i z as { } L(z ) := max k : w k : c(w z k ) >. For k > L(x ), Hece the R measure is Q(x ) = L(z ) k= B(x, k) = B(x, k ). p kb(x, k) + L(x ) k= p k B(x, L(x )).

Entropy Rates and Asymptotic Equipartition

Entropy Rates and Asymptotic Equipartition Chapter 29 Etropy Rates ad Asymptotic Equipartitio Sectio 29. itroduces the etropy rate the asymptotic etropy per time-step of a stochastic process ad shows that it is well-defied; ad similarly for iformatio,