UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS PDF Free Download

UC Berkeley CS 170: Efficiet Algorithms ad Itractable Problems Hadout 17 Lecturer: David Wager April 3, 2003 Notes 17 for CS 170 1 The Lempel-Ziv algorithm There is a sese i which the Huffma codig was optimal, but this is uder several assumptios: 1. The compressio is lossless, i.e., ucompressig the compressed file yields exactly the origial file. Whe lossy compressio is permitted, as for video, other algorithms ca achieve much greater compressio, ad this is a very active area of research because people wat to be able to sed video ad audio over the Web. 2. We kow all the frequecies f(i) with which each character appears. How do we get this iformatio? We could make two passes over the data, the first to compute the f(i), ad the secod to ecode the file. But this ca be much more expesive tha passig over the data oce for large files residig o disk or tape. Oe way to do just oe pass over the data is to assume that the fractios f(i)/ of each character i the file are similar to files you ve compressed before. For example you could assume all Java programs (or Eglish text, or PowerPoit files, or...) have about the same fractios of characters appearig. A secod cleverer way is to estimate the fractios f(i)/ o the fly as you process the file. Oe ca make Huffma codig adaptive this way. 3. We kow the set of characters (the alphabet) appearig i the file. This may seem obvious, but there is a lot of freedom of choice. For example, the alphabet could be the characters o a keyboard, or they could be the key words ad variables ames appearig i a program. To see what differece this ca make, suppose we have a file cosistig of strigs aaaa ad strigs bbbb cocateated i some order. If we choose the alphabet {a, b} the 8 bits are eeded to ecode the file. But if we choose the alphabet {aaaa, bbbb} the oly 2 bits are eeded. Pickig the correct alphabet turs out to be crucial i practical compressio algorithms. Both the UNIX compress ad GNU gzip algorithms use a greedy algorithm due to Lempel ad Ziv to compute a good alphabet i oe pass while compressig. Here is how it works. If s ad t are two bit strigs, we will use the otatio s t to mea the bit strig gotte by cocateatig s ad t. We let f be the file we wat to compress, ad thik of it just as a strig of bits, that is 0 s ad 1 s. We will build a alphabet A of commo bit strigs ecoutered i f, ad use it to compress f. Give A, we will break f ito shorter bit strigs like ad ecode this by f = A(1) 0 A(2) 1 A(7) 0 A(5) 1 A(i) j 1 0 2 1 7 0 5 1 i j

Notes umber 17 2 F = 0 0 0 1 1 1 1 0 1 0 1 1 0 1 0 0 0 set A is full A(1) = A(0)0 = 0 A(2) = A(1)0 = 00 A(3) = A(0)1 = 1 A(4) = A(3)1 = 11 A(5) = A(3)0 = 10 A(6) = A(5)1 = 101 = A(6)0 = 1010 = A(1)0 = 00 Ecoded F = (0,0),(1,0),(0,1), (3,1),(3,0), (5,1),(6,0),(1,0) = 0000 0010 0001 0111 0110 1011 1100 0010 Figure 1: A example of the Lempel-Ziv algorithm. The idices i of A(i) are i tur ecoded as fixed legth biary itegers, ad the bits j are just bits. Give the fixed legth (say r) of the biary itegers, we decode by takig every group of r + 1 bits of a compressed file, usig the first r bits to look up a strig i A, ad cocateatig the last bit. So whe storig (or sedig) a ecoded file, a header cotaiig A is also stored (or set). Notice that while Huffma s algorithm ecodes blocks of fixed size ito biary sequeces of variable legth, Lempel-Ziv ecodes blocks of varyig legth ito blocks of fixed size. Here is the algorithm for ecodig, icludig buildig A. Typically a fixed size is available for A, ad oce it fills up, the algorithm stops lookig for ew characters. A = { }... start with a alphabet cotaiig oly a empty strig i = 0... poits to ext place i file f to start ecodig repeat fid A(k) i the curret alphabet that matches as may leadig bits of f i f i+1 f i+2 as possible... iitially oly A(0) = empty strig matches let b be the umber of bits i A(k) if A is ot full, add A(k) f i+b to A... f i+b is the first bit umatched by A(k) output k f i+b i = i + b + 1 util i > legth(f) Note that A is built greedily, based o the begiig of the file. Thus there are o optimality guaratees for this algorithm. It ca perform badly if the ature of the file chages substatially after A is filled up, however the algorithm makes oly oe pass through the file (there are other possible implemetatios: A may be ubouded, ad the idex k would be ecoded with a variable-legth code itself). I Figure 1 there is a example of the algorithm ruig, where the alphabet A fills up after 6 characters are iserted. I this small example o compressio is obtaied, but if A were large, ad the same log bit strigs appeared frequetly, compressio would

Notes umber 17 3 be substatial. The gzip mapage claims that source code ad Eglish text is typically compressed 60%-70%. To observe a example, we took a latex file of 74, 892 bytes. Ruig Huffma s algorithm, with bytes used as blocks, we could have compressed the file to 36, 757 bytes, plus the space eeded to specify the code. The Uix program compress produced a ecodig of size 34, 385, while gzip produced a ecodig of size 22, 815. 2 Lower bouds o data compressio 2.1 Simple Results How much ca we compress a file without loss? We preset some results that give lower bouds for ay compressio algorithm. Let us start from a worst case aalysis. Theorem 1 Let C : {0, 1} {0, 1} be a ecodig algorithm that allows lossless decodig (i.e., let C be a ijective fuctio mappig bits ito a sequece of bits). The there is a file f {0, 1} such that C(f). I words, for ay lossless compressio algorithm there is always a file that the algorithm is uable to compress. Proof: Suppose, by cotradictio, that there is a compressio algorithm C such that, for all f {0, 1}, C(f) 1. The the set {C(f) : f {0, 1} } has 2 elemets because C is ijective, but it is also a set of strigs of legth 1, ad so it has at most 1 l=1 2l = 2 2 elemets, which gives a cotradictio. While the previous aalysis showed the existece of icompressible files, the ext theorem shows that radom files are hard to compress, thus givig a average case aalysis. Theorem 2 Let C : {0, 1} {0, 1} be a ecodig algorithm that allows lossless decodig (i.e., let C be a ijective fuctio mappig bits ito a sequece of bits). Let f {0, 1} be a sequece of radomly ad uiformly selected bits. The, for every t, Pr[ C(f) t] 1 2 t 1 For example, there is less tha a chace i a millio of compressio a iput file of bits ito a output file of legth 21, ad less tha a chace i eight millios that the output will be 3 bytes shorter tha the iput or less. Proof: We ca write Pr[ C(f) t] = {f : C(f) t} 2 Regardig the umerator, it is the size of a set that cotais oly strigs of legth t or less, so it is o more tha t l=1 2l, which is at most 2 t+1 2 < 2 t+1 = 2 /2 t 1. The followig result is harder to prove, ad we will just state it.

Notes umber 17 4 Theorem 3 Let C : {0, 1} {0, 1} be a prefix-free ecodig, ad let f be a radom file of bits. The E[ C(f) ]. This meas that, from the average poit of view, the optimum prefix-free ecodig of a radom file is just to leave the file as it is. I practice, however, files are ot completely radom. Oce we formalize the otio of a ot-completely-radom file, we ca show that some compressio is possible, but ot below a certai limit. First, we observe that eve if ot all -bits strigs are possible files, we still have lower bouds. Theorem 4 Let F {0, 1} be a set of possible files, ad let C : {0, 1} {0, 1} be a ijective fuctio. The 1. There is a file f F such that C(f) log 2 F. 2. If we pick a file f uiformly at radom from F, the for every t we have Pr[ C(f) (log 2 F ) t] 1 2 t 1 3. If C is prefix-free, the whe we pick a file f uiformly at radom from F we have E[ C(f) ] log 2 F. Proof: Part 1 ad 2 is proved with the same ideas as i Theorem 1 ad Theorem 2. Part 3 has a more complicated proof that we omit. 2.2 Itroductio to Etropy Suppose ow that we are i the followig settig: the file cotais characters there are c differet characters possible character i has probability p(i) of appearig i the file What ca we say about probable ad expected legth of the output of a ecodig algorithm? Let us first do a very rough approximate calculatio. Whe we pick a file accordig to the above distributio, very likely there will be about p(i) characters equal to i. The files with these typical frequecies have a total probability about p = i p(i)p(i) of beig geerated. Sice files with typical frequecies make up almost all the probability mass, there must be about 1/p = i (1/p(i))p(i) files of typical frequecies. Now, we are i a settig which is similar to the oe of parts 2 ad 3 of Theorem 4, where F is the set of files with typical frequecies. We the expect the ecodig to be of legth at least log 2 i (1/p(i))p(i) = i p(i) log 2(1/p(i)). The quatity i p(i) log 2 1/(p(i)) is the expected umber of bits that it takes to ecode each character, ad is called the etropy

Notes umber 17 5 of the distributio over the characters. The otio of etropy, the discovery of several of its properties, (a formal versio of) the calculatio above, as well as a (iefficiet) optimal compressio algorithm, ad much, much more, are due to Shao, ad appeared i the late 40s i oe of the most ifluetial research papers ever writte. 2.3 A Calculatio Makig the above calculatio precise would be log, ad ivolve a lot of ɛs. Istead, we will formalize a slightly differet settig. Cosider the set F of files such that the file cotais characters there are c differet characters possible character i occurs p(i) times i the file We will show that F cotais roughly 2 i p(i) log 2 1/p(i) files, ad so a radom elemet of F caot be compressed to less tha i p(i) log 2 1/p(i) bits. Pickig a radom elemet of F is almost but ot quite the settig that we described before, but it is close eough, ad iterestig i its ow. Let us call f(i) = p(i) the umber of occurreces of character i i the file. We eed two results, both from Math 55. The first gives a formula for F : F =! f(1)! f(2)! f(c)! Here is a sketch of the proof of this formula. There are! permutatios of characters, but may are the same because there are oly c differet characters. I particular, the f(1) appearaces of character 1 are the same, so all f(1)! orderigs of these locatios are idetical. Thus we eed to divide! by f(1)!. The same argumet leads us to divide by all other f(i)!. Now we have a exact formula for F, but it is hard to iterpret, so we replace it by a simpler approximatio. We eed a secod result from Math 55, amely Stirlig s formula for approximatig!:! 2π +.5 e This is a good approximatio i the sese that the ratio!/[ 2π +.5 e ] approaches 1 quickly as grows. (I Math 55 we motivated this formula by the approximatio log! = i=2 log i 1 log x dx.) We will use Stirlig s formula i the form log 2! log 2 2π + ( +.5) log2 log 2 e Stirlig s formula is accurate for large argumets, so we will be iterested i approximatig log 2 F for large. Furthermore, we will actually estimate log 2 F, which ca be iterpreted as the average umber of bits per character to sed a log file. Here goes: log 2 F = log 2(!/(f(1)! f(c)!)) = log 2! c log 2 f(i)!

Notes umber 17 6 1 [log 2 2π + ( +.5) log2 log 2 e (log 2 2π + (f(i) +.5) log2 f(i) f(i) log 2 e)] = 1 [ log 2 f(i) log 2 f(i) +(1 c) log 2 2π +.5 log2.5 log 2 f(i)] f(i) = log 2 log 2 f(i) + (1 c) log 2 2π +.5 log 2.5 c log 2 f(i) As gets large, the three fractios o the last lie above all go to zero: the first term looks like O(1/), ad the last two terms look like O( log 2 ). This lets us simplify to get log 2 F f(i) log 2 log 2 f(i) = log 2 p(i) log 2 p(i) = (log 2 )p(i) p(i) log 2 p(i) = p(i) log 2 /p(i) = p(i) log 2 1/p(i) Normally, the quatity i p(i) log 2 1/p(i) is deoted by H. How much more space ca Huffma codig take to ecode a file tha Shao s lower boud H? A theorem of Gallagher (1978) shows that at worst Huffma will take (p max +.086) bits more tha H, where p max is the largest of ay p i. But it ofte does much better. Furthermore, if we take blocks of k characters, ad ecode them usig Huffma s algorithm, the, for large k ad for tedig to ifiity, the average legth of the ecodig teds to the etropy.

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170