Compression in the Real World :Algorithms in the Real World. Compression in the Real World. Compression Outline

Compresson n the Real World 5-853:Algorthms n the Real World Data Compresson: Lectures and 2 Generc Fle Compresson Fles: gzp (LZ77), bzp (Burrows-Wheeler), BOA (PPM) Archvers: ARC (LZW), PKZp (LZW+) Fle systems: NTFS Communcaton Fax: ITU-T Group 3 (run-length + Huffman) Modems: V.42bs protocol (LZW), MNP5 (run-length+huffman) Vrtual Connectons 5-853 Page 5-853 Page 2 Compresson n the Real World Multmeda Images: gf (LZW), jbg (context), jpeg-ls (resdual), jpeg (transform+rl+arthmetc) TV: HDTV (mpeg-4) Sound: mp3 An example Other structures Indexes: google, lycos Meshes (for graphcs): edgebreaker Graphs Databases: 5-853 Page 3 Compresson Outlne Introducton: Lossless vs. lossy Model and coder Benchmarks Informaton Theory: Entropy, etc. Probablty Codng: Huffman + Arthmetc Codng Applcatons of Probablty Codng: PPM + others Lempel-Zv Algorthms: LZ77, gzp, compress,... Other Lossless Algorthms: Burrows-Wheeler Lossy algorthms for mages: JPEG, MPEG,... Compressng graphs and meshes: BBK 5-853 Page 4

Encodng/Decodng Wll use message n generc sense to mean the data to be compressed Input Message Encoder Compressed Message Decoder The encoder and decoder need to understand common compressed format. Output Message Lossless vs. Lossy Lossless: Input message = Output message Lossy: Input message Output message Lossy does not necessarly mean loss of qualty. In fact the output could be better than the nput. Drop random nose n mages (dust on lens) Drop background n musc Fx spellng errors n text. Put nto better form. Wrtng s the art of lossy text compresson. 5-853 Page 5 5-853 Page 6 How much can we compress? For lossless compresson, assumng all nput messages are vald, f even one strng s compressed, some other must expand. Model vs. Coder To compress we need a bas on the probablty of messages. The model determnes ths bas Messages Model Encoder Probs. Coder Bts Example models: Smple: Character counts, repeated strngs Complex: Models of a human face 5-853 Page 7 5-853 Page 8 2

Qualty of Compresson Runtme vs. Compresson vs. Generalty Several standard corpuses to compare algorthms e.g. Calgary Corpus 2 books, 5 papers, bblography, collecton of news artcles, 3 programs, termnal sesson, 2 object fles, geophyscal data, btmap bw mage The Archve Comparson Test mantans a comparson of just about all algorthms publcly avalable Comparson of Algorthms Program Algorthm Tme BPC Score RK LZ + PPM +5.79 430 BOA PPM Var. 94+97.9 407 PPMD PPM +20 2.07 265 IMP BW 0+3 2.4 254 BZIP BW 20+6 2.9 273 GZIP LZ77 Var. 9+5 2.59 38 LZ77 LZ77? 3.94? 5-853 Page 9 5-853 Page 0 Compresson Outlne Introducton: Lossy vs. Lossless, Benchmarks, Informaton Theory: Entropy Condtonal Entropy Entropy of the Englsh Language Probablty Codng: Huffman + Arthmetc Codng Applcatons of Probablty Codng: PPM + others Lempel-Zv Algorthms: LZ77, gzp, compress,... Other Lossless Algorthms: Burrows-Wheeler Lossy algorthms for mages: JPEG, MPEG,... Compressng graphs and meshes: BBK 5-853 Page Informaton Theory An nterface between modelng and codng Entropy A measure of nformaton content Condtonal Entropy Informaton content based on a context Entropy of the Englsh Language How much nformaton does each character n typcal Englsh text contan? 5-853 Page 2 3

Entropy (Shannon 948) For a set of messages S wth probablty p(s), s S, the self nformaton of s s: s () = log = log ps ( ) ps () Measured n bts f the log s base 2. The lower the probablty, the hgher the nformaton Entropy s the weghted average of self nformaton. HS ( ) = ps ( )log ps () Entropy Example p( S) = {. 25,. 25,. 25,. 25,. 25} H ( S) = 3.25log 4 + 2.25log8 = 2.25 p( S) = {. 5,. 25,. 25,. 25,. 25} H ( S) =.5log 2 + 4.25log8 = 2 p( S) = {. 75,. 0625,. 0625,. 0625,. 0625} H ( S) =.75 log(4 3) + 4.0625 log6 =.3 5-853 Page 3 5-853 Page 4 Condtonal Entropy Example of a Markov Chan The condtonal probablty p(s c) s the probablty of s n a context c. The condtonal self nformaton s sc ( ) = log psc ( ) The condtonal nformaton can be ether more or less than the uncondtonal nformaton. The condtonal entropy s the weghted average of the condtonal self nformaton H ( S C) = p( c) p( s c)log c C p( s c) p(w w).9 w. p(b w) p(w b).2 b p(b b).8 5-853 Page 5 5-853 Page 6 4

Entropy of the Englsh Language How can we measure the nformaton per character? ASCII code = 7 Entropy = 4.5 (based on character probabltes) Huffman codes (average) = 4.7 Unx Compress = 3.5 Gzp = 2.6 Bzp =.9 Entropy =.3 (for text compresson test ) Must be less than.3 for Englsh language. Shannon s experment Asked humans to predct the next character gven the whole prevous text. He used these as condtonal probabltes to estmate the entropy of the Englsh Language. The number of guesses requred for rght answer: # of guesses 2 3 4 5 > 5 Probablty.79.08.03.02.02.05 From the experment he predcted H(Englsh) =.6-.3 5-853 Page 7 5-853 Page 8 Compresson Outlne Introducton: Lossy vs. Lossless, Benchmarks, Informaton Theory: Entropy, etc. Probablty Codng: Prefx codes and relatonshp to Entropy Huffman codes Arthmetc codes Applcatons of Probablty Codng: PPM + others Lempel-Zv Algorthms: LZ77, gzp, compress,... Other Lossless Algorthms: Burrows-Wheeler Lossy algorthms for mages: JPEG, MPEG,... Compressng graphs and meshes: BBK 5-853 Page 9 Assumptons and Defntons Communcaton (or a fle) s broken up nto peces called messages. Each message come from a message set S = {s,,s n } wth a probablty dstrbuton p(s). Probabltes must sum to. Set can be nfnte. Code C(s): A mappng from a message set to codewords, each of whch s a strng of bts Message sequence: a sequence of messages Note: Adjacent messages mght be of a dfferent types and come from a dfferent probablty dstrbutons 5-853 Page 20 5

Dscrete or Blended We wll consder two types of codng: Dscrete: each message s a fxed set of bts Huffman codng, Shannon-Fano codng 000 000 0 message: 2 3 4 Blended: bts can be shared among messages Arthmetc codng 000000 Unquely Decodable Codes A varable length code assgns a bt strng (codeword) of varable length to every message value e.g. a =, b = 0, c = 0, d = 0 What f you get the sequence of bts 0? Is t aba, ca, or, ad? A unquely decodable code s a varable length code n whch bt strngs can always be unquely decomposed nto ts codewords. message:,2,3, and 4 5-853 Page 2 5-853 Page 22 Prefx Codes A prefx code s a varable length code n whch no codeword s a prefx of another word. e.g., a = 0, b = 0, c =, d = 0 All prefx codes are unquely decodable Prefx Codes: as a tree Can be vewed as a bnary tree wth message values at the leaves and 0s or s on the edges: a 0 0 0 d b c a = 0, b = 0, c =, d = 0 5-853 Page 23 5-853 Page 24 6

Some Prefx Codes for Integers n Bnary Unary Gamma..00 0 0 2..00 0 0 0 3..0 0 0 4..00 0 0 00 5..0 0 0 0 6..0 0 0 0 Average Length For a code C wth assocated probabltes p(c) the average length s defned as l ( C) = p()() c l c a c C We say that a prefx code C s optmal f for all prefx codes C, l a (C) l a (C ) l(c) = length of the codeword c (a postve nteger) Many other fxed prefx codes: Golomb, phased-bnary, subexponental,... 5-853 Page 25 5-853 Page 26 Relatonshp to Entropy Theorem (lower bound): For any probablty dstrbuton p(s) wth assocated unquely decodable code C, HS ( ) l( C) a Theorem (upper bound): For any probablty dstrbuton p(s) wth assocated optmal prefx code C, la ( C) H( S) + Kraft McMllan Inequalty Theorem (Kraft-McMllan): For any unquely decodable code C, c C Also, for any set of lengths L such that 2 l l L there s a prefx code C such that 2 l( c) l( c ) = l (,..., L ) = 5-853 Page 27 5-853 Page 28 7

Proof of the Upper Bound (Part ) Assgn each message a length: We then have l s 2 = 2 ( p( )) So by the Kraft-McMllan nequalty there s a prefx code wth lengths l(s). l ( s) = log s 2 ( p s ) ( ) log / ( ) = = ( p s ) log / ( ) ps () 5-853 Page 29 Proof of the Upper Bound (Part 2) Now we can calculate the average length gven l(s) l ( S) = p( s) l( s) a And we are done. ( ps) = ps () log / () ps ( ) ( + log( / ps ( ))) = + ps ( ) log( / ps ( )) = + HS ( ) 5-853 Page 30 Another property of optmal codes Theorem: If C s an optmal prefx code for the probabltes {p,, p n } then p > p j mples l(c ) l(c j ) Proof: (by contradcton) Assume l(c ) > l(c j ). Consder swtchng codes c and c j. If l a s the average length of the orgnal code, the length of the new code s ' la = la + pj( lc ( ) lc ( j)) + p( lc ( j) lc ( )) = la + ( pj p)( l( c) l( cj)) < la Ths s a contradcton snce l a s not optmal 5-853 Page 3 Huffman Codes Invented by Huffman as a class assgnment n 950. Used n many, f not most, compresson algorthms gzp, bzp, jpeg (as opton), fax compresson, Propertes: Generates optmal prefx codes Cheap to generate codes Cheap to encode and decode l a = H f probabltes are powers of 2 5-853 Page 32 8

Huffman Codes Huffman Algorthm: Start wth a forest of trees each consstng of a sngle vertex correspondng to a message s and wth weght p(s) Repeat untl one tree left: Select two trees wth mnmum weght roots p and p 2 Jon nto sngle tree by addng root wth weght p + p 2 5-853 Page 33 Example p(a) =., p(b) =.2, p(c ) =.2, p(d) =.5 a(.) b(.2) c(.2) d(.5) (.3) (.5) (.0) 0 a(.) b(.2) (.3) (.5) d(.5) c(.2) 0 Step a(.) b(.2) (.3) c(.2) 0 Step 2 a(.) b(.2) Step 3 a=000, b=00, c=0, d= 5-853 Page 34 Encodng and Decodng Encodng: Start at leaf of Huffman tree and follow path to the root. Reverse order of bts and send. Decodng: Start at root of Huffman tree and take branch for each bt receved. When at leaf can output message and return to root. There are even faster methods that (.0) 0 can process 8 or 32 bts at a tme (.5) d(.5) 0 (.3) c(.2) 0 a(.) b(.2) 5-853 Page 35 Huffman codes are optmal Theorem: The Huffman algorthm generates an optmal prefx code. Proof outlne: Inducton on the number of messages n. Consder a message set S wth n+ messages. Can make t so least probable messages of S are neghbors n the Huffman tree 2. Replace the two messages wth one message wth probablty p(m ) + p(m 2 ) makng S 3. Show that f S s optmal, then S s optmal 4. S s optmal by nducton 5-853 Page 36 9

Problem wth Huffman Codng Consder a message wth probablty.999. The self nformaton of ths message s log(. 999) =.0044 If we were to send a 000 such message we mght hope to use 000*.004 =.44 bts. Usng Huffman codes we requre at least one bt per message, so we would requre 000 bts. Arthmetc Codng: Introducton Allows blendng of bts n a message sequence. Only requres 3 bts for the example Can bound total bts requred based on sum of self nformaton: n l < 2 + s = Used n PPM, JPEG/MPEG (as opton), DMM More expensve than Huffman codng, but nteger mplementaton s not too bad. 5-853 Page 37 5-853 Page 38 Arthmetc Codng: message ntervals Assgn each probablty dstrbuton to an nterval range from 0 (nclusve) to (exclusve). e.g..0 c =.3 = 0.7 f ( ) p( j) j= b =.5 f(a) =.0, f(b) =.2, f(c) =.7 0.2 a =.2 0.0 The nterval for a partcular message wll be called the message nterval (e.g for b the nterval s [.2,.7)) 5-853 Page 39 Arthmetc Codng: sequence ntervals Code a message sequence by composng ntervals. For example: bac.0 0.7 0.3 c =.3 c =.3 c =.3 0.7 0.2 0.0 b =.5 a =.2 0.55 0.3 0.2 b =.5 a =.2 The fnal nterval s [.27,.3) We call ths the sequence nterval 0.27 0.2 0.2 b =.5 a =.2 5-853 Page 40 0

Arthmetc Codng: sequence ntervals To code a sequence of messages wth probabltes p ( =..n) use the followng: l = f l = l + s f bottom of nterval s = p s = s p sze of nterval Each message narrows the nterval by a factor of p. Fnal nterval sze: n s n = p = Warnng Three types of nterval: message nterval : nterval for a sngle message sequence nterval : composton of message ntervals code nterval : nterval for a specfc code used to represent a sequence nterval (dscussed later) 5-853 Page 4 5-853 Page 42 Unquely defnng an nterval Important property:the sequence ntervals for dstnct message sequences of length n wll never overlap Therefore: specfyng any number n the fnal nterval unquely determnes the sequence. Decodng s smlar to encodng, but on each step need to determne what the message value s and then reduce nterval Arthmetc Codng: Decodng Example Decodng the number.49, knowng the message s of length 3: 0.49.0 0.7 0.2 0.0 c =.3 b =.5 a =.2 The message s bbc. 0.7 0.55 0.49 0.3 0.2 c =.3 b =.5 a =.2 0.55 0.49 0.475 0.35 0.3 c =.3 b =.5 a =.2 5-853 Page 43 5-853 Page 44

Representng Fractons Bnary fractonal representaton:.75 / 3 /6 =. =.00 =.0 So how about just usng the smallest bnary fractonal representaton n the sequence nterval. e.g. [0,.33) =.0 [.33,.66) =. [.66,) =. But what f you receve a? Should we wat for another? Representng an Interval Can vew bnary fractonal numbers as ntervals by consderng all completons. e.g. mn max nterval.. 0. [. 750,. ). 0. 00. 0 [. 625,. 75) We wll call ths the code nterval. 5-853 Page 45 5-853 Page 46 Code Intervals: example [0,.33) =.0 [.33,.66) =. [.66,) =....0 0 Note that f code ntervals overlap then one code s a prefx of the other. Lemma: If a set of code ntervals do not overlap then the correspondng codes form a prefx code. Selectng the Code Interval To fnd a prefx code fnd a bnary fractonal number whose code nterval s contaned n the sequence nterval..79.75 Sequence Interval Code Interval (.0).6.625 Can use the fracton l + s/2 truncated to log( s 2) = + log s bts 5-853 Page 47 5-853 Page 48 2

Selectng a code nterval: example [0,.33) =.00 [.33,.66) =.00 [.66,) =.0.66.33 0.0.00.00 e.g: for [.33,.66 ), l =.33, s =.33 l + s/2 =.5 =.000 truncated to + log s = + log(.33) = 3 bts s.00 Is ths the best we can do for [0,.33)? RealArth Encodng and Decodng RealArthEncode: Determne l and s usng orgnal recurrences Code usng l + s/2 truncated to + -log s bts RealArthDecode: Read bts as needed so code nterval falls wthn a message nterval, and then narrow sequence nterval. Repeat untl n messages have been decoded. 5-853 Page 49 5-853 Page 50 Bound on Length Theorem: For n messages wth self nformaton {s,,s n } RealArthEncode wll generate at most 2 + n s = bts. Proof: log s + = n + log p = = n + log p = = + n s = < 2 + n s = 5-853 Page 5 Integer Arthmetc Codng Problem wth RealArthCode s that operatons on arbtrary precson real numbers s expensve. Key Ideas of nteger verson: Keep ntegers n range [0..R) where R=2 k Use roundng to generate nteger sequence nterval Whenever sequence nterval falls nto top, bottom or mddle half, expand the nterval by factor of 2 Ths nteger Algorthm s an approxmaton or the real algorthm. 5-853 Page 52 3

Integer Arthmetc Codng Integer Arthmetc (contractng) The probablty dstrbuton as ntegers R=256 Probabltes as counts: e.g. c(a) =, c(b) = 7, c(c) = 30 T s the sum of counts e.g. 48 (+7+30) T=48 Partal sums f as before: e.g. f(a) = 0, f(b) =, f(c) = 8 c = 30 Requre that R > 4T so that 8 probabltes do not get rounded to b = 7 zero a = 0 l = 0, s = R s u l = = = u l + l l + s ( f + s ) / T + s f / T R=256 u l 0 5-853 Page 53 5-853 Page 54 Integer Arthmetc (scalng) If l R/2 then (n top half) Output followed by m 0s m = 0 Scale message nterval by expandng by 2 If u < R/2 then (n bottom half) Output 0 followed by m s m = 0 Scale message nterval by expandng by 2 If l R/4 and u < 3R/4 then (n mddle half) Increment m Scale message nterval by expandng by 2 5-853 Page 55 4