Introducton to nformaton theory and data compresson Adel Magra, Emma Gouné, Irène Woo March 8, 207 Ths s the augmented transcrpt of a lecture gven by Luc Devroye on March 9th 207 for a Data Structures and Algorthms class (COMP 252). Data compresson nvolves encodng nformaton usng fewer bts than the orgnal representaton. Informaton Theory Informaton theory s the study of quantfcaton, storage, and com- Thomas and Cover [2006] muncaton of nformaton. Claude Shannon developed the mathematcal theory that descrbes the basc aspects of communcaton systems. It s concerned wth the constructon and study of mathematcal models usng probablty theory. In 948, Shannon publshed hs paper A Mathematcal Theory of Communcaton n the Bell Systems Techncal Journal 2. The pa- 2 Shannon [948] per provded a blueprnt for the dgtal age. Fgure llustrates a general communcaton system as Shannon proposed n hs paper. We can calculate the compresson rato C as: C = length(b) length(a) Shannon s Theory Fgure : Communcaton system dagram Shannon magned that every possble nput sequence that may have to be compressed has a gven probablty p, where the p 's sum to one. So f we transform the -th nput sequence nto one havng l bts, the expected length of the output bt sequence s p l. One can reverse engneer a transformaton (or compresson) algorthm and construct a bnary tree that maps every bnary output back to an nput. It s smlar to the old decson tree we saw when argung about lower bounds. Leaves correspond to possble nputs. What matters s to fnd a compresson method that mnmzes p l. Theoretcally, ths can be done by fndng the Huffman tree usng the Hu-Tucker algorthm (see secton Practce ). Snce the number of possble nputs s ncredbly large, one cannot possbly use Huffman to actually do t. In addton, p s generally unknown.
ntroducton to nformaton theory and data compresson 2 However, one can stll learn thngs from Shannon about p l. Hs man theorem s that: E + mn p l E where E s the bnary entropy, p log 2 p, and the mnmum s over all possble bnary trees (and thus, all compresson algorthms). As a specal case, puttng p = n, where n s the total number of possble answers for a partcular algorthmc problem (such as sortng), we redscover the decson tree lower bound seen earler n the course: the expected number of bnary oracle comparsons must be E = log 2 n. We wll prove Shannon s theorem n the next secton. Entropy (symbol E) In Informaton Theory, entropy (E) s a number defned to be the measure of the average nformaton content delvered by a message. It measures the unpredctablty of the outcome. The bnary entropy s defned by E = p log 2 p 0, where the p 's are the probabltes of the nput sequences. We wll prove that E + mn p l E, where the mnmum s over all bnary trees. Recall Kraft's nequalty, whch s vald for all bnary trees: 2 l. Remark: The converse of Kraft's nequalty s also true,.e., gven numbers l, l 2,... wth 2 l, there exsts a bnary tree such that ts leaves have those depths. We frst show: p l E. Observe that: p l = p log 2 2 l ( = p log 2 2 l p p ( ) = p log 2 (2 l p ) + p log 2 p ) = p log 2 (2 l p ) + E.
ntroducton to nformaton theory and data compresson 3 Now, p log 2 l p (p ( )) = 2 l p 2 l 0 snce log x x and by Kraft s nequalty. Thus, clearly p l E. Now we show: E + p l for the so-called Shannon - Fano code. In ths code, we take l = (log 2 ( p )). We have 2 l p. Thus, by the converse of Kraft's nequalty, there exsts a code that has length l for nput. That s the Shannon - Fano code. Now, p l = p (log 2 ( p )) p log 2 p + p = E +. So we are done. In concluson, E, measured n "bts," corresponds to how well one can hope to compress a fle gven the assumpton on the p 's. Practce In practce, we wll compress ether symbols or small chunks of symbols. There s a separaton problem on the part of the recever. How do we separate a bt sequence f we transform each symbol n the nput, symbol per symbol, nto a small bt sequence? Indeed, when we send an encoded strng, a concatenaton of "codewords," the recever mght not be able to parse correctly the strng n order to decode each strng porton or codeword. Example. Suppose we want to send a sequence of ntegers such as 8 23. Its standard bt representaton s 000 and 0. Sendng the sequence 0000 s not very useful. Where does the frst porton start or end? How large s the porton? A bt number always starts wth a, hence one can send a prefx that starts wth 0 and ndcates the length of codeword. We have then: 0000000000 Another method s to add a prefx of 0's of the same length of the codeword. We have: 00000000000000. Stll, ths method s nconvenent and slghtly wasteful. Fxed wdth codng, as for example n the standard 8-bts per character codng, gves another soluton. It can be vastly mproved. A frst such mprovement s (varable wdth) prefx codng. A codeword assgns a sequence of bts to a symbol. A code s a set of codewords. One can pcture a code as a tre (defned below). In a prefx code, each leaf unquely corresponds to an nput symbol. In ths manner, the separaton problem wll be solved. Tres A tre (pronounced try ) s a tree-based data structure for storng strngs n order to support fast pattern matchng. The man applca-
ntroducton to nformaton theory and data compresson 4 ton of a tre s nformaton retreval, thus t s not surprsng that the name tre comes from the word retreval. In a bnary tre, each edge represents a bt n a codeword, wth an edge to a left chld representng a 0 and an edge to a rght chld representng a. Each leaf s assocated wth a specfc character. Usng a tre we can develop a strategy for the coder and decoder. The coder would search for a character and then go backwards to the root whle recordng the path. Ths step can be done usng ponters to the parent node. The decoder doesn't need to have the code tre. She/he only needs to process the code to fnd a leaf and go backward to the root. For effcency we usually send the tre along wth the coded message. For prefx codng, the sender (or coder) has a prefx codng tree and uses ether a table of codewords or parents ponters n the tree to do hs codng. The recever needs the tree (whch s sent n some way), and wth the tree, one can easly decode the sequence symbol by symbol as leaves correspond to nput symbols. Example 2. A smple example we can make s to encode the alphabet a,b,c wth bts: The leaves correspond to all the possble nputs. Here 0 maps to a, 0 maps to b and maps to c. Prefx Codng Defnton 3. Prefx codng s a codng system n whch varables (words n Englsh text) are dfferentated by ther prefx attrbute. We gve an example wth a tre and an alphabet composed of 5 letters a,b,c,d,e. Each letter s attrbuted a prefx code whch s a proper bnary sequence. Let P be the prefx code: P = a:00,b:0,c:0,d:0,e:. We clearly see that P s a vald prefx code as no bnary sequence s the prefx of another n P. We can vew ths prefx code as a code tree: In ths example, the strng abbba s transformed nto 0000000. Ths bt sequence can be unquely nterpreted and decoded back nto
ntroducton to nformaton theory and data compresson 5 abbba n other words, we have solved the separaton problem qute elegantly. If we assume a certan probablty p on symbol (possbly approxmated by the relatve frequency of symbols n general nput sequences that we would lke to compress), then the expected length of a codeword whch ultmately tells us about the expected length of the compressed sequence s agan p l. We should frst of all desgn the code by usng the Huffman tree. Such codes are called Huffman codes. By Shannon's theorem, observe that for the Huffman code, p l E +. Huffman Codes We can optmze a prefx code by takng nto consderaton the probablty of dfferent code words to occur. We could then construct a Huffman Tree 3. The Huffman codng algorthm constructs a solu- 3 Cormen et al. [2009] ton step by step by pckng the locally optmal choce. It s called a greedy algorthm. Gven a fxed tree wth leaf dstances l and a certan assgnment of symbols to the leaves, p l s mnmzed by placng the symbols and j wth smallest p values furthest from the root. Therefore, snce sngle chld nodes are obvously suboptmal, the optmal tree has and j as chldren of an nternal node. Ths permts us to create one nternal node and reduce the problem by one. The algorthm proceeds n a seres of rounds. Algorthm: Frst make each of the dstnct characters of the strng to encode the root node of a sngle-node bnary tree. In each round, take the two bnary trees wth the smallest frequences and merge them nto a sngle bnary tree. Repeat ths process untl only one tree s left. Example 4. Let s show the algorthm wth the followng alphabet and probabltes p.
ntroducton to nformaton theory and data compresson 6 As seen n the prevous lecture (March 7, 207), there s a complete algorthm for buldng a Huffman tree usng bnary heaps. Let's recall the method we used: The Huffman tree has n leaves and 2n- nternal nodes. We buld the Huffman Tree by fllng the nternal nodes wth left and rght chldren. We use the Hu-Tucker algorthm, whch uses a prorty queue (H). Hu-Tucker(n symbols wth key and probablty p are gven) MAKE_EMPTY_PRIORITY_QUEUE(H) 2 For from to n do (to nsert the leaves frst) 3 LEFT[] = 0 4 RIGHT[] = 0 5 INSERT((p,),H) 6 For from n+ to 2n- do (to mplement the nternal nodes) 7 (p a,a) = DELETEMIN(H) 8 (p b,b) = DELETEMIN(H) 9 LEFT[] = a 0 RIGHT[] = b INSERT((p a + p r, ),H) Ths algorthm outputs the Huffman tree. The root s node 2n -. Left and rght chldren of nodes are stored n the arrays LEFT and RIGHT. The constructon of a Huffman tree takes O(n log(n)). Fnally, the entropy tells us about how well we can do. For example, E depends upon the language when the nput conssts of long texts.
ntroducton to nformaton theory and data compresson 7 Fnal Remarks Improvements are possble by groupng nput symbols n groups of two, three, or more. Once can also employ adaptve Huffman codng, where the code s changed as a text s beng processed (and the frequences of the symbols change). References Thomas H. Cormen, Charles E. Leserson, and Ronald L. Rvest. Introducton to Algorthms. 2009. Cambrdge, MA. Claude E. Shannon. A mathematcal theory of communcaton. The Bell System Techncal Journal, 27:379 423, 623 656, 948. Joy A. Thomas and Thomas M. Cover. Elements of Informaton Theory. Wley Seres, 2nd edton, 2006. New York.