Introduction to Information Theory, Data Compression,

Introducton to Informaton Theory, Data Compresson, Codng Mehd Ibm Brahm, Laura Mnkova Aprl 5, 208 Ths s the augmented transcrpt of a lecture gven by Luc Devroye on the 3th of March 208 for a Data Structures and Algorthms class (COMP 252) at McGll Unversty. The subject was Informaton Theory, Data Compresson, and Codng. Data Compresson: The effcent encodng of nformaton. In many compresson methods, nput symbols are mapped to codewords (bt sequences). The set of codewords s called a code. If all codewords are of equal length, then we have a fxed-length code. Otherwse, we have a varable-length code. The most mportant codes are prefx codes,.e., codes n whch no codeword s the prefx of another codeword. If codewords are mapped to bnary trees (a 0 correspondng to a left edge, and a to a rght edge), then one can assocate each symbol n a prefx code wth a unque leaf. It s noteworthy that the compressed (a coded) sequence can be decoded to yeld the nput by repeatedly gong down the tree untl leaves are reached. Claude E. Shannon (96-200) s a hghly recognzed Amercan mathematcan and computer scentst. He studed electrcal engneerng and mathematcs at the Unversty of Mchgan before gong on to complete a masters and postdoctorate degree at MIT. The computer scence and engneerng communty ncreasngly began to notce hs brllant mnd after the publcaton of hs master s thess "A Symbolc Analyss of Relay and Swtchng Crcuts", wrtten n 936. Hs most notable and well known publcaton "A Mathematcal Thoery of Communcaton", was publshed a few years later, n 948. Although he worked n a feld n whch no Nobel Prze exsted, he was granted numerous prestgous przes throughout hs career. He passed away at the age of 84 after a long fght wth alzhemer dsease. Informaton Theory Informaton Theory s the study of nformaton and how t can Charles E. Leserson, Thomas H. Cormen, and Ronald L. Rvest. Introducton be processed and communcated. Not long after begnnng work to Algorthms. Cambrdge, MA, 2009 at the Bell Laboratores, Claude E. Shannon publshed hs paper "A Mathematcal Theory of Communcaton" 2, n 948, n the Bell 2 Claude E. Shannon. A mathematcal Systems Techncal Journal. 3 Ths paper quckly ganed wde-spread theory of communcaton. The Bell System Techncal Journal, 948 recognton as beng the ground work for what s now known as 3 Inrene Woo Adel Magra, modern day nformaton theory. Emma Goune. Informaton The man premse of the paper was an nvestgaton nto solvng communcaton problems, dscussng them both n a theorectcal and real lfe sense. The greatest dfference between the two s that n real lfe, often tmes there s nose that can nterfere wth the mode of transmsson of nformaton, whch he called the channel. For the purpose of ths course, we consder a communcaton system n whch no nose s present. theory. http://luc.devroye. org/magra-goune-woo--shannon+ InformatonTheory-LectureNotes-McGllUnversty-20 pdf, March 207. Accessed on 208-03- 20

ntroducton to nformaton theory, data compresson, codng 2 Fgure : Noseless communcaton system Shannon s greatest concern was the "how" and not the "what" of nformaton transmsson. He dd note, however, that n the case of data compresson how well you compress (and how easly) depends on the nput you are consderng. That beng sad, he pays no attenton to the actual meanng of the nput, statng "...these semantc aspects of communcaton are rrelevant to the engneerng problem." 4 4 Shannon [948] The compresson rato, C, s defned by C = number of symbols n output number of symbols n nput In order to determne the expected length of the output sequence, Shannon consdered every possble nput. He assumed that every nput sequence that may have to be compressed has a gven probablty p, where the p s sum to one. If the th nput, was gven some encodng of length l bts, then the expected length of the output bt sequence s Σ p l. A bnary tree proved to be very useful n representng the encodng of nformaton. The nternal nodes of ths tree would have no value, however each leaf would represent a possble nput. Every left edge represents by a 0, and every rght edge a. Thngs to note:. Input n a communcaton system s not lmted to words, characters etc. It can be anythng! 2. Output s always bnary. Example: Entropy (Symbol E) In nformaton theory, entropy s a quantty that measures the amount of nformaton n a random varable. Thus entropy provdes a theoretcal (sometmes nachevable) lmt for the effcency of any possble encodng. 5 Correspondng Translaton Table: The bnary entropy s defned as follows E = Σ p log 2 p 0, where p s are the probabltes of the nput sequences. 5 George Markowsky. Informaton theory. https://www.brtannca.com/ scence/nformaton-theory, June 207. Accessed on 208-03-

ntroducton to nformaton theory, data compresson, codng 3 Shannon faced three problems:. Fnd a bnary tree that mnmzes Σ p l (solved by hs student, Davd Huffman). 2. Prove E mn Σ p l, where "mn" refers to the mnmum over all bnary trees. (Thus, the expected length of the output, regardless of the comparson method, s at least E.) 3. Prove Σ p l E+, for some bnary tree. (Ths reassures us, snce we can come close to the lowerbound, E.) We wll frst prove (2) E mnσ p l. Proof. Recall Kraft s nequalty, whch s vald for all bnary trees: 2 l By Taylor s seres expanson, log e x x. Now observe that: p l = p log 2 2 l () = p log 2 (2 l p ) (2) p = p log 2 + p p log 2 (p 2 l ) (3) ( ) = E (log 2 e) p log e p 2 l (4) ( ) E (log 2 e) p p 2 l (5) = E. (6) We have shown that p l E. We must now exhbt a compresson method wth E + p l. Proof. We take l = log 2 ( p ) so we have, 2 l p 2 log 2 p. So, Kraft s nequalty holds, By orderng the lengths l from small to large, and assgnng the l s to leaves n a bnary tree from left to rght, one can fnd a code wth the gven l s. Ths code s called the Shannon-Fano code.

ntroducton to nformaton theory, data compresson, codng 4 Now, p l p ( + log 2 ) = + E. p Huffman Tree A Huffman tree s a bnary tree that mnmzes Σ p l where p s the weght of leaf and l s the dstance from leaf to the root. It has the followng propertes:. Two nputs wth smallest p value are furthest from the root. 2. Every nternal node has 2 chldren. 3. Two nputs wth smallest p value can safely be made sblngs. It s mportant to note that Huffman trees are not unque! The Hu-Tucker algorthm s a greedy algorthm desgned to output the Huffman tree gven a set of nputs and ther p s. It has tme complexty O(n log n). Setup: Let PQ be a bnary heap holdng pars (, p ) wth the smallest key p near the root. Assumng that there are n leaves, we can reserve n nterval nodes n an array of total sze 2n. Let us use left[] and rght[] to denote the chldren of node. Node s the root. HuffmanTree MAKENULL(PQ) 2 for = n to 2n do 3 left[] = rght[] = nl; 4 INSERT((, p ), PQ); for = n down to do 6 (a, p a ) = DELETEMIN(PQ); 7 (b, p b ) = DELETEMIN(PQ); 8 left[]= a; 9 rght[]= b; 0 INSERT((, (p a + p b )), PQ);

ntroducton to nformaton theory, data compresson, codng 5 Example: How to construct a Huffman tree Examples We wll now show dfferent methods of codng and see how they compare wth Shannon s lower bound. Suppose our nput s x, x 2,..., x n where x are unformly random elements of {,2,3}. There are, therefore, 3 n equally lkely nput sequences of length n. Note that E= log 2 3 n = n log 2 3.57n. ) (Fxed wdth length). We use two bts per nput symbol usng the fxed wdth code: 0, 2 0, 3. So the length of the output s 2n whch s not optmal. There s room for a smaller expected output length.

ntroducton to nformaton theory, data compresson, codng 6 2) (Huffman code). Consder the Huffman code where symbols are coded symbol by symbol usng a Huffman tree prefx code: 0, 2 0, 3. The expected output length s 3 5 n, snce p l = 3 () + 3 (2) + 3 (2) = 5 3. Thus, the expected output length s 5 3 n, whch s consderably larger than E.57n. 3) Let s now make groups of fxed length d. Each group of d s an nput symbol coded by a Huffman code. The expected output length n number of bts wll be n d tmes the expected length of the Huffman tree code for one group, whch we know s + log 2 3 d. So the overall expected length s n d log 2 3 d n d ( + d log 2 3) = n(log 2 3 + d ). Fnally, by choosng d large enough, we can get arbtrarly close to E. We cannot take d too large though, because computng the Huffman code would requre too much space as the Huffman tree has 3 d leaves.

ntroducton to nformaton theory, data compresson, codng 7 References [] Charles E. Leserson, Thomas H. Cormen, and Ronald L. Rvest. Introducton to Algorthms. Cambrdge, MA, 2009. [2] Claude E. Shannon. A mathematcal theory of communcaton. The Bell System Techncal Journal, 948. [3] Inrene Woo Adel Magra, Emma Goune. Informaton theory. http://luc.devroye.org/magra-goune-woo--shannon+ InformatonTheory-LectureNotes-McGllUnversty-207.pdf, March 207. Accessed on 208-03-20. [4] George Markowsky. Informaton theory. https://www. brtannca.com/scence/nformaton-theory, June 207. Accessed on 208-03-.