Leture 6: Coing theory Biology 429 Crl Bergstrom Ferury 4, 2008 Soures: This leture loosely follows Cover n Thoms Chpter 5 n Yeung Chpter 3. As usul, some of the text n equtions re tken iretly from those soures. Coing theory is the stuy of how informtion n e pkge for trnsport. Let us egin with n exmple. Suppose tht we hve sequene of symols A, C, E, B, B, D, E, A, C, D, D, A, B, A, E, A, B, D, C, A,... rwn from n lphet X = {A, B, C, D, E}, n we wnt to sen someone messge telling them this sequene. Our trnsmission hnnel llows only inry oing, so our messge hs to tke the form of string of zeros n ones. How n we oe this messge? One wy to o it is to simply use lok oe tht mps eh letter into inry oewor: A 000 B 001 C 010 D 011 E 100 Thus the messge ove woul look like 000010100001001011100000010... 1
Sine the oe wors re onstnt length, we n esily go in n group them efore eoing: 000 010 100 001 001 011 100 000 010... One n see right wy tht this oing is sort of ineffiient, euse we re not mking use of the oe wors 101, 110, n 111. Inee, we re using three its to trnsmit t most log 5 its of informtion. Another oing pproh woul e to uil set of vrile-length oewors. One might wnt to use oe suh s the following: A 0 B 1 C 00 D 01 E 11 ut in this se, there is no wy to uniquely eoe messge. For exmple 00101... oul e AABAB or CBAC or ADD or ny numer of other possiilities. We n use vrile-length oe wors so long s we llow the reeiver to uniquely eoe the messge. One wy to o this, for exmple, is to hve unique en-of-keywor symol. We oul let the numer of 1 s to inite the letter n using 0 s to inite the spes etween oewors. Then our messge woul look like A 0 B 10 C 110 D 1110 E 11110 01101111010101110111100110... Here, we n use the 0 s to group the symols n eoe: 2
0 110 11110 10 10 1110 11110 011 0... In generl, will this ltter pproh e more effiient, or less effiient, thn the former pproh? We nee efinition of effiieny. The ovious efinition is the expete oewor length per soure symol. Where l(x) is the length of the oewor neessry to enoe symol x, we re intereste in the expete oewor length L(C) = p(x)l(x). x X For our exmple soure ove, the expete oewor length of the first oe is simply 3, sine in ll ses the oewors re of length three. For the seon oe, the expete oewor length epens on the reltive frequeny of the symols in the originl messge. If messge hs lots of As, tht the ltter oing will e very effiient, euse mny of the oewors in the messge will e 0, i.e., one it long inste of three s in the former oe. If inste messge hs lots of Es, the ltter oe will e very ineffiient, euse then mn of the oewors in the oe messge will e 11110, i.e., five its long inste of three. Thus we see tht the effiieny of system for enoing informtion epens on the sttistil properties of the informtion to e enoe. Coing theory llows us to explore this reltionship, often with fous on esigning oes tht will e optiml or ner-to-optiml for ny given type of soure t. Notie the lose reltionship etween the two following prolems: Given soure of rnom vriles X rwn from lphet X with proilities p(x), fin n effiient wy of oing the soure using n lphet D. Given t file, fin wy of effiiently ompressing this t. With this out of the wy, we egin y looking t prefix oes. Prefix oes re those oes for whih one n eoe the messge string uniquely without hving to look forwr in the string, euse you n lwys figure out when oewor hs ene without neeing to look t see wht oewor omes next. Our oe with terminting zeros is very strightforwr exmple of prefix oe: we lwys know when we ve rehe the en of oewor, 3
euse every oe wor ens with zero n every zero signls the en of oewor. Note tht of ourse ll prefix oes re uniquely eole, though not ll uniquely eole oes re prefix oes (see Cover n Thoms tle 5.1 for n exmple). Theorem 1 A oe is prefix oe if n only if no oewor is prefix (the first prt of) ny other oewor. Proof: If oewor i is prefix of oewor j, then fter reeiving the symols for oewor i, one nees to look forwr t the susequent symols to etermine whether the full oewor is i or j, n thus the oe is not prefix oe. This proves the only if iretion. To prove the if iretion, we note tht when we use oe where no oewor is prefix of ny other, we will know with ertinty tht we hve reeive the full oewor when we reeive string of symols tht s up to this oewor. euse no other oewor strts tht sme wy. Thus we hve prefix oe. Now tht we hve this simple test for prefix oe, we n prove one of the min theorems in oing theory. Theorem 2 For ny prefix oe using n lphet of size D, the oewor lengths l 1, l 2,...,l m stisfy D l i 1. i Proof: Tke oe with n lphet of size D, n suppose tht the longest oe wor is of length w. Thus for ny non-singulr oe (where eh symol gets unique oewor) we n hve t most D w +D w 1 +...+D oewors. 4
D D 2 D 3 But we re looking t prefix oes here n thus no oewor n e the prefix of ny other oewor. The lrgest numer of oewors our prefix oe n hve is then D w, i.e., ll oewors hve the mximl length. If ny of our oewors re shorter, sy of length l < w, tht now rules out D w l potentil other oewors. So suppose tht we hve oewors 1, 2,..., m with lengths l 1, l 2,...,l i. Then we en up ruling out mny esenent oewors y our no-prefixes rule, nmely i D w l i esennts. Thus the tul numer of oewors we n hve is D w i D w l i 0, giving us D w > i D w l i. Diviing through y D w (whih is positive) we get the Krft inequlity: 1 i D l i. This puts powerful lower oun on the verge oewor length L(C). Next, we will see how lose we n ome to hieving this oun, n we will formlize the reltionship etween this oun n the entropy rte of the soure. We n lso prove the onverse thn for ny set of lengths stisfying the Krft inequlity, we n onstrut prefix oe with oe wors of those lengths. Theorem 3 Given ny set of oewor lengths l 1, l 2,...,l m tht stisfy the Krft inequlity D l i 1, i 5
there exists prefix oe with m oewors with preisely those lengths. The proof is y supplying onstrution. To onstrut suh set of oewors, rete tree s we i in the proof of the Krft inequlity. Orer the oewor lengths from shortest to longest, then strt with the first oewor length l 1. Assign the first symol to the first oewor with tht length, n remove ll esenents. Assign the seon symol to the first oewor remining on the tree with length l 2. Agin remove ll esenents. Continue until ll symols re ssigne oewor. One will lwys hve enough remining rnhes to ssign keywor to eh length, y the lultions performe in the proof of Krft s inequlity. We n then prove reltion etween the expete length L(C) of prefix oe n the entropy of the soure. Here we simply stte the theorem Theorem 4 The expete length of prefix oe L(C) using D-symol lphet is greter thn or equl to the entropy se D of the soure: L(C) H D (X) So we n t o etter thn the entropy rte when oing soure. We n get quite lose to the entropy rte, though, using very strightforwr oing proeure suggeste y Shnnon. This proeure, whih we will explore in the next leture, gives n expete oe length elow H D (X) + 1. 6