Shannon s noiseless coding theorem

18.310 lecture otes May 4, 2015 Shao s oiseless codig theorem Lecturer: Michel Goemas I these otes we discuss Shao s oiseless codig theorem, which is oe of the foudig results of the field of iformatio theory. Roughly speakig, we wat to aswer such questios as how much iformatio is cotaied i some piece of data? Oe way to approach this questio is to say that the data cotais bits of iformatio i average if it ca be coded by a biary sequece of legth i average. So iformatio theory is closely related to data compressio. 1 Some History History of data compressio. Oe of the earliest istaces of widespread use of data compressio came with telegraph code books, which were i widespread use at the begiig of the 20th Cetury. At this time, telegrams were quite expesive; the cost of a trasatlatic telegram was aroud $1 per word, which would be equivalet to somethig like $30 today. This led to the developmet of telegraph code books, some of which ca be foud i Google Books. These books gave a log list of words which ecoded phrases. Some of the codewords i these books covey quite a bit of iformatio; i the Fourth editio of the ABC Code, for example, Mirmido meas Lord High Chacellor has resiged ad saturatio meas Are recoverig salvage, but should bad weather set i the hull will ot hold out log. 1 Iformatio theory. I 1948, Claude Shao published a semial paper which fouded the field of iformatio theory 2. I this paper, amog other thigs, he set data compressio o a firm mathematical groud. How did he do this? Well, he set up a model of radom data, ad maaged to determie how much it could be compressed. This is what we will discuss ow. 2 Radom data ad compressio First of all we eed a model of data. We take our data to be a sequece of letters from a give alphabet A. The we eed some probabilistic settig. We will have a radom source of letters. So we could say that we have radom letters X 1, X 2, X 3,... from A. I these otes we will assume that we have a first-order source, that is, we assume that the radom variables X 1, X 2, X 3,... are idepedet ad idetically distributed. So for ay letter a i the alphabet A the probability PX = a is some costat p a which does ot deped o or o the letters X 1, X 2,..., X 1. Shao s theory actually carries out to more complicated models of sources Markov chais of ay order. These more complicated sources would be more realistic models of reality. However for simplicity, but we shall oly cosider first-order sources i these otes. 3 1 The ABC Uiversal Commercial Telegraph Code, Specially Adapted for the Use of Fiaciers Merchats, Shipowers, Brokers, Agets, Etc. by W. Clauso-Thue, America Code Publishig Co., Fourth Editio 1899 2 Available o-lie at http://cm.bell-labs.com/cm/ms/what/shaoday/shao1948.pdf 3 Suppose for istace you are tryig to compress Eglish text. We might cosider that we have some sample Etropy-1

Now we eed to say what data compressio meas. We shall ecode data by biary sequeces sequeces of 0 ad 1. A codig fuctio φ for a set S of messages is simply a fuctio which associates to each elemet s S a distict biary sequece φs. Now if the messages s i S have a certai probability distributio the the legth L of the biary sequece φs is a radom variable. We are lookig for codes such that the average legth EL is as small as possible. I our cotext, the radom messages will be the sequeces s = X 1, X 2,..., X cosistig of the first letters comig out of the source. Oe way to ecode these messages is to attribute distict biary sequeces of legth log 2 A to the letters i the alphabet A. The the biary sequece φs would be the cocateatio of the codes of the letter, so that the legth L of φs would be log 2 A. That s a perfectly valid codig fuctio, leadig to average legth EL = log 2 A. Now the mai questio is: ca we do better? How much better? This is what we discuss ext. 3 Shao s etropy Theorem Cosider a alphabet A = {a 1,..., a k } ad a first-order source X as above: the th radom letter is deoted X. For all i {1,..., k} we deote by p i the probability of the letter a i, that is, PX = a i = p i. We defie the etropy of the source X as Hp = p i log 2 p i. We ofte deote the etropy just by H, without emphasizig the depedece o p. The etropy Hp is a oegative umber. It ca also be show that Hp log 2 k by cocavity of the logarithm. This upper boud is achieved whe p 1 = p 2 =... = p k = 1/k. We will ow discuss ad prove leavig a few details out the followig result of Shao. Theorem 1 Shao s etropy Theorem. Let X be a first order source with etropy H. Let φ be ay codig fuctio i biary words for the sequeces s = X 1, X 2,..., X cosistig of the first letters comig out of the source. The the legth L of the code φs is at least H o average, that is, EL H + o, where the little o otatio meas that the expressio divided by goes to zero as goes to ifiity. Moreover, there exists a codig fuctio φ such that EL H + o. corpus of Eglish text o had say, everythig i the Library of Cogress. Shao cosidered a series of sources, each of which is a better approximatio to Eglish. The first-order source which emits a letter a with probability p a which is proportioal to its frequecy i the text. The probability distributio of a sequece of letters from this source is just idepedet radom variables where the letter a j appears with some probability p j. The secodorder source is that where a letter is emitted with probability that depeds oly o the previous letter, ad these probabilities are just the coditioal probabilities that appear i the corpus that is, the coditioal probability of gettig a u, give that the previous letter was a q, is derived by lookig at the frequecy of all letters that follow a q i the corpus. I the third-order source, the probability of a letter depeds oly o the two previous letters, ad so o. High-order sources would seem to give a pretty good approximatio of Eglish, ad so it would seem that a compressio method that works well o this class of sources would also work well o Eglish text. Etropy-2

So the etropy of the source tells you how much the messages comig out of it ca be compressed. Aother way of iterpretig this theorem is to say that the amout of iformatio comig out of the source is H bits of iformatios per letters. We will ow give a sketch of the proof of Shao s etropy Theorem. First, let s try to show that oe caot compress the source too much. We look at a sequece of letters from the first-order source X, with the probability of letter a i beig p i for all i i [k]. First observe that the umber of sequeces of legth with exactly i letter a i is = 1, 2,..., k! 1! 2! k!, ad all these words have the same probability p 1 1 p 2 2 p k k. Now, if we have some umber M of equally likely messages that must be set, the i order to sed them we eed to use log M bits o average see the coutig otes. So we eed to sed at least log 2 1, 2,..., k bits. To approximate this, we ca use Stirlig s formula! 2π. e It gives Usig this formula oe obtais log 2 log 2! = log 2 log 2 e + o. 1, 2,..., k = log 2! = log 2 = log 2 i! i log 2 i + o i log 2 i / + o. I terms of the etropy fuctio, this ca be rewritte as: 1 log 2 = H 1, 2,..., k,, k + o. So we have a coditioal lower boud o the legth of the codig sequeces EL 1,..., k i log 2 i / + o. 1 I particular if i = p i for all i oe gets EL 1,..., k k p i log 2 p i +o = H+o. Etropy-3

Now we wat to fid what i is i geeral. The expectatio of i is p i. Moreover it ca be show that i is very cocetrated aroud this expectatio. Actually, usig Chebyshev s iequality oe ca prove do it! that for ay costat ɛ > 0 P i p i ɛ p i1 p i ɛ 2. Now we fix a costat ɛ > 0 ad we cosider two cases. We defie a sequece of letters to be ɛ-typical if i p i ɛ for all i ad ɛ-atypical otherwise. By the uio boud, the above gives a boud o the probability to be ɛ-atypical: Pɛ-atypical p i 1 p i ɛ 2 1 ɛ 2. Moreover Equatio 1 gives EL ɛ-typical p i ɛ log 2 p i + ɛ + o. We ow use the liearity of expectatio to boud the legth of the coded message: EL = EL ɛ typicalpɛ typical + EL ɛ atypicalpɛ atypical p i ɛ log 2 p i + ɛ + o 1 1 ɛ 2. Sice oe ca take ɛ as small as oe wats, this shows that EL H + o. So we have proved the first part of the Shao etropy theorem. We ow will show that oe ca do compressio ad get coded messages of legth o more tha H i average. We use agai the liearity of expectatio to boud the legth of the coded messages: EL = EL ɛ typicalpɛ typical + EL ɛ atypicalpɛ atypical. Now, we eed to aalyze this expressio. Patypical c, so as log as we do t make the output i the atypical case more tha legth C, we ca igore the secod term, as this oe will be costat while the first term will be liear. What we could do is use oe bit to tell the receiver whether the output was typical ad atypical. If it is atypical, we ca sed it without compressio thus sedig log 2 k = log 2 k, ad if it typical, we ca the compress it. The domiat cotributio to the expected legth the occurs from typical outputs, because of the rarity of atypical oes. How do we compress the source output if it s typical? Oe of the simplest ways theoretically but this is ot practical is to calculate the umber of typical outputs, ad the assig a umber i biary represetatio to each output. This compresses the source to log 2 of the umber of typical outputs. We will do this ad get a upper boud of H + o bits, where the little o otatio meas that the expressio divided by goes to zero as goes to ifiity. Etropy-4

How do we calculate the umber of typical outputs? For each typical vector of umbers i, we have that the umber of outputs is 1, 2,..., k. So a upper boud o the total umber of typical outputs is i :p i ɛ i p i +ɛ 1, 2,..., k this is a upper boud as we have t take ito accout that i =. But we ca eve use a eve cruder upper boud by upper boudig the umber of terms i the summatio by k it is ot ecessary to use the improved 2ɛ k. Thus, we get that the umber of typical outputs is k, p 1 ± ɛ, p 2 ± ɛ,..., p k ± ɛ where we ca choose the p i ± ɛ to maximize the expressio. Takig logs, we get that the umber of bits required to sed a typical output is at most k log 2 + H + cɛ, for some costat c. The first term is egligible for large, ad we ca let ɛ go to zero as goes to so as to get compressio to H + o bits. I summary, Shao s oiseless theorem says that we eed to trasmit H bits ad this is ca be essetially achieved. We ll see ext a much more practical way Huffma codes to do the compressio ad although it ofte does ot quite achieve the Shao boud, it gets fairly close to it., Etropy-5