Notes on Information Theory by Jeff Steif

Size: px

Start display at page:

Download "Notes on Information Theory by Jeff Steif"

Esther Morton
5 years ago
Views:

1 Notes o Iformatio Theory by Jeff Steif 1 Etropy, the Shao-McMilla-Breima Theorem ad Data Compressio These otes will cotai some aspects of iformatio theory. We will cosider some REAL problems that REAL people are iterested i although we might make some mathematical simplificatios so we ca (mathematically) carry this out. Not oly are these thigs iheretly iterestig (i my opiio, at least) because they have to do with real problems but these problems or applicatios allow certai mathematical theorems (most otably the Shao-McMilla-Breima Theorem) to come alive. The results that we will derive will have much to do with the etropy of a process (a cocept that will be explaied shortly) ad allows oe to see the real importace of etropy which might ot at all be obvious simply from its defiitio. The first aspect we wat to cosider is so-called data compressio which meas we have data ad wat to compress it i order to save space ad hece moey. More mathematically, let X 1, X 2,..., X be a fiite sequece from a statioary stochastic process {X k } k=. Statioarity will always 1

2 mea strog statioarity which meas that the joit distributio of X m, X m+1, X m+2,..., X m+k is idepedet of m, i other words, we have time-ivariace. We also assume that the variables take o values from a fiite set S with S = s. Ayway, we have our X 1, X 2,... X ad wat a rule which give a sequece assigs to it a ew sequece (hopefully of shorter legth) such that differet sequeces are mapped to differet sequeces (otherwise oe could ot recover the data ad the whole thig would be useless, e.g., assigig every word to 0 is ice ad short but of limited use). Assumig all words have positive probability, we clearly ca t code all the words of legth ito words of legth 1 sice there are s of the former ad s 1 of the latter. However, oe perhaps ca hope that the expected legth of the coded word is less tha the legth of the origial word (ad by some fixed fractio). What we will evetually see is that this is always the case EXCEPT i the oe case where the statioary process is i.i.d. AND uiform o S. It will tur out that oe ca code thigs so that the expected legth of the output is H/ l(s) times the legth of the iput where H is the etropy of the process. (Oe should coclude from the above discussio that the oly statioary process with etropy l(s) is i.i.d. uiform.) What is the etropy of a statioary process? We defie this i 2 steps. First, cosider a partitio of a probability space ito fiitely may sets where (after some orderig), the sets have probabilities, p 1,..., p k (which are of course positive ad add to 1). The etropy of this partitio is defied to be k p i l p i. i=1 Souds arbitrary but turs out to be very atural, has certai atural 2

3 properties (after oe iterprets etropy as the amout of iformatio gaied by performig a experimet with the above probability distributio) ad is i fact the oly (up to a multiplicative costat) fuctio which has these atural properties. I wo t discuss these but there are may refereces (ask me if you wat). (Here l meas atural l, but sometimes i these otes, it will refer to a differet base.) Now, give a statioary process, look oly at the radom variables X 1, X 2,... X. This breaks the probability space ito s pieces sice there are this may sequeces of legth. above) of this partitio. Let H be the etropy (as defied Defiitio 1.1: The etropy of a statioary process is lim H where H is defied above (this limit ca be show to always exist usig subadditivity). Hard Exercise: Defie the coditioal etropy H(X Y ) of X give Y (where both r.v. s take o oly fiitely may values) i the correct way ad show that the etropy of a process {X } is also lim H(X 0 X 1,... X ). Exercise: Compute the etropy first of i.i.d. uiform ad the a geeral i.i.d.. If you re the feelig cofidet, go o ad look at Markov chais (they are ot that hard). Exercise: Show that etropy is uiquely maximized at i.i.d. uiform. (The previous exercise showed this to be the case if we restrict to i.i.d. s but you eed to cosider all statioary processes.) We ow discuss the Shao-McMilla-Breima Theorem. It might seem dry ad the poit of it might ot be so apparet but it will come very much alive soo. 3

4 Before doig this, we eed to make a importat (ad ot uatural) assumptio that our process is ergodic. What does this mea? To motivate it, let s costruct a statioary process which does ot satisfy the Strog Law of Large Numbers (SLLN) for trivial reasos. Let {X } be i.i.d. with 0 s ad 1 s, 0 with probability 3/4 ad 1 with probability 1/4 ad let {Y } be i.i.d with 0 s ad 1 s, 0 with probability 1/4 ad 1 with probability 3/4. Now costruct a statioary process by first flippig a fair coi ad the if it s heads, take a realizatio from the {X } process ad if it s tails, take a realizatio from the {Y } process. (By thikig of statioary processes as measures o sequece space, we are simply takig a covex (1/2,1/2) combiatio of the measures associated to the two processes {X } ad {Y }). You should check that this resultig process is statioary with each margial havig mea 1/2. Lettig Z deote this process, we clearly do t have that 1 Z i 1/2 a.s. i=1 (which is what the SLLN would tell us if {Z } were i.i.d.) but rather we clearly have that 1 Z i 3/4 (1/4) with probability 1/2(1/2). i=1 The poit of ergodicity is to rule out such stupid examples. There are two equivalet defiitios of ergodicity. The first is that if T is the trasformatio which shifts a bi-ifiite sequece oe step to the left, the ay evet which is ivariat uder T has probability 0 or 1. The other defiitio is that the measure (or distributio) o sequece space associated to the process is ot a covex combiatio of 2 measures each also correspodig 4

5 to some statioary process (i.e., ivariat uder T ). It is ot obvious that these are equivalet defiitios but they are. Exercise: Show that a i.i.d. process is ergodic by usig the first defiitio. Theorem 1.2 (Shao-McMilla-Breima Theorem): Cosider a statioary ergodic process with etropy H. Let p(x 1, X 2,..., X ) be the probability that the process prits out X 1,..., X. (You have to thik about what this meas BUT NOTE that it is a radom variable). The as, l(p(x 1, X 2,..., X )) Ha.s. ad i L 1. NOTE: The fact that E[ l(p(x 1,X 2,...,X )) ] H is exactly (check this) the defiitio of etropy ad therefore geeral real aalysis tells us that the a.s. covergece implies the L 1 covergece. We do ot prove this ow (we will i 2) but istead see how we ca use it to do data compressio. To do data compressio, we will oly eed the above covergece i probability (which of course follows from either the a.s. or L 1 covergece). Thikig about what covergece i probability meas, we have the followig corollary. Corollary 1.3: Cosider a statioary ergodic process with etropy H. The for all ɛ > 0, for large, we have that the words of legth ca be divided ito two sets with all words C i the first set satisfyig e (H+ɛ) < p(c) < e (H ɛ) ad with the total measure of all words i the secod set beig < ɛ. Oe should thik of the first set as the good set ad the secod set as the bad set. (Later it will be the good words which will be compressed). 5

6 Except i the i.i.d. uiform case, the bad set will have may more elemets (i terms of cardiality) tha the good set but its probability will be much lower. (Note that it is oly i the i.i.d. uiform case that probabilities ad cardiality are the same thig.) Here is a related result which will fially allow us to prove the data compressio theorem. Propositio 1.4: Cosider a statioary ergodic process with etropy H. Order the words of legth i decreasig order (i terms of their probabilities). Fix λ (0, 1). We select the words of legth i the above order oe at a time util their probabilities sum up to at least λ ad we let N (λ) be the umber of words we take to do this. The l(n (λ)) lim = H. Exercise: Note that oe obtais the same limit for all λ. Thik about this. Covice yourself however that the above covergece caot be uiform i λ. Is the covergece uiform i λ o compact itervals of (0, 1)? Proof: Fix λ (0, 1). Let ɛ > 0 be arbitrary but < 1 λ ad λ. We will show both ad lim sup lim if from which the result follows. l(n (λ)) l(n (λ)) H + ɛ H ɛ By Corollary 1.3, we kow that for large, the words of legth ca be broke up ito 2 groups with all words C i the first set satisfyig e (H+ɛ) < p(c) < e (H ɛ) 6

7 ad with the total measure of all words i the secod set beig < ɛ. We ow break the secod group (the bad group) ito 2 sets depedig o whether p(c) e (H ɛ) (the big bad words) or whether p(c) e (H+ɛ) (the small bad words). Whe we order the words i decreasig order, we clearly first go through the big bad words, the the good words ad fially the small bad words. We ow prove the first iequality. For large, the total measure of the good words is at least 1 ɛ which is bigger tha λ ad hece the total measure of the big bad words together with the good words is bigger tha λ. Hece N (λ) is at most the total umber of big bad words plus the total umber of good words. Sice all these words have probability at least e (H+ɛ), there caot be more tha e (H+ɛ) of them. Hece lim sup l(n (λ)) lim sup l(e (H+ɛ) + 1) = H + ɛ. For the other iequality, let M (λ) be the umber of good words amog the N (λ) words take whe we accumulated at least λ measure of the space. Sice the total measure of the big bad words is at most ɛ, the total measure of these M (λ) words is at least λ ɛ. Sice each of these words has probability at most e (H ɛ), there must be at least (λ ɛ)e (H ɛ) words i M (λ) (if you do t see this, we have M (λ) e (H ɛ) λ ɛ). Hece ad we re doe. lim sup lim sup l(n (λ)) lim sup l((λ ɛ)e (H ɛ) ) l(m (λ)) = H ɛ, Note that by takig λ close to 1, oe ca see that (as log as we are ot i.i.d. uiform, i.e., as log as H < l s), we ca cover almost all words (i the 7

8 sese of probability) by usig oly a eglible percetage of the total umber of words (e H words istead of the total umber of words e ( l s) ). We fially arrive at data compressio, ow that we have thigs set up ad have a little feelig about what etropy is. We ow make some defiitios. Defiitio 1.5: A code is a ijective (i.e. 1-1) mappig σ which takes the set of words of legth with alphabet S to words of ay fiite legth (of at least 1) with alphabet S. Defiitio 1.6: Cosider a statioary ergodic process {X k } k=. The compressibility of a code σ is E[ l(σ(x 1,...,X )) ] where l(w ) deotes the legth of the word W. This measures how well a code compresses o average. Theorem 1.7: Cosider a statioary ergodic process {X k } k= with etropy H. Let µ be the miimum compressibility over all codes. The the limit µ lim µ exists ad equals H l s. The quatity µ is called the compressio coefficiet of the process {X k }. Sice H is always strictly less tha l s except i the i.i.d. uiform case, this says that data compressio is always possible except i this case. Proof: As usual, we have two directios to prove. We first show that for arbitrary ɛ ad δ > 0, ad for large, µ (1 δ) H 2ɛ l s. Fix ay code σ. Call a word C short (for σ) if σ(c) (H 2ɛ) l s. Sice codes are 1-1, the umber of short words is at most l s ( i=1 s + s s (H 2ɛ) l s s (H 2ɛ) s i ) s s 1 e(h 2ɛ). 8

9 Next, sice l(n (δ)) H by Propositio 1.4, for large, we have that N (δ) > e (H ɛ). This says that for large, if we take < e (H ɛ) of the most probable words, we wo t get δ total measure ad so certaily if we take < e (H ɛ) of ay of the words, we wo t get δ total measure. I particular, sice the umber of short words is at most which is for large less tha s s 1 e(h 2ɛ) e (H ɛ), the short words ca t cover δ total measure for large, i.e., P(short word) δ for large. Now ote that this argumet was valid for ay code. That is, for large, ay -code satisfies P(short word) δ. Hece for ay -code σ ad so as desired. E[ σ(c) ( ) (H 2ɛ) ] (1 δ) l s µ (1 δ) H 2ɛ l s, For the other directio, we show that for ay ɛ > 0, we have that for large, µ 1/ (H+ɛ) l s + δ. This will complete the argumet. The umber of differet sequeces of legth exactly (H+ɛ) l s is at least s (H+ɛ) l s which is e (H+ɛ). By Propositio 1.4 the umber N (1 δ) of most probable words eeded to cover 1 δ of the measure is e (H+ɛ) for large. Sice there are at least this may words of legth (H+ɛ) l s, we ca code these high probability words ito words of legth (H+ɛ) l s. For the 9

10 remaiig words, we code them to themselves. For such a code σ, we have that E[σ(X 1,..., X )] is at most (H+ɛ) l s + δ ad hece the compressio of this code is at most 1/ (H+ɛ) l s + δ. This is all ice ad dady but EXERCISE: Show that for a real perso, who really has data ad really wats to compress it, the above is all useless. We will see later o ( 3) a method which is ot useless. 2 Proof of the Shao-McMilla-Breima Theorem We provide here the classical proof of this theorem where we will assume two major theorems, amely, the ergodic theorem ad the Martigale Covergece Theorem. This sectio is much more techical tha all of the comig sectios ad requires more mathematical backgroud. Feel free if you wish to move o to 3. As this sectio is idepedet of all of the others, you wo t miss aythig. We first show that the proof of this result is a easy cosequece of the ergodic theorem i the special case of a i.i.d. process or more geerally of a multistep Markov chai. We take the middle road ad prove it for Markov chais (ad the reader will easily see that it exteds trivially to multi-step Markov chais). Proof of SMB for Markov Chais: First ote that l(p(x 1, X 2,..., X )) 1 l p(x j X j 1 ) = 10

11 where the first term l p(x 1 X 0 ) is take to be l p(x 1 ). If the first term were iterpreted as l p(x 1 X 0 ) istead, the the ergodic theorem would immediately tell us this limit is a.s. E[ l p(x 1 X 0 )] which (you have see as a exercise) is the etropy of the process. Of course this differece i the first term makes a error of at most cost/ ad the result is proved. For the multistep markov chai, rather tha replacig the first term by somethig else (as above), we eed to do so for the first k terms where k is the look-back of the markov chai. Sice this k is fixed, there is o problem. To prove the geeral result, we isolate the mai idea ito a lemma which is due to Breima ad which is a geeralizatio of the ergodic theorem which is also used i its proof. Actually, to uderstad the statemet of the ext result, oe really eeds to kow some ergodic theory. Lear about it o your ow, ask me for books, or I ll discuss what is eeded or just forget this whole sectio. Propositio 2.1: Assume that g (x) g(x) a.e. ad that sup g (x) <. The if φ is a ergodic measure preservig trasformatio, we have that 1 1 lim g j (φ j (x)) = 0 g a.e. ad i L 1. Note if g = g L 1 for all, the this is exactly the ergodic theorem. Proof: (as explaied i the above remark), the ergodic theorem tells us 1 1 lim g(φ j (x)) = 0 g a.e. ad i L 1. 11

12 Sice g j (φ j (x)) = 1 we eed to show that ( ) g(φ j (x)) [g j (φ j (x)) g(φ j (x))], g j (φ j (x)) g(φ j (x)) 0 a.e. ad i L 1. Let F N (x) = sup j N g j (x) g(x). By assumptio, each F N is itegrable ad goes to 0 mootoically a.e. from which domiated covergece gives fn 0. Now, 1 1 g j (φ j (x)) g(φ j (x)) 1 0 N 1 1 N 1 N 1 0 N 1 0 f N (φ j φ N (x)). g j (φ j (x)) g(φ j (x)) + For fixed N, lettig, the first term goes to 0 a.e. while the secod term goes to f N a.e. by the ergodic theorem. Sice f N 0, this proves the poitwise part of (*). For the L 1 part, agai for fixed N, lettig, the L 1 orm of the first term goes to the 0 while the L 1 orm of the secod term is always at most f N (which goes to 0) ad we re doe. Proof of the SMB Theorem: To properly do this ad to apply the previous theorem, we eed to set thigs up i a ergodic theory settig. Our process X gives us a measure µ o U = {1,..., S} Z (thought of as its distributio fuctio). We also have a trasformatio T o U, called the left shift, which simply maps a ifiite sequece 1 uit to the left. The fact that the process is statioary meas that T preserves the measure µ i that for all evets A, µ(t A) = µ(a). 12

13 Next, lettig g k (w) = l p(x 1 X 0,..., X k+1 ), (with g 0 (w) = l p(x 1 )), we get immediately that l(p(x 1(w), X 2 (w),..., X (w))) = g j (T j (w)). (Note that w here is a ifiite sequece i our space X.) We are ow i the set-up of the previous propositio although of course a umber of thigs eed to be verified, the first beig the covergece of the g k s to somethig. Note that for ay i i our alphabet {1,..., S}, g k (w) o the cylider set X 1 = i is simply l P (X 1 = i X 0,..., X k+1 ) which (here we use a secod otrivial result) coverges by the Martigale covergece theorem (ad the cotiuity of l) to l P (X 1 = i X 0,...). We therefore get that g k (w) g(w) where g(w) = l P (X 1 = i X 0,...) o the cylider set X 1 = i, i.e., g(w) = l P (X 1 X 0,...). If we ca show that sup k g k (w) <, the previous propositio will tell us that l(p(x 1, X 2,..., X )) g a.e. ad i L 1. Sice, by defiitio, E[ l(p(x 1,X 2,...,X )) ] h({x i }) by defiitio of etropy, it follows that g = h({x i }) ad we would be doe. EXERCISE: Show directly that E[g(w)] is h({x i }). We ow proceed with verifyig sup k g k (w) <. Fix λ > 0. We show that P (sup k g k > λ) se λ (where s is the alphabet size) which implies the desired itegrability. P (sup k g k > λ) = k P (E k) where E k is the set where the first time g j gets larger tha λ is k, i.e., E k = {g j λ, j = 0,... k 1, g k > λ}. These 13

14 are obviously disjoit for differet k ad make up the set above. P (E k ) = i P ((X 1 = i) E k ) = i P ((X 1 = i) F i k ) where F i k where the first time f i j f i j = l p(x 1 = i X 0,..., X j+1 ). gets larger tha λ is k where Sice F i k is X k+1,..., X 0 measurable, we have P ((X 1 = i) Fk) i = = F i k F i k e f i k e λ P (F i k). P (X 1 = i X k+1,..., X 0 ) Sice the Fk i are disjoit for differet k (but ot for differet i), k P (E k) i k e λ P (Fk i) se λ. Now, is the set We fially metio a couple of thigs i the higher dimesioal case, that is, where we have a statioary radom field where etropy ca be defied i a completely aalogous way. It was ukow for some time if a (poitwise, that is, a.s.) SMB theorem could be obtaied i this case, the mai obstacle beig that it was kow that a atural cadidate for a multidimesioal martigale covergece theorem was i fact false ad so oe could ot proceed as we did above i the 1-d case. (It was however kow that this theorem held i mea (i.e., i L 1 ), see below.) However, recetly, (i 1983) Orstei ad Weiss maaged to get a poitwise SMB ot oly for the multidimesioal lattice but i the more geeral settig of somethig called ameable groups. (They used a method which circumvets the Martigale Covergece Theorem). The result of Orstei ad Weiss is difficult ad so we do t preset it. Istead we preset the Shao McMilla Theorem (ote, o McMilla here) which is the L 1 covergece i the SMB Theorem. Of course, this is a 14

15 weaker result tha we just proved but the poit of presetig it is two-fold, first to see how much easier it is tha the poitwise versio ad also to see a result which geeralizes (relatively) easily to higher dimesios (this higher dimesioal variat beig left to the reader). Propositio 2.2 (Mea SMB Theorem or MB Theorem): The SMB theorem holds i L 1. Proof Let g j be defied as above. The Martigale covergece theorem implies that g j g i L 1 as. We the have (with deotig the L 1 orm), l(p(x 1, X 2,..., X )) H = 1 g j (T j (w)) g (T j (w)) g j (T j (w)) H g (T j (w)) H. The first term goes to 0 by the L 1 covergece i the Martigale Covergece Theorem (we do t of course always get L 1 covergece i the Martigale Covergece Theorem but we have so i this settig). The secod term goes to 0 i L 1 by the ergodic theorem together with the fact that g = H, a easy computatio left to the reader. 3 Uiversal Etropy Estimatio? This sectio simply raises a iterestig poit which will be delved ito further later o. Oe ca cosider the problem of tryig to estimate the etropy of a process {X } = after we have oly see X 1, X 2,..., X i such a way 15

16 that these estimates coverge to the true etropy with prob 1. (As above, we kow that to have ay chace of doig this, oe eeds to assume ergodicity). Obviously, oe could simply take the estimate to be l(p(x 1, X 2,..., X )) which will (by SMB) coverge to the etropy. However, this is clearly udesirable i the sese that i order to use this estimate, we eed to kow what the distributio of the process is. It certaily would be preferable to fid some estimatio procedure which does ot deped o the distributio sice we might ot kow it. We therefore wat a family of fuctios h : S R (h (x 1,..., x ) would be our guess of the etropy of the process if we see the data x 1,..., x ) such that for ay give ergodic process {X } =, lim h (X 1,..., X ) = h({x } = ) a.s.. (Clearly the suggestio earlier is ot of this form sice h depeds o the process). We call such a family of fuctios a uiversal etropy estimator, uiversal sice it holds for all ergodic processes. There is o immediate obvious reaso at all why such a thig exists ad there are similar types of thigs which do t i fact exist. (If etropy were a cotiuous fuctio defied o processes (the latter beig give the usual weak topology), a geeral theorem (see 9) would tell us that such a thig exists BUT etropy is NOT cotiuous.) However, as it turs out, a uiversal etropy estimator does i fact exist. Oe such uiversal etropy estimator comes from a algorithm called the Lempel-Ziv algorithm which was later improved by Wyer-Ziv. We will see that this algorithm also turs out to be a uiversal data compressor, which we will describe later. (This method is used, as far as I uderstad it, i the real world). 16

17 4 20 questios ad Relative Etropy We have all heard problems of the followig sort. You have 8 cois oe of which has a differet weight tha all the others which have the same weight. You are give a scale (which ca determie which of two give quatities is heavier) ad your job is to idetify the bad coi. How may weighigs do you eed? Later we will do somethig more geeral but it turs out the problem of how may questios oe eeds to ask to determie somethig has a extremely simple aswer which is give by somethig called Kraft s Theorem. The aswer will ot at all be surprisig ad will be exactly what you would guess, amely if you ask questios which have D possible aswers, the you will eed l D () questios where is the total umber of possibilities. (If the latter is ot a iteger, you eed to take the iteger right above it). Defiiti 4.1: A mappig C from a fiite set S to fiite words with alphabet {1,..., D} will be called a prefix-free D code o S if for ay x, y S, C(x) is ot a prefix of C(y) (which implies of course that C is ijective) Theorem 4.2 (Kraft s Iequality): Let C be a prefix-free D code defied o S. Let {l 1,... l S } be the legths of the code words of C. The S D l i 1. i=1 Coversely, if {l 1,... l } are itegers satisfyig D l i 1, i=1 the there exists a prefix-free D code o a set with elemets whose code words have legth {l 1,... l }. 17

18 This turs out to be quite easy to prove so try to figure it out o your ow. I ll preset it i class but wo t write it here. Corollary 4.3: There is a prefix free D code o a set of elemets whose maximum code word has legth l D () while there is o such code whose maximum code word is < l D (). I claim that this corollary implies that the umber of questios with D possible aswers oe eeds to ask to determie somethig with possibilities is l D () (where if this is ot a iteger, it is take to be the iteger right above it). To see this oe just eeds to thik a little ad realize that askig questios util we get the right aswer is exactly a prefix free D code. (For example, if you have the code first, your first questio would be What is the first umber i the codeword assiged to x? which of course has D possible aswers.) Actually, goig back to the weighig questio, we did ot really aswer that questio at all. We uderstad the situatio whe we ca ask ay D ary questio (i.e., ay partitio of the outcome space ito D pieces) but with questios like the weighig questio, while ay weighig breaks the outcome space ito 3 pieces, we are physically limited (I assume, I have ot thought about it) from costructig all possible partitios of the outcome space ito 3 pieces. So this ivestigatio, while iterestig ad useful, did ot allow us to aswer the origial questio at all but we ed there oetheless. Now we wat to move o ad do somethig more geeral. Let s say that X is a radom variable takig o fiitely may values X with probabilities p 1,..., p (so X = ). Now we wat a successful guessig scheme (i.e., a prefix free code) where the expected umber of questios we eed to ask 18

19 (rather tha the maximum umber of questios) is ot too large. The followig theorem aswers this questio ad more. For simplicity, we talk about prefix free codes. Theorem 4.4: Let C be a prefix free D code o X ad N the legth of the codeword assiged to X (which is a r.v.). The H D (X) E[N] where H D (X) is the D etropy of X give by k i=1 p i l D p i. Coversely, there exists a prefix free D code satisfyig E[N] < H D (X) + 1. (We will prove this below but must first do a little more developmet.) EXERCISE: Relate this result to Corollary 4.3. A importat cocept i iformatio theory which is useful for may purposes ad which will allow us to prove the above is the cocept of relative etropy. This cocept comes up i may other places below we will explai three other places where it arises which are respectively (1) If I costruct a prefix free code satisfyig E[N] < H D (X) + 1 uder the assumptio that the true distributio of X is q 1,..., q but it turs out the true distributio is p 1,..., p, how bad will E[N] be? (2) (level 2) large deviatio theory for the Gliveko Catelli Theorem (which is called Saov s Theorem) (3) uiversal data compressio for i.i.d. processes. The first 1 of these will be doe i this sectio, the 2d i the 5 while the last will be doe i 6. 19

20 Defiitio 4.5: If p = p 1,..., p ad q = q 1,..., q are two probability distributios o X, the relative etropy or Kullback Leiber distace betwee p ad q is i=1 p i l p i q i. Note that if for some i, p i > 0 ad q i = 0, the the relative etropy is. Oe should thik of this as a metric BUT it s ot-it s either symmetric or satisfies the triagle iequality. Oe actually eeds to specify the base of the logarithm i this defiitio. The first thig we eed is the followig whose proof is a easy applicatio of Jese s iequality which is left to the reader. Theorem 4.6: Relative etropy (o matter which base we use) is oegative ad 0 if ad oly if the two distributios are the same. Proof of Theorem 4.4: Let C be a prefix free code with l i deotig the legth of the ith codeword ad N deotig the legth of the codeword as a r.v.. Simple maipulatio gives E[N] H D (X) = i p i l D ( p i D l i / i D l i ) l D( i D l i ). The first sum is oegative sice it s a relative etropy ad the the secod term is oegative by Kraft s Iequality. To show there is a prefix free code satisfyig E[N] < H D (X) + 1, we simply costruct it. Let l i = l D (1/p i ). A trivial computatio shows that the l i s satisfy the coditios of Kraft s Iequality ad hece there is a prefix free D code which assigs the ith accout a codeword of legth l i. A trivial calculatio shows that E[N] < H D (X)

21 How did we kow to choose l i = l D (1/p i )? If the itegers l i were allowed to be oitegers, Lagrage Multipliers show that the l i s miimizig E[N] subject to the cotrait of Kraft s iequality are l i = l D (1/p i ). There is a easy corollary to Theorem 4.4 whose proof is left to the reader, which we ow give. Corollary 4.7: The mimimum expected codeword legth L per symbol satisfies H(X 1,..., X ) L H(X 1,..., X ) + 1. Moreover, if X 1,... is a statioary process with etropy H, the L H. This particular prefix free code, called the Shao Code is ot ecessarily optimal (although it s ot bad). There is a optimal code called the Huffma code which I will ot discuss. It turs out also that the Shao Code is close to optimal i a certai sese (ad i a better sese tha that the mea code legth is at most 1 from optimal which we kow). We ow discuss our first applicatio of relative etropy (besides our usig it i the proof of Theorem 4.4). We metioed the followig questio above. If I cotruct the Shao prefix free code satisfyig E[N] < H D (X) + 1 uder the assumptio that the true distributio of X is q 1,..., q but it turs out the true distributio is p 1,..., p, how bad will E[N] be? The aswer is give by the followig theorem. Theorem 4.8: The expected legth E p [N] uder p(x) of the Shao code 21

22 l i = l D (1/q i ) satisfies H D (p) + D(p q) E p [N] < H D (p) + D(p q) + 1. Proof: Trivial calculatio left to the reader. 5 Saov s Theorem: aother applicatio of relative etropy I this sectio, we give a importat applicatio (or iterpretatio) of relative etropy, amely (level 2) large deviatio theory. Before doig this, let me remid you (or tell you) what (level 1) large deviatio theory is (this is the last sectio of the first chapter i Durrett s probability book). (Level 1) large deviatio theory simply says the followig modulo the techical assumptios. Let X i be i.i.d. with fiite mea m ad have a momet geeratig fuctio defied i some eighborhood of the origi. First, the WLLN tells us i=1 X i P ( m ɛ) 0 as for ay ɛ. (For this, we of course do ot eed the expoetial momet). (Level 1) large deviatio theory tells us how fast the above goes to 0 ad the aswer is, it goes to 0 expoetially fast ad more precisely, it goes like e f(ɛ) where f(ɛ) > 0 is the so-called Frechel trasform of the logarithm of the momet geeratig fuctio give by f(ɛ) = max(θɛ l(e[e θx ])). θ 22

23 Level two large deviatio theory deals with a similar questio but at the level of the empirical distributio fuctio. Let X i be i.i.d. agai ad let F be the empirical distributio of X 1,..., X. Gliveko Catelli tells us that F F a.s. where F is the the commo distributio of the X i s. If C is a closed set of measures ot cotaiig F, it follows (we re ow doig weak covergece of probability measures o the space of probability measures o R so you have to thik carefully) that P (F C) 0 ad we wat to kow how fast. The aswer is lim 1 l(p (F C)) = mi D(P F ). The above is called Saov s Theorem which we ow prove. There will be a fair developmet before gettig to this which I feel will be useful ad worth doig (ad so this might ot be the quickest route to Saov s Theorem). All of this material is take from Chapter 12 of Cover ad Thomas s book. If x = x 1,... x is a sample from a i.i.d. process with distributio Q, the P x, which we call the type of x, is defied to be the empirical distributio (which I will assume you kow) of x. Defiitio 5.1: We will let P deote the set of all types (i.e., empirical distributios) from a sample of size ad give P P, we let T (P ) be all x = x 1,... x (i.e., all samples) which have type (empirical distributio) P. EXERCISE: If Q has its support o {1,..., 7}, fid T (P ) where P is the type of P C 23

24 Lemma 5.2: P ( + 1) X where X is the set of possible values are process ca take o. This lemma, while importat, is trivial ad left to the reader. Lemma 5.3: Let X 1,... be i.i.d. accordig to Q ad let P x deote the type of x. The Q (x) = 2 (H(Px)+D(Px Q)) ad so the Q probability of x depeds oly o its type. Proof: Trivial calculatio (although a few lies) left to the reader. Lemma 5.4: Let X 1,... be i.i.d. accordig to Q. If x is of type Q, the Q (x) = 2 H(Q). Our ext result gives us a estimate of the size of a type class T (P ). Lemma 5.5: 1 ( + 1) X 2H(P ) T (P ) 2 H(P ). Proof: Oe method is to write dow the exact cardiality of the set (usig elemetary combiatorics) ad the apply Stirlig s formula. We however proceed differetly. The upper boud is trivial. Sice each x T (P ) has (by the previous corollary) uder P, probability 2 H(P ), there ca be at most 2 H(P ) elemets i T (P ). For the lower boud, the mai step is is to show that type class P has the largest probability of all type classes (ote however, that a elemet from 24

25 type class P does ot ecessarily have a higher probability tha elemets from other type classes as you should check), i.e., P (T (P )) P (T (P )) for all P. I leave this to you (although it takes some work, see the book if you wat). The we argue as follows. Sice there are at most (+1) X type classes, ad T (P ) has the largest probability, its P probability must be at least 1 (+1) X. Sice each elemet of this type class has P probability 2 H(P ), we obtai the lower boud. Combiig Lemmas 5.3 ad 5.5, we immediately obtai Corollary 5.6: 1 ( + 1) X 2 D(P Q) Q (T (P )) 2 D(P Q). Remark: The above tells us that if we flip fair cois, the probability that we get exactly 50% heads, while goig to 0 as, does ot go to 0 expoetially fast. The above discussio has set us up for a easy proof of the Gliveko Catelli Theorem (aalogous to the fact that oce oe has level 1 large deviatio theory set up, the Strog Law of Large Numbers follows trivially). The key that makes the whole thig work is the fact (Corollary 5.6) that the probability uder Q that a certai type class (differet from Q) arises is expoetially small (the expoet beig give by the relative etropy) ad the fact that there are oly polyomially may type classes. 25

26 Theorem 5.7: Let X 1,... be i.i.d. accordig to P. The l(+1) (ɛ X ) P (D(P x P ) > ɛ) 2 ad hece (by Borel Catelli) D(P x P ) 0 a.s.. Proof: Usig P (D(P x P ) > ɛ) Q:D(Q P ) ɛ ( + 1) X 2 ɛ l(+1) (ɛ X = 2 ). 2 D(Q P ) We are ow ready to state Saov s Theorem. We are doig everythig uder the assumptio that the uderlyig distributio has fiite support (i.e., there are oly a fiite umber of values our r.v. ca take o). Oe ca get rid of this assumptio but we stick to it here sice it gets rid of some techicalities. We ca view the set of all prob. measures (call it P) o our fiite outcome space (of size X ) as a subspace of R X (or eve of R X 1 ). (I the former case, it would be simplex i the positive coe.) This immediately gives us a ice topology o these measures (which of course is othig but the weak topology i a simple settig). Assumig that P gives every x X positive measure (if ot, chage X ), ote that D(Q P ) is a cotiuous fuctio of Q o P ad hece o ay closed (ad hece compact) set E i P, this fuctio assumes a miimum (which is ot 0 if P E). 26

27 Theorem 5.8 (Saov s Theorem): Let X 1,... be i.i.d. with distributio P ad E be a closed set i P. The P (E)(= P (E P )) ( + 1) X 2 f(e) where f(e) = mi Q E D(Q P ). (P (E) meas of course P ({x : P x E}).) If, i additio, E is the closure of its iterior, the the above expoetial upper boud also gives a lower boud i that 1 lim l P (E) = f(e). Proof: The derivatio of the upper boud follows easily from Theorem 5.7 which we leave to the reader. The lower boud is ot much harder but a little care is eeded (as should be clear from the topological assumptio o E). We eed to be able to fid guys i E P. Sice the P s become more ad more dese as gets large, it is easy to see (why?) that E P for large (this just uses the fact that E has oempty iterior) ad that there exist Q E P with Q Q where Q is some distributio which miimizes D(Q P ) over Q E. This immediately give that D(Q P ) D(Q P )(= f(e)). We ow have P (E) P (T (Q )) from which oe immediately cocludes that lim if 1 ) 2 D(Q P ( + 1) X 1 l P (E) lim if D(Q P ) = f(e). 27

28 The first part gives this as a upper boud ad so we re doe. Note that the proof goes through if oe of the Q s which miimize D(Q P ) over Q E (of which there must be at least oe due to compactess) is i the closure of the iterior of E. It is i fact also the case (although ot att all obvious ad related to other importat thigs) that if E is also covex, the there is a uique miimizig Q ad so there is a Hilbert space like picture here. (If E is ot covex, it is easy to see that there is ot ecessarily a uique miimizig guy). The fact that there is a uique miimizig guy (i the covex case) is suggested by (by ot implied by) the covexity of relative etropy i its two argumets. Hard Exercise: Recover the level 1 large deviatio theory from Saov s Theorem usig Lagrage multipliers. 6 Uiversal Etropy Estimatio ad Data Compressio We first use relative etropy to prove the possibility of uiversal data compressio for i.i.d. processes. Later o, we will do the much more difficult uiversal data compressio for geeral ergodic soures, the so called Ziv- Lempel algorithm which is used i the real world (as far as I uderstad it). Of course, from a practical poit of view, the results of Chapter 1 are useless sice typically oe does NOT kow the process that oe is observig ad so oe wats a uiversal way to compress data. We will ow do this for i.i.d. processes where it is quite easy usig the thigs we derived i 5. We have see previously i 4 that if we kow the distributio of a r.v. x, we ca ecode x by assigig it a word which has legth l(p(x)). We 28

29 have also see if the distributio is really q, we get a pealty of D(p q). We ow fid a code of rate R which suffices for all i.i.d. processes of etropy less tha R. Defiitio 6.1: A fixed rate block code of rate R for a process X 1,... with kow state space X cosists of two mappigs which are the ecoder, f : X {1,..., 2 R } ad a decoder φ : {1,..., 2 R } X. We let P (,Q) e = Q (φ (f (X 1,..., X )) (X 1,..., X )) be the probability of a error i this codig scheme uder the assumptio that Q is the true distributio of the x i s. Theorem 6.2: There exists a fixed rate block code of rate R such that for all Q with H(Q) < R, P (,Q) e 0 as. Proof: Let R = R X l(+1) ad let A = {x X : H(P x ) R }. The A = P P,H(P ) R T (P ) ( + 1) X 2 R = 2 R. 29

30 Now we idex the elemets of A i a arbitrary way ad we let f (x) = idex of x i A if x A ad 0 otherwise. The decodig map φ takes of course the idex ito correspodig word. Clearly, we obtai a error (at the th stage) if ad oly if x A. We the get P (,Q) e = 1 Q (A ) = P :H(P )>R Q (T (P )) P :H(P )>R 2 D(P Q) ( + 1) X 2 mi P :H(P )>R D(P Q). Now R R ad H(Q) < R ad so R > H(Q) for large ad so the expoet mi P :H(P )>R D(P Q) is bouded away from 0 ad so P (,Q) e goes expoetially to 0 for fixed Q. Actually, this last argumet was a little sloppy ad oe should be a little more careful with the details. Oe first should ote that compactess together with the fact that D(P Q) is 0 oly whe P = Q implies that if we stay away from ay eighborhood of Q, D(P Q) is bouded away from 0. The ote that if R > H(Q), the cotiuity of the etropy fuctio tells us that {P : H(P ) > R } misses some eighborhood of Q ad hece mi P :H(P )>R D(P Q) > 0 ad clearly this is icreasig i (look at the derivative of l x x ). Remarks: (1) Techically, 0 is ot a allowable image. Fix this (it s trivial). (2) For fixed Q, sice the above probability goes to expoetially fast, Borel Catelli tells us the codig scheme will work evetually forever (i.e., we have a a.s. statemet rather tha just a i probability statemet). (3) There is o uiformity i Q but there is a uiformity i Q for H(Q) R ɛ. (Show this). (4) For Q with H(Q) > R, the above guessig scheme works with probability goig to 0. Could there be a differet guessig scheme which works? 30

31 The above theorem produces for us a uiversal data compressor for i.i.d. sequeces. Exercise: Costruct a etropy estimator which is uiversal for i.i.d. sequeces. We ow costruct a uiversal etropy estimator which is uiversal for all ergodic processes. We will ow costruct somethig which is ot techiquely a uiversal etropy estimator accordig to the defiitio give i 2 but seems aalogous ad ca easily be modified to be oe. We will later discuss how this procedure is related to a uiversal data compressor. Theoreom 6.3: Let (x 1, x 2,...) be a ifiite word ad let R (x) = mi{j : x 1, x 2,..., x = x j+1 x j+2... x j+ } (which is the first time after whe the first block repeats itself). The log R (X 1, X 2,...) lim = Ha.s. where H is the etropy of the process. Rather the writig out the proof of that, we will read the paper Etropy ad Data Compressio Schemes by Orstei ad Weiss. This paper will be much more demadig tha these otes have bee up to ow. 7 Shao s Theorems o Capacity I this sectio, we discuss ad prove the fudametal theorem of chael capacity due to Shao i his groudbreakig work i the field. Before goig ito the mathematics, we should first give a small discussio. Assume that we have a chael to which oe ca iput a 0 or a 1 ad which outputs a 0 or 1 from which we wat to recover the iput. Let s assume 31

32 that the chael trasmits the sigal correctly with probability 1 p ad corrupts it (ad therefore seds the other value) with probability p. Assume that p < 1/2 (if p = 1/2, the chael clearly would be completely useless). Based o the output we receive, our best guess at the iput would of course be the output i which case the probability of makig a error would be p. If we wat to decrease the probability of makig a error, we could do the followig. We could simply sed our iput through the chael 3 times ad the based o the 3 outputs, we would guess the iput to be that output which occurred the most times. This codig/decodig scheme will work as log as the chael does ot make 2 or more errors ad the probability of this is (for small p) much much smaller ( p 2 ) tha if we were to sed just 1 bit through. Great! The probability is much smaller ow BUT there is a price to pay for this. Sice we have to sed 3 bits through the chael to have oe iput bit guess, the rate of trasmissio has suddely dropped to 1/3. Ayway, if we are still dissatisfied with the probability of this scheme makig a error, we could sed the bit through 5 times (usig a aalogous majority rule decodig scheme) thereby decreasig the probability of a decodig mistake much more ( p 3 ) but the the trasmissio rate would go eve lower to 1/5. It seems that i order to achieve the probability of error goig to 0, we eed to drop the rate of trasmissio closer ad closer to 0. The amazig discovery of Shao is that this very reasoable statemet is simply false. It turs out that if we are simply willig to use a rate of trasmissio which is less tha chael capacity (which is simply some umber depedig oly o the chael (i our example, this umber is strictly positive as log as p 1 2 which is reasoable)), the as log as we sed log strigs of iput, we ca trasmit them with arbitrarily low 32

33 probability of error. arbitrary reliability. I particular, the rate eed ot go to 0 to achieve We ow eter the mathematics which explais more precisely what the above is all about ad proves it. The formal defiitio of capacity of a chael is based o a cocept called mutual iformatio of two r.v. s which is deoted by I(X; Y ). Defiitio 7.1: I(X; Y ), called the mutual iformatio of X ad Y, is the relative etropy of the joit distributio of X ad Y ad the product distributio of X ad Y. Note that I(X; Y ) ad I(Y ; X) are the same ad that they are ot ifiite (although recall i geeral relative etropy ca be ifiite). Defiitio 7.2: The coditioal etropy of Y give X, deoted H(Y X) is defied to be x P (X = x)h(y X = x). Propositio 7.3: (1)H(X, Y ) = H(X) + H(Y X) (2) I(X; Y ) = H(X) H(X Y ). The proof of these are left to the reader. There is agai o idea ivolved, simply computatio. Defiitio 7.4: A discrete memoryless chael (DMC) is a system which takes as iput elemets of a iput alphabet X ad prits out elemets from a output alphabet Y radomly accordig to some p(y x) (i.e., if x is set i, the y comes out with probability p(y x)). Of course the umbers p(y x) are oegative ad for fixed x give 1 if we sum over y. (Such a thig is sometimes called a kerel i probability theory ad could be viewed as 33

34 a markov trasitio matrix if the sets X ad Y were the same (which they eed ot be)). Defiitio 7.5: The iformatio chael capacity of a discrete memoryless chael is C = max I(X; Y ). p(x) Exercise: Compute this for differet examples like X = Y = {0, 1} ad the chael beig p(x x) = p for all x. We assume ow that if we sed a sequece through the chael, the outputs are all idepedetly chose from p(y x) (i.e., the outputs occur idepedetly, a assumptio you might object to). Defiitio 7.6: A (M, ) code C cosists of the followig: 1. A idex set {1,..., M}. 2. A ecodig fuctio X : {1,..., M} X which yields the codewords X (1),..., X (M) which is called the codebook. 3. A decodig fuctio g : Y {1,..., M}. For i {1,..., }, we let λ i be the probability of a error if we ecode ad sed i over the chael, that is P (g(y ) i) whe i is ecoded ad set over the chael. We also let λ = max i λ i P e = 1 M i λ i be the maximal error ad be the average probability of error. We attach a superscript C to all of these if the code is ot clear from cotext. Exercise: Give ay DMC, fid a (M, ) code such that λ 1 = 0. Defiitio 7.7: The rate of a (M, ) code is l M (where we use base 2). 34

35 We ow arrive at oe of the most importat theorems i the field (as far as I uderstad it), which says i words that all rates below chael capacity ca be achieved with arbitrarily small probability of error. Theorem 7.8 (Shao s Theorem o Capacity): All rates below capacity C are achievable i that give R < C, there exists a sequece of (2 R, ) codes with maximum probability λ 0. Coversely if there exists a sequece of (2 R, ) codes with maximum probability λ 0, the R C. Before presetig the proof of this result, we make a few remarks which should be of help to the reader. I the defiitio of a (M, ) code, we have this set {1,..., M} ad a ecodig fuctio X mappig this set ito X. This way of defiig a code is ot so essetial. The oly thig that really matters is what the image of X is ad so oe could really have defied a (M, ) code to be a subset of X cosistig of M elemets together with the decodig fuctio g : Y X. Formally, the oly reaso these defiitios are ot EXACTLY the same is that X might ot be ijective which is certaily a property you wat your code to have ayway. The poit is oe should also thik of a code i this way. Proof: The proof of this is quite log. While the mathematics is all very simple, coceptually it is a little more difficult. The method is a radomizatio method where we choose a code at radom (accordig to some distributio) ad show that with high probability it will be a good code. The fact that we are choosig the code accordig to a special distributio will make the computatio of this probability tractable. The key cocept i the proof is the otio of joit typicality. The 35

36 SMB (for i.i.d.) tells us that a typical sequece x 1,..., x from a ergodic statioary process has 1 l(p(x 1,..., x )) beig close to the etropy H(X). If we have istead idepedet samples from a joit distributio p(x, y), we wat to kow what x 1, y 1,..., x, y typically looks like ad this gives us the cocept of joit typicality. A importat poit is that x 1,..., x ad y 1,..., y ca be idividually typical without beig joitly typical. We ow give a completely heuristic proof of the theorem. Give the output, we will choose as our guess for the iput somethig (which ecodes to somethig) which is joitly typical with the output. Sice the true iput ad the output will be joitly typical with high probability, there will with high probability be some iput which is joitly typical with the output, amely the true iput. The problem is maybe there is more tha 1 possible iput sequece which is joitly typical with the output. (If there is oly oe, there is o problem.) The probability that a possible iput sequece which was ot iput to the output is i fact joitly typical with the output sequece should be more or less the probability that a possible iput ad output sequece chose idepedetly are joitly typical which is 2 I(X;Y ) by a trivial computatio. Hece if we use 2 R with R < I(X; Y ) differet possible iput sequeces, the, with high probability, oe of the possible iput sequeces (other tha the true oe) will be joitly typical with the output ad we will decode correctly. We eed the followig lemma whose proof is left to the reader. It is easy to prove usig the methods i 1 whe we looked at cosequeces of the SMB Theorem. A arbitrary elemet of X x 1,..., x will be deoted by x below. Lemma 7.9: Let p(x, y) be some joit distributio. Let A ɛ be the set of 36

37 ɛ joitly typical sequeces (with respect to p(x, y)) of legth, amely, A ɛ = {(x, y ) X Y : 1 p(x ) H(X) < ɛ, 1 p(y ) H(Y ) < ɛ, ad 1 p(x, y ) H(X, Y ) < ɛ} where p(x, y ) = i p(x i, y i ). The if (X, Y ) is draw i.i.d. accordig to p(x, y), the 1. P ((X, Y ) A ɛ ) 1 as. 2. A ɛ 2 (H(X,Y )+ɛ). 3. If ( X, Ỹ ) is chose i.i.d. accordig to p(x)p(y), the P (( X, Ỹ ) A (I(X;Y ) 3ɛ) ɛ ) 2 ad for sufficietly large P (( X, Ỹ ) A ɛ ) (1 ɛ)2 (I(X;Y )+3ɛ). We ow proceed with the proof. Let R < C. We ow costruct a sequece of (2 R, ) codes (2 R meas 2 R here). whose maximum probability of error λ () goes to 0. We will first prove the existece of a sequece of codes which have a small average error (i the limit) ad the modify it to get somethig with small maximal error (i the limit). By the defiitio of capacity, we ca choose p(x) so that R < I(X; Y ) ad the we ca choose ɛ so that R + 3ɛ < I(X; Y ). Let C be the set of all ecodig fuctios X : {1, 2,..., 2 R } X. Our ecodig fuctio will be chose radomly accordig to p(x) i that each i {1, 2,..., 2 R } is idepedetly set to x where x is chose accordig to i=1 p(x). Oce we have chose some ecodig fuctio X, 37

38 the decodig procedure is as follows. If we receive y, our decoder guesses that the word W {1, 2,..., 2 R } has bee set if (1) X (W ) ad y are ɛ joitly typical ad (2) there is o other Ŵ {1, 2,..., 2R } with X (Ŵ ) ad y beig ɛ joitly typical. We also let C deote the set of codes obtaied by cosiderig all ecodig fuctios ad their resultig codes. Let E deote the evet that a mistake is made whe W {1, 2,..., 2 R } is chose uiformly. We first show that P (E ) 0 as. P (E ) = P (C) 1 2 R C C e = C C P (C)P (),C 2 R w=1 λ w (C) = 1 2 R 2 R w=1 C C P (C)λ w (C). By symmetry C C P (C)λ w (C) does ot deped o w ad hece the above is C C P (C)λ 1 (C) = P (E W = 1). Oe should easily see directly ayway that symmetry gives P (E ) = P (E W = 1). Because of the idepedece i which the ecodig fuctio ad the word W was chose, we ca istead cosider the procedure where we choose the code at radom as above but always trasmit the first word 1. I the ew resultig probability space, we still let E deote the evet of a decodig mistake. We also for each i {1, 2,..., 2 R } have the evet E i = {(X (i), Y ) A ɛ } where Y is the output of the chael. We clearly have E ((E1 )c 2R i=2 E i ). Now P (E1 ) 1 as by the joit typicality lemma. Next the idepedece of X (i) ad X (1) (for i 1) implies the idepedece of X (i) ad Y (1) (for i 1) which gives (by the joit typicality lemma) P (E i ) 2 (I(X;Y ) 3ɛ) ad so P ( i=2 2RE i ) 2R 2 (I(X;Y ) 3ɛ) = 2 (I(X;Y ) 3ɛ R) which goes to 0 as. Sice we saw that P (E ) goes to 0 as ad 38

Entropy Rates and Asymptotic Equipartition

Entropy Rates and Asymptotic Equipartition Chapter 29 Etropy Rates ad Asymptotic Equipartitio Sectio 29. itroduces the etropy rate the asymptotic etropy per time-step of a stochastic process ad shows that it is well-defied; ad similarly for iformatio,