Notes on Information Theory by Jeff Steif

Size: px
Start display at page:

Download "Notes on Information Theory by Jeff Steif"

Transcription

1 Notes o Iformatio Theory by Jeff Steif 1 Etropy, the Shao-McMilla-Breima Theorem ad Data Compressio These otes will cotai some aspects of iformatio theory. We will cosider some REAL problems that REAL people are iterested i although we might make some mathematical simplificatios so we ca (mathematically) carry this out. Not oly are these thigs iheretly iterestig (i my opiio, at least) because they have to do with real problems but these problems or applicatios allow certai mathematical theorems (most otably the Shao-McMilla-Breima Theorem) to come alive. The results that we will derive will have much to do with the etropy of a process (a cocept that will be explaied shortly) ad allows oe to see the real importace of etropy which might ot at all be obvious simply from its defiitio. The first aspect we wat to cosider is so-called data compressio which meas we have data ad wat to compress it i order to save space ad hece moey. More mathematically, let X 1, X 2,..., X be a fiite sequece from a statioary stochastic process {X k } k=. Statioarity will always 1

2 mea strog statioarity which meas that the joit distributio of X m, X m+1, X m+2,..., X m+k is idepedet of m, i other words, we have time-ivariace. We also assume that the variables take o values from a fiite set S with S = s. Ayway, we have our X 1, X 2,... X ad wat a rule which give a sequece assigs to it a ew sequece (hopefully of shorter legth) such that differet sequeces are mapped to differet sequeces (otherwise oe could ot recover the data ad the whole thig would be useless, e.g., assigig every word to 0 is ice ad short but of limited use). Assumig all words have positive probability, we clearly ca t code all the words of legth ito words of legth 1 sice there are s of the former ad s 1 of the latter. However, oe perhaps ca hope that the expected legth of the coded word is less tha the legth of the origial word (ad by some fixed fractio). What we will evetually see is that this is always the case EXCEPT i the oe case where the statioary process is i.i.d. AND uiform o S. It will tur out that oe ca code thigs so that the expected legth of the output is H/ l(s) times the legth of the iput where H is the etropy of the process. (Oe should coclude from the above discussio that the oly statioary process with etropy l(s) is i.i.d. uiform.) What is the etropy of a statioary process? We defie this i 2 steps. First, cosider a partitio of a probability space ito fiitely may sets where (after some orderig), the sets have probabilities, p 1,..., p k (which are of course positive ad add to 1). The etropy of this partitio is defied to be k p i l p i. i=1 Souds arbitrary but turs out to be very atural, has certai atural 2

3 properties (after oe iterprets etropy as the amout of iformatio gaied by performig a experimet with the above probability distributio) ad is i fact the oly (up to a multiplicative costat) fuctio which has these atural properties. I wo t discuss these but there are may refereces (ask me if you wat). (Here l meas atural l, but sometimes i these otes, it will refer to a differet base.) Now, give a statioary process, look oly at the radom variables X 1, X 2,... X. This breaks the probability space ito s pieces sice there are this may sequeces of legth. above) of this partitio. Let H be the etropy (as defied Defiitio 1.1: The etropy of a statioary process is lim H where H is defied above (this limit ca be show to always exist usig subadditivity). Hard Exercise: Defie the coditioal etropy H(X Y ) of X give Y (where both r.v. s take o oly fiitely may values) i the correct way ad show that the etropy of a process {X } is also lim H(X 0 X 1,... X ). Exercise: Compute the etropy first of i.i.d. uiform ad the a geeral i.i.d.. If you re the feelig cofidet, go o ad look at Markov chais (they are ot that hard). Exercise: Show that etropy is uiquely maximized at i.i.d. uiform. (The previous exercise showed this to be the case if we restrict to i.i.d. s but you eed to cosider all statioary processes.) We ow discuss the Shao-McMilla-Breima Theorem. It might seem dry ad the poit of it might ot be so apparet but it will come very much alive soo. 3

4 Before doig this, we eed to make a importat (ad ot uatural) assumptio that our process is ergodic. What does this mea? To motivate it, let s costruct a statioary process which does ot satisfy the Strog Law of Large Numbers (SLLN) for trivial reasos. Let {X } be i.i.d. with 0 s ad 1 s, 0 with probability 3/4 ad 1 with probability 1/4 ad let {Y } be i.i.d with 0 s ad 1 s, 0 with probability 1/4 ad 1 with probability 3/4. Now costruct a statioary process by first flippig a fair coi ad the if it s heads, take a realizatio from the {X } process ad if it s tails, take a realizatio from the {Y } process. (By thikig of statioary processes as measures o sequece space, we are simply takig a covex (1/2,1/2) combiatio of the measures associated to the two processes {X } ad {Y }). You should check that this resultig process is statioary with each margial havig mea 1/2. Lettig Z deote this process, we clearly do t have that 1 Z i 1/2 a.s. i=1 (which is what the SLLN would tell us if {Z } were i.i.d.) but rather we clearly have that 1 Z i 3/4 (1/4) with probability 1/2(1/2). i=1 The poit of ergodicity is to rule out such stupid examples. There are two equivalet defiitios of ergodicity. The first is that if T is the trasformatio which shifts a bi-ifiite sequece oe step to the left, the ay evet which is ivariat uder T has probability 0 or 1. The other defiitio is that the measure (or distributio) o sequece space associated to the process is ot a covex combiatio of 2 measures each also correspodig 4

5 to some statioary process (i.e., ivariat uder T ). It is ot obvious that these are equivalet defiitios but they are. Exercise: Show that a i.i.d. process is ergodic by usig the first defiitio. Theorem 1.2 (Shao-McMilla-Breima Theorem): Cosider a statioary ergodic process with etropy H. Let p(x 1, X 2,..., X ) be the probability that the process prits out X 1,..., X. (You have to thik about what this meas BUT NOTE that it is a radom variable). The as, l(p(x 1, X 2,..., X )) Ha.s. ad i L 1. NOTE: The fact that E[ l(p(x 1,X 2,...,X )) ] H is exactly (check this) the defiitio of etropy ad therefore geeral real aalysis tells us that the a.s. covergece implies the L 1 covergece. We do ot prove this ow (we will i 2) but istead see how we ca use it to do data compressio. To do data compressio, we will oly eed the above covergece i probability (which of course follows from either the a.s. or L 1 covergece). Thikig about what covergece i probability meas, we have the followig corollary. Corollary 1.3: Cosider a statioary ergodic process with etropy H. The for all ɛ > 0, for large, we have that the words of legth ca be divided ito two sets with all words C i the first set satisfyig e (H+ɛ) < p(c) < e (H ɛ) ad with the total measure of all words i the secod set beig < ɛ. Oe should thik of the first set as the good set ad the secod set as the bad set. (Later it will be the good words which will be compressed). 5

6 Except i the i.i.d. uiform case, the bad set will have may more elemets (i terms of cardiality) tha the good set but its probability will be much lower. (Note that it is oly i the i.i.d. uiform case that probabilities ad cardiality are the same thig.) Here is a related result which will fially allow us to prove the data compressio theorem. Propositio 1.4: Cosider a statioary ergodic process with etropy H. Order the words of legth i decreasig order (i terms of their probabilities). Fix λ (0, 1). We select the words of legth i the above order oe at a time util their probabilities sum up to at least λ ad we let N (λ) be the umber of words we take to do this. The l(n (λ)) lim = H. Exercise: Note that oe obtais the same limit for all λ. Thik about this. Covice yourself however that the above covergece caot be uiform i λ. Is the covergece uiform i λ o compact itervals of (0, 1)? Proof: Fix λ (0, 1). Let ɛ > 0 be arbitrary but < 1 λ ad λ. We will show both ad lim sup lim if from which the result follows. l(n (λ)) l(n (λ)) H + ɛ H ɛ By Corollary 1.3, we kow that for large, the words of legth ca be broke up ito 2 groups with all words C i the first set satisfyig e (H+ɛ) < p(c) < e (H ɛ) 6

7 ad with the total measure of all words i the secod set beig < ɛ. We ow break the secod group (the bad group) ito 2 sets depedig o whether p(c) e (H ɛ) (the big bad words) or whether p(c) e (H+ɛ) (the small bad words). Whe we order the words i decreasig order, we clearly first go through the big bad words, the the good words ad fially the small bad words. We ow prove the first iequality. For large, the total measure of the good words is at least 1 ɛ which is bigger tha λ ad hece the total measure of the big bad words together with the good words is bigger tha λ. Hece N (λ) is at most the total umber of big bad words plus the total umber of good words. Sice all these words have probability at least e (H+ɛ), there caot be more tha e (H+ɛ) of them. Hece lim sup l(n (λ)) lim sup l(e (H+ɛ) + 1) = H + ɛ. For the other iequality, let M (λ) be the umber of good words amog the N (λ) words take whe we accumulated at least λ measure of the space. Sice the total measure of the big bad words is at most ɛ, the total measure of these M (λ) words is at least λ ɛ. Sice each of these words has probability at most e (H ɛ), there must be at least (λ ɛ)e (H ɛ) words i M (λ) (if you do t see this, we have M (λ) e (H ɛ) λ ɛ). Hece ad we re doe. lim sup lim sup l(n (λ)) lim sup l((λ ɛ)e (H ɛ) ) l(m (λ)) = H ɛ, Note that by takig λ close to 1, oe ca see that (as log as we are ot i.i.d. uiform, i.e., as log as H < l s), we ca cover almost all words (i the 7

8 sese of probability) by usig oly a eglible percetage of the total umber of words (e H words istead of the total umber of words e ( l s) ). We fially arrive at data compressio, ow that we have thigs set up ad have a little feelig about what etropy is. We ow make some defiitios. Defiitio 1.5: A code is a ijective (i.e. 1-1) mappig σ which takes the set of words of legth with alphabet S to words of ay fiite legth (of at least 1) with alphabet S. Defiitio 1.6: Cosider a statioary ergodic process {X k } k=. The compressibility of a code σ is E[ l(σ(x 1,...,X )) ] where l(w ) deotes the legth of the word W. This measures how well a code compresses o average. Theorem 1.7: Cosider a statioary ergodic process {X k } k= with etropy H. Let µ be the miimum compressibility over all codes. The the limit µ lim µ exists ad equals H l s. The quatity µ is called the compressio coefficiet of the process {X k }. Sice H is always strictly less tha l s except i the i.i.d. uiform case, this says that data compressio is always possible except i this case. Proof: As usual, we have two directios to prove. We first show that for arbitrary ɛ ad δ > 0, ad for large, µ (1 δ) H 2ɛ l s. Fix ay code σ. Call a word C short (for σ) if σ(c) (H 2ɛ) l s. Sice codes are 1-1, the umber of short words is at most l s ( i=1 s + s s (H 2ɛ) l s s (H 2ɛ) s i ) s s 1 e(h 2ɛ). 8

9 Next, sice l(n (δ)) H by Propositio 1.4, for large, we have that N (δ) > e (H ɛ). This says that for large, if we take < e (H ɛ) of the most probable words, we wo t get δ total measure ad so certaily if we take < e (H ɛ) of ay of the words, we wo t get δ total measure. I particular, sice the umber of short words is at most which is for large less tha s s 1 e(h 2ɛ) e (H ɛ), the short words ca t cover δ total measure for large, i.e., P(short word) δ for large. Now ote that this argumet was valid for ay code. That is, for large, ay -code satisfies P(short word) δ. Hece for ay -code σ ad so as desired. E[ σ(c) ( ) (H 2ɛ) ] (1 δ) l s µ (1 δ) H 2ɛ l s, For the other directio, we show that for ay ɛ > 0, we have that for large, µ 1/ (H+ɛ) l s + δ. This will complete the argumet. The umber of differet sequeces of legth exactly (H+ɛ) l s is at least s (H+ɛ) l s which is e (H+ɛ). By Propositio 1.4 the umber N (1 δ) of most probable words eeded to cover 1 δ of the measure is e (H+ɛ) for large. Sice there are at least this may words of legth (H+ɛ) l s, we ca code these high probability words ito words of legth (H+ɛ) l s. For the 9

10 remaiig words, we code them to themselves. For such a code σ, we have that E[σ(X 1,..., X )] is at most (H+ɛ) l s + δ ad hece the compressio of this code is at most 1/ (H+ɛ) l s + δ. This is all ice ad dady but EXERCISE: Show that for a real perso, who really has data ad really wats to compress it, the above is all useless. We will see later o ( 3) a method which is ot useless. 2 Proof of the Shao-McMilla-Breima Theorem We provide here the classical proof of this theorem where we will assume two major theorems, amely, the ergodic theorem ad the Martigale Covergece Theorem. This sectio is much more techical tha all of the comig sectios ad requires more mathematical backgroud. Feel free if you wish to move o to 3. As this sectio is idepedet of all of the others, you wo t miss aythig. We first show that the proof of this result is a easy cosequece of the ergodic theorem i the special case of a i.i.d. process or more geerally of a multistep Markov chai. We take the middle road ad prove it for Markov chais (ad the reader will easily see that it exteds trivially to multi-step Markov chais). Proof of SMB for Markov Chais: First ote that l(p(x 1, X 2,..., X )) 1 l p(x j X j 1 ) = 10

11 where the first term l p(x 1 X 0 ) is take to be l p(x 1 ). If the first term were iterpreted as l p(x 1 X 0 ) istead, the the ergodic theorem would immediately tell us this limit is a.s. E[ l p(x 1 X 0 )] which (you have see as a exercise) is the etropy of the process. Of course this differece i the first term makes a error of at most cost/ ad the result is proved. For the multistep markov chai, rather tha replacig the first term by somethig else (as above), we eed to do so for the first k terms where k is the look-back of the markov chai. Sice this k is fixed, there is o problem. To prove the geeral result, we isolate the mai idea ito a lemma which is due to Breima ad which is a geeralizatio of the ergodic theorem which is also used i its proof. Actually, to uderstad the statemet of the ext result, oe really eeds to kow some ergodic theory. Lear about it o your ow, ask me for books, or I ll discuss what is eeded or just forget this whole sectio. Propositio 2.1: Assume that g (x) g(x) a.e. ad that sup g (x) <. The if φ is a ergodic measure preservig trasformatio, we have that 1 1 lim g j (φ j (x)) = 0 g a.e. ad i L 1. Note if g = g L 1 for all, the this is exactly the ergodic theorem. Proof: (as explaied i the above remark), the ergodic theorem tells us 1 1 lim g(φ j (x)) = 0 g a.e. ad i L 1. 11

12 Sice g j (φ j (x)) = 1 we eed to show that ( ) g(φ j (x)) [g j (φ j (x)) g(φ j (x))], g j (φ j (x)) g(φ j (x)) 0 a.e. ad i L 1. Let F N (x) = sup j N g j (x) g(x). By assumptio, each F N is itegrable ad goes to 0 mootoically a.e. from which domiated covergece gives fn 0. Now, 1 1 g j (φ j (x)) g(φ j (x)) 1 0 N 1 1 N 1 N 1 0 N 1 0 f N (φ j φ N (x)). g j (φ j (x)) g(φ j (x)) + For fixed N, lettig, the first term goes to 0 a.e. while the secod term goes to f N a.e. by the ergodic theorem. Sice f N 0, this proves the poitwise part of (*). For the L 1 part, agai for fixed N, lettig, the L 1 orm of the first term goes to the 0 while the L 1 orm of the secod term is always at most f N (which goes to 0) ad we re doe. Proof of the SMB Theorem: To properly do this ad to apply the previous theorem, we eed to set thigs up i a ergodic theory settig. Our process X gives us a measure µ o U = {1,..., S} Z (thought of as its distributio fuctio). We also have a trasformatio T o U, called the left shift, which simply maps a ifiite sequece 1 uit to the left. The fact that the process is statioary meas that T preserves the measure µ i that for all evets A, µ(t A) = µ(a). 12

13 Next, lettig g k (w) = l p(x 1 X 0,..., X k+1 ), (with g 0 (w) = l p(x 1 )), we get immediately that l(p(x 1(w), X 2 (w),..., X (w))) = g j (T j (w)). (Note that w here is a ifiite sequece i our space X.) We are ow i the set-up of the previous propositio although of course a umber of thigs eed to be verified, the first beig the covergece of the g k s to somethig. Note that for ay i i our alphabet {1,..., S}, g k (w) o the cylider set X 1 = i is simply l P (X 1 = i X 0,..., X k+1 ) which (here we use a secod otrivial result) coverges by the Martigale covergece theorem (ad the cotiuity of l) to l P (X 1 = i X 0,...). We therefore get that g k (w) g(w) where g(w) = l P (X 1 = i X 0,...) o the cylider set X 1 = i, i.e., g(w) = l P (X 1 X 0,...). If we ca show that sup k g k (w) <, the previous propositio will tell us that l(p(x 1, X 2,..., X )) g a.e. ad i L 1. Sice, by defiitio, E[ l(p(x 1,X 2,...,X )) ] h({x i }) by defiitio of etropy, it follows that g = h({x i }) ad we would be doe. EXERCISE: Show directly that E[g(w)] is h({x i }). We ow proceed with verifyig sup k g k (w) <. Fix λ > 0. We show that P (sup k g k > λ) se λ (where s is the alphabet size) which implies the desired itegrability. P (sup k g k > λ) = k P (E k) where E k is the set where the first time g j gets larger tha λ is k, i.e., E k = {g j λ, j = 0,... k 1, g k > λ}. These 13

14 are obviously disjoit for differet k ad make up the set above. P (E k ) = i P ((X 1 = i) E k ) = i P ((X 1 = i) F i k ) where F i k where the first time f i j f i j = l p(x 1 = i X 0,..., X j+1 ). gets larger tha λ is k where Sice F i k is X k+1,..., X 0 measurable, we have P ((X 1 = i) Fk) i = = F i k F i k e f i k e λ P (F i k). P (X 1 = i X k+1,..., X 0 ) Sice the Fk i are disjoit for differet k (but ot for differet i), k P (E k) i k e λ P (Fk i) se λ. Now, is the set We fially metio a couple of thigs i the higher dimesioal case, that is, where we have a statioary radom field where etropy ca be defied i a completely aalogous way. It was ukow for some time if a (poitwise, that is, a.s.) SMB theorem could be obtaied i this case, the mai obstacle beig that it was kow that a atural cadidate for a multidimesioal martigale covergece theorem was i fact false ad so oe could ot proceed as we did above i the 1-d case. (It was however kow that this theorem held i mea (i.e., i L 1 ), see below.) However, recetly, (i 1983) Orstei ad Weiss maaged to get a poitwise SMB ot oly for the multidimesioal lattice but i the more geeral settig of somethig called ameable groups. (They used a method which circumvets the Martigale Covergece Theorem). The result of Orstei ad Weiss is difficult ad so we do t preset it. Istead we preset the Shao McMilla Theorem (ote, o McMilla here) which is the L 1 covergece i the SMB Theorem. Of course, this is a 14

15 weaker result tha we just proved but the poit of presetig it is two-fold, first to see how much easier it is tha the poitwise versio ad also to see a result which geeralizes (relatively) easily to higher dimesios (this higher dimesioal variat beig left to the reader). Propositio 2.2 (Mea SMB Theorem or MB Theorem): The SMB theorem holds i L 1. Proof Let g j be defied as above. The Martigale covergece theorem implies that g j g i L 1 as. We the have (with deotig the L 1 orm), l(p(x 1, X 2,..., X )) H = 1 g j (T j (w)) g (T j (w)) g j (T j (w)) H g (T j (w)) H. The first term goes to 0 by the L 1 covergece i the Martigale Covergece Theorem (we do t of course always get L 1 covergece i the Martigale Covergece Theorem but we have so i this settig). The secod term goes to 0 i L 1 by the ergodic theorem together with the fact that g = H, a easy computatio left to the reader. 3 Uiversal Etropy Estimatio? This sectio simply raises a iterestig poit which will be delved ito further later o. Oe ca cosider the problem of tryig to estimate the etropy of a process {X } = after we have oly see X 1, X 2,..., X i such a way 15

16 that these estimates coverge to the true etropy with prob 1. (As above, we kow that to have ay chace of doig this, oe eeds to assume ergodicity). Obviously, oe could simply take the estimate to be l(p(x 1, X 2,..., X )) which will (by SMB) coverge to the etropy. However, this is clearly udesirable i the sese that i order to use this estimate, we eed to kow what the distributio of the process is. It certaily would be preferable to fid some estimatio procedure which does ot deped o the distributio sice we might ot kow it. We therefore wat a family of fuctios h : S R (h (x 1,..., x ) would be our guess of the etropy of the process if we see the data x 1,..., x ) such that for ay give ergodic process {X } =, lim h (X 1,..., X ) = h({x } = ) a.s.. (Clearly the suggestio earlier is ot of this form sice h depeds o the process). We call such a family of fuctios a uiversal etropy estimator, uiversal sice it holds for all ergodic processes. There is o immediate obvious reaso at all why such a thig exists ad there are similar types of thigs which do t i fact exist. (If etropy were a cotiuous fuctio defied o processes (the latter beig give the usual weak topology), a geeral theorem (see 9) would tell us that such a thig exists BUT etropy is NOT cotiuous.) However, as it turs out, a uiversal etropy estimator does i fact exist. Oe such uiversal etropy estimator comes from a algorithm called the Lempel-Ziv algorithm which was later improved by Wyer-Ziv. We will see that this algorithm also turs out to be a uiversal data compressor, which we will describe later. (This method is used, as far as I uderstad it, i the real world). 16

17 4 20 questios ad Relative Etropy We have all heard problems of the followig sort. You have 8 cois oe of which has a differet weight tha all the others which have the same weight. You are give a scale (which ca determie which of two give quatities is heavier) ad your job is to idetify the bad coi. How may weighigs do you eed? Later we will do somethig more geeral but it turs out the problem of how may questios oe eeds to ask to determie somethig has a extremely simple aswer which is give by somethig called Kraft s Theorem. The aswer will ot at all be surprisig ad will be exactly what you would guess, amely if you ask questios which have D possible aswers, the you will eed l D () questios where is the total umber of possibilities. (If the latter is ot a iteger, you eed to take the iteger right above it). Defiiti 4.1: A mappig C from a fiite set S to fiite words with alphabet {1,..., D} will be called a prefix-free D code o S if for ay x, y S, C(x) is ot a prefix of C(y) (which implies of course that C is ijective) Theorem 4.2 (Kraft s Iequality): Let C be a prefix-free D code defied o S. Let {l 1,... l S } be the legths of the code words of C. The S D l i 1. i=1 Coversely, if {l 1,... l } are itegers satisfyig D l i 1, i=1 the there exists a prefix-free D code o a set with elemets whose code words have legth {l 1,... l }. 17

18 This turs out to be quite easy to prove so try to figure it out o your ow. I ll preset it i class but wo t write it here. Corollary 4.3: There is a prefix free D code o a set of elemets whose maximum code word has legth l D () while there is o such code whose maximum code word is < l D (). I claim that this corollary implies that the umber of questios with D possible aswers oe eeds to ask to determie somethig with possibilities is l D () (where if this is ot a iteger, it is take to be the iteger right above it). To see this oe just eeds to thik a little ad realize that askig questios util we get the right aswer is exactly a prefix free D code. (For example, if you have the code first, your first questio would be What is the first umber i the codeword assiged to x? which of course has D possible aswers.) Actually, goig back to the weighig questio, we did ot really aswer that questio at all. We uderstad the situatio whe we ca ask ay D ary questio (i.e., ay partitio of the outcome space ito D pieces) but with questios like the weighig questio, while ay weighig breaks the outcome space ito 3 pieces, we are physically limited (I assume, I have ot thought about it) from costructig all possible partitios of the outcome space ito 3 pieces. So this ivestigatio, while iterestig ad useful, did ot allow us to aswer the origial questio at all but we ed there oetheless. Now we wat to move o ad do somethig more geeral. Let s say that X is a radom variable takig o fiitely may values X with probabilities p 1,..., p (so X = ). Now we wat a successful guessig scheme (i.e., a prefix free code) where the expected umber of questios we eed to ask 18

19 (rather tha the maximum umber of questios) is ot too large. The followig theorem aswers this questio ad more. For simplicity, we talk about prefix free codes. Theorem 4.4: Let C be a prefix free D code o X ad N the legth of the codeword assiged to X (which is a r.v.). The H D (X) E[N] where H D (X) is the D etropy of X give by k i=1 p i l D p i. Coversely, there exists a prefix free D code satisfyig E[N] < H D (X) + 1. (We will prove this below but must first do a little more developmet.) EXERCISE: Relate this result to Corollary 4.3. A importat cocept i iformatio theory which is useful for may purposes ad which will allow us to prove the above is the cocept of relative etropy. This cocept comes up i may other places below we will explai three other places where it arises which are respectively (1) If I costruct a prefix free code satisfyig E[N] < H D (X) + 1 uder the assumptio that the true distributio of X is q 1,..., q but it turs out the true distributio is p 1,..., p, how bad will E[N] be? (2) (level 2) large deviatio theory for the Gliveko Catelli Theorem (which is called Saov s Theorem) (3) uiversal data compressio for i.i.d. processes. The first 1 of these will be doe i this sectio, the 2d i the 5 while the last will be doe i 6. 19

20 Defiitio 4.5: If p = p 1,..., p ad q = q 1,..., q are two probability distributios o X, the relative etropy or Kullback Leiber distace betwee p ad q is i=1 p i l p i q i. Note that if for some i, p i > 0 ad q i = 0, the the relative etropy is. Oe should thik of this as a metric BUT it s ot-it s either symmetric or satisfies the triagle iequality. Oe actually eeds to specify the base of the logarithm i this defiitio. The first thig we eed is the followig whose proof is a easy applicatio of Jese s iequality which is left to the reader. Theorem 4.6: Relative etropy (o matter which base we use) is oegative ad 0 if ad oly if the two distributios are the same. Proof of Theorem 4.4: Let C be a prefix free code with l i deotig the legth of the ith codeword ad N deotig the legth of the codeword as a r.v.. Simple maipulatio gives E[N] H D (X) = i p i l D ( p i D l i / i D l i ) l D( i D l i ). The first sum is oegative sice it s a relative etropy ad the the secod term is oegative by Kraft s Iequality. To show there is a prefix free code satisfyig E[N] < H D (X) + 1, we simply costruct it. Let l i = l D (1/p i ). A trivial computatio shows that the l i s satisfy the coditios of Kraft s Iequality ad hece there is a prefix free D code which assigs the ith accout a codeword of legth l i. A trivial calculatio shows that E[N] < H D (X)

21 How did we kow to choose l i = l D (1/p i )? If the itegers l i were allowed to be oitegers, Lagrage Multipliers show that the l i s miimizig E[N] subject to the cotrait of Kraft s iequality are l i = l D (1/p i ). There is a easy corollary to Theorem 4.4 whose proof is left to the reader, which we ow give. Corollary 4.7: The mimimum expected codeword legth L per symbol satisfies H(X 1,..., X ) L H(X 1,..., X ) + 1. Moreover, if X 1,... is a statioary process with etropy H, the L H. This particular prefix free code, called the Shao Code is ot ecessarily optimal (although it s ot bad). There is a optimal code called the Huffma code which I will ot discuss. It turs out also that the Shao Code is close to optimal i a certai sese (ad i a better sese tha that the mea code legth is at most 1 from optimal which we kow). We ow discuss our first applicatio of relative etropy (besides our usig it i the proof of Theorem 4.4). We metioed the followig questio above. If I cotruct the Shao prefix free code satisfyig E[N] < H D (X) + 1 uder the assumptio that the true distributio of X is q 1,..., q but it turs out the true distributio is p 1,..., p, how bad will E[N] be? The aswer is give by the followig theorem. Theorem 4.8: The expected legth E p [N] uder p(x) of the Shao code 21

22 l i = l D (1/q i ) satisfies H D (p) + D(p q) E p [N] < H D (p) + D(p q) + 1. Proof: Trivial calculatio left to the reader. 5 Saov s Theorem: aother applicatio of relative etropy I this sectio, we give a importat applicatio (or iterpretatio) of relative etropy, amely (level 2) large deviatio theory. Before doig this, let me remid you (or tell you) what (level 1) large deviatio theory is (this is the last sectio of the first chapter i Durrett s probability book). (Level 1) large deviatio theory simply says the followig modulo the techical assumptios. Let X i be i.i.d. with fiite mea m ad have a momet geeratig fuctio defied i some eighborhood of the origi. First, the WLLN tells us i=1 X i P ( m ɛ) 0 as for ay ɛ. (For this, we of course do ot eed the expoetial momet). (Level 1) large deviatio theory tells us how fast the above goes to 0 ad the aswer is, it goes to 0 expoetially fast ad more precisely, it goes like e f(ɛ) where f(ɛ) > 0 is the so-called Frechel trasform of the logarithm of the momet geeratig fuctio give by f(ɛ) = max(θɛ l(e[e θx ])). θ 22

23 Level two large deviatio theory deals with a similar questio but at the level of the empirical distributio fuctio. Let X i be i.i.d. agai ad let F be the empirical distributio of X 1,..., X. Gliveko Catelli tells us that F F a.s. where F is the the commo distributio of the X i s. If C is a closed set of measures ot cotaiig F, it follows (we re ow doig weak covergece of probability measures o the space of probability measures o R so you have to thik carefully) that P (F C) 0 ad we wat to kow how fast. The aswer is lim 1 l(p (F C)) = mi D(P F ). The above is called Saov s Theorem which we ow prove. There will be a fair developmet before gettig to this which I feel will be useful ad worth doig (ad so this might ot be the quickest route to Saov s Theorem). All of this material is take from Chapter 12 of Cover ad Thomas s book. If x = x 1,... x is a sample from a i.i.d. process with distributio Q, the P x, which we call the type of x, is defied to be the empirical distributio (which I will assume you kow) of x. Defiitio 5.1: We will let P deote the set of all types (i.e., empirical distributios) from a sample of size ad give P P, we let T (P ) be all x = x 1,... x (i.e., all samples) which have type (empirical distributio) P. EXERCISE: If Q has its support o {1,..., 7}, fid T (P ) where P is the type of P C 23

24 Lemma 5.2: P ( + 1) X where X is the set of possible values are process ca take o. This lemma, while importat, is trivial ad left to the reader. Lemma 5.3: Let X 1,... be i.i.d. accordig to Q ad let P x deote the type of x. The Q (x) = 2 (H(Px)+D(Px Q)) ad so the Q probability of x depeds oly o its type. Proof: Trivial calculatio (although a few lies) left to the reader. Lemma 5.4: Let X 1,... be i.i.d. accordig to Q. If x is of type Q, the Q (x) = 2 H(Q). Our ext result gives us a estimate of the size of a type class T (P ). Lemma 5.5: 1 ( + 1) X 2H(P ) T (P ) 2 H(P ). Proof: Oe method is to write dow the exact cardiality of the set (usig elemetary combiatorics) ad the apply Stirlig s formula. We however proceed differetly. The upper boud is trivial. Sice each x T (P ) has (by the previous corollary) uder P, probability 2 H(P ), there ca be at most 2 H(P ) elemets i T (P ). For the lower boud, the mai step is is to show that type class P has the largest probability of all type classes (ote however, that a elemet from 24

25 type class P does ot ecessarily have a higher probability tha elemets from other type classes as you should check), i.e., P (T (P )) P (T (P )) for all P. I leave this to you (although it takes some work, see the book if you wat). The we argue as follows. Sice there are at most (+1) X type classes, ad T (P ) has the largest probability, its P probability must be at least 1 (+1) X. Sice each elemet of this type class has P probability 2 H(P ), we obtai the lower boud. Combiig Lemmas 5.3 ad 5.5, we immediately obtai Corollary 5.6: 1 ( + 1) X 2 D(P Q) Q (T (P )) 2 D(P Q). Remark: The above tells us that if we flip fair cois, the probability that we get exactly 50% heads, while goig to 0 as, does ot go to 0 expoetially fast. The above discussio has set us up for a easy proof of the Gliveko Catelli Theorem (aalogous to the fact that oce oe has level 1 large deviatio theory set up, the Strog Law of Large Numbers follows trivially). The key that makes the whole thig work is the fact (Corollary 5.6) that the probability uder Q that a certai type class (differet from Q) arises is expoetially small (the expoet beig give by the relative etropy) ad the fact that there are oly polyomially may type classes. 25

26 Theorem 5.7: Let X 1,... be i.i.d. accordig to P. The l(+1) (ɛ X ) P (D(P x P ) > ɛ) 2 ad hece (by Borel Catelli) D(P x P ) 0 a.s.. Proof: Usig P (D(P x P ) > ɛ) Q:D(Q P ) ɛ ( + 1) X 2 ɛ l(+1) (ɛ X = 2 ). 2 D(Q P ) We are ow ready to state Saov s Theorem. We are doig everythig uder the assumptio that the uderlyig distributio has fiite support (i.e., there are oly a fiite umber of values our r.v. ca take o). Oe ca get rid of this assumptio but we stick to it here sice it gets rid of some techicalities. We ca view the set of all prob. measures (call it P) o our fiite outcome space (of size X ) as a subspace of R X (or eve of R X 1 ). (I the former case, it would be simplex i the positive coe.) This immediately gives us a ice topology o these measures (which of course is othig but the weak topology i a simple settig). Assumig that P gives every x X positive measure (if ot, chage X ), ote that D(Q P ) is a cotiuous fuctio of Q o P ad hece o ay closed (ad hece compact) set E i P, this fuctio assumes a miimum (which is ot 0 if P E). 26

27 Theorem 5.8 (Saov s Theorem): Let X 1,... be i.i.d. with distributio P ad E be a closed set i P. The P (E)(= P (E P )) ( + 1) X 2 f(e) where f(e) = mi Q E D(Q P ). (P (E) meas of course P ({x : P x E}).) If, i additio, E is the closure of its iterior, the the above expoetial upper boud also gives a lower boud i that 1 lim l P (E) = f(e). Proof: The derivatio of the upper boud follows easily from Theorem 5.7 which we leave to the reader. The lower boud is ot much harder but a little care is eeded (as should be clear from the topological assumptio o E). We eed to be able to fid guys i E P. Sice the P s become more ad more dese as gets large, it is easy to see (why?) that E P for large (this just uses the fact that E has oempty iterior) ad that there exist Q E P with Q Q where Q is some distributio which miimizes D(Q P ) over Q E. This immediately give that D(Q P ) D(Q P )(= f(e)). We ow have P (E) P (T (Q )) from which oe immediately cocludes that lim if 1 ) 2 D(Q P ( + 1) X 1 l P (E) lim if D(Q P ) = f(e). 27

28 The first part gives this as a upper boud ad so we re doe. Note that the proof goes through if oe of the Q s which miimize D(Q P ) over Q E (of which there must be at least oe due to compactess) is i the closure of the iterior of E. It is i fact also the case (although ot att all obvious ad related to other importat thigs) that if E is also covex, the there is a uique miimizig Q ad so there is a Hilbert space like picture here. (If E is ot covex, it is easy to see that there is ot ecessarily a uique miimizig guy). The fact that there is a uique miimizig guy (i the covex case) is suggested by (by ot implied by) the covexity of relative etropy i its two argumets. Hard Exercise: Recover the level 1 large deviatio theory from Saov s Theorem usig Lagrage multipliers. 6 Uiversal Etropy Estimatio ad Data Compressio We first use relative etropy to prove the possibility of uiversal data compressio for i.i.d. processes. Later o, we will do the much more difficult uiversal data compressio for geeral ergodic soures, the so called Ziv- Lempel algorithm which is used i the real world (as far as I uderstad it). Of course, from a practical poit of view, the results of Chapter 1 are useless sice typically oe does NOT kow the process that oe is observig ad so oe wats a uiversal way to compress data. We will ow do this for i.i.d. processes where it is quite easy usig the thigs we derived i 5. We have see previously i 4 that if we kow the distributio of a r.v. x, we ca ecode x by assigig it a word which has legth l(p(x)). We 28

29 have also see if the distributio is really q, we get a pealty of D(p q). We ow fid a code of rate R which suffices for all i.i.d. processes of etropy less tha R. Defiitio 6.1: A fixed rate block code of rate R for a process X 1,... with kow state space X cosists of two mappigs which are the ecoder, f : X {1,..., 2 R } ad a decoder φ : {1,..., 2 R } X. We let P (,Q) e = Q (φ (f (X 1,..., X )) (X 1,..., X )) be the probability of a error i this codig scheme uder the assumptio that Q is the true distributio of the x i s. Theorem 6.2: There exists a fixed rate block code of rate R such that for all Q with H(Q) < R, P (,Q) e 0 as. Proof: Let R = R X l(+1) ad let A = {x X : H(P x ) R }. The A = P P,H(P ) R T (P ) ( + 1) X 2 R = 2 R. 29

30 Now we idex the elemets of A i a arbitrary way ad we let f (x) = idex of x i A if x A ad 0 otherwise. The decodig map φ takes of course the idex ito correspodig word. Clearly, we obtai a error (at the th stage) if ad oly if x A. We the get P (,Q) e = 1 Q (A ) = P :H(P )>R Q (T (P )) P :H(P )>R 2 D(P Q) ( + 1) X 2 mi P :H(P )>R D(P Q). Now R R ad H(Q) < R ad so R > H(Q) for large ad so the expoet mi P :H(P )>R D(P Q) is bouded away from 0 ad so P (,Q) e goes expoetially to 0 for fixed Q. Actually, this last argumet was a little sloppy ad oe should be a little more careful with the details. Oe first should ote that compactess together with the fact that D(P Q) is 0 oly whe P = Q implies that if we stay away from ay eighborhood of Q, D(P Q) is bouded away from 0. The ote that if R > H(Q), the cotiuity of the etropy fuctio tells us that {P : H(P ) > R } misses some eighborhood of Q ad hece mi P :H(P )>R D(P Q) > 0 ad clearly this is icreasig i (look at the derivative of l x x ). Remarks: (1) Techically, 0 is ot a allowable image. Fix this (it s trivial). (2) For fixed Q, sice the above probability goes to expoetially fast, Borel Catelli tells us the codig scheme will work evetually forever (i.e., we have a a.s. statemet rather tha just a i probability statemet). (3) There is o uiformity i Q but there is a uiformity i Q for H(Q) R ɛ. (Show this). (4) For Q with H(Q) > R, the above guessig scheme works with probability goig to 0. Could there be a differet guessig scheme which works? 30

31 The above theorem produces for us a uiversal data compressor for i.i.d. sequeces. Exercise: Costruct a etropy estimator which is uiversal for i.i.d. sequeces. We ow costruct a uiversal etropy estimator which is uiversal for all ergodic processes. We will ow costruct somethig which is ot techiquely a uiversal etropy estimator accordig to the defiitio give i 2 but seems aalogous ad ca easily be modified to be oe. We will later discuss how this procedure is related to a uiversal data compressor. Theoreom 6.3: Let (x 1, x 2,...) be a ifiite word ad let R (x) = mi{j : x 1, x 2,..., x = x j+1 x j+2... x j+ } (which is the first time after whe the first block repeats itself). The log R (X 1, X 2,...) lim = Ha.s. where H is the etropy of the process. Rather the writig out the proof of that, we will read the paper Etropy ad Data Compressio Schemes by Orstei ad Weiss. This paper will be much more demadig tha these otes have bee up to ow. 7 Shao s Theorems o Capacity I this sectio, we discuss ad prove the fudametal theorem of chael capacity due to Shao i his groudbreakig work i the field. Before goig ito the mathematics, we should first give a small discussio. Assume that we have a chael to which oe ca iput a 0 or a 1 ad which outputs a 0 or 1 from which we wat to recover the iput. Let s assume 31

32 that the chael trasmits the sigal correctly with probability 1 p ad corrupts it (ad therefore seds the other value) with probability p. Assume that p < 1/2 (if p = 1/2, the chael clearly would be completely useless). Based o the output we receive, our best guess at the iput would of course be the output i which case the probability of makig a error would be p. If we wat to decrease the probability of makig a error, we could do the followig. We could simply sed our iput through the chael 3 times ad the based o the 3 outputs, we would guess the iput to be that output which occurred the most times. This codig/decodig scheme will work as log as the chael does ot make 2 or more errors ad the probability of this is (for small p) much much smaller ( p 2 ) tha if we were to sed just 1 bit through. Great! The probability is much smaller ow BUT there is a price to pay for this. Sice we have to sed 3 bits through the chael to have oe iput bit guess, the rate of trasmissio has suddely dropped to 1/3. Ayway, if we are still dissatisfied with the probability of this scheme makig a error, we could sed the bit through 5 times (usig a aalogous majority rule decodig scheme) thereby decreasig the probability of a decodig mistake much more ( p 3 ) but the the trasmissio rate would go eve lower to 1/5. It seems that i order to achieve the probability of error goig to 0, we eed to drop the rate of trasmissio closer ad closer to 0. The amazig discovery of Shao is that this very reasoable statemet is simply false. It turs out that if we are simply willig to use a rate of trasmissio which is less tha chael capacity (which is simply some umber depedig oly o the chael (i our example, this umber is strictly positive as log as p 1 2 which is reasoable)), the as log as we sed log strigs of iput, we ca trasmit them with arbitrarily low 32

33 probability of error. arbitrary reliability. I particular, the rate eed ot go to 0 to achieve We ow eter the mathematics which explais more precisely what the above is all about ad proves it. The formal defiitio of capacity of a chael is based o a cocept called mutual iformatio of two r.v. s which is deoted by I(X; Y ). Defiitio 7.1: I(X; Y ), called the mutual iformatio of X ad Y, is the relative etropy of the joit distributio of X ad Y ad the product distributio of X ad Y. Note that I(X; Y ) ad I(Y ; X) are the same ad that they are ot ifiite (although recall i geeral relative etropy ca be ifiite). Defiitio 7.2: The coditioal etropy of Y give X, deoted H(Y X) is defied to be x P (X = x)h(y X = x). Propositio 7.3: (1)H(X, Y ) = H(X) + H(Y X) (2) I(X; Y ) = H(X) H(X Y ). The proof of these are left to the reader. There is agai o idea ivolved, simply computatio. Defiitio 7.4: A discrete memoryless chael (DMC) is a system which takes as iput elemets of a iput alphabet X ad prits out elemets from a output alphabet Y radomly accordig to some p(y x) (i.e., if x is set i, the y comes out with probability p(y x)). Of course the umbers p(y x) are oegative ad for fixed x give 1 if we sum over y. (Such a thig is sometimes called a kerel i probability theory ad could be viewed as 33

34 a markov trasitio matrix if the sets X ad Y were the same (which they eed ot be)). Defiitio 7.5: The iformatio chael capacity of a discrete memoryless chael is C = max I(X; Y ). p(x) Exercise: Compute this for differet examples like X = Y = {0, 1} ad the chael beig p(x x) = p for all x. We assume ow that if we sed a sequece through the chael, the outputs are all idepedetly chose from p(y x) (i.e., the outputs occur idepedetly, a assumptio you might object to). Defiitio 7.6: A (M, ) code C cosists of the followig: 1. A idex set {1,..., M}. 2. A ecodig fuctio X : {1,..., M} X which yields the codewords X (1),..., X (M) which is called the codebook. 3. A decodig fuctio g : Y {1,..., M}. For i {1,..., }, we let λ i be the probability of a error if we ecode ad sed i over the chael, that is P (g(y ) i) whe i is ecoded ad set over the chael. We also let λ = max i λ i P e = 1 M i λ i be the maximal error ad be the average probability of error. We attach a superscript C to all of these if the code is ot clear from cotext. Exercise: Give ay DMC, fid a (M, ) code such that λ 1 = 0. Defiitio 7.7: The rate of a (M, ) code is l M (where we use base 2). 34

35 We ow arrive at oe of the most importat theorems i the field (as far as I uderstad it), which says i words that all rates below chael capacity ca be achieved with arbitrarily small probability of error. Theorem 7.8 (Shao s Theorem o Capacity): All rates below capacity C are achievable i that give R < C, there exists a sequece of (2 R, ) codes with maximum probability λ 0. Coversely if there exists a sequece of (2 R, ) codes with maximum probability λ 0, the R C. Before presetig the proof of this result, we make a few remarks which should be of help to the reader. I the defiitio of a (M, ) code, we have this set {1,..., M} ad a ecodig fuctio X mappig this set ito X. This way of defiig a code is ot so essetial. The oly thig that really matters is what the image of X is ad so oe could really have defied a (M, ) code to be a subset of X cosistig of M elemets together with the decodig fuctio g : Y X. Formally, the oly reaso these defiitios are ot EXACTLY the same is that X might ot be ijective which is certaily a property you wat your code to have ayway. The poit is oe should also thik of a code i this way. Proof: The proof of this is quite log. While the mathematics is all very simple, coceptually it is a little more difficult. The method is a radomizatio method where we choose a code at radom (accordig to some distributio) ad show that with high probability it will be a good code. The fact that we are choosig the code accordig to a special distributio will make the computatio of this probability tractable. The key cocept i the proof is the otio of joit typicality. The 35

36 SMB (for i.i.d.) tells us that a typical sequece x 1,..., x from a ergodic statioary process has 1 l(p(x 1,..., x )) beig close to the etropy H(X). If we have istead idepedet samples from a joit distributio p(x, y), we wat to kow what x 1, y 1,..., x, y typically looks like ad this gives us the cocept of joit typicality. A importat poit is that x 1,..., x ad y 1,..., y ca be idividually typical without beig joitly typical. We ow give a completely heuristic proof of the theorem. Give the output, we will choose as our guess for the iput somethig (which ecodes to somethig) which is joitly typical with the output. Sice the true iput ad the output will be joitly typical with high probability, there will with high probability be some iput which is joitly typical with the output, amely the true iput. The problem is maybe there is more tha 1 possible iput sequece which is joitly typical with the output. (If there is oly oe, there is o problem.) The probability that a possible iput sequece which was ot iput to the output is i fact joitly typical with the output sequece should be more or less the probability that a possible iput ad output sequece chose idepedetly are joitly typical which is 2 I(X;Y ) by a trivial computatio. Hece if we use 2 R with R < I(X; Y ) differet possible iput sequeces, the, with high probability, oe of the possible iput sequeces (other tha the true oe) will be joitly typical with the output ad we will decode correctly. We eed the followig lemma whose proof is left to the reader. It is easy to prove usig the methods i 1 whe we looked at cosequeces of the SMB Theorem. A arbitrary elemet of X x 1,..., x will be deoted by x below. Lemma 7.9: Let p(x, y) be some joit distributio. Let A ɛ be the set of 36

37 ɛ joitly typical sequeces (with respect to p(x, y)) of legth, amely, A ɛ = {(x, y ) X Y : 1 p(x ) H(X) < ɛ, 1 p(y ) H(Y ) < ɛ, ad 1 p(x, y ) H(X, Y ) < ɛ} where p(x, y ) = i p(x i, y i ). The if (X, Y ) is draw i.i.d. accordig to p(x, y), the 1. P ((X, Y ) A ɛ ) 1 as. 2. A ɛ 2 (H(X,Y )+ɛ). 3. If ( X, Ỹ ) is chose i.i.d. accordig to p(x)p(y), the P (( X, Ỹ ) A (I(X;Y ) 3ɛ) ɛ ) 2 ad for sufficietly large P (( X, Ỹ ) A ɛ ) (1 ɛ)2 (I(X;Y )+3ɛ). We ow proceed with the proof. Let R < C. We ow costruct a sequece of (2 R, ) codes (2 R meas 2 R here). whose maximum probability of error λ () goes to 0. We will first prove the existece of a sequece of codes which have a small average error (i the limit) ad the modify it to get somethig with small maximal error (i the limit). By the defiitio of capacity, we ca choose p(x) so that R < I(X; Y ) ad the we ca choose ɛ so that R + 3ɛ < I(X; Y ). Let C be the set of all ecodig fuctios X : {1, 2,..., 2 R } X. Our ecodig fuctio will be chose radomly accordig to p(x) i that each i {1, 2,..., 2 R } is idepedetly set to x where x is chose accordig to i=1 p(x). Oce we have chose some ecodig fuctio X, 37

38 the decodig procedure is as follows. If we receive y, our decoder guesses that the word W {1, 2,..., 2 R } has bee set if (1) X (W ) ad y are ɛ joitly typical ad (2) there is o other Ŵ {1, 2,..., 2R } with X (Ŵ ) ad y beig ɛ joitly typical. We also let C deote the set of codes obtaied by cosiderig all ecodig fuctios ad their resultig codes. Let E deote the evet that a mistake is made whe W {1, 2,..., 2 R } is chose uiformly. We first show that P (E ) 0 as. P (E ) = P (C) 1 2 R C C e = C C P (C)P (),C 2 R w=1 λ w (C) = 1 2 R 2 R w=1 C C P (C)λ w (C). By symmetry C C P (C)λ w (C) does ot deped o w ad hece the above is C C P (C)λ 1 (C) = P (E W = 1). Oe should easily see directly ayway that symmetry gives P (E ) = P (E W = 1). Because of the idepedece i which the ecodig fuctio ad the word W was chose, we ca istead cosider the procedure where we choose the code at radom as above but always trasmit the first word 1. I the ew resultig probability space, we still let E deote the evet of a decodig mistake. We also for each i {1, 2,..., 2 R } have the evet E i = {(X (i), Y ) A ɛ } where Y is the output of the chael. We clearly have E ((E1 )c 2R i=2 E i ). Now P (E1 ) 1 as by the joit typicality lemma. Next the idepedece of X (i) ad X (1) (for i 1) implies the idepedece of X (i) ad Y (1) (for i 1) which gives (by the joit typicality lemma) P (E i ) 2 (I(X;Y ) 3ɛ) ad so P ( i=2 2RE i ) 2R 2 (I(X;Y ) 3ɛ) = 2 (I(X;Y ) 3ɛ R) which goes to 0 as. Sice we saw that P (E ) goes to 0 as ad 38

Entropy Rates and Asymptotic Equipartition

Entropy Rates and Asymptotic Equipartition Chapter 29 Etropy Rates ad Asymptotic Equipartitio Sectio 29. itroduces the etropy rate the asymptotic etropy per time-step of a stochastic process ad shows that it is well-defied; ad similarly for iformatio,

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Information Theory and Statistics Lecture 4: Lempel-Ziv code Iformatio Theory ad Statistics Lecture 4: Lempel-Ziv code Łukasz Dębowski ldebowsk@ipipa.waw.pl Ph. D. Programme 203/204 Etropy rate is the limitig compressio rate Theorem For a statioary process (X i)

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Lecture 6: Source coding, Typicality, and Noisy channels and capacity

Lecture 6: Source coding, Typicality, and Noisy channels and capacity 15-859: Iformatio Theory ad Applicatios i TCS CMU: Sprig 2013 Lecture 6: Source codig, Typicality, ad Noisy chaels ad capacity Jauary 31, 2013 Lecturer: Mahdi Cheraghchi Scribe: Togbo Huag 1 Recap Uiversal

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP Etropy ad Ergodic Theory Lecture 5: Joit typicality ad coditioal AEP 1 Notatio: from RVs back to distributios Let (Ω, F, P) be a probability space, ad let X ad Y be A- ad B-valued discrete RVs, respectively.

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

4.3 Growth Rates of Solutions to Recurrences

4.3 Growth Rates of Solutions to Recurrences 4.3. GROWTH RATES OF SOLUTIONS TO RECURRENCES 81 4.3 Growth Rates of Solutios to Recurreces 4.3.1 Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer.

More information

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170 UC Berkeley CS 170: Efficiet Algorithms ad Itractable Problems Hadout 17 Lecturer: David Wager April 3, 2003 Notes 17 for CS 170 1 The Lempel-Ziv algorithm There is a sese i which the Huffma codig was

More information

Lecture 14: Graph Entropy

Lecture 14: Graph Entropy 15-859: Iformatio Theory ad Applicatios i TCS Sprig 2013 Lecture 14: Graph Etropy March 19, 2013 Lecturer: Mahdi Cheraghchi Scribe: Euiwoog Lee 1 Recap Bergma s boud o the permaet Shearer s Lemma Number

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero? 2 Lebesgue Measure I Chapter 1 we defied the cocept of a set of measure zero, ad we have observed that every coutable set is of measure zero. Here are some atural questios: If a subset E of R cotais a

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Lecture 11: Channel Coding Theorem: Converse Part

Lecture 11: Channel Coding Theorem: Converse Part EE376A/STATS376A Iformatio Theory Lecture - 02/3/208 Lecture : Chael Codig Theorem: Coverse Part Lecturer: Tsachy Weissma Scribe: Erdem Bıyık I this lecture, we will cotiue our discussio o chael codig

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

1 Introduction to reducing variance in Monte Carlo simulations

1 Introduction to reducing variance in Monte Carlo simulations Copyright c 010 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a ukow mea µ = E(X) of a distributio by

More information

The Growth of Functions. Theoretical Supplement

The Growth of Functions. Theoretical Supplement The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that

More information

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number MATH 532 Itegrable Fuctios Dr. Neal, WKU We ow shall defie what it meas for a measurable fuctio to be itegrable, show that all itegral properties of simple fuctios still hold, ad the give some coditios

More information

A Proof of Birkhoff s Ergodic Theorem

A Proof of Birkhoff s Ergodic Theorem A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed

More information

5 Birkhoff s Ergodic Theorem

5 Birkhoff s Ergodic Theorem 5 Birkhoff s Ergodic Theorem Amog the most useful of the various geeralizatios of KolmogorovâĂŹs strog law of large umbers are the ergodic theorems of Birkhoff ad Kigma, which exted the validity of the

More information

Shannon s noiseless coding theorem

Shannon s noiseless coding theorem 18.310 lecture otes May 4, 2015 Shao s oiseless codig theorem Lecturer: Michel Goemas I these otes we discuss Shao s oiseless codig theorem, which is oe of the foudig results of the field of iformatio

More information

Math 155 (Lecture 3)

Math 155 (Lecture 3) Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,

More information

BIRKHOFF ERGODIC THEOREM

BIRKHOFF ERGODIC THEOREM BIRKHOFF ERGODIC THEOREM Abstract. We will give a proof of the poitwise ergodic theorem, which was first proved by Birkhoff. May improvemets have bee made sice Birkhoff s orgial proof. The versio we give

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame Iformatio Theory Tutorial Commuicatio over Chaels with memory Chi Zhag Departmet of Electrical Egieerig Uiversity of Notre Dame Abstract A geeral capacity formula C = sup I(; Y ), which is correct for

More information

MA131 - Analysis 1. Workbook 2 Sequences I

MA131 - Analysis 1. Workbook 2 Sequences I MA3 - Aalysis Workbook 2 Sequeces I Autum 203 Cotets 2 Sequeces I 2. Itroductio.............................. 2.2 Icreasig ad Decreasig Sequeces................ 2 2.3 Bouded Sequeces..........................

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.

More information

Sequences, Series, and All That

Sequences, Series, and All That Chapter Te Sequeces, Series, ad All That. Itroductio Suppose we wat to compute a approximatio of the umber e by usig the Taylor polyomial p for f ( x) = e x at a =. This polyomial is easily see to be 3

More information

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22 CS 70 Discrete Mathematics for CS Sprig 2007 Luca Trevisa Lecture 22 Aother Importat Distributio The Geometric Distributio Questio: A biased coi with Heads probability p is tossed repeatedly util the first

More information

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece 1, 1, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

The Random Walk For Dummies

The Random Walk For Dummies The Radom Walk For Dummies Richard A Mote Abstract We look at the priciples goverig the oe-dimesioal discrete radom walk First we review five basic cocepts of probability theory The we cosider the Beroulli

More information

MA131 - Analysis 1. Workbook 3 Sequences II

MA131 - Analysis 1. Workbook 3 Sequences II MA3 - Aalysis Workbook 3 Sequeces II Autum 2004 Cotets 2.8 Coverget Sequeces........................ 2.9 Algebra of Limits......................... 2 2.0 Further Useful Results........................

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

1 Convergence in Probability and the Weak Law of Large Numbers

1 Convergence in Probability and the Weak Law of Large Numbers 36-752 Advaced Probability Overview Sprig 2018 8. Covergece Cocepts: i Probability, i L p ad Almost Surely Istructor: Alessadro Rialdo Associated readig: Sec 2.4, 2.5, ad 4.11 of Ash ad Doléas-Dade; Sec

More information

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece,, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet as

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

Lecture 15: Strong, Conditional, & Joint Typicality

Lecture 15: Strong, Conditional, & Joint Typicality EE376A/STATS376A Iformatio Theory Lecture 15-02/27/2018 Lecture 15: Strog, Coditioal, & Joit Typicality Lecturer: Tsachy Weissma Scribe: Nimit Sohoi, William McCloskey, Halwest Mohammad I this lecture,

More information

Lecture 2: April 3, 2013

Lecture 2: April 3, 2013 TTIC/CMSC 350 Mathematical Toolkit Sprig 203 Madhur Tulsiai Lecture 2: April 3, 203 Scribe: Shubhedu Trivedi Coi tosses cotiued We retur to the coi tossig example from the last lecture agai: Example. Give,

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10 DS 00: Priciples ad Techiques of Data Sciece Date: April 3, 208 Name: Hypothesis Testig Discussio #0. Defie these terms below as they relate to hypothesis testig. a) Data Geeratio Model: Solutio: A set

More information

Lecture 10: Universal coding and prediction

Lecture 10: Universal coding and prediction 0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved

More information

Sequences I. Chapter Introduction

Sequences I. Chapter Introduction Chapter 2 Sequeces I 2. Itroductio A sequece is a list of umbers i a defiite order so that we kow which umber is i the first place, which umber is i the secod place ad, for ay atural umber, we kow which

More information

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018) Radomized Algorithms I, Sprig 08, Departmet of Computer Sciece, Uiversity of Helsiki Homework : Solutios Discussed Jauary 5, 08). Exercise.: Cosider the followig balls-ad-bi game. We start with oe black

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory 1. Graph Theory Prove that there exist o simple plaar triagulatio T ad two distict adjacet vertices x, y V (T ) such that x ad y are the oly vertices of T of odd degree. Do ot use the Four-Color Theorem.

More information

Math 113 Exam 3 Practice

Math 113 Exam 3 Practice Math Exam Practice Exam will cover.-.9. This sheet has three sectios. The first sectio will remid you about techiques ad formulas that you should kow. The secod gives a umber of practice questios for you

More information

Math 216A Notes, Week 5

Math 216A Notes, Week 5 Math 6A Notes, Week 5 Scribe: Ayastassia Sebolt Disclaimer: These otes are ot early as polished (ad quite possibly ot early as correct) as a published paper. Please use them at your ow risk.. Thresholds

More information

Lecture 7: Channel coding theorem for discrete-time continuous memoryless channel

Lecture 7: Channel coding theorem for discrete-time continuous memoryless channel Lecture 7: Chael codig theorem for discrete-time cotiuous memoryless chael Lectured by Dr. Saif K. Mohammed Scribed by Mirsad Čirkić Iformatio Theory for Wireless Commuicatio ITWC Sprig 202 Let us first

More information

Frequentist Inference

Frequentist Inference Frequetist Iferece The topics of the ext three sectios are useful applicatios of the Cetral Limit Theorem. Without kowig aythig about the uderlyig distributio of a sequece of radom variables {X i }, for

More information

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes The Maximum-Lielihood Decodig Performace of Error-Correctig Codes Hery D. Pfister ECE Departmet Texas A&M Uiversity August 27th, 2007 (rev. 0) November 2st, 203 (rev. ) Performace of Codes. Notatio X,

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Entropies & Information Theory

Entropies & Information Theory Etropies & Iformatio Theory LECTURE I Nilajaa Datta Uiversity of Cambridge,U.K. For more details: see lecture otes (Lecture 1- Lecture 5) o http://www.qi.damtp.cam.ac.uk/ode/223 Quatum Iformatio Theory

More information

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size.

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size. Lecture 7: Measure ad Category The Borel hierarchy classifies subsets of the reals by their topological complexity. Aother approach is to classify them by size. Filters ad Ideals The most commo measure

More information

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function. MATH 532 Measurable Fuctios Dr. Neal, WKU Throughout, let ( X, F, µ) be a measure space ad let (!, F, P ) deote the special case of a probability space. We shall ow begi to study real-valued fuctios defied

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

Notes 5 : More on the a.s. convergence of sums

Notes 5 : More on the a.s. convergence of sums Notes 5 : More o the a.s. covergece of sums Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: Dur0, Sectios.5; Wil9, Sectio 4.7, Shi96, Sectio IV.4, Dur0, Sectio.. Radom series. Three-series

More information

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1 Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

Beurling Integers: Part 2

Beurling Integers: Part 2 Beurlig Itegers: Part 2 Isomorphisms Devi Platt July 11, 2015 1 Prime Factorizatio Sequeces I the last article we itroduced the Beurlig geeralized itegers, which ca be represeted as a sequece of real umbers

More information

Lecture 6: Integration and the Mean Value Theorem. slope =

Lecture 6: Integration and the Mean Value Theorem. slope = Math 8 Istructor: Padraic Bartlett Lecture 6: Itegratio ad the Mea Value Theorem Week 6 Caltech 202 The Mea Value Theorem The Mea Value Theorem abbreviated MVT is the followig result: Theorem. Suppose

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet

More information

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck! Uiversity of Colorado Dever Dept. Math. & Stat. Scieces Applied Aalysis Prelimiary Exam 13 Jauary 01, 10:00 am :00 pm Name: The proctor will let you read the followig coditios before the exam begis, ad

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Lecture 1: Basic problems of coding theory

Lecture 1: Basic problems of coding theory Lecture 1: Basic problems of codig theory Error-Correctig Codes (Sprig 016) Rutgers Uiversity Swastik Kopparty Scribes: Abhishek Bhrushudi & Aditya Potukuchi Admiistrivia was discussed at the begiig of

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15 CS 70 Discrete Mathematics ad Probability Theory Summer 2014 James Cook Note 15 Some Importat Distributios I this ote we will itroduce three importat probability distributios that are widely used to model

More information

Riemann Sums y = f (x)

Riemann Sums y = f (x) Riema Sums Recall that we have previously discussed the area problem I its simplest form we ca state it this way: The Area Problem Let f be a cotiuous, o-egative fuctio o the closed iterval [a, b] Fid

More information

CS 330 Discussion - Probability

CS 330 Discussion - Probability CS 330 Discussio - Probability March 24 2017 1 Fudametals of Probability 11 Radom Variables ad Evets A radom variable X is oe whose value is o-determiistic For example, suppose we flip a coi ad set X =

More information

Math 2784 (or 2794W) University of Connecticut

Math 2784 (or 2794W) University of Connecticut ORDERS OF GROWTH PAT SMITH Math 2784 (or 2794W) Uiversity of Coecticut Date: Mar. 2, 22. ORDERS OF GROWTH. Itroductio Gaiig a ituitive feel for the relative growth of fuctios is importat if you really

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 6 9/23/203 Browia motio. Itroductio Cotet.. A heuristic costructio of a Browia motio from a radom walk. 2. Defiitio ad basic properties

More information

STAT Homework 1 - Solutions

STAT Homework 1 - Solutions STAT-36700 Homework 1 - Solutios Fall 018 September 11, 018 This cotais solutios for Homework 1. Please ote that we have icluded several additioal commets ad approaches to the problems to give you better

More information

2 Banach spaces and Hilbert spaces

2 Banach spaces and Hilbert spaces 2 Baach spaces ad Hilbert spaces Tryig to do aalysis i the ratioal umbers is difficult for example cosider the set {x Q : x 2 2}. This set is o-empty ad bouded above but does ot have a least upper boud

More information

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions CS 70 Discrete Mathematics for CS Sprig 2005 Clacy/Wager Notes 21 Some Importat Distributios Questio: A biased coi with Heads probability p is tossed repeatedly util the first Head appears. What is the

More information

Math 113 Exam 4 Practice

Math 113 Exam 4 Practice Math Exam 4 Practice Exam 4 will cover.-.. This sheet has three sectios. The first sectio will remid you about techiques ad formulas that you should kow. The secod gives a umber of practice questios for

More information