Information Theory Notes

Size: px

Start display at page:

Download "Information Theory Notes"

Audra Gaines
6 years ago
Views:

1 Iformatio Theory Notes Erkip, Fall 009 Table of cotets Week (9/4/009) Course Overview Week (9//009) Jese s Iequality Week 3 (9/8/009) Data Processig Iequality Week 4 (0/5/009) Asymptotic Equipartitio Property Week 5 (0/9/009) Etropy Rates for Stochastic Processes Data Compressio Week 6 (0/6/009) Optimal Codes Huffma Codes Week 7 (/9/009) Other Codes with Special Properties Chael Capacity Week 8 (/6/009) Chael Codig Theorem Week 9 (/3/009) Differetial Etropy Week 0 (/30/009) Gaussia Chaels Parallel Gaussia Chaels No-white Gaussia oise chaels Week (/7/009) Rate Distortio Theory Biary Source / Hammig Distributio Gaussia Source with Squared-Error Distortio Idepedet Gaussia Radom Variables with Sigle-Letter Distortio Note: This set of otes follows the lecture, which refereces the text Elemets of Iformatio Theory by Cover, Thomas, Secod Editio.

2 Week (9/4/009) Course Overview Commuicatios-cetric Motivatio: Commuicatio over oisy chaels. Noise Source Ecoder Chael Decoder Destiatio Examples: over iteret, Voice over phoe etwork, Data over USB storage. I these examples, the source would be the voice, text, ad data, the chael would be the (wired or wireless) phoe etwork, iteret, ad storage device. Purpose: Reproduce source data at the destiatio (with presece of oise i chael) Chael ca itroduce oise (static), iterferece (multiple copies at destiatio arrivig at differet times from radio waves boucig o buildigs, etc), or distortio (amplitude, phase, modulatio) Sources have redudacy, e.g. the eglish laguage, ca recostruct words from parts: th_t that, ad images, where idividual pixel differeces are ot oticed by the huma eye. Some sources ca tolerate distortio (images). A typical example of a ecoder is broke ito three compoets: From Source Source Ecoder Chael Ecoder Modulator to Chael The source ecoder outputs a source codeword, a efficiet represetatio of the source, removig redudacies. The output is take to be a sequece of bits (biary) The chael ecoder outputs a chael codeword, itroducig cotrolled redudacy to tolerate errors that may be itroduced by the chael. The cotrolled redudacy is take ito accout while decodig. Note the distictio from redudacies preset i the source, where the redudacies may be arbitrary (may pixels i a image). The codeword is desiged by examiig chael properties. The simplest chael ecoder ivolves just repeatig the source codeword. The modulator coverts the bits of the chael codeword ito waveforms for trasmissio over the chael. They may be modulated i amplitude or frequecy, e.g. radio broadcasts. The correspodig decoder ca also be broke ito three compoets that ivert the operatios of the source ecoder above: From Chael Demodulator Chael Decoder Source Decoder To Destiatio We call the output of the demodulator the received word ad the output of the chael decoder the estimated source codeword. Remark. Later we will see that separatig the source ad chael ecoder/decoder as above does ot cause a loss i performace uder certai coditios. This allows us to cocetrate o each compoet idepedetly of the other compoets. Called the Joit Source Chael Codig Theorem.

3 Iformatio Theory addresses the followig questios: Give a source, how much ca I compress the data? Are there ay limits? Give a chael, how oisy ca the chael be, or how much redudacy is ecessary to miimize error i decodig? What is the maximum rate of commuicatio? Iformatio Theory iteracts with may other fields as well, probability/statistics, computer sciece, ecoomics, quatum theory, etc. See book for details. There are two fudametal theorems by Shao that address these questios. First we defie a few quatities. Typically, the source data is ukow, ad we ca treat the iput data as a radom variable, for istace, flippig cois ad sedig that as iput across the chael to test the commuicatio scheme. We defie the etropy of a radom variable X, takig values i the alphabet X as H(X)= x X p(x) log p(x) The base logarithm measures the etropy i bits. The ituitio is that etropy describes the compressibility of the source. Example. Let X uif{, etropy is, 6}. Note we eed 4 bits to represet the values of X ituitively. The 6 H(X)= 6 log 6 = 4 bits x= Example. 8 horses i a race with wiig probabilities ( ),,,,,,,. Note H(X)= log 4 log 4 8 log 8 6 log log 64 = bits The aive ecodig for outputtig the wier of the race uses 8 bits, labellig the horses from 000 to i biary. However, sice some horses wi more frequetly, it makes sese to use shorter ecodig words for more frequet wiers. For istace, label the horses (0,0,0,0,00,0,0,). This particular ecodig has the property that it is prefix-free, o codeword is a prefix of aother codeword. This makes it easy to determie how may bits to read util we ca determie which horse is beig refereced. Now ote some codewords are loger tha 3 bits, but o the average, we see that the average descriptio legth is which is exactly the etropy. ()+ 4 ()+ 8 (3) + 6 (4) (6)=bits Now we are ready to state (roughly) Shao s First Theorem: Theorem 3. (Source Code Theorem) Roughly speakig, H(X) is the miimum rate at which we ca compress a radom variable X ad recover it fully. Ituitio ca be draw from the previous two examples. We will visit this agai i more detail later. 3

4 Now we defie the mutual iformatio betwee two radom variables (X, Y ) p(x, y) defied as I(X; Y )= x,y p(x, y) log p(x, y) p(x)p(y) Later we will show that I(X; Y ) = H(X) H(X Y ), where H(X Y ) is to be defied later. Here H(X) is to be iterpreted as the compressibility (as stated earlier), or the amout of ucertaity i X. H(X Y ) is the amout of ucertaity i X, kowig Y, ad fially I(X ; Y ) is iterpreted as the reductio i ucertaity of X due to kowledge of Y, or a measure of depedecy betwee X ad Y. For istace, if X ad Y are completely depedet (kowig Y gives all iformatio about X), the H(X Y ) = 0 (o ucertaity give Y ), ad so I(X ; Y ) = H(X) (a complete reductio of ucertaity). O the other had, if X ad Y are idepedet, the H(X Y ) = H(X) (still ucertai) ad I(X ; Y ) = 0 (o reductio of ucertaity). Now we will characterize a commuicatios chael as oe for which the chael output Y depeds probabilitstic o the iput X. that is, the chael is characterized by the coditioal probability P(Y X). For a simple example, suppose the iput is {0, }, ad the chael flips the iput bit with probability ε, the P(Y = X =0)=ε, P(Y = 0 X = 0) = ε. Defie the capacity of the chael to be C = max p(x) I(X ; Y ) otig here that I(X; Y ) depeds o the joit p(x, y) = p(y x)p(x), ad thus depeds o the iput distributio as well as the chael. The maximizatio is take over all possible iput distributios. We ca ow roughly state Shao s Secod Theorem: Theorem 4. (Chael Codig Theorem) The maximum rate of iformatio over a chael with arbitrarily low probability of error is give by its capacity C. Here rate is umber of bits per trasmissio. This meas that as log as the trasmissio rate is below the capacity, the probability of error ca be made arbitrarily small. We ca motivate the theorem above with a few examples: Example 5. Cosider a oiseless biary chael, where iput is {0, } ad the output is equal to the iput. We would the expect to be able to sed bit of data i each trasmissio through the chael with o probability of error. Ideed if we compute I(X;Y ) =H(X) H(X Y ) =H(X) sice kowig Y gives complete iformatio about X, ad thus takig derivative ad settig to zero gives us C = max H(X)=max [ p log p ( p) log ( p)] p(x) p log p +log( p)+=0 ad so p = p ad p =. The C = log =bit 4

5 as expected. Example 6. Cosider a oisy 4 symbol chael, where the iput ad output X, Y take values i {0,,, 3} ad P(Y = j X = j)=p(y = j + X = j) = where j + is take modulo 4. That is, a give iput is either preserved or icremeted upo output with equal probability. To make use of this oisy chael, we ote that we do ot have to use all the iputs whe sedig iformatio. If we restrict the iput to oly 0 or, the we ote that 0 is set to 0 or, ad is set to or 3, ad we ca easily deduce from the output whether we set a 0 or a. This allows us to sed oe bit of iformatio through the chael with zero probability of error. It turs out that this is the best strategy. Give a iput distributio p X (x), we compute the joit distributio: p(x, y)= Note that the margial distributio of the output is { p X(x) y {x, x+} 0 else p Y (y)= (p X(y) + p X (y )) Applyig the defiitio of coditioal etropy, we have The the mutual iformatio is H(Y X)= x ( p(x)h I(X;Y )=H(Y ) ) = The etropy of Y is 3 H(Y )= y=0 p Y (y) log p Y (y) Sice give p Y (y), we ca fid p X correspodig to p Y, it suffices to maximize I(X ; Y ) over possibilities for p Y. Lettig p Y (0)= p, p Y ()= q, p Y ()=r, p Y (3)= p q r, we have the maximizatio problem max p log p q log q r log r ( p q r) log( p q r) p,q,r Takig the gradiet ad settig it to zero, we have the equatios which reduces to log p +log( p q r)+ = 0 log q +log( p q r)+ = 0 log r +log( p q r)+ = 0 p = p q r q = p q r r = q q r 5

6 so that p = q = r =. Pluggig this ito the above gives 4 as expected. [ ] C = 4 4 log = bit 4 Remark. Above, to illustrate how to use the oisy 4-symbol chael, suppose the iput strig is 000. If the 4-symbol chael has o oise, the we could sed bits of iformatio per trasmissio by usig the followig scheme: Block the iputs two bits at a time: 0, 0,, 0, ad sed the messages,, 3,, across the chael. The the chael decoder iverts back to 0, 0,, 0 ad reproduces the iput strig. However, sice the chael is oisy, we caot do this without icurrig some probability of error. Thus istead, we decide to set oe bit at a time, ad so we sed 0,, 0,, ad the output of the chael is 0 or, or 3, 0 or, or3, which we ca easily ivert to recover the iput. Furthermore, by the theorem, bit per trasmissio is the capacity of the chael, so this is the best scheme for this chael. A additioal bous is that we ca eve operate at capacity with zero probability of error! I geeral, we ca compute the capacity of a chael, but it is hard to fid a ice scheme. For istace, cosider the biary symmetric chael, with iputs {0, } where the output is flipped with probability ε. Note that sedig oe bit through this chael has a probability of error ε. Oe possibility for usig the chael ivolves sedig the same bit repeatedly through the chael ad takig the majority i the chael decodig. For istace, if we sed the same bit three times, the probability of error is P( or3flips)=ε 3 + 3ε ( ε) =3ε ε 3 = O(ε ) whereas the data rate becomes bits per trasmissio (we eeded to sed 3 bits to sed bit of iformatio). To lower the probability of error we would the eed to sed more bits. If we sed bits, the the 3 data rate is ad the probability of error is P(/ flips or more)= ( ) ε j j ( ε) j j=/ We ca compute the capacity of the chael directly i this case as well. Note that H(Y X)= x=0 p X (x)h(ε)=h(ε) ad I(X; Y ) = H(X) H(ε) = H(ε) p log p ( p) log p, which is maximized at p =, correspodig to a chael capacity of H(ε). Note that the proposed repititio codes above do ot achieve this capacity for arbitrarily low probability of error. Later we will show how to fid such codes. Week (9//009) Today: Followig Chapter, defiig ad examiig properties of basic iformatio-theoretic quatities. Etropy is for a discrete radom variable X, takig values i a coutable alphabet X with probability mass fuctio p(x) =P(X = x), x X. Later we will discuss a otio of etropy for the cotiuous case. 6

7 Defiitio. The etropy of X p(x) is defied to be H(X)= x X p(x) log p(x)=e p(x) ( log p(x)) If we use the base logarithm, the the measure is i bits. If we use the atural logarithm, the measure is i ats. Also, by covetio (cotiuity), we take 0 log0=0. Note that H(X) is idepedet o labelig. It oly depeds o the masses i the pmf p(x). Also, H(X) 0, which follows from the fact that log(p(x)) 0 for 0 p(x). Example 7. The etropy of a Beroulli(p) radom variable, is H(X) = p log p ( p) log( p). { 0 w.p. p X = w.p. p We will ofte use the otatio H(p) 6 p log p ( p) log( p) (biary etropy fuctio) for coveiece. Note the graph of H(p):. (-x*log(x)-(-x)*log(-x))/log() I particular, we ote the followig observatios: H(p) is cocave. Later we will show that etropy i geeral is a cocave fuctio of the uderlyig probability distributio. H(p) = 0 for p = 0,. This reiforces the ituitio that etropy is a measure of radomess, sice i the case of o radomess p =0,, there is o etropy. H(p) is maximized whe p = /. Likewise, this is the situatio where a biary distributio is the most radom. 7

8 H(p)=H( p). Example 8. Let w.p. X = w.p. 3 w.p. 4 4 I this case H(X) = log 4 log = + = 3 bits. Last time we viewed etropy as the average 4 legth of a biary code eeded to ecode X. We ca iterpret such biary codes as a sequece of yes/o questios that are eeded to determie the value of X. The etropy is the the average umber of questios eeded usig the optimal strategy for determiig X. I this case, we ca use the followig simple strategy, takig advatage of the fact that X = occurs more ofte tha the other cases: Questio : Is X =? Questio : Is X =? I this case, the umber of questios eeded is if X =, which happes with probability ad otherwise, with probability, so the expected umber of questios eeded is + = 3, which coicides with the etropy. Example 9. Let X,, X be i.i.d Beroulli(p) distributed (0 with probability p, with probability p). The the joit distributio is give by p(x,, x )= p # of 0 s ( p) # of s = ( p) i Xi p i Xi Observe that if is large, the by the law of large umbers i= that X i ( p) with high probability, so p(x,, x ) ( p) ( p) p p = H(p) The iterpretatio is that for large, the probability of the joit is cocetrated i a set of size H(p) o which it is approximately uiformly distributed. I this case we ca exploit this compressio ad use H(p) bits to distiguish sequeces i this set (per radom variable we are usig H(p) bits. For this reaso H(p) is the effective alphabet size of (X,, X ). Defiitio. The joit etropy of (X, Y ) p(x, y) is give by H(X, Y )= x X y Y Defiitio. The coditioal etropy of Y give X is p(x, y) log p(x, y) =E p(x,y) ( log p(x, y)) H(Y X) = p(x)h(y X =x) x X = p(x)p(y x) log p(y x) x X y Y = p(x, y) log p(y x) x,y = E p(x,y) ( log p(y x)) We expect that we ca write the joit etropy i terms of the idividual etropies, ad this is ideed the case, i a result we call the Chai Rule for Etropy: 8

9 Propositio 0. (Chai Rule for Etropy) For two radom variables X, Y, H(X, Y ) =H(X)+H(Y X)=H(Y )+H(X Y ) For multiple radom variables X,, X, we have that H(X,, X ) = i= H(X i X,, X i ) Proof. We just perform a simple computatio H(X, Y ) = p(x, y) log p(x, y) x X y Y = log p(x) p(x, y) x X y Y x X = H(X) +H(Y X) y Y p(x, y) log p(y x) by symmetry (switchig the roles of X ad Y ) we get the other equality. For radom variables X,, X we proceed by iductio. We have just proved the result for two radom variables. Assume it is true for radom variables. The H(X,, X ) = H(X X,, X )+H(X,, X ) = H(X X,, X )+ i= = i= H(X i X,, X i ) H(X i X,, X i ) where we have used the usual chai rule ad the iductio hypothesis. We also immediately get the chai rule for coditioal etropy: Corollary. (Chai Rule for Coditioal Etropy) For three radom variables X, Y, Z, we have H(X, Y Z)=H(Y Z) +H(X Y, Z) =H(X Z)+H(Y X, Z) ad for radom variables X,, X ad a radom variable Z, we have H(X,, X Z) = i= H(X i Z, X, X i ) Proof. This follows from the chai rule for etropy give Z =z, ad averagig over z gives the result. Example. Cosider the followig joit distributio for X, Y i tabular form with the margials: p(x, y) = y\x p y p x 9

10 Note that H(X) = bit ad H(Y ) = H ( 4 ) = 4 log log3 = 3 log 3 bits. The coditioal etropies 4 4 are We compute the differet etropies H(X), H(Y ), H(Y X), H(X Y ), H(X, Y ) below: ( ) H(X) = H = bit ( ) H(Y ) = H 4 = 4 log log3 4 = 3 log3 bits 4 H(Y X) = P(X = ) H(Y X = )+P(X = ) H(Y X = ) = ( ) H = bits H(X Y ) = P(Y =) H(X Y = )+P(Y = ) H(X Y =) ) = 3 ( 4 H 3 [ ] = log 3 3 log 3 = 3 4 log3 bits H(X, Y ) = 4 log 4 log = 3 bits Above we ote that H(Y X = ) = H(X Y = ) = H(0) = 0 sice i each case the distributio becomes determiistic. Note that i geeral H(Y X) H(X Y ) ad i this example H(X Y ) H(X), H(Y X) H(Y ). We will show this i geeral later. Also, H(Y X = x) has o relatio to H(Y ), ca be larger or smaller. The coditioal etropy is related to a more geeral quatity called relative etropy, which we ow defie. Defiitio. For two pmf s p(x), q(x) o X the relative etropy (Kullback-Leibler distace) is defied by D(p q)= x X p(x) log p(x) q(x) usig the covetio that 0 log 0 = 0 log 0 = 0 ad p q 0 logp =. I particular, the distace is allowed to be 0 ifiite. Note that the distace is ot symmetric, so D(p q) D(q p) i geeral, ad so it is ot a metric. However it does satisfy positivity. We will show that D(p q) 0 with equality whe p = q. Roughly speakig this describes the loss from usig q istead of p for compressio. Defiitio. The mutual iformatio betwee two radom variables X, Y p(x, y) is I(X ; Y )= x X y Y p(x, y) p(x, y) log = d(p(x, y) p(x)p(y)) p(x)p(y) 0

11 Relatioship betwee Etropy ad Mutual Iformatio I(X ; Y ) = p(x, y) p(x, y) log p(x) p(y) x,y = p(x, y) p(x, y) log p(x) x,y y = H(Y X)+H(Y ) = H(Y ) H(Y X) ( log p(y) x ) p(x, y) ad thus we coclude by symmetry that I(X ; Y )=H(Y ) H(Y X)=H(X) H(X Y ) Remark. As we said before, this is the reductio i etropy of X give kowledge of Y, ad we will show later that I(X;Y ) 0 with equality if ad oly if X, Y are idepedet. Also, ote that I(X ; X) = H(X) H(X, X) = H(X), ad sometimes H(X) is referred to as self-iformatio. The relatioships betwee the etropies ad the mutual iformatio ca be represeted by a Ve Diagram. H(X), H(Y ) represet the two circles, the itersectio is I(X ; Y ), the uio is H(X, Y ), ad the differeces X Y ad Y X represet H(X Y ) ad H(Y X), respectively. We will also establish chai rules for mutual iformatio ad relative etropy. First we defie the mutual iformatio coditioed o a radom variable Z: Defiitio. The coditioal mutual etropy of X, Y give Z for (X, Y, Z) p(x, y, z) is I(X ; Y Z)=H(X Z) H(X Y, Z)=E p(x,y,z) (log otig that this is also I(X ; Y Z = z) averaged over z. The chai rule for mutual iformatio is the I(X,, X ; Y )= i= I(X i ; Y X,, X i ) ) p(x, y z) p(x z)p(y z) The proof just uses the chai rules for etropy ad coditioal etropy: I(X, X ; Y ) = H(X,, X ) H(X,, X Y ) = H(X i X,, X i ) H(X i Y, X,, X i ) i= = I(X i ; Y X,, X i ) i= Now we defie the relative etropy coditioed o a radom variable Z, ad prove the correspodig chai rule:

12 Defiitio. Let X be a radom variable with distributio give by pmf p(x). Let p(y x), q(y x) be two coditioal pmfs. The coditioal relative etropy of p(y x) ad q(y x) is the defied to be D(p(y x) q(y x)) = x p(x)d(p(y X = x) q(y X =x)) The the chai rule for relative etropy is = p(y x) p(x)p(y x) log q(y x) x,y ( ) p(y X) = E p(x,y) log q(y X) ( D(p(x, y) q(x, y)) = E p(x,y) log p(x, Y ) ) q(x, Y ) ( ) p(y X) = E p(x,y) log q(y X) + logp(x) q(x) ( ( p(y X) = E p(x,y) log )+ E q(y X) p(x) log p(x) ) q(x) = D(p(y x) q(y x)) +D(p(x) q(x)) The various chai rules will help simplify calculatios later i the course. Jese s Iequality Covexity ad Jese s Iequality will be very useful for us later. First a defiitio: Defiitio. A fuctio f is said to be covex o [a, b] if for all [a, b ] [a, b] ad 0 λ we have f(λa + ( λ)b ) λf(a )+( λ)f(b ) If the iequality is strict, the we say f is strictly covex. Note that if we have a biary radom variable X Beroulli(λ), the we ca restate covexity as f(e p(x) (X)) E p(x) (f(x)) We ca geeralize this to more complicated radom variables, which leads to Jese s iequality: Theorem 3. (Jese s Iequality) If f is covex, ad X is a radom variable, the f(e(x)) E(f(X)) ad if f is strictly covex, the equality above implies X =E(X) with probability, i.e. X is costat. Proof. There are may proofs, but for fiite radom variables, we ca show this by iductio ad the fact that it is already true by defiitio for biary radom variables. To exted the proof to the geeral case, ca use the idea of supportig lies: we ote that f(x) +m(y x) f(y)

13 for all x ad some m depedet o x. Set x = E(X) to get f(e(x)) +m(y E(X)) f(y) ad itegratig i y gives the result f(e(x)) E(f(X)) If we have equality, the f(y) f(e(x)) +m(y E(X))=0 or f(y)= f(e(x))+m(y E(X)) for a.e. y. However, by strict covexity this ca occur oly if y =E(X) a.e. Week 3 (9/8/009) We cotiue with establishig iformatio theoretic quatities ad iequalities. We start with applicatios of Jese s iequality. Theorem 4. (Iformatio Iequality) For ay p(x), q(x) pmfs, we have that with equality if ad oly if p = q. D(p q) 0 Proof. First let A={x: p(x) >0}, ad compute D(p q) = p(x) log p(x) q(x) x A = ( p(x) log q(x) ) p(x) x A ( ) log p(x) q(x) p(x) ( x A ) log q(x) = 0 x X Examiig the proof above, equality occurs if ad oly if x A q(x)= x X q(x) ad equality is achieved i the applicatio of Jese s iequality. The first coditio leads to q(x) = 0 wheever p(x) =0. 3

14 Equality i the applicatio of Jese s iequality with strict covexity of log( ) implies that o A, q(x) p(x) = q(x)= x A ad therefore q = p. This has quite a few implicatios: Corollary 5. I(X ; Y ) 0 if ad oly if X, Y are idepedet. Proof. Notig I(X; Y )=D(p(x, y) p(x)p(y)), we have that I(X ; Y ) 0 with equality if ad oly if p(x, y) = p(x)p(y), i.e. X, Y are idepedet. Corollary 6. If X <, the we have the followig useful upper boud: H(X) log X Proof. Let u(x) uif(x), i.e. u(x) = i X. The X for x X. Let X p(x) be ay radom variable takig values 0 D(p u) = p(x) log p(x) u(x) x = H(X)+ p(x) log X x = H(X)+log X from which the result follows. Equality holds if ad oly if p = u. Theorem 7. (Coditioig Reduces Etropy) with equality if ad oly if X, Y are idepedet. H (X Y ) H(X) Proof. Note I(X ; Y ) = H(X) H(X Y ) 0, where equality holds if ad oly if X, Y are idepedet. Remark. Do ot forget that H(X Y = y) is ot ecessarily H(X). The coditioal etropy iequality allows us to show the cocavity of the etropy fuctio: Corollary 8. Suppose X p(x) is a r.v. The H(p) 6 H(X) is a cocave fuctio of p(x), i.e. H(λp +( λ)p ) λh(p )+( λ)h(p ) 4

15 Note that p i are pmf s, so H(p i ) is ot the biary etropy fuctio. Proof. Let X i p i (x) for i =, defied o the same alphabet X. Let ad defie { w.p.λ θ = w.p.( λ) { X w.p.λ Z = X θ = X w.p.( λ) Now ote that H(Z θ)=λh(p ) +( λ)h(p ), ad by the coditioal etropy iequality, λh(p )+( λ)h(p )=H(Z θ) H(Z) =H(λp + ( λ)p ) Note that P(Z =a) =λp (a) +( λ)p (a). Corollary 9. H(X,, X ) i= H(X i ) Proof. By the chai rule, H(X,, X ) = H(X i X,, X i ) H(X i ) i= i= usig the coditioal etropy iequality. Equality holds if ad oly if X,, X are mutually idepedet, followig from X i beig idepedet from the joit (X,, X i ). Aother applicatio of Jese s iequality will give us aother set of useful iequalities. Theorem 0. (Log Sum Iequality) Let a,, a ad b,, b be oegative umbers. The i= a i log a i b i ( ) ( a i log i= ) ai bi Proof. This follows from the covexity of t log t (with secod derivative ). The defiig t we have that X = a i b w.p. i b i bi bi i= a i log a i b i = E(X log X) E(X) log E(X) ( ) ai ai = log bi bi ad multiplyig by b i gives the result. Note that equality holds if ad oly if ai ai = b i bi, or a i = Cb i. 5

16 The we have the followig corollaries: Corollary. D(p q) is covex i the pair (p, q), i.e. D(λp + ( λ)p λq + ( λ)q ) λd(p q ) +( λ)d(p q ) Proof. Expadig the iequality above, we wat to show that x (λp (x)+( λ)p (x)) log λp (x)+( λ)p (x) λq (x)+( λ)q (x) λ x p (x) log p (x) q (x) +( λ) x p (x) log p (x) q (x) The for a fixed x, it suffices to show that i=, λ i p i log λ ( ip i λ i q i i=, λ i p i )(log i λ ) ip i i λ i q i otig λ =λ, λ =( λ), which is exactly the log sum iequality. Summig over x gives the result. Theorem. Let (X, Y ) p(x, y)= p(x)p(y x). The. I(X ; Y ) is a cocave fuctio of p(x) for a fixed coditioal distributio p(y x).. I(X ; Y ) is a covex fuctio of p(y x) for a fixed margial distributio p(x). Remark. This will be useful i studyig the source-chael codig theorems, sice p(y x) captures iformatio about the chael. Cocavity implies that local maxima are also global maxima, ad whe computig the capacity of a chael (maximizig mutual iformatio betwee iput ad output of chael) will allow us to look for local maxima. Likewise, covexity implies that local miima are global miima, ad whe we study lossy compressio we will wat to compute give a fixed iput distributio, what is the worst case sceario, which ivolves miimizig the mutual iformatio. Proof. For (), ote that I(X ; Y ) = H(Y ) H(Y X). Sice p(y) = x p(x)p(y x) is a liear fuctio of p(x) for a fixed p(y x), ad H(Y ) is a cocave fuctio of p(y), we see that H(Y ) is a cocave fuctio of p(x), sice a compositio of a cocave fuctio φ with a liear fuctio T is cocave: As for the secod term, ote that φ(t(λx + ( λ)x ))= φ(λtx +( λ)tx ) λφ(tx ) +( λ)φ(tx ) H(Y X)= x p(x)h(y X = x) where H(Y X = x) = p(y x) log p(y x) which is fixed for a fixed p(y x), ad thus H(Y X) is a liear fuctio of p(x), which i particular is cocave. Thus I(X ; Y ) is a sum of two cocave fuctios which is also cocave: (φ + φ )(λx +( λ)y) i (λφ i (x)+( λ)φ i (y))=λ(φ + φ )(x)+( λ)(φ + φ )(y) This proves (). 6

17 To prove (), cosider p (y x), p (y x), two coditioal distributios give p(x), with correspodig joits p (x, y), p (x, y) ad output margials p (y), p (y). The also cosider p λ (y x)=λp (y x)+( λ)p (y x) with correspodig joit p λ (x, y)=λp (x, y) +( λ)p (x, y) ad margials p λ (y)=λp (y) + ( λ)p (y). The the mutual iformatio I λ (X ; Y ) usig p λ is I λ (X ; Y ) = D(p λ (x, y) p(x)p λ (y)) λd(p (x, y) p(x)p (y))+( λ)d(p (x, y) p(x)p (y)) = λi (X ; Y ) +( λ)i (X ; Y ) which shows that I λ is a covex fuctio of p(y x) for a fixed p(x). Data Processig Iequality We ow tur to provig a result about Markov chais, which we defie first: Defiitio 3. The radom variables X, Y, Z form a Markov chai, deoted X Y Z if the joit factors i the followig way: for all x, y, z. p(x, y, z)= p(x)p(y x)p(z y) Properties.. X Y Z if ad oly if X, Z are coditioally idepedet give Y, i.e. p(x, z y) = p(x y)p(z y) Proof. First assume X Y Z. The p(x, z y)= p(x, y, z) p(y) Coversely, suppose p(x, z y) = p(x y)p(z y). The = p(x, y)p(z y) p(y) = p(x y)p(z y) p(x, y, z)= p(y)p(x, z y)= p(y)p(x y)p(z y)= p(x)p(y x)p(z y) as desired.. X Y Z Z Y X, which follows from the previous property. 3. If Z = f(y ), the X Y Z. This also follows from the first property, sice give Y, Z is idepedet of X: { z = f(y) p(z x, y)= p(z y)= 0 else Now we ca state the theorem: Theorem 4. (Data Processig Iequality) Suppose X Y Z. The I(X;Y ) I(X;Z). 7

18 Remark. This meas that alog the chai, the depedecy decreases. Alteratively, if we iterpet I(X ; Y ) as the amout of kowledge Y gives about X, the ay processig of Y to get Z may decrease the iformatio that we ca get about X. Proof. Expad I(X;Y, Z) by chai rule i two ways: I(X ; Y, Z)=I(X ; Y )+I(X ; Z Y ) I(X ; Y, Z)=I(X ; Z)+I(X ; Y Z) Note that I(X;Z Y ) =0 sice X, Z are idepedet give Y. Also I(X ; Y Z) 0, ad thus I(X ; Y )=I(X ; Y, Z)=I(X ; Z)+I(X ; Y Z) I(X ; Z) as desired. Now we illustrate some examples. Example 5. (Commuicatios Example) Modelig the process of sedig a (radom) message W to some destiatio: W Ecoder X Chael Y Decoder Z Note that W X Y Z is a Markov chai, each oly depeds directly o the previous. The data processig iequality tells us that I(X ; Y ) I(X ; Z) Thus, if we are ot careful, the decoder may decrease mutual iformatio (ad of course caot add ay iformatio about X). Similarly, I(X ; Y ) I(W ; Z) Example 6. (Statistics Examples) I statistics, frequetly we wat to estimate some ukow distributio X by makig measuremets (observatios) ad the comig up with some estimator: X Measuremet Y Estimator Z For istace, X could be the distributio of heights of a populatio (gaussia with ukow mea ad stadard deviatio). the Y would be some samples, ad from samples we ca come up with a estimator (say the sample mea / stadard deviatio). X Y Z, ad by the data processig iequality we the ote that the estimator may decrease iformatio. A estimator that does ot decrease iformatio about X is called a sufficiet statistic. Here is a probabilistic setup. Suppose we have a family of probability distributios {p θ (x)}, ad Θ is the parameter to be estimated. X represets samples from some distributio p Θ i the family. Let T(X) be ay fuctio of X (for istace, the sample mea). The Θ X T(X) form a Markov Chai. By the data processig iequality, I(Θ; T(X)) I(Θ; X). Defiitio 7. T(X) is a sufficiet statistic for Θ if Θ T(X) X also forms a Markov Chai. Note that this implies I(Θ; T(X))=I(Θ; X). 8

19 As a example, cosider p θ (x) Beroulli(θ). Ad cosider X = (X,, X ) i.i.d p θ (x). The T(X) = i= X i is a sufficiet statistic for Θ, i.e. give T(X), the distributio of X does ot deped o Θ. Here T(X) couts the umber of s. The the joit ( ) xi = k P(X = (x,, x ) T(X)=k, Θ=θ)= k 0 else where we ote that P(X = (x, k Θ=θ)= ( ) k θ k ( θ) k., x ) T(X) = k Θ = θ) = θ k ( θ) k (if xi = k) ad P(T(X) = As a secod example, cosider f θ (x) N(θ, ) ad cosider X = (X,, X ) i.i.df θ (x). The T(X) = i= X i is a sufficiet statistic for Θ. Note that i= X i N(θ, ) so that T(X) N(θ, /). Cosider the joit of (X,, X, T(X)), which is ( ) exp f(x,, x, x )= f θ (x,, x )P(T(X)=x x,, x )= i= x i θ x = xi 0 else Note the θ-depedet part is i= x i θ. Also, ( ) f(x ) = exp x θ ad if x = i x i, the the θ-depedet part of f(x ) is x θ = i= x i θ. Thus the θ-depedet parts cacel i f(x,, x x, Θ) so that give T(X), we have that X ad Θ are idepedet. As a fial example, if g θ Uif(θ, θ + ), ad X = (X, I this case we ote that ( ) T(X,, X ) = max X i, mi X i i i, X ) i.i.dg θ, the a sufficiet statistic for θ is P(x I,, x I max X i = M,mi X i = m)= I i [m, M] i i i= which does ot deped o the parameter θ. Give the max ad mi, we kow that the X i are i betwee, so we do ot eed to kow θ to compute the coditioal distributio. Now cosider the problem of estimatig a ukow r.v. X p(x) by observig Y with coditioal distributio p(y x). Use g(y ) = Xˆ as the estimate for X, ad cosider the error P e = Pr(Xˆ X). Fao s iequality relates this error to the coditioal etropy H(X Y ). This is ituitive, sice the lower the etropy, the lower we expect the probability of error to be. Theorem 8. (Fao s Iequality) Suppose we have a setup as above with X takig values i the fiite alphabet X. The H(P e )+P e log( X ) H(X Y ) where H(P e ) is the biary etropy fuctio P e log P e + ( P e ) log ( P e ). Note that a weaker form of the iequality is (after usig H(P e ) ad log( X ) log( X ): P e H(X Y ) log( X ) 9

20 Remark. Note i particular from the first iequality that if P e = 0 the H(X Y ) = 0, so that X is a fuctio of Y. { X Xˆ Proof. Let E = be the idicator of error, ad ote E Beroulli(P e). Examie H(E, X Y ) 0 X = Xˆ usig the chai rule: H(E Y ) + H(X E, Y ) = H(E, X Y ) = H(X Y ) + H(E X, Y ). Note that H(E Y ) H(E)=H(P e ). Also, H(X E, Y )=P(E =0)H(X E = 0, Y )+P(E =)H(X E =, Y ) where i the first compoet H(X E = 0, Y ) = 0 because kowig Y ad that there was o error gives us iformatio that X = g(y ), so it becomes determiistic. Also, give E =, Y, we kow that X takes values i X \{Xˆ }, a alphabet of size X. The usig the estimate H(X) log X, we have that H(X E, Y ) P e log( X ) ad fially sice H(E X, Y ) =0 sice kowig X, Y allows us to determie whether there is a error, H(X Y ) P e log( X )+H(P e ) which is the desired iequality. Remark. The theorem is also true with g(y ) radom. See.0. i book. I fact, we just make a slight adjustmet. I the proof above we have actually showed the result with H(X Xˆ) istead of H(X Y ). The the rest is a cosequece of the data processig iequality, sice I(X, Xˆ) I(X, Y ) H(X) H(X Xˆ) H(X) H(X Y ) H(X Y ) H(X Xˆ) Also, the iequality may be sharp, there exist distributios for which equality is satisfied. See book. Week 4 (0/5/009) Suppose we have X, X i.i.d with etropy H(X). Note P(X = X ) = x X probability below: p(x). We ca boud this Lemma 9. P(X = X ) H(X), with equality if ad oly if X is uiformly distributed. Proof. Sice x is covex, ad H(X)=E p(x) log p(x), we have that H(X) = E p(x)log p(x) E p(x) p(x)= x X p(x) = P(X = X ) With the same proof, we also get Lemma 30. Let X p(x), X r(x). The P(X = X H(p) D(p r) ) or P(X = X H(r) D(r p) ) 0

21 Proof. H(p) D(p r) = E p(x) log p(x) log p(x) r(x) E p(x) r(x) = P(X =X ) ad swappig the roles of p(x), r(x), the other iequality holds. These match ituitio, sice the higher the etropy (ucertaity), the less likely the two radom variables will agree. Asymptotic Equipartitio Property Now we will shift gears ad discuss a importat result that allows for compressio of radom variables. First recall the differet otios of covergece of radom variables: Usual covergece (of real umbers, or determiistic r.v.): Covergece a.e. (with probability ): a a ε, Ns.t. for > N, a a ε X X a.e. P(ω: X (ω) X(ω))= Covergece i probability (weaker tha above) p X X lim P( X (ω) X(ω) > ε)=0 ε Covergece i law/distributio: d X X P(X [a, b]) P(X [a, b]) for all [a, b] with P(X =a) =P(X = b)=0 (or cdf s coverge at pts of cotiuity, or characteristic fuctios coverge pt-wise, etc...) Now we work towards a result kow as the Law of Large Numbers for i.i.d r.v. (sample meas ted to the expectatio i the limit). First, Lemma 3. (Chebyshev s Iequality) Let X be a r.v. with mea µ ad variace σ. The P( X µ > ε) σ ε Proof. σ = X µ dp X µ dp X µ >ε ε P( X µ > ε) from which we coclude the result.

22 Theorem 3. (Weak Law of Large Numbers (WLLN)) Let X,, X be i.i.d X with mea µ ad variace σ. The i probability, i.e. X, the mea of A is µ ad the variace is σ. The by Cheby- Proof. Note that lettig A = shev s iequality, i= X X i= ) lim P( X µ > ε =0 i= ) P( X µ >ε σ ε 0 i= so that we have covergece i probability. With a straightforward applicatio of the law of large umbers, we arrive at the followig: Theorem 33. (Asymptotic Equipartitio Property (AEP)) If X,, X are i.i.d p(x), the log p(x,, x ) H(X) i probability. Proof. Sice p(x,, x )= p(x ) p(x ), we have that log p(x,, x )= log p(x i ) E( log p(x))=h(x) i= a.e. usig the Law of Large Numbers o the radom variable log p(x i ), which are still i.i.d, sice (X, Y ) idepedet implies (f(x), g(y )) idepedet. Remark. Note that this shows that p(x,, x ) H(X), so that the joit is approximately uiform o a smaller space. This is where we get compressio of the r.v., ad motivates the followig defiitio. Defiitio 34. The typical set A ε () with respect to p(x) is the set of sequeces (x,, x ) such that (H(X)+ε) p(x,, x ) (H(X) ε) Properties of A ε () :. Directly from the defiitio, we have that () (X,, X ) A ε H(X) ε log p(x,, x ) H(X)+ε log p(x,, x ) H(X) < ε

23 ( ) (). P A ε ε for large. Proof. The AEP shows that for some ( P log p(x,, X ) H(X) )< > ε ε ad comparig with the first property, this gives a boud o the measure of the complemet of the typical set, ad this gives the result. 3. ( ε) (H(X) ε) () A ε (H(X)+ε) Proof. = (x,,x ) p(x,, x ) () Aε p(x, (), x ) A ε (H(X)+ε) gives the RHS iequality, ad ( ) () ε <P A ε = p(x, () Aε (), x ) A ε (H(X) ε) gives the LHS iequality (oly true for large eough ). Give these properties, we ca come up with a scheme for compressig data. Give X,, X i.i.d p(x) over the alphabet X, ote that sequeces from the typical set occur with very high probability > ( ε). Thus we expect to be able to use shorter codes for typical sequeces. The scheme is very simple. Sice there are at most (H(X)+ε) sequeces i the typical set, we eed at most (H(X) + ε) bits, ad roudig up this is (H(X) + ε) + bits. To idicate i the code that we are describig a typical sequece, we will apped a 0 to ay code word of a typical sequece. Thus to describe a typical sequece we use at most (H(X) + ε) + bits. For o-typical sequeces, sice they occur rarely we ca afford to use the lousy descriptio with at most log X + bits, ad appedig a at the begiig to idicate that it is ot typical, we describe o-typical sequeces with log X + bits. Note that this code is -: give the code word we will be able to decode by lookig at the first digit to see whether it is a typical sequece, ad the lookig i the appropriate table to recover the sequece. Now we compute the average legth of the code. To fix otatio, let x = (x, the legth of the correspodig codeword. The the expected legth of code is E(l(x )) = x p(x )l(x ) () Aε p(x )[(H(X)+ε)+] + (A ε () ) c p(x )[ log X + ] P(A ε () )[(H(X) +ε)] +( P(A ε () ))[ log X ] + (H(X)+ε)+ε log X + (H(X)+ε ), x ) ad let l(x ) deote with ε = ε + ε log X + /. Note that ε ca be made arbitrarily small, sice ε 0 ε 0. Thus [ ] E l(x ) H(X) +ε 3

24 for large, ad o the average we have represeted X usig H(X) bits. Questio: Ca oe do better tha this? It turs out that the aswer is o, but we will oly be able to aswer this much later. Remark. This scheme is also ot very practical: For oe, we caot compress idividual r.v., we must take large ad compress them together. The table to look up will be large ad iefficiet. This requires kowledge of the uderlyig distributio p(x). We will later see schemes that ca estimate the uderlyig distributio as the iput comes, a sort of uiversal ecodig scheme. Next time: We will discuss etropy i the cotext of stochastic processes. Defiitio 35. A stochastic process is a idexed set of r.v. s with arbitrary depedece. To describe a stochastic process we specify the joit p(x,, x ) for all. Ad we will be focusig o statioary stochastic processes, i.e. oes for which for all, t, i.e. idepedet of time shifts. p(x,, x )= p(x t+,, x t+ ) I the special case of i.i.d. radom variables, we had that H(X,, X ) = i= H(X) = H(X), ad we kow how the etropy behaves for large. We will ask how H(X,, X ) grows for a geeral statioary process. Week 5 (0/9/009) Notes: Midterm /, Next class starts at pm. Etropy Rates for Stochastic Processes Cotiuig o, we ca ow defie a aalogous quatity for etropy: Defiitio 36. The etropy rate of a stochastic process (arbitrary) {X i } is whe the limit exists. Also defie a related quatity, H(X)= lim H(X,, X ) whe the limit exists. H (X)= lim H(X X,, X ) Note that H(X) is the Cesaro limit of H(X X, direct limit., X ) (usig the chai rule), whereas H (X) is the 4

25 Also, for i.i.d radom variables, this agrees with the usual defiitio of etropy, sice i that case. H(X,, X ) H(X) Theorem 37. For a statioary stochastic process, the limits H(X) ad H (X) exist ad are equal., X ) is mootoe decreasig for a statioary pro- Proof. This follows from the fact that H(X X, cess: H(X + X,, X ) H(X + X,, X ) = H(X X,, X ) where the iequality follows sice coditioig reduces etropy ad the equality follows sice the process is statioary. The the sequece is mootoe decreasig ad also bouded below by 0 (etropy is oegative) ad thus the limit ad the Cesaro limit exists ad are equal. Here is a importat result that we will ot prove (maybe later), by Shao, McMilla, ad Breima: Theorem 38. (Geeralized AEP) For ay statioary ergodic process, log p(x,, X ) H(X)a.e. Thus, the results about compressio ad typical sets geeralize to statioary ergodic process. Now we itroduce the followig otatio for Markov Chais (MC) ad also some review:. We will take X to be a discrete, at most coutable alphabet, called the state space. For istace, X could be {0,,, }.. A stochastic process is a Markov Chai (MC) if P(X = j X =i, X = i,, X =i )=P(X = j X = i) i.e. coditioig oly depeds o the previous time. We call this probability which we deote P ij,+ the oe-step trasitio probability. If this is idepedet of, the we call the process a time-ivariat MC. Thus, we ca write X 0 X ad X + X X 0 (sice the joit o ay fiite segmet factorizes with depedecies oly o the previous radom variable. 3. For a time-ivariat MC, we express P ij i a compact probability trasitio matrix P (may be ifiite dimesioal ), where P ij is the etry of the i-th row, j-th colum. Note that j P ij = for each i. Give a distributio µ o X, which we ca write µ =[p(0), p(), ] 5

26 if at time 0 the margial distributio is µ, the at time the probabilities ca be foud by the formula P(X = j)= i P(X 0 =i)p ij = [µp] j where we use the otatio row vector µ times matrix P. It turs out that a ecessary ad sufficiet coditio for statioarity is that the margial distributios all equal some distributio µ for every, where µ = µp. For a time-ivariat MC, the joit depeds oly o the iitial distributio P(X 0 = i) ad the time ivariat trasitio probability (which is fixed), ad thus if P(X 0 = i) = µ i where µ = µp, statioarity follows. 4. Uder some coditios o a geeral (ot ecessarily time-ivariat) Markov chai (ergodicity, for istace), o matter what iitial distributio µ 0 we start from µ µ (covergece a.e. to a uique statioary distributio µ). Perro-Frobeius has such a result about coditios o P. A example of a ot so ice Markov chai is a Markov chai with two isolated compoets. The the limits exist but deped o the iitial distributio. Now we tur to results about Etropy Rates. For statioary MC s, ote that H(X) = H (X) = lim H(X X,, X ) = lim H(X X ) = H(X X ) = i,j µ i P ij log P ij Example 39. For a state Markov chai with ( ) α α P = β β we ca spot a statioary distributio to be µ = β, µ α+ β = α. The we ca compute the etropy rate α + β accordigly H(X) =H(X X ) = β α + β H(α) + α α + β H(β) Example 40. (Radom Walk o a Weighted Graph) For a graph with m odes X = {,, m}, assig weights w ij 0 with w ij = w ji ad w ii = 0 to edges of the graph. Let w i = j w ij, the total weight of edges leavig i, ad W = i<j w ij, the sum of all the weights. The defie trasitio probabilities p ij = wij ad examie the resultig statioary distributio. w i (A example radom walk fixes a startig poit, say, which correspods to iitial distributio µ 0 = [, 0, ]). The statioary distributio (turs out to be uique) ca be guessed: µ i = wi. To verify: W [µp] j = i µ i p ij = i w i W wij = w ij = w j w i W W i 6

27 which shows that µp = µ. Now we ca compute the etropy rate: H(X) = H(X X ) = ij = ij µ i p ij log p ij w i W wij w i log w ij w i w ij = W log w ij + w i ij ij ( wij ) ( wi ) = H +H W W w ij W log W w i which we ca iterpret as the etropy of the edge distributio plus the etropy of all the ode distributio, so to speak. I the special case where all the weights are equal, let E i be the umber of edges from ode i, ad let E be the total umber of edges. The µ i = Ei ad E ( E H(X) = log(e) H E,, E ) m E Example 4. (Radom Walk o a Chessboard) I the book, but is just a special case of the above. Example 4. (Hidde Markov Model) This cocers a Markov Chai {X } which we caot observe (is hidde), ad fuctios (observatios) Y (X ) which we ca observe. We ca boud the etropy rate of Y. Y does ot ecessarily form a Markov Chai. Data Compressio We ow work towards the goal of arguig that etropy is the limit of data compressio. First let s cosider properties of codes. For istace, a simple example is the Morse code, where each letter is ecoded by a sequece of dots ad dashes (biary alphabet), where the idea is to put frequetly used letters with shorter descriptios, ad the descriptios have arbitrary legth. We defie a code C for some source (r.v.) X to be a mappig from X D, fiite legth strigs from a D-ary alphabet. We deote C(x) to be the code word of x ad l(x) to be the legth of C(x). The expected codeword legth is the L(C) = E(l(X)). Reasoable Properties of Codes. We say a code is o-sigular if it is oe to oe, i.e. each C(x) is easily decodable to recover x. Now cosider the extesio C of a code C to be defied by C(x,, x ) = C(x ) C(x ) (cocateatio of codes), for ecodig may realizatios of X at oce (for istace, to sed across a chael). It would be good for the extesio the to be o-sigular as well. We call a code uiquely decodable (UD) if the extesio is osigular. Eve more specific, a code is prefix-free (or prefix) if o codeword is a prefix of aother code word. I this case the momet a code word is trasmitted we ca decode it immediately (without havig to receive the ext letter) Example biary codes: x sigular osigular but UD UD but ot prefix prefix

28 The goal is to miimize L(C), ad to this ed we ote a useful iequality. Theorem 43. (Kraft Iequality) For ay prefix code over a alphabet of size D, the codeword legths l,, l m must satisfy m D li i= Coversely, for ay set of legths satisfyig this iequality, there exists a prefix code with the specified legths. Proof. The idea is to examie a D-ary tree, where from the root, each brach represets a letter from the alphabet. The prefix coditio implies that the codewords correspod to the leaves of the D-ary tree. The argumet is the to exted the tree to a full D-ary tree (add odes util the depth is l max, the max legth). The umber of leaves at l max for a full tree is D lmax, ad if a codeword has legth l i, the the umber of leaves at the bottom level that correspod to this word is D lmax li. Thus, the codewords of a prefix ecodig satisfy D lmax li D lmax i ad the result follows. The coverse is easy to see by reverseig the costructio. Associate each codeword to D lmax li cosecutive leaves (which have a commo acestor). Next time we will show that the Kraft iequality eeds to hold also for UD codes, i which case there is o loss or beefit from focusig o prefix codes. Week 6 (0/6/009) First we ote that we ca exted the Kraft iequality discussed last week to coutably may legths as well. Theorem 44. (Exteded Kraft Iequality) For ay prefix code over a alphabet of size D, the codeword legths l, l, must satisfy D li i= Coversely, for ay set of legths satisfyig this iequality, there exists a prefix code with the specified legths. Proof. Give a prefix code, we ca agai cosider the ifiite tree with the codewords at the leaves. We still exted the tree so that the tree is full i a differet sese, addig odes util every ode either has D childre or o childre. Such a tree ca the be associated to a probability space o the (coutably-ifiite) leaves by the formula P(X = l) = D depth(l), ad the sum of all the probabilities is. Sice we have added dummy odes, we have oly icreased the sum of D li, ad thus D li i= 8

29 Coversely, as before we ca reverse the procedure above to costruct a D-ary tree with leaves at depth l i. It turs out that UD codes also eed to satisfy Kraft s iequality. Theorem 45. (McMilla) The codeword legths l, l,, l m of a UD code satisfy m D li i= ad coversely, for ay set of legths satisfyig this iequality, there exists a UD code with the specified legths. Proof. The coverse is immediate from the previous Theorem 44, sice we ca fid a prefix code satisfyig the specified legths, ad prefix codes are a subset of UD codes. Now give a UD code, examie C k, the extesio of the code, where C(x, l(x,, x k )= k m= l(x m ). Now look at, x k ) = C(x ) C(x k ), ad ( x X ) k D l(x) = kl max D l(x,,xk) = a(m)d m x,,x k m= which is a polyomial i D, where a(m) is the umber of sequeces with code legth m, ad sice C k is osigular, a(m) D m. Thus ( kl max =kl max x X D l(x) ) k m= Takig k-th roots, ad takig the limit as k, we the have that x X D l(x) (kl max ) /k which gives the result. Remark. Thus, there is o loss i cosiderig prefix codes as opposed to UD codes, sice for ay UD code there is a prefix code with the same legths. Optimal Codes Give a source with probability distributio (p,, p m ), we would wat to kow how to fid a prefix code with the miimum expected legth. The problem is the a miimizatio problem: m mi L = i= p i l i m s.t. D li i= l i Z + 9

30 where D is a fixed positive iteger (the base). This is a liear program with iteger costraits. First, we igore the iteger costrait ad assume equality i Kraft s iequality. The we have a Lagrage multiplier problem. The derivative of the Lagragia gives the equatios p i = λd li log D p D li = i λ log D m p i λ log D = i= λ log D = D li λ = = p i log D Thus the optimal codeword legths are l i = log D p i which yields the expected code legth L = i p i l i = i p i log D p i =H(X) the etropy beig i base D. This gives a lower boud for the miimizatio problem. Theorem 46. For ay prefix code for a radom variable X, the expected codeword legth satisfies with equality if ad oly if D li = p i. L H(X) Proof. This has already bee prove partially via Lagrage multipliers (with equality costrait i Kraft), but we ow tur to a Iformatio Theoretic approach. Let r i = D l i where c = i D li. The c L H(X) = i p i l i + i p i log p i = i p i log D li + i p i log p i = p i log p i r i i i D(p r) 0 p i log c ad thus L H(X) as desired. Equality holds if ad oly if log c = 0 ad p = r, i which case we have equality i Kraft s iequality (c = ) ad D li = p i, which fiishes the proof. Now if p i D li, we would like to kow how close to the etropy lower boud we ca get. A good suboptimal code is the Shao-Fao code, which ivolves a simple roudig procedure. Theorem 47. For optimal codeword legths l i,, l m for a radom variable X, L = i p il i satisfies H(X) L H(X)+ 30

31 Proof. Recall from Theorem 46 that legths of log achieve the lower boud if they are itegers. We p i ow suppose we used log pi istead. This correspods to the Shao code, but first we ote that the legths still satisfy Kraft s iequality: D log/pi i ad also that this gives the desired upper boud: i D log/pi i p i log p i i p i (log ) + = H(X)+ p i Sice the optimal code is better tha this Shao code, this gives the desired upper boud H(X) +. The lower boud was already proved i the previous theorem 46. Remark 48. The overhead of i the previous theorem ca be made isigificat if we ecode blocks (X,, X B ) to icrease H(X,, X B ), effectively spreadig the overhead across a block. By the previous theorem, we ca fid a code for (X,, X B ) with legths l(x,, x ) such that The per outcome, we have that H(X,, X ) E[l(x,, x )] H(X,, X )+ H(X,, X ) E[l(x,, x )] H(X,, X ) ad lettig L = E[l(x,, x )], we see that L H(X) if X,, X is statioary. Note the coectio to the Shao-McMilla-Breima Theorem (or AEP i the case of i.i.d X i ). There we saw that we ca desig codes usig typical sequeces to get a compressio rate arbitrarily close to the etropy. This remark uses a slightly differet approach usig Shao codes. Huffma Codes Now we cosider a particular prefix code for a probability distributio, which will tur out to be optimal. The idea is essetially to costruct a prefix code tree (as i the proofs of Kraft s iequality) from the bottom up. We first look at the case D = (biary codes) Huffma Procedure:. Order the symbols i decreasig order of probabilities, so x,, x m satisfies p p p m.. Merge the two symbols of lowest probability (last symbols) ito a sigle tree associated with probability p m + p m. + p m + p m x m x m Deote this symbol x m. 3

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Information Theory and Statistics Lecture 4: Lempel-Ziv code Iformatio Theory ad Statistics Lecture 4: Lempel-Ziv code Łukasz Dębowski ldebowsk@ipipa.waw.pl Ph. D. Programme 203/204 Etropy rate is the limitig compressio rate Theorem For a statioary process (X i)