INFORMATION THEORY AND STATISTICS. Jüri Lember

Size: px

Start display at page:

Download "INFORMATION THEORY AND STATISTICS. Jüri Lember"

Gavin Rich
5 years ago
Views:

1 INFORMATION THEORY AND STATISTICS Lecture otes ad exercises Sprig 203 Jüri Lember

2 Literature:. T.M. Cover, J.A. Thomas "Elemets of iformatio theory", Wiley, 99 ja 2006; 2. Yeug, Raymod W. "A first course of iformatio theory", Kluwer, 2002; 3. Te Su Ha, Kigo Kobayashi "Mathematics of iformatio ad codig", AMS, 994; 4. Csisza r, I., Shields, P. "Iformatio theory ad statistics : a tutorial", MA 2004; 5. Mackay, D. "Iformatio theory, iferece ad learig algorithms", Cambridge 2004; 6. McEliece, R. "Iformatio ad codig", Cambridge 2004; 7. Gray, R. "Etropy ad iformatio theory", Spriger 990; 8. Gray, R. "Etropy ad iformatio theory", Spriger 990; 9. Gray, R. "Source codig theory", Kluwer, 990; 0. Shields, P. "The ergodic theory of discrete sample paths", AMS 996;. Dembo, A., Zeitoui, O. "Large deviatio techiques ad Applicatios", Spriger

3 Mai cocepts. (Shao) etropy I what follows, let X = {x, x 2,...} be a discrete (fiite or coutably ifiite) alphabet. Let X be a radom variable takig values o X with distributio P. We shall deote p i := P(X = x i ) = P (x i ). Thus, for every A X P (A) = P(X A) = i:x i A p i = x A P (x). Sice X is fixed, the distributio P ca be uiquely represeted via the probabilities p i i.e. P = (p, p 2,...). Recall that the support of P, deoted via X P is the set of letters havig positive probability (atoms), i.e. X P := {x X : P (x) > 0}. Also recall that for ay g : X R such that p i g(x i ) <, the expectatio of g(x) is defies as follows Eg(X) = p i g(x i ) = g(x)p (x) = g(x)p (x). (.) x X x X P NB! I what follows log := log 2 ad 0 log 0 := 0... Defiitio ad elemetary properties Def. The (Shao) etropy of radom variable X (distributio P ) H(X) is H(X) = p i log p i = x X P (x) log P (x). Remarks: H(X) depeds o X via P, oly. By (.) H(X) = E ( log P (X) ) = E log P (X). The sum p i log p i is always defied (sice p i log p i 0), but ca be ifiite. Hece 0 H(X), ad H(X) = 0 iff for a letter x, X = x, a.s.. 3

4 Etropy does ot deped o the alphabet X, it oly depeds o probabilities p i. Hece, we ca also write H(p, p 2,...). I priciple, ay other logarithm log b ca be used i the defiitio of etropy. Such etropy is deoted by H b i.e. H b (X) = p i log b p i = x X P (x) log b P (x). sice log b p = log b a log a p, it holds H b (X) = (log b a)h a (X), so that H b (X) = (log b 2)H(X) ad H e (X) = (l 2)H(X). I iformatio theory, typically, log 2 is used ad such etropy is measured i bits. The etropy defied with l is measured i ats, the etropy defied with log 0 is measured i dits. The umber log p(x i ) ca be iterpreted as the amout of iformatio oe gets if X takes x i. The smaller p(x i ), the bigger is the amout of iformatio. The etropy is thus the average amout of iformatio or radomess X cotais the bigger H(X), the more radom is X. The cocept of etropy was itroduced by C. Shao i his semial paper "A mathematical theory of commuicatio" (948). Examples: Let X = {0, }, p = P(X = ), i.e. X B(, p). The H(X) = p log p ( p) log( p) =: h(p). The fuctio h(p) is called the biary etropy fuctio. The fuctio h(p) is cocave, symmetric aroud 2 ad has maximum at p = 2 : h( 2 ) = 2 log 2 2 log 2 = log 2 =. 2 Cosider the distributios P : a b c d e Q : a b c d H(P ) = 2 log 2 4 log 4 8 log 8 6 log 6 6 log 6 = = 5 8 H(Q) = log 4 = 2. Thus P is "less radom ", although the umber of atoms (the letters with positive probability) is bigger. 4

5 ..2 Axiomatic approach The etropy has the property of groupig H(p, p 2, p 3,...) = H(Σ k i=p i, p k+, p k+2,...) + ( ) ( Σ k p p ) k i=p i H Σ k i= p,..., i Σ k i= p. (.2) i The proof of (.2) is Exercise 2. I a sese, groupig is a atural "additivity" property that a measure of iformatio should have. It turs out that whe X is fiite, the groupig together with symmetry ad cotiuity implies etropy. More precisely, let for ay m, P m be the set all probability measures i m-dimesioal alphabet, i.e. { m } P m := (p,..., p m ) : p i 0, p i =. Suppose, for every m we have a fuctio f m : P m [0, ) that is a cadidate for a measure of iformatio. The fuctio f m is cotiuous if it is cotiuous with respect to all coordiates, ad it is symmetric, if it value is idepedet of the order of the argumets. Theorem.2 Let, for every m, f m : P m [0, ) be symmetric fuctios satisfyig the followig axioms: A f 2 is ormalized, i.e. f 2 ( 2, 2 ) = ; A2 f m is cotiuous for every m = 2, 3,...; A3 it has the groupig property: for every < k < m, f m (p, p 2,..., p m ) = f m k+ (Σ k i=p i, p k+,..., p m )+ ( ) ( Σ k p p ) k i=p i fk Σ k i= p,..., i Σ k i= p. i A4 for every m <, it holds f m ( m,..., m ) f (,..., ). The for every m, Proof. Let, for every m, f m (p,..., p m ) = i= m p i log p i. (.3) i= g(m) := f m ( m,..., m ). By symmetry ad applyig A3 m times, we obtai ( g(m) = f m m,...,,..., }{{ m} m,..., ) }{{ m} = f m ( m..., m ) + f (,..., ) = g(m) + g(). 5

6 Hece, for itegers ad k, g( k ) = kg() ad by A, g(2 k ) = kg(2) = k i.e. g(2 k ) = log(2 k ), k. Usig A4, it is possible to show that the equality above holds for every iteger, i.e. g() = log, N. Fix a arbitrary m ad cosider (p,..., p m ), where all compoets are ratioal. The, there exist itegers k,..., k m ad commo deomiator such that p i = k i, i =,..., m. I this case, Therefore, ( g() = f,..., }{{ } k,,...,,..., }{{ } k 2 = f m ( k,..., k m ) + m = f m (p,..., p m ) + f m (p,..., p m ) = log() i= m i=,..., } {{ } k m ) k i f k i ( k i,..., k i ) k i g(k i) = f m (p,..., p m ) + m p i log(k i ) = i= m p i log(k i ). i= m p i log( k m i ) = p i log p i so that (.3) holds whe all p i are ratioal. Now use cotiuity of f m to deduce that (.3) always holds. Remark: Oe ca drop the axiom A4...3 Etropy is strictly cocave Jese s iequality. We shall ofte use Jese s iequality. Recall that a fuctio g : R R is covex, if for every x, x 2 ad λ [0, ], it holds i= g(λx + ( λ)x 2 ) λg(x ) + ( λ)g(x 2 ). A fuctio g is strictly covex, if equality holds oly for λ = or λ = 0. A fuctio g is cocave, if g is covex. Theorem.3 (Jese s iequality). Let g be covex fuctio ad X a radom variable such that E g(x) < ad E X <. The i= Eg(X) g(ex). (.4) If g is strictly covex, the (.4) is equality if ad oly if X = EX a.s.. 6

7 Mixture of distributios ad the cocavity of etropy. Let P ad P 2 be two distributios give i X. (Note that ay two discrete distributios ca be defied i a commo alphabet like the uio of their supports). The mixture of P ad P 2 is their covex combiatio: Q = λp + ( λ)p 2, λ (0, ). Whe X P, X 2 P 2 ad Z B(, λ), the the followig radom variable has the mixture distributio Q: { X if Z =, Y = X 2 if Z = 0. Clearly Q cotais the radomess of P ad P 2. I additio, Z is radom. Propositio. Etropy is strictly cocave i.e. H(Q) λh(p ) + ( λ)h(p 2 ) ad the iequality is strict except whe P = P 2. Whe X P ad X P2 are disjoit, the H(Q) = λh(p ) + ( λ)h(p 2 ) + h(λ). (.5) Proof. The fuctio f(y) = y log y is strictly cocave (y 0). Thus, for every x X λp (x) log P (x) ( λ)p 2 (x) log P 2 (x) = λf ( P (x) ) + ( λ)f ( P 2 (x) ) ( ) f λp (x) + ( λ)p 2 (x) = Q(x) log Q(x). Sum over X to get λh(p ) + ( λ)h(p 2 ) H(Q). The iequality is strict, whe there is at least oe x X so that P (x) P 2 (x). The proof of (.5) is Exercise 5. Example: Let P = B(, p ) ad P 2 = B(, p 2 ) (both Beroulli distributios). The the mixture λp + ( λ)p 2 is B(, λp + ( λ)p 2 ). The cocavity of etropy implies that biary etropy fuctio h(p) is strictly cocave: h(λp +( λ)p 2 ) λh(p )+( λ)h(p 2 )..2 Joit etropy Let X ad Y be radom variables takig values i discrete alphabets X ad Y, respectively. The (X, Y ) is radom vector with support i X Y = {(x, y) : x X, y Y}. Let P be the (joit) distributio of (X, Y ), a probability measure o X Y. Deote p ij := P (x i, y j ) = P ( (X, Y ) = (x i, y j ) ) = P(X = x i, Y = y j ). Joit distributio is ofte represeted by the followig table 7

8 X \Y y y 2... y j... x P (x, y ) = p P (x, y 2 ) = p 2... p j... j p j = P (x ) x 2 P (x 2, y ) = p 2 P (x, y 2 ) = p p 2j... j p 2j = P (x 2 ) x i p i p i2... p ij... j p ij = P (x i ) i p i = P (y ) i p i2 = P (y 2 )... i p ij = P (y j )... I the table ad i what follows (with some abuse of otatio), P (x) := P(X = x) ad P (y) := P(Y = y) deote margial laws. The radom variables X ad Y are idepedet if ad oly if P (x, y) = P (x)p (y) x X, y Y. The radom vector (X, Y ) ca be cosidered as a radom variable i a product alphabet X Y, ad the etropy of such a radom variable is H(X, Y ) = p ij log p ij = ( ) P (x, y) log P (x, y) = E log P (X, Y ). (.6) ij (x,y) X Y Def.4 The etropy H(X, Y ) as defied i (.6) is called the joit etropy of (X, Y ). Idepedet X ad Y. H(X, Y ) = = x X Whe X ad Y are idepedet, the P (x)p (y)(log P (x) + log P (y)) (x,y) X Y P (x, y) log P (x, y) = x X y Y P (x) log P (x) y Y P (y) log P (y) = H(X) + H(Y ). The argumet above ca be restate as follows. For every x X ad y Y it holds log P (x, y) = log P (x) + log P (y) so that Expectatio is liear log P (X, Y ) = log P (X) + log P (Y ). H(X, Y ) = E ( log P (X, Y ) ) = E ( log P (X) + log P (Y ) ) = E log P (X) E log P (Y ) = H(X) + H(Y ). The joit etropy of several radom variables. several radom variables X,..., X is defied H(X,..., X ) := E log P (X,..., X ). Whe all radom variables are idepedet, the H(X,..., X ) = H(X i ). 8 i= By aalogy, the joit etropy of

9 .3 Coditioal etropy.3. Defiitio Let x be such that P (x) > 0. The defie the coditioal probabilities P (y x) := P(Y = y X = x) = P (x, y) P (x). The coditioal distributio of Y give X = x is y y 2 y 3... P (y x) P (y 2 x) P (y 2 x).... The etropy of that distributio is H(Y x) :=: H(Y X = x) := y Y P (y x) log P (y x). Cosider the fuctio x H(Y x). Applyig it to the radom variable X P, we get a ew radom variable (the fuctio of X) with distributio H(Y x ) H(Y x 2 ) H(Y x 3 )... P (x ) P (x 2 ) P (x 3 ).... ad expectatio x X P H(Y x)p (x). Def.5 The coditioal etropy of Y give X P is Remarks: H(Y X) := H(Y x)p (x) = P (x) log P (y x)p (y x) x X P x X P y Y = ( ) log P (y x)p (x, y) = E log P (Y X). x X P y Y Whe X ad Y are idepedet, the P (y x) = P (y) x X P, y Y so that H(Y X) = H(Y ). I geeral H(X Y ) H(Y X) (take idepedet X, Y such that H(X) H(Y )). H(Y X) = 0 iff for a fuctio f, Y = f(x). Ideed, H(Y X) = 0 iff H(Y X = x) = 0 for every x X P. Hece, there exists f(x) such that P(Y = f(x) X = x) = or Y = f(x). 9

10 Joit etropy for more tha two radom variables. Let X, Y, Z be radom variables with supports X, Y ad Z. Cosiderig the vector (X, Y ) (or the vector (Y, Z)) as a radom variable, we have H(X, Y Z) := P (z) P (x, y z) log P (x, y z) z Z = H(X Y, Z) := (x,y,z) X Y Z (x,y) X Y log P (x, y z)p (x, y, z) = E log P (X, Y Z) P (x y, z) log P (x y, z) P (y, z) (y,z) Y Z x X = log P (x y, z)p (x, y, z) = E log P (X Y, Z). (x,y,z) X Y Z Moreover, give ay set X,..., X of radom variables, oe ca similarly defie coditioal etropies H(X, X,..., X j X j,..., X )..3.2 Chai rules for etropy Lemma. (Chai rule) Let X,..., X be radom variables. The H(X,..., X ) = H(X ) + H(X 2 X ) + H(X 3 X, X 2 ) + + H(X X,..., X ). Proof. For ay (x,..., x ) such that P (x,..., x ) > 0, it holds so that P (x,..., x ) = P (x )P (x 2 x )P (x 3 x, x 2 ) P (x x,..., x ), H(X,..., X ) = E log P (X,..., X ) = E log P (X ) E log P (X 2 X ) E log P (X X,..., X ) = H(X ) + H(X 2 X ) + + H(X X,..., X ). I particular, for ay radom vector (X, Y ) H(X, Y ) = H(X) + H(Y X) = H(Y ) + H(X Y ). Lemma.2 (Chai rule for coditioal etropy) Let X,..., X, Z be radom variables. The H(X,..., X Z) = H(X Z)+H(X 2 X, Z)+H(X 3 X, X 2, Z)+ +H(X X,..., X, Z). 0

11 Proof. For every (x,..., x, z) such that P (x,..., x, z) > 0, it holds so that P (x,..., x z) = P (x z)p (x 2 x, z)p (x 3 x 2, x, z) P (x x,..., x, z) log P (X,..., X Z) = log P (X Z) + log P (X 2 X, Z) + + P (X X,..., X, Z). Now take expectatio. I particular, for ay radom vector (X, Y, Z) H(X, Y Z) = H(X Z) + H(Y X, Z) = H(Y Z) + H(X Y, Z)..4 Kullback-Leibler distace.4. Defiitio NB! I what follows, 0 log( 0 q ) := 0, if q 0 ad p log(p ) := if p > 0. 0 Def.6 Let P ad Q two distributios o X. The Kullback-Leibler distace (Kullback- Leibler divergece, relative etropy, iformatioal divergece) betwee probability distributios P ad Q is defied as Where X P, the Whe X P ad Y Q, the D(P Q) := x X ( D(P Q) = E P (x) log P (x) Q(x). (.7) log P (X) Q(X) D(X Y ) := D(P Q). Def.7 Let, for ay x X, P (y x) ad Q(y x) be two (coditioal) probability distributios o Y. Let P (x) be a probability distributio o X. The coditioal Kullback- Leibler distace is the K-L distace of P (y x) ad Q(y x) averaged over P D(P (y x) Q(y x)) = x P (x) y ). P (y x) log P (y x) Q(y x) = x where P (x, y) := P (y x)p (x) ad (X, Y ) P (x, y). y P (x, y) log P (y x) Q(y x) = E log P (Y X) Q(Y X),

12 Remarks: Note that log P (x) is ot always o-egative so that i case of ifiite X, we have Q(x) to show that the sum of the series i (.7) is defied. Let us do it. Defie X + := { x X : P (x) } { Q(x) >, X := x X : P (x) } Q(x). The series over X is absolutely coverget: If P (x) log P (x) Q(x) = P (x) log Q(x) P (x) P (x) Q(x) P (x). x X x X x X P (x) log P (x) Q(x) <. x X + the series (.7) is coverget, otherwise its sum is. As we shall show below, D(P Q) 0 with equality oly if P = Q. However, i geeral D(P Q) D(Q P ). Hece K-L distace is ot a metric (true "distace"). Moreover, it does ot satisfy triagular iequality (Exercise 7). K-L distace measures the amout of "average surprise", that a distributio P provides us, whe we believe that the distributio is Q. If there is a x X such that Q(x ) = 0 (we believe x ever occurs), but P (x ) > 0 (it still happes sometimes), the ( P (x P (x ) ) ) log = Q(x ) implyig that D(P Q) =. This matches with ituitio seeig a impossible evet to happe is extremely surprisig (a miracle). O the other had, if there is a letter x X such that Q(x ) > 0 (we believe it might happe), but P (x ) = 0 (it actually ever happes), the ( P (x P (x ) ) ) log = 0. Q(x ) also this matches with the ituitio we are ot largely surprised if somethig that might happe actually ever does. I this poit of view the asymmetry of K-L distace is rather atural. Example: Let P = B(, ), Q = B(, q). The 2 D(P Q) = 2 log( 2q ) + 2 log( 2( q) ) = log(4q( q)), if q 0 2 D(Q P ) =q log(2q) + ( q) log(2( q)) if q 0. 2

13 .4.2 K-L distace is o-egative: Gibbs iequality ad its cosqueces Propositio.2 (Gibbs iequality) D(P Q) 0, with equality iff P = Q. Proof. Whe D(P Q) =, the iequality trivially holds. Hece cosider the situatio D(P Q) < i.e., series (.7) coverges absolutely (whe X ifiite). Let X P. Defie Y := Q(X) P (X) ad let g(x) := log(x). Note that g is strictly covex. We shall apply Jese s iequality. Let us first covice that all expectatios exists E g(y ) = x X log Q(x) P (x) P (x) = x X By Jese s iequality ( D(P Q) = E log P (X) ) Q(X) ( = E log P (x) P (x) <, Q(x) E Y log Q(X) P (X) = EY = x X ) = Eg(Y ) g(ey ) = log() = 0, with D(P Q) = 0 if ad oly if Y = a.s. or Q(x) = P (x) for every x X P. This implies Q(x) = P (x) for every x X. Q(x) P (x) =. P (x) Corollary. (log-sum iequality) Let a, a 2,... ad b, b 2,... oegative umbers so that a i < ad 0 < b i <. The ai log a i ( ai a i ) log, (.8) b i bi with equality iff a i b i = c i. Proof. Let a i = a i j a j, b i = b i j b. j Hece (a, a 2,...) ad (b, b 2,...) are probability measures so that from Gibbs iequality, it follows 0 a i log a i b i = a i j a log j a i j a j b i j b j = [ j a ai log a i ( aj a i ) log ]. j b i bj Sice ai log aj bj <, the iequality (.8) follows. We kow that D((a, a 2,...) (b, b 2,...)) = 0 iff a i = b i. This, however, implies that a i j = a j b i j b =: c, i. j 3

14 Remark: Note that log-sum iequality ad Gibbs iequality are equivalet. From Gibbs (or log-sum) iequality, it also follows that for fiite X, the distributio with the biggest etropy is uiform. Note that if U is uiform distributio over X, the H(U) = log X. Corollary.2 Let X <. The, for ay distributio P, it holds H(P ) log X, with equality iff P is uiform over X. Proof. Let U be uiform distributio over X, i.e. U(x) = X x X. The D(P U) = x X P (x) log P (x) U(x) = log X H(P ) 0. The equality holds iff U(x) = P (x) for every x X, i.e. P = U. Pisker iequality. There are several ways to measure the distace betwee differet probability measures o X. I statistics, a commo measure is so-called l or total variatio distace : for ay two probability measures P ad P 2 o X : It is easy to see (Exercise 8) where P P 2 := x X P (x) P 2 (x). P P 2 = 2 sup P (B) P 2 (B) = 2 P (A) P 2 (A) 2, (.9) B X A := {x X : P (x) P 2 (x)}. The covergece i total variatio, i.e. P P 0 implies that for every B X, P (B) P (B). I particular, for ay x X, P (x) P (x). O the other had, it is possible to show (Sheffe s theorem) that the covergece P (x) P (x) for every x implies P P 0. Thus P P 0 P (x) P (x), x X. I what follows, the covergece P P is always meat i total variatio. Note that for fiite X this is equivalet to the covergece i usual (Euclidia) distace. Pisker iequality implies that covergece i K-L distace i.e. D(P P ) 0 or D(P P ) 0 implies P P. Theorem.8 (Pisker iequality) For every two probability measures P ad P 2 o X, it holds D(P P 2 ) 2 l 2 P P 2 2. (.0) The proof of Pisker iequality is based o log-sum iequality. 4

15 Covexity of K-L distace. Let P, P 2, Q, Q 2 be the distributios o X. cosider the mixtures λp + ( λ)p 2 ja λq + ( λ)q 2. Corollary.3 D ( λp + ( λ)p 2 λq + ( λ)q 2 ) λd(p Q ) + ( λ)d(p 2 Q 2 ). (.) Proof. Fix x X. Log-sum iequality: Sum over X. λp (x) log λp (x) λq (x) + ( λ)p 2(x) log ( λ)p 2(x) ( λ)q 2 (x) ( ) λp (x) + ( λ)p 2 (x) log λp (x) + ( λ)p 2 (x) λq (x) + ( λ)q 2 (x). Take Q = Q 2 = Q. The from (5.3), it follows that the fuctio P D(P Q) is covex. Similarly oe gets that Q D(P Q) is covex. Whe they are fiite, the both fuctios are also strictly covex. Ideed: D(P Q) = P (x) log P (x) P (x) log Q(x) = P (x) log Q(x) H(P ). (.2) The fuctio P P (x) log Q(x) is liear, P H(P ) strictly cocave. The differece is, thus, strictly covex (whe fiite). From (.2) also the strict covexity of Q D(P Q) follows. Cotiuity of K-L distace for fiite X. I fiite-dimesioal space, a fiite covex fuctio is cotiuous. Hece if X < ad the fuctio P D(P Q) is fiite (i a ope set), the it is cotiuous (i that set). The same holds for the fuctio Q D(P Q). Example: The fiiteess is importat. Let X = {a, b}, ad let for every the measure P be such that P (a) = p, where p > 0 ad p 0. Let P (a) = 0. Clearly, P P, but for every = D(P P ) D(P P ) = 0. Coditioig icreases K-L distace. Let, for every x X, P (y x) ad P 2 (y x) be coditioal probability distributios, ad let P (x) a probability measure o X. Let P i (y) := x P i (y x)p (x), where i =, 2. The D(P (y x) P 2 (y x)) D(P P 2 ). (.3) Proof of (.3) is Exercise 6. 5

16 .5 Mutual iformatio Let (X, Y ) be radom vector with distributio P (x, y), (x, y) X Y. As usually, let P (x) ad P (y) be the margial distributios, i.e. P (x) is distributio of X ad P (y) is distributio of Y. Def.9 The mutual iformatio I(X; Y ) of X ad Y is K-L distace betwee the joit distributio P (x, y) ad the product distributio P (x)p (y) I(X; Y ) := x,y P (x, y) log P (x, y) P (x)p (y) = D( P (x, y) P (x)p (y) ) ( = E log P (X, Y ) ). P (X)P (Y ) Hece I(X; Y ) is K-L distace betwee (X, Y ) ad a vector (X, Y ), where X ad Y are distributed as X ad Y, but ulike X ad Y, the radom variables X ad Y are idepedet. Properties: I(X; Y ) depeds o joit distributio P (x, y). 0 I(X; Y ). mutual iformatio is symmetric I(X; Y ) = I(Y ; X). I(X; Y ) = 0 iff X, Y are idepedet. The followig relatio is importat: For the proof, ote I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X). (.4) I(X; Y ) = E log P (X, Y ) P (X)P (Y ) = E log P (X Y )P (Y ) P (X)P (Y ) = E log P (X Y ) P (X) = E log P (X Y ) E log P (X) = H(X) H(X Y ). By symmetry, the roles of X ad Y ca be chaged. Hece the mutual iformatio is the reductio of radomess of X due to the kowledge of Y. Whe X ad Y are idepedet, the H(X Y ) = H(X), ad I(X; Y ) = 0. O the other had, whe X = f(y ), the H(X Y ) = 0 so that I(X; Y ) = H(X). I particular I(X; X) = H(X) H(X X) = H(X). Therefore, sometimes etropy is referred to as self-iformatio. 6

17 Recall chai rule: H(X Y ) = H(X, Y ) H(Y ). Hece Coditioig reduces etropy I(X; Y ) = H(X) + H(Y ) H(X, Y ). (.5) H(X Y ) H(X), because H(X) H(X Y ) = I(X; Y ) 0. Recall H(X Y ) = y H(X Y = y)p (y). The fact that sum is smaller tha H(X) does ot imply that H(X Y = y) H(X) for every y. As the followig little couterexample shows, it eed ot to be case (check!) Y\X a b u 0 3 v 8 For ay radom vector (X,..., X ), it holds H(X,..., X ) 4 8 H(X i ), with equality iff all compoets are idepedet. For the proof use chai rule H(X,..., X ) = H(X ) + H(X 2 X ) + H(X 3 X, X 2 ) + + H(X X,..., X ) i= ad apply the fact that coditioig reduces etropy. Coditioal mutual iformatio. support of Z. Let X, Y, Z be radom variables, let Z be the Def.0 The coditioal mutual iformatio of X, Y give Z is P (X Y, Z) I(X; Y Z) :=H(X Z) H(X Y, Z) = E log P (X Z) P (X Y, Z)P (Y Z) P (X, Y Z) =E log = E log P (X Z)P (Y Z) P (X Z)P (Y Z) = P (x, y z) P (x, y, z) log P (x z)p (y z) x,y,z = z P (z) y,x P (x, y z) log P (x, y z) P (x z)p (y z) = z D ( P (x, y z) P (x z)p (y z) ) P (z). 7

18 Properties: I(X; Y Z) 0, with equality iff X ad Y are coditioally idepedet: P (x, y z) = P (x z)p (y z), x X, y Y, z Z. (.6) For proof ote that I(X; Y Z) = 0 iff for every z Z, it holds ( ) D P (x, y z) P (x z)p (y z) = 0. This meas coditioal idepedece. The proof of followig equalities is Exercise 8 I(X; X Z) = H(X Z) I(X; Y Z) = H(Y Z) H(Y X, Z) I(X; Y Z) = H(X Z) + H(Y Z) H(X, Y Z). I additio, the followig equality holds I(X; Y Z) = H(X; Z) + H(Y ; Z) H(X, Y, Z) H(Z). (.7) Chai rule for mutual iformatio I(X,..., X ; Y ) = I(X ; Y )+I(X 2 ; Y X )+I(X 3 ; Y X, X 2 )+ +I(X ; Y X,..., X ). For proof use chai rule for etropy ad coditioal etropy: I(X,..., X ; Y ) =H(X,..., X ) H(X,..., X Y ) =H(X ) + H(X 2 X ) + + H(X X,..., X ) H(X Y ) H(X 2 X, Y ) H(X X,..., X, Y ). Chai rule for coditioal mutual iformatio: I(X,..., X ; Y Z) = I(X ; Y Z)+I(X 2 ; Y X, Z)+ +I(X ; Y X,..., X, Z). Proof is similar. 8

19 .6 Fao s iequality Let X be a (ukow) radom variable ad ˆX a related radom variable a estimate of X. Let P e := P(X ˆX) be the probability of mistake made by estimatio. If P e = 0, the X = ˆX a.s. so that H(X ˆX) = 0. Therefore, it is atural to expect that whe P e is small, the H(X ˆX) should also be small. Fao s iequality quatifies that idea. Theorem. (Fao s iequality) Let X ad ˆX be radom variables o X. The where h is biary etropy fuctio. H(X ˆX) h(p e ) + P e log( X ), (.8) Proof. Let Hece Chai rule for etropy: E = { if ˆX X, 0 if ˆX = X. E = I { ˆX X}, E B(, P e ). because H(E X, ˆX) = 0 (why?) O the other had, H(E, X ˆX) = H(X ˆX) + H(E X, ˆX) = H(X ˆX), (.9) H(E, X ˆX) = H(E ˆX) + H(X E, ˆX) H(E) + H(X E, ˆX) = h(p e ) + H(X E, ˆX). Note H(X E, ˆX) = x X P( ˆX = x, E = )H(X ˆX = x, E = ) + x X P( ˆX = x, E = 0)H(X ˆX = x, E = 0). Give ˆX = x ad E = 0, we have X = x ad the H(X ˆX = x, E = 0) = 0 or H(X E, ˆX) = x X P( ˆX = x, E = )H(X ˆX = x, E = ). If E = ad ˆX = x, the X X \x, so that H(X ˆX = x, E = ) log( X ). To summarize: H(X E, ˆX) P e log( X ). Form (.9) we obtai H(X ˆX) P e log( X ) + h(p e ). 9

20 Corollary.4 H(X ˆX) + P e log X, ehk P e H(X ˆX). log X If X <, the Fao s iequality implies: if P e 0, the H(X ˆX) 0. Whe X =, the Fao s iequality is trivial ad such a implicatio might ot exists. Example: Let Z B(, p) ad let Y be such a radom variable that Y H(Y ) =. Defie X as follows { 0 if Z = 0, X = Y if Z =. > 0 ad Let ˆX = 0 a.s.. The P e = P(X > 0) = P(X = Y ) = P(Z = ) = p. But H(X ˆX) = H(X) H(X Z) = ph(y ) =. The for every p > 0, clearly H(X ˆX) = ad therefore H(X ˆX) 0, whe P e 0. Whe Fao s iequality is a equality? holds iff for every x X, Ispectig the proof reveals that equality ad H(X ˆX = x, E = ) = log( X ) (.20) H(E ˆX) = H(E). (.2) The equality (.20) meas that the coditioal distributio of X give X ˆX = x is uiform over all remaiig alphabet X \x. That, i tur, meas that to every x i X correspods p i so that P( ˆX = x i, X = x j ) = p i, j i. I other words, the joit distributio of ( ˆX, X) ˆX\X x x 2 x x P( ˆX = x, X = x ) P( ˆX = x, X = x 2 ) P( ˆX = x, X = x ) x 2 P( ˆX = x 2, X = x ) P( ˆX = x 2, X = x 2 ) P( ˆX = x 2, X = x ) x P( ˆX = x, X = x ) P( ˆX = x, X = x ) is such that i every row, all elemets outside the mai diagoal are equal (to a costat depedig o the row). The relatio (.2) meas that for every x X, it holds that P (X = x ˆX = x) = P e (i every row the probability i mai diagoal divided by the 20

21 sum of the whole row equals to P e. A joit distributio satisfyig both requiremets (.20) ad (.2) is, for example, ˆX\X a b c a 3 0 b 25 c with this distributio, P e = 2, log( X ) = so that 5 O the other had P e log( X ) + h(p e ) = log log 5 2 = 3 5 log log 5. 5 H(X ˆX = a) = H(X ˆX = b) = H(X ˆX = c) = 3 5 log log 5, 5 implyig that H(X ˆX) = 3 5 log log 5. 5 Therefore, Fao s iequality is a equality..7 Data processig iequality.7. Fiite Markovi chai Def.2 The radom variables X,..., X with supports X,..., X form a Markov chai whe for every x i X i ad m = 2,..., P(X m+ = x m+ X m = x m,..., X = x ) = P(X m+ = x m+ X m = x m ). (.22) The X,..., X is Markov chai iff for every x,..., x such that x i X i P (x,..., x ) = P (x, x 2 )P (x 3 x 2 ) P (x x ). The fact that X,..., X form a Markov chai is i iformatio theory deoted as X X 2 X. Thus X Y Z iff P (x, y, z) = P (x)p (y x)p (z y). We shall ow list (without proofs) some elemetary properties of Markov chais. 2

22 Properties: If X X 2 X, the X X X (reversed MC is also a MC). Every sub-chai Markov chai is a Markov chai: if X X 2 X, the X X 2 X k. If X X 2 X, the for every m < ad x i X i P (x,..., x m+ x m,..., x ) = P (x,..., x m+ x m ). (.23) X X iff for every m = 2,..., the radom variables X,..., X m ad X m+,..., X are coditioally idepedet give X m : for every x m X m, P (x,..., x m, x m+,..., x x m ) = P (x,..., x m x m )P (x m+,..., x x m ). (.24).7.2 Data processig iequality Lemma.3 (Data processig iequality) Whe X Y Z, the with equality iff X Z Y. I(X; Y ) I(X; Z), Proof. From X Y Z it follows that X ad Z are coditioally idepedet give Y. This implies I(X; Z Y ) = 0 ad from the chai rule for etropy, it follows I(X; Y, Z) = I(X; Z) + I(X; Y Z) = I(X; Y ) + I(X; Z Y ) = I(X; Y ). (.25) Sice I(X; Y Z) 0,we obtai I(X; Z) I(X; Y ) ad the equality holds iff I(X; Y Z) = 0 or the radom variables X ad Y are coditioally idepedet give Z. That meas X Z Y. Let X be a ukow radom variable we are iterested i. Istead of X, we kow Y (data) givig us I(X; Y ) bits of iformatio. Would it be possible to process the data so that the amout of iformatio about X icreases? The data are possible to process determiistically applyig a determiistic fuctio g, obtaiig g(y ). Hece we have Markov chai X Y g(y ) ad from data processig iequality I(X; Y ) I(X; g(y )) it follows that g(y ) does ot give more iformatio about X as Y. Aother possibility is to process Y by applyig additioal radomess idepedet of X. Sice this additioal radomess is idepedet of X, the X Y Z is still Markov chai ad from data processig iequality I(X; Y ) I(X; Z). Hece, the data processig iequality postulates well-kow fact: it is ot possible to icrease iformatio by processig the data. 22

23 Corollary.5 Whe X Y Z, the Proof. Exercise 23. Corollary.6 Whe X Y Z, the Proof. Exercise Sufficiet statistics H(X Z) H(X Y ). I(X; Z) I(Y ; Z), I(X; Y Z) I(X; Y ). Let {P θ } be a family of probability distributios model. Let X be a radom sample from the distributio P θ. Recall that -elemetal radom sample ca always be cosidered as a radom variable takig values i X. Clearly the sample depeds o chose distributio P θ or, equivaletly, o its idex parameter θ. Let T (X) be ay statistic (fuctio of the sample) givig a estimate to ukow parameter θ. Let us cosider the Bayesia approach, where θ is a radom variable with (prior) distributio π. The θ X T (X) is Markov chai ad from data processig iequality I(θ; T (X)) I(θ; X). Whe the iequality above is a equality, the T (X) gives as much iformatio about θ as X ad we kow that the equality implies θ T (X) X. By defiitio of Markov chai, the for every sample x X P(X = x T (X) = t, θ) = P(X = x T (X) = t) or give the value of T (X), the distributio of sample is idepedet of θ. I statistics, a statistic T (X) havig such a property is called sufficiet. Corollary.7 A statistic T is sufficiet iff for every distributio π of θ the followig equality holds true I(θ; T (X)) = I(θ; X). Example: Let {P θ } the family of all Beroulli distributios. A statistic T (X) = i= X i is sufficiet, because { 0 if i P(X = x,..., X i = x i T (X) = t, θ) = x i t, if ( i t) x i = t. Ideed: if i x i = t, the P(X = x,..., X = x T (X) = t, θ) = P(X = x,..., X = x, T (X) = t, θ) P(T (X) = t, θ) θ t ( θ) t π(θ) = x,...,x : i x i=t θt ( θ) t π(θ) = ( t), because give sum t (the umber of oes) there are exactly ( t) possibilities for differet samples. 23

24 .8 Etropy rate of a stochastic process Let us cosider a stochastic process {X } =. Def.3 The etropy rate of a stochastic process {X } =is provided the limit exists. H X := lim H(X,..., X ), Examples: let {X } = i.i.d. radom variables from the distributio P, i.e. X i P. the lim H(X,..., X ) = lim i= H(X i ) = lim H(P ). Thus, i i.i.d. case the etropy rate of the process equals to the etropy of X. Let {X } = be idepedet radom variables H(X,..., X ) = H(X i ). The limit eed ot always exists so that the etropy rate is ot always defied for that process. Let X, X 2,... i.i.d. radom variables X i P. Let X = Z. Cosider radom walk {S } =0, s.t. S 0 = 0, S = X, S 2 = X + X 2,..., S = X + + X. The etropy rate of radom walk is H S = H(P ). The proof of that is Exercise 32. i= The limit H X. Cosider the limit (whe exists) H X := lim H(X X,..., X ). We shall ow show that for a large class of stochastic processes, called statioary processes, the limit H X always exists. Def.4 A stochastic process {X } = is statioary, if for every ad every k the radom vectors have the same distributios. (X,..., X ) ad (X k+,..., X k+ ) 24

25 Hece, whe {X } = is statioary, the all radom variables X, X 2,... have the same distributios, all two-dimesioal radom vectors (X, X 2 ), (X 2, X 3 ),... have the same distributio, the vectors (X, X 2, X 3 ), (X 2, X 3, X 4 ),... have the same distributio etc. Propositio.3 Whe {X } = is statioary, the the limit H X always exists. Proof. Sice {X } = is statioary, the for every the radom vectors (X,..., X ) ad (X 2,..., X + ) have the same distributios. Hece, for every Therefore H(X X,..., X ) = H(X + X 2,..., X ). H(X + X,..., X ) H(X + X 2,..., X ) = H(X X,..., X ), so that the sequece {H(X X,..., X )} is o-egative ad o-icreasig. Such a sequece has always a limit. Next, we show that for a statioary process the etropy rate is always defied ad equals to H X. We eed Cesaro s lemma Lemma.4 (Cesaro) Let {a } o-egative real umbers with a > 0 ad a =. Deote b := i= a i. Let x x be arbitrary coverget sequece. The a i x i x, whe. b i= I a special case a =, we obtai x x x. Theorem.5 Whe {X } = is a statioary process, the H X always exists ad H X = H X. Proof. From the chai rule for etropy: H(X,..., X ) = H(X k X,..., X k ). Use H(X k X,..., X k ) H X, together with Cesaro lemma to obtai lim H(X,..., X ) = lim H(X k X,..., X k ) = H X. k= k= Hece, every statioary process has a etropy rate that equals to H X. It might be 0 eve if X is still radom (ca you fid a example of such process?). O the other had, also a o-statioary processes might have a etropy rate (which of the examples above was o-statioary). 25

26 .8. Etropy rate of Markov chai Determiig a etropy rate of a stochastic process is, i geeral, ot a easy task. I this sub-subsectio, we fid the etropy rate of statioary Markov chai. Let {X } = be a radom process where all radom variables X i are takig the values o discrete alphabet X. Def.6 The radom process {X } = is Markov chai, if for every m ad x,..., x m X such that P(X m = x m,..., X = x ) > 0, (.22) holds, i.e. P(X m+ = x m+ X m = x m,..., X = x ) = P(X m+ = x m+ X m = x m ). (.26) I thermiology of Markov chais, the elemets of X are called states, ad the chai is called time homogeous, if the the right had side of equality (.26) is idepedet of m. I this case, for every m ad x i, x j X P(X m+ = x j X m = x i ) = P (X 2 = x j X = x i ) =: P ij. The matrix P = (P ij ) is trasitio matrix of time-homogeous MC {X }. Let π(i) = π(x i ) iitial distributio be the distributio of X. The P(X 2 = x j ) = x i X P(X 2 = x j X = x i )P(X = x i ) = i P ij π(i) so that the distributio of X 2 is π T P. Similarly, the distributio of X k is π T P k. Now, it is ot hard to see that the distributio of ay fiite vector (X k,..., X k+l ) is fully determied by trasitio matrix P ad iitial distributio π. Markov chai {X } is statioary iff π is such that π T P = π or π(j) = i π(i)p ij j. Such iitial distributio (whe exists) is called statioary iitial distributio. Whether it exists ad is uique, depeds o the trasitio matrix P. Example: Let X = 2 ad let the trasitio matrix be ( ) α α. β β Uique statioary iitial distributio correspodig to that trasitio matrix is β ( α + β, α α + β ). Theorem.7 Let {X } be statioary time-homogeous Markov chai with trasitio matrix (P ij ) ad (statioary) iitial distributio π. The H X = H(X 2 X ) = i π(i) j P ij log P ij. 26

27 Proof. From (.26), we obtai that for every H(X X,..., X ) = H(X X ). Sice chai is statioary, we get H(X X ) = H(X 2 X ) ad by Theorem.5, The equatio H X = H X = lim H(X X,..., X ) = lim H(X X ) = H(X 2 X ). H(X 2 X ) = i π(i) j P ij log P ij is Exercise 3..9 Exercises. Let us toss util the firs head. Let X be the umber tosses eeded. Fid H(X), if the probability of head is p. 2. Prove groupig property H(p, p 2, p 3,...) = H(p + p 2, p 3,...) + (p + p 2 )H( (p + p 2 ), p 2 (p + p 2 ) ) ad deduce (.2). 3. Let g : X X a fuctio. Prove that 4. Fid P such that H(P ) =. H(g(X)) H(X), H(g(X) Y ) H(X Y ). 5. let X ad X 2 radom variables with disjoit supports. Let X have mixture distributio, i.e. { X if Z =, X = X 2 if Z = 0, where Z B(, p). Fid H(X). Show that 6. Let X P. Show that 2 H(X) 2 H(X ) + 2 H(X 2). P ( P (X) d ) (log d ) H(X). p 7. Fid distributios P, Q ad R show that D(P Q) > D(P R) + D(R Q). 27

28 8. Prove (.9). 9. Let ad for every, P = (p, p 2,..., p m, 0, 0,...) P = ( ( )p,..., ( )p m,,..., 0,...), (.27) M M }{{ } M where show that M = 2 c, c > 0. H(P ) = ( )H(P ) + log 2 M + h( ) H(P ) + c. 0. Let X ifiite. Defie P = ( α log, α log,..., α, 0, ), log }{{} where α > 0. Show that P P, where P = (, 0,...), but H(P ) α. Let Q = (q, q 2, q 3,...), where q i = ( q)q i. Show that D(P Q) <, but D(P Q).. Let X = (X,..., X ) radom vector, where X i has Beroulli distributio for every i. The radom variables X i are either idepedet or idetically distributed. Let R = (R,..., R ) be the ru legths of X. For example, if X = (, 0, 0, 0,,, 0), the R = (, 3, 2, ). Show that 0 H(X) H(R) mi i H(X i ). 2. Let X, Y be radom variables, let Z = X + Y. Show that H(Z X) = H(Y X). Show that whe X ad Y are idepedet, the H(X) H(Z) ad H(Y ) H(Z). Fid X ad Y such that H(X) > H(Z) ad H(Y ) > H(Z). Whe H(Z) = H(X) + H(Y )? 28

29 3. Let ρ(x, Y ) = H(X Y ) + H(Y X). Show that ρ is semi-metric. Whe ρ(x, Y ) = 0? Show that ρ(x, Y ) = H(X)+H(Y ) 2I(X; Y ) = H(X, Y ) I(X; Y ) = 2H(X, Y ) H(X) H(Y ). 4. Prove that for every 2 Show that H(X,..., X ) H(X i X j, j i). i= 2 [H(X, X 2 ) + H(X 3, X 2 ) + H(X, X 3 )] H(X, X 2, X 3 ). 5. Let X, Y, Z be radom variables, with Y ad Z beig idepedet. Show that D(X Y Z) = H(X Z) + D(X Y ) + H(X) H(Z) + D(X Y ). 6. Usig log-sum iequality prove (.3). 7. (a) Let X ad X 2 have the same distributio. Let ρ(x, X 2 ) := H(X 2 X ). (.28) H(X ) Prove that ρ is symmetric, ρ [0, ]. Whe ρ = 0? Whe ρ =? (b) Let (X, Y ) have the followig joit distributio, where ϵ (0, 4 ]: Y \X ϵ ϵ Fid I(X; Y ) ad ρ (like i (.28)). Fid cov(x, Y ) ad the correlatio coefficiet of X ad Y. Note that whe, the the limit of correlatio coefficiet is for every ϵ > 0. (c) Let (X, Y ) have the followig joit distributio Y \X Fid I(X; Y ) ad ρ (like i (.28)). Fid cov(x, Y ) ad the correlatio coefficiet of X ad Y. 29

30 8. Prove I(X; X Z) = H(X Z) I(X; Y Z) = H(Y Z) H(Y X, Z) I(X; Y Z) = H(X Z) + H(Y Z) H(X, Y Z) I(X; Y Z) = H(X, Z) + H(Y, Z) H(X, Y, Z) H(Z). 9. Prove H(X, Y Z) H(X Z) I(X, Y ; Z) I(X; Z) H(X, Y, Z) H(X, Y ) H(X, Z) H(X) Whe the iequalities are equalities? 20. fid X, Y, Z such that 2. Prove that I(X; Y Z) I(Y ; Z X) I(Y ; Z) + I(X; Y ). I(X; Y Z) > I(X; Y ) = 0 0 = I(X; Y Z) < I(X; Y ). H(X g(y )) H(X Y ). fid (X, Y ) such that X ad Y are depedig, g is ot oe-to-oe, but the iequality is a equality. 22. Let X = (X,..., X ) be a radom vector with biary (0 or valued) compoets havig the followig distributio: { 2 ( ) whe P (x,..., x ) = i x i is eve; 0, whe i x i is odd. Fid the distributio of X i. Fid the distributio of (X i, X i+ ). Fid I(X ; X 2 ), I(X 2 ; X 3 X ), I(X 4 ; X 3 X, X 2 ),..., I(X ; X X, X 2,..., X 2 ). 23. Prove that if X Y Z, the H(X Z) H(X Y ), I(X; Z) I(Y ; Z) ad I(X; Y Z) I(X; Y ). 24. Let {P θ } be a set of Beroulli distributios, θ Θ, where Θis discrete set, π is a prior distributio of θ. Let X be a radom sample ad T (X) = i= X i. Fid H(θ T (X)) ad H(θ X). Show that data processig iequality is a equality. 30

31 25. Let X X 2 X 3 X 4. Prove I(X ; X 4 ) I(X 2 ; X 3 ). 26. let X X 2 X. Fid I(X ; X 2, X 3,..., X ). 27. Let X X 2 X 3 be Markov chai, where X =, X 2 = k, X 3 = m, k < ad k < m. Show that "bottleeck" decreases mutual iformatio betwee X ad X 3 i.e. I(X ; X 3 ) log k. Show that whe k =, the X ad X 3 are idepedet. 28. Let X = m ad let X be a radom variable takig values o X. Fid a oradom estimate ˆX to X with smallest error probability. Let P e = P(X ˆX). fid X such that Fao s iequality is a equality H(X) = P e log( X ) + h(p e )? 29. Let P be a probability distributio with support X P = {, 2,...}. Let µ be the mea of P. Prove that H(P ) µ log µ + ( µ) log(µ ), with equality iff P has geometric distributio. Hece, amogst such distributios, the geometric distributio has the biggest etropy. 30. Let {X } = be a statioary radom process. Prove 3. Prove that for statioary MC, H(X,..., X ) H(X,..., X ) H(X,..., X ) H(X X,..., X ). H(X 2 X ) = i π(i) j P ij log P ij. 32. Let X, X 2,... be i.i.d. radom variables X i P. Cosider radom walk {S } =0, s.t. S 0 = 0, S = X, S 2 = X + X 2,..., S = X + + X. Prove that the etropy rate of radom walk is H S = H(P ). 33. A dog walks o the itegers: at time 0 is it o positio 0. The it start to move, with probability 0.5 to left ad with the same probability to right. The it cotiues movig i the same directio, possibly reversig directio with probability 0.. A typical walk might look like Fid H X. (X 0, X,...) = (0,, 2, 3, 4, 3, 2,, 0,, 2, 3,...). 3

32 34. Cosider radom walk o rig (0,,..., l), i.e. l is followed by 0. Let S = X i, i= where X has uiform distributio o (0,,..., l) ad X 2, X 3,... are i.i.d. radom variables P (X 2 = ) = P (X 2 = 2) = 0.5. Fid H S. 32

33 2 Zero-error data compressio 2. Codes I this sectio, we suppose that besides our origial alphabet X, we have aother fiite codig alphabet D. I what follows, D =: D so that alphabet D will be referred to as D-ary alphabet ad without loss of geerality we take D = {0,..., D }. I case D = 2, thus, we speak about biary alphabet {0, } etc. The alphabet D is used i data trasmissio. Typically D < X, hece to trasmit a letter x it should be represeted as a fiite strig of letters from D - a codeword. I what follows, let D be the set of all fiite legth strigs (codewords) from D. Formally, thus D := =D, X := =X. Def 2. A code is mappig C : X D. There are differet codes. A classical example of a code is Morse alphabet, where D cosists of three elemets: a dot, a dash ad a letter space. Actually there is also a word space but whe codig letters oly, it will ot be eeded. I Morse code, short letters represet frequet letters (i Eglish) ad log sequeces represet ifrequet letters. This makes Morse code reasoably efficiet but, as we shall see, this is ot the most efficiet (optimal) code. Oe ca see this immediately by oticig that oe of the three code-letters space is used i the ed of the word, oly. Def 2.2 A code C is o-sigular, whe it is ijective i.e. every elemet of X is mapped ito a differet codeword: if x i x j the C(x i ) C(x j ). No-sigularity is sufficiet to decode uiquely letters, but typically oe eed to codewords. A the a stroger property is eeded. Def 2.3 A extesio of a code C is a mappig C from X ito D defied as follows C : X D, C (x x ) := C(x ) C(x ). Hece the extesio of a code C is a cocateatio of codewords of letters to obtai a codeword for word. Def 2.4 A code C is uiquely decodable, if its extesio is o-sigular. Hece, if C is uiquely decodable, the to every codeword C(x ) C(x ) correspods oly oe origial word (source strig) x x. However, oe may have to look at the etire strig to determie eve the first symbol i the correspodig source strig. It is atural to expect that the first letter x ca be decoded as soo as C(x ) has bee observed decodig ca be performed "o-lie". This meas that C(x ) caot be the begiig (prefix) of ay other codeword. 33

34 Def 2.5 A code C is prefix code (prefix-free code, istataeous code) if o codeword is a prefix of ay other codeword i.e. there are o differet letters x i ad x j such that C(x i ) is a prefix of C(x j ). Clearly prefix codes are uiquely decodable ad uiquely decodable codes are o-sigular. Examples: Morse code is prefix code, sice every codeword eds with space. Let X = {a, b, c, d} ad cosider biary codes C, C 2, C 3 ad C 4, represeted i the table. X C C 2 C 3 C 4 a b c 0 0 d Code C is ot o-sigular; C 2 is o-sigular but ot uiquely decodable, sice 00 could stad for the letter b as well as for the words ad ad ca. Code C 3 is uiquely decodable but ot prefix code. Ideed, to figure out whether is a codeword of cbb... b or dbb... b, oe has to cout all 0 s. Thus, oe caot decode the first letter before the whole codeword is read. This is so, because the codeword C(c) = is a prefix of the codeword C(d) = 0. Code C 4 is prefix code, hece every letter ca be decoded as soo as it codeword has observed. Decode "o-lie" the word Kraft iequality Prefix code as a tree. Every prefix code ca be represeted as D-ary tree, where every ode has at most D childre. To every brach of a tree correspod a letter from D, to every leave correspods a letter from X ad the path from the root to the letter is the codeword of that letter (leave). The legth of that codeword is the legth (or level) of that leave. Example: Let D = 3. Let us costruct a code tree of the followig prefix code: a b c d e f g h I what follows, give a code C, we shall deote by l(x) := C(x) the legth of the codeword. I the example above, X = 8 ad the legths of codewords i icreasig order are l = l 2 =, l 3 = 2, l 4 = l 5 = l 6 = l 7 = l 8 = 3. It is clear that whe C is a prefix code ad ca be represeted as a tree, the the codeword legths caot be arbitrary small. Kraft iequality gives a ice boud. 34

35 Theorem 2.6 (Kraft iequality) Let C : X D be a prefix code l i = l(x i ). The D l i. (2.) i Coversely, let {l i } X i= itegers that satisfy (2.). The there exist prefix code C : X D such that l i = l(x i ) x i X. Proof. Let us start with provig the first claim for the case X = m <. Let l := max{l,..., l m } <. Orgaize the set {l,..., l m } (code) as a D-ary tree. A codeword at level l i has D l l i descedats at level l. All the the descedat sets (correspodig to differet l i ) must be disjoit. Therefore the total umber of odes i these sets (over all codewords) must be less tha or equal to D l : m D l l i D l i= m D l i. Let us ow prove the same claim for geeral case, where X. Recall i= D = {0,..., D } ad cosider the codeword d d 2 d li. Let 0.d d 2 d li D-ary expasio 0.d d 2 d li, i.e. be the real umber havig the 0.d d 2 d li = l i j= d j D j. (2.2) Cosider the iterval (sub-iterval of [0, ]) [0.d d 2 d li, 0.d d 2 d li + D l i ). correspodig to the codeword d d 2 d li. To this iterval belog all real umbers whose D-ary expasio begis with 0.d d 2 d li. Clearly the legth of that iterval is D l i. Sice C is prefix code the itervals correspodig to differet codewords are disjoit. Sice they are all sub-itervals of [0, ], their legths sum up somethig less tha or equal to. This meas that (2.) holds. Let us prove the secod statemet: we are give the set {l i } X i= satisfyig (2.). We aim to costruct a prefix code so that the codewords have legths {l i }. Sice (2.) holds, it is possible to divide uit iterval ito disjoit subitervals with legths D l i. Ideed, order l l 2. Let the first iterval be [0, D l ), secod [D l, D l + D l 2 ) ad so o. Thus the first iterval correspods to l. It begis with 0 that ca be represeted as }{{} l 35

36 The first iterval eds with D l with D-ary expasio beig 0. 0 } {{ 0}. l Clearly the first iterval cosists of these real umbers, whose D-ary expasio begis with (with l zeros). Secod iterval correspods to l 2. We represet both D l as well as D l +D l 2 as D-ary real umbers with l 2 umbers after 0.. Recall that l 2 l. If l 2 = l, the the D l will be represeted just like previously, otherwise it will be represeted as l 2 { }} { 0. 0 } {{ 0} 0 0. (2.3) l Clearly oe eeds at most l 2 figures after 0. to expad D l + D l 2 : To this iterval belog all these real umbers whose D-ary expasio begis with (2.3). The begiig of the third iterval (correspodig to l 3 ) ca be represeted as D-ary umber 0.d d 2 d l3. Agai, recall l 3 l 2 ad if l 3 > l 2, the the last l 3 l 2 elemets of that represetatio are zero. The D-ary expasio of the edpoit of that iterval D l + D l 2 + D l 3 has obviously at most l 3 elemets after 0.. We proceed similarly: the iterval correspodig to l i begis with D l + + D l i. The D-ary expasio of that umber has at most l i elemets after 0. ad we use l i elemets which is possible because l i l i. Hece, the D-ary represetatio is 0.d d li. To this iterval belog real umbers whose D-ary expasio begis with that represetatio. To costruct the code, take to every l i (to letter x i ) the word d d li from the D-ary expasio of D l + + D l i (begiig of the iterval). Sice differet codewords belog to differet itervals, the obtaied code is a prefix code. Examples: Cosider the code C 4. The l =, l 2 = 2, l 3 = l 4 = 3. Let us fid the real umbers whose D-ary represetatios are 0.d d 2 d li. We obtai = 0, = 0. 2 = 0.5, = 0. 2 = = 0.75, 0. 2 = = Hece the itervals used i the first part of the proof are [0, 0 + ), [0.5, ), [0.75, ), [0.875, ). 2 I this example, the Kraft iequality is a equality. The coverse: Let {, 2, 3, 3} be the legths of the codewords. The easiest way to costruct the correspodig code is to costruct a tree. The procedure used i the proof is as follows. Let us costruct the itervals: [0, 2 ), [ 2, ), [ 2 + 4, ), [ , ). 36

37 With biary represetatio these itervals (recall the umbers of figures after 0. must be l i ) are 2 {}}{{}}{{}}{{}}{ [0. 0, 0.), [0. 0, 0.), [0. 0, 0.), [0., ). Codewords: 0, 0, 0,. Let the legths of the codewords be {2, 2, 3, 3}. Note that Kraft iequality is strict: = 3 <. Itervals 4 [0, 4 ), [ 4, 2 ), [ 2, ), [ 2 + 8, ). With biary expasio these itervals are [0.00, 0.0), [0.0, 0.0), [0.00, 0.0), [0.0, 0.0). Codewords: 00, 0, 00, Expected legth ad etropy Let us cosider the case where letters are chose radomly accordig to a distributio P o X. I other words, we cosider a radom variable X P. Give a code C we are iterested i the expected legth of a codeword. Sice l(x) is the legth of codeword C(x), the expected legth of the code C is 3 3 L(C) = x l(x)p (x). Example: Cosider the code C 4. Let P (a) = 2, P (b) = 4, P (c) = P (d) = 8. The Note that H(P ) = 7 4. L(C 4 ) = = 7 4. Hece L is the average umber of symbols we eed to describe the outcome of X whe the code C is used. Clearly, the smaller the expected legth, the better code. The expected legth is obviously small whe all codeword are small i.e. l(x) is small for every x. O the other had, we kow that for prefix code the legths l(x) caot be arbitrary small, sice they have to satisfy Kraft iequality. But give the legths l(x) that satisfy Kraft equality, how to choose the code with miimal expected legth? We kow how to fid the codewords, but how to assig these words to letters x? The ituitio correctly suggest that the expected legth is small if the frequet (high probability) letters have small codewords ad ifrequet letters loger. Also the Morse code follows the same priciple, but the symbol "space" is oly used to mark the ed of the word, hece oe ca figure out a three letter prefix code with smaller expected legth. 37

38 The ext theorem provides a fudametal lower boud o the expected legth of ay prefix code. It turs out tha the for D-ary code the expected legth caot be lower the H D (P ). Theorem 2.7 Let C : X D be a prefix code. The L(C) H D (P ), with the equality if ad oly if l(x) = log D P (x), x X. Proof. L(C) H D (P ) = x l(x)p (x) x P (x) log D P (x) = x P (x) log D D l(x) + x P (x) log D P (x). Let The L(C) H D (P ) = x c := x D l(x), R(x) := D l(x). c P (x) log D P (x) R(x) log D c = D(P R) + log D c 0, because D(P R) 0 ad from Kraft iequality, it follows log D c 0. The iequality is a equality oly if P = R ad c =. This holds iff for every x X it holds P (x) = D l(x). Necessary coditio is that log D P (x) is iteger for every x X. Optimal codes for D-adic distributio. The code with miimum expected legth is called optimal. From the precedig theorem, it follows that if P satisfies the followig coditio: log D Z, x X, (2.4) P (x) (sometimes such distributios are called D-adic), the optimal prefix code is easy to costruct: take l(x) = log D P (x). The legths l(x) satisfy Kraft iequality (with equality) ad the correspodig optimal code ca be costructed via costructig the tree or usig the iterval as i the proof of Kraft iequality. The expected legth of such code is H D (P ) ad from the precedig theorem we kow that it must be the optimal. 38

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Information Theory and Statistics Lecture 4: Lempel-Ziv code Iformatio Theory ad Statistics Lecture 4: Lempel-Ziv code Łukasz Dębowski ldebowsk@ipipa.waw.pl Ph. D. Programme 203/204 Etropy rate is the limitig compressio rate Theorem For a statioary process (X i)