Lecture 10: Universal coding and prediction

0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for formal publicatios. They may be distributed outside this class oly with the permissio of the Istructor. 0. Uiversal codig ad predictio We wat a ecoder that works well for ay distributio p P. For example, to gzip a text file we would like a ecoder that works for several differet laguages, rather tha desigig a ew zip code for each sigle laguage. Shao/Huffma codes could achieve this, but their disadvatage is that geerally we have to wait for the whole sequece to arrive before begiig to ecode/decode. We would like to desig a code that ca be ecoded ad decoded immediately o a symbol per symbol basis. I this lecture we will focus o the duality betwee optimal codig ad optimal predictio. For example, it is atural to require a code C to have short legth L C (x) for the symbol x; aalogously i predictio we ofte wat to miimize a loss fuctio loss q (x) for a predictor q. Table 0. gives a geeral overview of several dual cocepts ad otios of a code C ad a predictor q that we will use i this (ad upcomig) lectures. 0.. Weak ad strog uiversality A code C is uiversal if R C = sup E p (R p,c ) 0. (0.) A predictor q is uiversally cosistet if R q = sup E p (R p,q ) 0. (0.) Just as stadard calculus has poitwise ad uiform covergece (of e.g. fuctios), we ca also differetiate betwee weak ad strog uiversality: weakly uiversal if the covergece rate depeds o p P, e.g. if R C = o( γp ) ad γ p chages for every p P. strog uiversal if the covergece rate is the same for all p P, e.g. R C = o( γ ) p P. By Kraft s iequality we kow that for prefix codes x X LC(x The rate eed ot ecessarily be polyomial; it just serves as a example. ), (0.3) 0-

0- Lecture 0: Uiversal codig ad predictio legth of symbol x usig code C code legth data x, x p P code C predictor q LC(x) lossq(x) = log q(x) Properties per symbol Average expected legth Ep ideal code legth LC(x ) - log q(x ) if x iid = i= log q(x i) ( ) LC(x) log p(x ) (Shao iformatio cotet) egative log-likelihood or self-iformatio loss empirical log-likelihood of the data E p log q(x ) E-loss = risk log p(x ) likelihood uder the true model p redudacy Rp,C = L C(x ) ( log p(x ) ) Rp,q = log q(x ) ( log p(x ) ) excess loss of predictor q wrt true p expected redudacy Ep (Rp,C) = Ep ( ) LC(x ) H(x ) 0 for uiquely decodable codes miimax expected redudacy a mic sup Ep (Rp,C) :=RC Ep (Rp,q) = E p log p(x ) q(x ) = D (p q) 0 excess risk = E excess loss miq sup D (p q) =:Rq miimax excess risk Table 0.: Uiversal codig ad predictio: per symbol properties of a code C ad a predictor q. a We wat that the code C works well for all p P.

Lecture 0: Uiversal codig ad predictio 0-3 ad we ca costruct the correspodig predictor by q(x L C(x ) ) = x = k X L C(x ) L C(x ) L c (x ) log q(x ). (0.4) Similarly, for ay distributio q(x ) we ca defie a correspodig prefix code that satisfies This must be true sice we kow that at least the Shao code For prefix codes (see (0.)) we have L c (x ) log q(x ) +. (0.5) R C mi sup q D (p q) arg mi q ( ) log q(x ) ca achieve this. =: R. (0.6) Let q be the predictor that achieves the above miimum. We ca the costruct a Shao code C from this predictor. By (0.5) we kow that R C will be withi / bit of R C (or withi bit for the etire sequece). 0.. Predictio problem We have data x from p P. How much loss do we suffer from usig q p istead of the true p? Here q ca be ay distributio; however, typically q is a estimate of p depedig o the data x. I geeral, we have to impose some restrictios o the class of distributios P to get uiversally cosistet codes C / predictors q, i.e. to achieve error rates 0. For example, we ofte assume i) iid, ii) Markov chais,... For a specific q cosider R q R. Typically we try to boud ad to get cotrol over the error rates. Note that q = arg mi q R q, where R q is the worst excess risk for a particular q. q is the model/estimate q that miimizes the expected worst case sceario. We will show later i the course that, i geeral, the optimal q is a mixture distributio over the class p P. I other words, for ay q, a mixture distributio p mix such that the excess risk of q is always greater or equal to the excess risk of p mix, i.e. D (p q) D (p p mix ). (0.7) 0..3 Maximum loss istead of expected loss Now istead of expected loss, cosider the maximum loss (maximum over all possible sequeces x ) R = mi q sup max x log p(x ) q(x ) (0.8)

0-4 Lecture 0: Uiversal codig ad predictio Let P ML (x ) = sup p(x ) (the MLE). Defie the ormalized ML as NML(x ) = P ML(x ) x P ML(x ) = q, (0.9) uder maximum-loss (istead of E-loss). The ormalized maximum likelihood distributio is the best uiversal predictor uder maximum loss. Theorem 0. For ay class P of processes with fiite alphabet q = NML(x ) ad R = log x P ML (x ). (0.0) For a proof, see pg 480 of Csiszar ad Shield s tutorial. 0..3. Problems of NML ad maximum loss for arithmetic codig For arithmetic codig we eed the coditioal distributio q(x x ). But the NML distributio is ot cosistet i the sese that q (x ) x + q (x + ) (0.) or equivaletly q (x,..., x ) x + q (x,..., x, x + ) (0.) Remark: The right had side i the above expressios yields a valid distributio, but it is ot the distributio of x,..., x uder q. This ca be see by recallig the defiitio of q. Thus it is ot possible to defie a correspodig arithmetic code (Shao ad Huffma codes are possible though). Thus we retur to cosider E-loss as i this case we kow that q is a mixture distributio - ad this is cosistet i the sese that q(x,..., x ) = x + q(x,..., x, x + ). (0.3) Examples of model classes ad their optimal codes/predictors. P is the class of iid processes with fiite alphabet X. It ca be show that the optimal predictor is give by q(x (x i x i ) + ) =, i= i + X (0.4) (x i x i ) = # of ocurreces of symbol x i i past x i. (0.5) The term (xi xi ) i is simply the frequecy of symbols observed before time t; the additioal smoothes out the ML estimate. Thus it avoids assigig 0 probability to symbols that have ot occurred yet (but may occur i the future). + + X

Lecture 0: Uiversal codig ad predictio 0-5 Let x be the umber of times the symbol x occurred i the etire legth sequece. The (0.4) ca be rewritte as (see ext lecture) q(x ) = ( x X ( x )( x 3 ) ) ( ) + X + X X π(p) p(x ), (0.6) where π(p) Dirichlet (,..., ). Oe ca show that (for proof see ext lecture) R q Rq X. P is the class of Markov processes of order. log best possible boud + costat. (0.7) Let i (k, j) be the cout of how may times the sequece (k, j) appeared i the first i symbols (x,..., x i ); also let i (k) = j i (k, j) be the total umber of times the symbol k occurred i the first i symbols. q(x ) = i= q(j x i ), q(j x i ) = i (k, j) +. (0.8) i (k) + X For a Markov process of order m = oe ca show (see ext lecture) R q Rq X ( X ) log best possible boud + costat. (0.9) 3. P is the class of Markov processes of order m, i.e. x i depeds o previous m steps. The q(j x i # times j occured preceeded by xi ) = # times x i X i m occured + i m +. (0.0) Here it holds R q Rq X m ( X ) log best possible boud + costat m. (0.) Agai, see the ext lecture for detailed derivatios. https://e.wikipedia.org/wiki/dirichlet_distributio