10-704: Information Processing and Learning Spring Lecture 10: Feb 12

Size: px

Start display at page:

Download "10-704: Information Processing and Learning Spring Lecture 10: Feb 12"

Piers Morris
5 years ago
Views:

1 10-704: Iformatio Processig ad Learig Sprig 2015 Lecture 10: Feb 12 Lecturer: Akshay Krishamurthy Scribe: Dea Asta, Kirthevasa Kadasamy Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for formal publicatios. They may be distributed outside this class oly with the permissio of the Istructor Codes Codes are fuctios that covert strigs over some alphabet ito (typically shorter) strigs over aother alphabet. Recall, the codig problem X Ecoder C() Σ Here Σ is the dictioary (e.g. for biary codes Σ = {0, 1}). Our goal is to have low Epected code legth with respect to the distributio p of, l(c) = E p l() where l() is the legth of C() Taoomy of codes Let X be a radom variable takig values i a set X. Let deote Kleee closure of the dictioary ad C be the code i.e C : X Σ. The etesio of the code C is a code of the form C : X Σ defied by C( ) = C( 1 )C( 2 )... C( ), = 0, 1,..., 1, 2,..., X. Listed below are they types of codes we saw i class. Codes Descriptio of C Nosigular C ijective i.e., X, = C() = C( ) Uiquely Decodable The etesio the code is osigular. Prefi/Istataeous/Self-puctuatig No code word prefies aother: for all distict, X, C( ) does ot start with C( ) We illustrate this below with the followig eample for the symbols {a, b, c, d} take from Chapter 5 i Cover ad Thomas. X Sigular Nosigular Uiquely Decodable Istataeous a b c d We begi with the followig importat results. 10-1

2 10-2 Lecture 10: Feb 12 Theorem 10.1 (Kraft-McMilla Iequality) For ay uiquely decodable code C : X Σ where D = Σ, D l() 1. (10.1) Coversely, for all sets {l()} X of umbers satisfyig (12.1), there eists a prefi code C : X {1, 2,..., D} such that l() is the legth of C() for each. Theorem H(X) l(c) for all uiquely decodable codes. 2. For ay ɛ > 0, there eists large eough ad a code C : X Σ such that l(c ) H(X) + ɛ. ɛ > 0. Proof: The proof for the first statemet follows by solvig the cove program mi X p()l() subject to the costraits D l() 1. p() = 1. For the secod, we ca use the Shao code i blocks. That is, for a -legth sequece 1 = use the code leghts l ( 1 ) = log D p( 1 ) so that H(X) = H(X 1 ) El (X 1 ) H(X 1 ) + 1 = H(X) + 1 Propositio 10.3 The ideal codelegths for a prefi code with smallest epected codelegth are l () = log D 1 p() (Shao iformatio cotet) Proof: I last class, we showed that for all legth fuctios l of prefi codes, E[l ()] = H p (X) E[l()]. While Shao etropies are ot iteger-valued ad hece caot be the legths of code words, the itegers { log D 1 p() } X satisfy the Kraft-McMilla Iequality ad hece there eists some uiquely decodable code C for which H p () E[l()] < H p () + 1, X (10.2) by Theorem Such a code is called Shao code. Moreover, the legths of code words for such a code C achieve the etropy for X asymptotically, i.e. if Shao codes are costructed for strigs of symbols where, istead of idividual symbols. Assumig X 1, X 2,... form a iid process, for all = 0, 1,... H(X) = H(X 1, X 2,..., X ) E[l( 1,..., )] < H(X 1, X 2,..., X ) + 1 = H(X) + 1 by (12.2), ad hece E[ l(1,...,) ] H(X). If X 1, X 2,... form a startioary process, the a similar argumet shows that E[ l(1,...,) ] H(X ), where H(X ) is the etropy rate of the process.

3 Lecture 10: Feb Theorem 10.4 (Shao Source Codig Theorem) A collectio of iid radom variables, each with etropy H(X), ca be compressed ito H(X) bits o average with egligible loss as. Coversely, o uiquely decodable code ca compress them to less tha H(X) bits without loss of iformatio No-sigular vs. Uiquely decodable codes Ca we gai aythig by givig up uique decodability ad oly requirig the code to be o-sigular? First, the questio is ot really fair because we caot decode sequece of symbols each ecoded with a o-sigular code easily. Secod, (as we argue below) o-sigular codes oly provide a small improvemet i epected codelegth over etropy. Theorem 10.5 The legth of a o-sigular code satisifes D l() l ma ad for ay probability distributio p o X, the code has epected legth E[l(X)] = p()l() H D (X) log D l ma. Proof: Let a l deote the umber of uique codewords of legth l. The a l D l sice o codeword ca be repeated due to o-sigularity. Usig this l ma l ma D l() = a l D l D l D l = l ma. The epected codelegth ca be obtaied by solvig the followig optimizatio problem: mi p()l() subject to D l l ma, l=1 the cove o-sigularity code costrait. Differetiatig the Lagragia p l +λ D l with respect to l ad otig that at the global miimum (λ, l ) it must be zero, we get : which implies that D l = p λ l D. l=1 p λ D l l D = 0 Usig complemetary slackess, otig that λ > 0 for the above coditio to make sese, we have : D l = p λ l D = l ma which implies λ = 1/(l ma l D) ad hece D l = p l ma, or the optimum legth l = log D (p l ma ). This gives the epected miimum codelegth for osigular codes as p l = p log D (p l ma ) = H D (X) log D l ma. I last lecture, we saw a eample of a o-sigular code for a process which has epected legth below etropy. However, this is oly true whe ecodig idividual symbols. As a direct corollary of the above result, if symbol strigs of legth are ecoded usig a o-sigular code, the E[l(X )] H(X ) log D (l ma )

4 10-4 Lecture 10: Feb 12 Thus, the epected legth per symbol ca t be much smaller tha the etropy (for iid processes) or etropy rate (for statioary processes) asymptotically eve for o-sigular codes, sice the secod term divided by is egligible. Thus, o-sigular codes do t offer much improvemet over uiquely decodable ad prefi codes. I fact, the followig result shows that ay o-sigular code ca be coverted ito a prefi code while oly icreasig the codelegth per symbol by a amout that is egligible asymptotically Huffma Codig Is there a prefi code with epected legth shorter tha Shao code? The aswer is yes. The optimal (shortest epected legth) prefi code for a give distributio ca be costructed by a simple algorithm due to Huffma. We itroduce a optimal symbol code, called a Huffma code, that admits a simple algorithm for its implemetatio. We fi Σ = {0, 1} ad hece cosider biary codes, although the procedure described here readily adapts for more geeral Σ. Simply, we defie the Huffma code C : X {0, 1} as the codig scheme that builds a biary tree from leaves up - takes the two symbols havig the least probabilities, assigs them equal legths, merges them, ad the reiterates the etire process. Formally, we describe the code as follows. Let X = { 1,..., N }, p 1 = p( 1 ), p 2 = p 2 ( 2 ),... p N = p( N ). The procedure Huff is defied as follows: Huff (p 1,..., p N ): if N > 2 the C(1) 0, C(2) 1 else sort p 1 p 2... p N C Huff(p 1, p 2,..., p N 2, p N 1 + p N ) for each i if i N 2 the C(i) C (i) else if i = N 1 the C(i) C (N 1) 0 else C(i) C (N 1) 1 retur C For eample, cosider the followig probability distributio: symbol a b c d e f g p i Huffma code The Huffma tree is build usig the procedure described above. The two least probable symbols at the first iteratio are a ad f, so they are merged ito oe ew symbol af with probability = At the secod iteratio, the two least probable symbols are af ad g which are the combied ad so o. The resultig Huffma tree is show below.

5 Lecture 10: Feb a f g c The Huffma code for a symbol i the alphabet {a, b, c, d, e, f, g} ca ow be read startig from the root of the tree ad traversig dow the tree util is reached; each leftwards movemet suffies a 0 bit ad each rightwards movemet adds a trailig 1, resultig i the code show above i the table. Remark 1: If more tha two symbols have the same probability at ay iteratio, the the Huffma codig may ot be uique (depedig o the order i which they are merged). However, all Huffma codigs o that alphabet are optimal i the sese they will yield the same epected codelegth. Remark 2: Oe might thik of aother alterate procedure to assig small codelegths by buildig a tree top-dow istead, e.g. divide the symbols ito two sets with almost equal probabilities ad repeatig. While ituitively appealig, this procedure is suboptimal ad leads to a larger epected codelegth tha the Huffma ecodig. You should try this o the symbol distributio described above. Remark 3: For a D-ary ecodig, the procedure is similar ecept D least probable symbols are merged at each step. Sice the total umber of symbols may ot be eough to allow D variables to be merged at each step, we might eed to add some dummy symbols with 0 probability before costructig the Huffma tree. How may dummy symbols eed to be added? Sice the first iteratio merges D symbols ad the each iteratio combies D-1 symbols with a merged symbols, if the procedure is to last for k (some iteger umber of) iteratios, the the total umber of source symbols eeded is 1 + k(d 1). So before begiig the Huffma procedure, we add eough dummy symbols so that the total umber of symbols look like 1+k(D 1) for the smallest possible value of k. Now we will show that the Huffma procedure is ideed optimal, i.e. it yields the smallest epected codelegth for ay prefi code. Sice there ca be may optimal codes (e.g. flippig bits i a code still leads to a code with same codelegth, also echagig source symbols with same codelegth still yields a optimal code) ad Huffma codig oly fids oe of them, lets first characterize some properties of optimal codes. Assume the source symbols 1,..., N X are ordered so that p 1 p 2 p N. For brevity, we write l i for l( i ) for each i = 1,..., N. We first observe some properties of geeral optimal prefi codes. d b e Lemma 10.6 For ay distributio, a optimal prefi code eists that satisfies: 1. if p j > p k, the l j l k. 2. The two logest codewords have the same legth ad correspod to the two least likely symbols. 3. The two logest codewords oly differ i the last two bits. Proof: The collectio of prefi codes is well-ordered uder epected legths of code words. Hece there eists a (ot ecessarily uique) optimal prefi code. To see (1), suppose C is a optimal prefi code. Let C be the code iterchagig C( j ) ad C( k ) for some j < k (so that p j p k ). The 0 L(C ) L(C) = i p i l i i p i l i = p j l k + p k l j p j l j p k l k = (p j p k )(l k l j )

6 10-6 Lecture 10: Feb 12 ad hece l k l j 0, or equivaletly, l j l k. To see (2), ote that if the two logest codewords had differig legths, a bit ca be removed from the ed of the logest codeword while remaiig a prefi code ad hece have strictly lower epected legth. A applicatio of (1) yields (2) sice it tells us that the logest codewords correspod to the least likely symbols. We claim that Huffma codes are optimal, at least amog all prefi codes. Because our proof ivolves multiple codes, we avoid ambiguity by writig L(C) for the epected legth of a code word coded by C, for each C. Propositio 10.7 Huffma codes are optimal prefi codes. Proof: Defie a sequece {A N } N=2,..., X of sets of source symbols, ad associated probabilities P N = {p 1, p 2,..., p N 1, p N + p N p X }. Let C N deote a huffma ecodig o the set of source symbols A N with probabilities P N. We iduct o the size of the alphabets N. 1. For the base case N = 2, the Huffma code maps 1 ad 2 to oe bit each ad is hece optimal. 2. Iductively assume that the Huffma code C N 1 is a optimal prefi code. 3. We will show that the Huffma code C N is also a optimal prefi code. Notice that the code C N 1 is formed by takig the commo prefi of the two logest codewords (leastlikely symbols) i { 1,..., N } ad allottig it to a symbol with epected legth p N 1 + p N. I other words, the Huffma tree for the merged alphabet is the merge of the Huffma tree for the origial alphabet. This is true simply by the defiitio of the Huffma procedure. Let l i deote the legth of the codeword for symbol i i C N ad let l i deote the legth of symbol i i C N 1. The L(C N ) = = N 2 i=1 N 2 i=1 p i l i + p N 1 l N 1 + p N l N p i l i + (p N 1 + p N )l N 1 +(p N 1 + p N ) } {{ } L(C N 1 ) the last lie followig from the Huffma costructio. Suppose, to the cotrary, that C N were ot optimal. Let C N be optimal (eistece is guarateed by previous Lemma). We ca take C N 1 to be obtaied by mergig the two least likely symbols which have same legth by Lemma But the L( C N ) = L( C N 1 ) + (p N 1 + p N ) L(C N 1 ) + (p N 1 + p N ) = L(C N ) where the iequality holds sice C N 1 is optimal. Hece, C N had to be optimal. Remarks 10.8 The umbers p 1, p 2,..., p N eed ot be probabilities - just weights {w i } takig arbitrary o-egative values. Huffma ecodig i this case results i a code with miimum i p iw i. Remarks 10.9 Sice Huffma codes are optimal prefi codes, they satisfy H(X) E[l(X)] < H(X) + 1, same as Shao code. However, epected legth of Huffma codes is ever loger tha that of a Shao code, eve though for ay give idividual symbol either Shao or Huffma code may assig a shorter codelegth.

7 Lecture 10: Feb Remarks Huffma codes are ofte udesirable i practice because they caot easily accomodate chagig source distributios. We ofte desire codes that ca icorporate refied iformatio o the probability distributios of strigs, ot just the distributios of idividual source symbols (e.g. Eglish laguage.) The huffma codig tree eeds to be recomputed for differet source distributios (e.g. Eglish vs Frech) Coectios to Machie Learig Cosider the classical statistical learig set up. We have data Z from some distributio p. We wat to have good predictio o some learig task which is ofte characterised via, f = argmi E Z p [L(Z, f(z))] f F where l is some loss fuctio ad R(f) = E Z p [L(Z, f(z))] is the risk. For eample i liear regressio Z = (X, Y ) ad L(Z, f(z)) = (Y f(x)) 2 where f(x) = β X the the objective becomes, β = argmi β E X,Y p [ (Y β X) 2]. However, the challege is that we do t see the true distributio p. Istead we have samples Z 1 = {Z 1,..., Z } where Z i p. A atural idea is to miimize the emprical risk, 1 ˆf = argmi E Z ˆp [L(Z, f(z))] = argmi f F f F L(Z i, f(z i )) Deote the empirical risk by R (f) = E Z ˆp [L(Z, f(z))] This procedure is called Empirical Risk Miimizatio (ERM). But does ERM work? We will begi this discussio by cosiderig bouded loss fuctios l [0, 1]. We shall eted this i the et lecture. For this, we wish to boud the ecess risk, R( ˆf) R(f ). Notig that R ( ˆf) R (f ), we see that R( ˆf) R(f ) R( ˆf) R ( ˆf) + R(f ) R (f ). Therefore it is sufficiet to have a uiform deviatio boud of the form f F, R (f) R(f) ɛ. Usig Hoeffdig iquality we see that for ay fied f F, with probability > 1 δ log(2/δ) R (f) R(f) 2 Therefore if F is fiite we see that with probability > 1 δ log(2 F /δ) f F R (f) R(f) 2 which gives us the desired uiform deviatio boud. i=1 (10.3) Note that we got the uiform deviatio boud by assigig failure probability δ/ F to each fuctio ad the usig the uio boud. But to apply the uio boud it is sufficiet if F is coutable. Suppose we have a distributio ζ over F. The we ca assig a failure probability of δ(f) = δζ(f) for each f F ad apply the uio boud. I particular, if we ca assig a prefi code to the class the we ca use the distributio ζ(f) = 2 l(f). f ζ(f) 1 by the Kraft iquality. This gives a deviatio boud of the form, R (f) R(f) (1 + l(f)) log(2/δ) 2 = ɛ(f)

Lecture 10: Universal coding and prediction

Lecture 10: Universal coding and prediction 0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved