Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Bouds o the expected etropy ad KL-dvergece of sampled multomal dstrbutos Brado C. Roy bcroy@meda.mt.edu Orgal: May 18, 2011 Revsed: Jue 6, 2011 Abstract Iformato theoretc quattes calculated from a sampled multomal dstrbuto devate from the true quatty as a fucto of the umber of samples. I ths report, bouds o the expected etropy ad KL-dvergece for a sampled dstrbuto are derved. These bouds ca be helpful uderstadg the error betwee the emprcal quatty ad the true quatty for a dstrbuto. 1 Expected etropy lower boud Cosder a multomal dstrbuto p wth B bs, ad estmates of p obtaed by samplg. Let p be the dstrbuto obtaed by takg samples from p. What s the expected etropy of p as a fucto of? We should expect that for = 1 samples, all the probablty mass wll be oe partcular b of p ad the etropy should be 0. As we expect p p ad Hp ) Hp). Ths dervato explores ths relatoshp. We wat to kow the expected value of the etropy of p, [ B ] E [Hp )] = E p, log p, = E [p, log p, ] 1) We ow cosder oly the expected value term 1). The maxmum lkelhood estmate 1 Jue 27, 2011

of p from samples s p, = x for all = 1... B. Thus, for a partcular b we have [ x E [p, log p, ] = E log x ] = Prx = k) k log k 2) Assumg the samples are d p, the the expected umber of samples b ca be calculated as ) = p k 1 p ) k k k log k = 1! p k 1 p ) k k log k k)!k! = 1! p k 1 p ) k k log k k)!k! k=1 Note that ths equato, p s the true probablty of b p rather tha the estmated value. Also ote that the last le above results from the fact that whe k = 0, the summad s zero. Next we smplfy k k!, pull a out of!, ad pull a p to the frot of the sum obtag 1)! = p k)!k 1)! k=1 p k 1 1 p ) k log k 2 Jue 27, 2011

Now, let j = k 1, m = 1 ad apply Jese s equalty for the followg dervato: = p m j=0 = p m j=0 p log m! m j)!j! Prx = j) log j + 1 m + 1 m j=0 Prx = j) j + 1 m + 1 = p log mp + 1 m + 1 = p log 1)p + 1 = p log p + 1 p ) p j 1 p ) m j log j + 1 m + 1 We obta 4) from 3) usg Jese s equalty for the cocave fucto log ). Equato 5) s just the expected value of the fucto j+1 for the bomal dstrbuto. Substtutg back for m + 1 yelds 6) whch smplfes to 7). Puttg all ths back together, we have E [p, log p, ] p log p + 1 p ) 8) E [p, log p, ] p log p + 1 p ) 9) Recallg equato 1), ad replacg the expectato wth our lower boud we have E [Hp )] = 3) 4) 5) 6) 7) E [p, log p, ] 10) p log p + 1 p ) ) Note that tutvely, as, log p + 1 p log p equato 11) yeldg the ) true etropy. Whe = 1, log p + 1 p = log1) = 0 mplyg Hp 1 ) = 0 as expected. ) Moreover, for each, log p + 1 p < log p ad therefore cotrbutes a smaller factor to the total etropy, mplyg that for small the expected etropy s smaller. 11) 3 Jue 27, 2011

Equato 11) ca also be wrtte as p log p 1 + 1 p )) p = p log p p log 1 + 1 p ) p = Hp) p log 1 + 1 p ) p Equato 12) s useful because t solates the compoet that vares wth. If c = 1 p p the x = c / 1 ad the Taylor expaso to log 1 + x) ca be appled. The Taylor seres of log1 + x) for 1 < x 1 s log1 + x) = x x2 2 + x3 3 x4 4 +... ad so applyg for x = c / we have log1 + c /) = c c2 2 2 + c3 3 3 c4 4 4 +... Pluggg ths to the summato 12) yelds the frst le below, ad the secod le we apply the fact that p c k = 1 p )c k 1 to get ) c = p c2 2 2 + c3 3 3 c4 4 4 +... = 1 p + B 1 1 p )c 2 2 1 p )c 2 3 3... wth the approxmato mprovg as creases. Also ote that ths preserves the boud, sce cludg the frst odd umber of terms of the Taylor seres e. 1,3,5,... terms) s always greater tha the whole seres. Therefore, we ca wrte a alteratve boud o the expected etropy as E [Hp )] Hp) B 1 Ths shows that the lower boud approaches the true etropy wth 1, a property that wll come to play later for the KL-dvergece. We performed some smulatos to obta the expected etropy as a fucto of ad compared ths to the lower boud obtaed usg ths dervato. Iterestgly, ths lower boud seems to be a good approxmato to Hp ), ad for = 1 ad t seems that equalty holds. These smulatos are show fgure 1. 12) 13) 14) 4 Jue 27, 2011

4.5 Hp ) ad lower boud 4 etropy logetropy) 3.5 3 2.5 2 1.5 1 expected etropy 0.5 expected etropy lower boud 0 0 1000 2000 3000 4000 5000 6000 7000 8000 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.2 expected etropy expected etropy lower boud 0.4 0 1 2 3 4 5 6 7 8 9 10 log) Fgure 1: Smulatos of the expected etropy for varous sample szes ad the lower boud. 5 Jue 27, 2011

1.1 Expected etropy upper boud The lower boud of the expected etropy coverges to the true etropy of the dstrbuto as. Ths s ot surprsg, as p p. Nevertheless, we ca also fd a upper boud o the expected etropy to better uderstad how t vares wth. Startg wth equato 2), we have E [p, log p, ] = = Prx = k) k log k Prx = k) k log k Prx = k) Prx = k) Let a k = Prx = k) k ad b k = Prx = k). The rewrtg, ad applyg the log-sum equalty yelds = a k log a k b k ) a a k k b k = p log p sce a k = p ad b k = 1. Therefore, E [p, log p, ] p log p or equvaletly, E [p, log p, ] p log p. Puttg ths boud o equato 2) back for equato 1) gves E [Hp )] p log p = Hp) I other words, the expected etropy of the sampled dstrbuto obtaed after samples s upper bouded by the etropy of the true dstrbuto. It would be ce to fd a tghter upper boud o the expected etropy, amely, oe that vares wth. Oe crude way to show that the upper boud creases wth s to fd the maxmum etropy p for each. The maxmum etropy p would be oe where each sample lads a ew b, ad for B we have p = log 1/. However, ths s ot a very satsfyg upper boud. It may be more frutful to focus o probablstc bouds, that may stead take the varace to accout. 6 Jue 27, 2011

2 Expected cross etropy The cross etropy betwee q ad p, here deoted as Hq, p) = q log p, ca be thought of as the cost bts of ecodg q usg a code for p. Suppose we have q the dstrbuto obtaed by takg samples from q. The what s the expected cross etropy Hq, p)? Smlar to the expected etropy calculato, we seek [ B ] E [Hq, p)] = E q, log p = = E [q, log p ] E [ x log p The expected value E [x ] above s just the expected cout for b uder the true probablty dstrbuto q. Thus, E [x ] = q ad 1 log p E [x ] = q log p ] = Hq, p) 15) So the expected cross etropy E [Hq, p)] s just the true cross etropy Hq, p). Note that Hp, p) = Hp), ad the specal case where q = p we have that E [Hp, p)] = Hp). 3 Expected KL-dvergece upper boud The KL-dvergece betwee dstrbutos q ad p s wrtte as Dq p) = q log q p. Ths ca be rewrtte as Dq p) = q log q q log p 16) = Hq, p) Hq) 17) For all q ad p of the same dmeso, Dq p) 0 wth equalty ff q = p. The KLdvergece ca be thought of as the addtoal bts requred to ecode q usg a code for p rather tha the code for q. What s the expected KL-dvergece of a sampled dstrbuto p to the true dstrbuto p? Usg the results from the prevous sectos, we have 7 Jue 27, 2011

E [Dp p)] = E [Hp, p) Hp )] = E [Hp, p)] E [Hp )] = Hp) E [Hp )] Hp) + p log p + 1 p ) 18) If we stead wrte the KL-dvergece boud usg equato 14) we have E [Dp p)] Hp) Hp) + B 1 = B 1 For comparg a sampled dstrbuto q agast p, we have E [Dq p)] Hq, p) Hq) + B 1 = Dq p) + B 1 I other words, for a gve umber of samples, we expect the sampled KL-dvergece to be wth a certa rage of the true KL-dvergece, depedg o. We tested ths upper boud by takg samples of p to obta p, computg Dp p), ad repeatg ths may tmes for each. The average of Dp p) s a estmate of the expected KL-dvergece for. We the computed the upper boud usg equato 18) ad plotted ths as well. The smulato results are show fgure 2. 19) 8 Jue 27, 2011

0.16 0.14 Dp p) ad upper boud Emprcal Upper boud Dp p) log 10 Dp p)) 0.12 0.1 0.08 0.06 0.04 0.02 0 0 1 2 3 4 5 um samples x 10 4 0.5 1 1.5 2 2.5 3 3.5 Emprcal Upper boud 4 2 2.5 3 3.5 4 4.5 5 log 10 um samples) Fgure 2: KL-dvergece for p sampled from p for varous, ad the upper boud o the expected value. 9 Jue 27, 2011