Maximum Likelihood Estimation of Functionals of Discrete Distributions

Size: px

Start display at page:

Download "Maximum Likelihood Estimation of Functionals of Discrete Distributions"

Kelley Garrett
6 years ago
Views:

1 Maximum Likelihood Estimatio of Fuctioals of Discrete Distributios Jiatao Jiao, Studet Member, IEEE, Kartik Vekat, Studet Member, IEEE, Yaju Ha, Studet Member, IEEE, ad Tsachy Weissma, Fellow, IEEE arxiv: v7 [cs.it] 0 Aug 07 Abstract We cosider the problem of estimatig fuctioals of discrete distributios, ad focus o tight up to uiversal multiplicative costats for each specific fuctioal oasymptotic aalysis of the worst case squared error risk of widely used estimators. We apply cocetratio iequalities to aalyze the radom fluctuatio of these estimators aroud their expectatios, ad the theory of approximatio usig positive liear operators to aalyze the deviatio of their expectatios from the true fuctioal, amely their bias. We explicitly characterize the worst case squared error risk icurred by the Maximum Likelihood Estimator MLE i estimatig the Shao etropy HP = S pilpi, ad the power sum F αp = S pα i,α > 0, up to uiversal multiplicative costats for each fixed fuctioal, for ay alphabet size S ad sample size for which the risk may vaish. As a corollary, for Shao etropy estimatio, we show that it is ecessary ad sufficiet to have S observatios for the MLE to be cosistet. I additio, we establish that it is ecessary ad sufficiet to cosider S /α samples for the MLE to cosistetly estimate F αp,0 < α <. The miimax rate-optimal estimators for both problems require S/ l S ad S /α /ls samples, which implies that the MLE has a strictly sub-optimal sample complexity. Whe < α < 3/, we show that the worst-case squared error rate of covergece for the MLE is α for ifiite alphabet size, while the miimax squared error rate is l α. Whe α 3/, the MLE achieves the miimax optimal rate regardless of the alphabet size. As a applicatio of the geeral theory, we aalyze the Dirichlet prior smoothig techiques for Shao etropy estimatio. I this cotext, oe approach is to plug-i the Dirichlet prior smoothed distributio ito the etropy fuctioal, while the other oe is to calculate the Bayes estimator for etropy uder the Dirichlet prior for squared error, which is the coditioal expectatio. We show that i geeral such estimators do ot improve over the maximum likelihood estimator. No matter how we tue the parameters i the Dirichlet prior, this approach caot achieve the miimax rates i etropy estimatio. The performace of the miimax rate-optimal estimator with samples is essetially at least as good as that of Dirichlet smoothed etropy estimators with l samples. Mauscript received Moth 00, 0000; revised Moth 00, 0000; accepted Moth 00, Date of curret versio Moth 00, This work was supported i part by the Ceter for Sciece of Iformatio CSoI uder grat agreemet CCF Materials of this paper were preseted i part at the 05 Iteratioal Symposium o Iformatio Theory, Hog Kog, Chia. Persoal use of this material is permitted. However, permissio to use this material for ay other purposes must be obtaied from the IEEE by sedig a request to pubs-permissios@ieee.org. Jiatao Jiao, Yaju Ha, ad Tsachy Weissma are with the Departmet of Electrical Egieerig, Staford Uiversity, CA, USA. {jiatao,yjha, tsachy}@staford.edu. Kartik Vekat was with the Departmet of Electrical Egieerig, Staford Uiversity. He is curretly with PDT Parters. kvekat@alumi.staford.edu. Idex Terms etropy estimatio, maximum likelihood estimator, Dirichlet prior smoothig, approximatio theory, high dimesioal statistics, Réyi etropy, approximatio usig positive liear operators I. INTRODUCTION Etropy ad related iformatio measures arise i iformatio theory, statistics, machie learig, biology, eurosciece, image processig, liguistics, secrecy, ecology, physics, ad fiace, amog other fields. Numerous iferetial tasks rely o data drive procedures to estimate these quatities see, e.g. [] [6]. We focus o two cocrete ad well-motivated examples of iformatio measures, amely the Shao etropy [7] HP p i lp i, ad the power sum F α P,α > 0: F α P p α i,α > 0. The power sum F α P fuctioal ofte emerges i various operatioal problems [8]. It also has coectios to the Réyi etropy [9] H α P via the formula H α P = lfαp α. Cosider estimatig the Shao etropy HP based o i.i.d. samples followig ukow discrete distributio P with ukow alphabet size S. This problem has a rich history with extesive study i various fields ragig from iformatio theory, statistics, eurosciece, physics, psychology, medicie, etc. We refer the reader to [0] for a review. Oe of the most widely used estimators for this purpose is the Maximum Likelihood Estimator MLE, which is simply the empirical etropy. The empirical etropy is a istatiatio of the plugi priciple i fuctioal estimatio, where a poit estimate of the parameter distributio P i this case is used to costruct a estimator for a fuctioal of the parameter via the plug-i approach. The idea of usig the MLE for estimatig iformatio measures of iterest i this case etropy, is ot oly ituitive, but has soud justificatio: asymptotic efficiecy. The beautiful theory of Hájek ad Le Cam [] [3] shows that, as the umber of observed samples grows without boud while the fiite parameter dimesio e.g., alphabet size remais fixed, the MLE performs optimally i estimatig ay differetiable fuctioal whe the statistical model complies with the beig LAN Local Asymptotic Normality coditio [3]. Thus, for fiite dimesioal problems, the problems

2 of parameter ad fuctioal estimatio are well uderstood i a asymptotic sese, ad the MLE appears to be ot oly atural but also theoretically justified. But does it make sese to employ the MLE to estimate the etropy i most practical applicatios? As it turs out, while asymptotically optimal i etropy estimatio, the MLE is by o meas sacrosact i may real applicatios, especially i regimes where the alphabet size is comparable to, or eve larger tha the umber of observatios. It was show that the MLE for etropy is strictly sub-optimal i the large alphabet regime [4], [5]. Therefore, classical asymptotic theory does ot satisfactorily address high dimesioal settigs, which are becomig icreasigly importat i the moder era of high dimesioal statistics. There has bee a wave of recet research activities focusig o aalyzig existig approaches of fuctioal estimatio, as well as proposig ew estimators that are provably ear optimal i the large alphabet regime. Paiski [4] showed that the MLE eeds S samples to cosistetly estimate the Shao etropy, ad Paiski [5] established the existece of a o-explicit estimator that oly required S samples. It implies that the MLE is strictly sub-optimal i terms of sample complexity. It was Valiat ad Valiat [6] who first explicitly costructed a liear programmig based estimator later modified i [7] that achieves cosistecy i etropy estimatio with S/ l S samples, which they also proved to be ecessary. Valiat ad Valiat [8] costructed aother approximatio based estimator that achieved better theoretical properties tha the liear programmig oes, which was ot yet show to be miimax rate-optimal for all rages of S ad. The authors [0] costructed the first miimax rate-optimal estimators for HP ad F α P,α > 0 based o best polyomial approximatio, which are agostic to the alphabet size S. Utilizig the released MATLAB ad Pytho packages of the estimators i [0], [9], [0] demostrated that these miimax rate-optimal estimators ca lead to sigificat performace boosts i various machie learig tasks. Wu ad Yag [] idepedetly applied the best polyomial approximatio idea to etropy estimatio ad obtaied the miimax rates. However, their estimator requires the kowledge of the alphabet size S. The approximatio ideas proved to be very fruitful i Acharya et al. [], Wu ad Yag [3], Ha, Jiao, ad Weissma [4], Jiao, Ha, ad Weissma [5], Bu et al. [6], Orlitsky, Suresh, ad Wu [7], Wu ad Yag [8]. The mai cotributio of this paper is a explicit characterizatio of the worst case squared error risk of estimatig HP ad F α P usig the MLE up to a uiversal multiplicative costat for each specific fuctioal, for all rages of S ad i which the risk may vaish. Uderstadig the beefits ad limitatios of the MLE i a oasymptotic settig serves two key purposes. First, the approach is a atural bechmark for comparig other more uaced procedures for estimatio of fuctioals. Secod, performace aalysis for the MLE reveals regimes where the problem is difficult, ad motivates the developmet of improvemets, which have bee validated i [0], [4] [8], [], []. As a byproduct of the aalysis, we explicitly poit out a equivalece betwee bias aalysis of fuctioal estimators usig plug-i rules ad approximatio theory usig positive liear operators. We believe these powerful tools itroduced from approximatio theory may have far reachig impacts i various applicatios i the iformatio theory commuity. We metio that there exist umerous other approaches proposed i various disciplies to estimate etropy, may amog which are difficult to aalyze theoretically. Amog them we metio the Miller Madow bias-corrected estimator ad its variats [9] [3], the jackkife estimator [3], the shrikage estimator [33], the coverage adjusted estimator [34], the Best Upper Boud BUB estimator [4], the B-Splies estimator [35], ad [36], [37] etc. For a Bayesia statisticia, a atural approach is to first impose a prior o the ukow discrete distributio before cosiderig estimatig etropy. The Dirichlet prior, beig the cojugate prior to the multiomial distributio, appears to be particularly popular i the Bayesia approach to etropy estimatio. Dirichlet smoothig may have two cootatios i the cotext of etropy estimatio: [38], [39] Oe first obtais a Bayes estimate for the discrete distributio P, which we deote by ˆP B, ad the plugs it i the etropy fuctioal to obtai the etropy estimate HˆP B. [40] [4] Oe calculates the Bayes estimate for etropy HP uder Dirichlet prior for squared error. The estimator is the coditioal expectatio E[HP X], where X represets the samples. Nemema, Shafee, ad Bialek [4] argued i a ituitive way why Dirichlet prior is bad for etropy estimatio ad proposed to use mixtures of Dirichlet priors. Archer, Park, ad Pillow [43] have come up with priors that perform better tha the Dirichlet prior. Also see [44], [45]. Aother cotributio of this paper is a explicit characterizatio of the worst case squared error risk of estimatig HP usig the Dirichlet prior plug-i approach up to a uiversal multiplicative costat, for all rages of S ad i which the risk may vaish. We show rigorously that either of the two approaches utilizig the Dirichlet prior result i improvemets over the MLE i the large alphabet regime. Specifically, these approaches require at least S to be cosistet, while the miimax rate-optimal estimators such as the oes i [0] [] to achieve cosistecy. The rest of the paper is orgaized as follows. We preset the mai results i Sectio III, discuss the fudametal ideas behid the proofs i Sectio IV, ad detail the proofs i Sectio V ad VI. Proofs of auxiliary lemmas are deferred to the appedices. oly eed S ls II. PRELIMINARIES The Dirichlet distributio with order S with parameters α,...,α S > 0 has a probability desity fuctio with respect to Lebesgue measure o the Euclidea space R S give by f x,,x S ;α,,α S = Bα S x αi i 3

3 3 o the ope S -dimesioal simplex defied by: x,,x S > 0 4 x x S < 5 x S = x x S 6 ad zero elsewhere. The ormalizig costat is the multiomial Beta fuctio, which ca be expressed i terms of the Gamma fuctio: S Bα = Γα i S, α = α,,α S. 7 Γ α i Assumig the ukow discrete distributio P follows prior distributio P Dirα, ad we observe a vector X = X,X,...,X S with multiomial distributio multi;p,p,...,p S, the oe ca show that the posterior distributio P P X is also a Dirichlet distributio with parameters αx = α X,α X,...,α S X S. 8 Furthermore, the posterior mea coditioal expectatio of p i give X is give by [46, Example 5.4.4] δ i X E[p i X] = α i X i S α. 9 i The estimator δ i X is widely used i practice for various choices ofα. For example, if α i = S, the the correspodig δ X,δ X,...,δ S X is the miimax estimator for P uder squared loss [46, Example 5.4.5]. However, it is o loger miimax uder other loss fuctios such as l loss, which was ivestigated i [47]. Note that the estimator δ i X subsumes the MLE ˆp i = Xi as a special case, sice we ca take the limit α 0 for δ i X to obtai MLE. We deote the empirical distributio by P = ˆp, ˆp,..., ˆp S. The Dirichlet prior smoothed distributio estimate is deoted as ˆP B, where ˆP B = S α P i S α i S α i α S α. 0 i Note that the smoothed distributio ˆP B ca be viewed as a covex combiatio of the empirical distributio P ad the α prior distributio S. We call the estimator HˆP B the Dirichlet prior smoothed αi plug-i estimator. Aother way to apply Dirichlet prior i etropy estimatio is to compute the Bayes estimator for HP uder squared error, give that P follows Dirichlet prior. It is well kow that the Bayes estimator uder squared error is the coditioal expectatio. It was show i Wolpert ad Wolf [40] that Ĥ Bayes E[HP X] = ψ α i X i α i X i S α ψα i X i, i X i where ψz Γ z Γz is the digamma fuctio. We call the estimator ĤBayes the Bayes estimator uder Dirichlet prior. Throughout this paper, we observe i.i.d. samples from a ukow discrete distributio P = p,p,...,p S. We deote the samples as i.i.d. radom variables {Z i } i takig values i Z = {,,...,S} with probability p,p,...,p S. Defiig X i ½Z j = i, i S, j= we kow that X,X,...,X S follows a multiomial distributio with parameter ;p,p,...,p S. Deote h j S ½X i = j, 0 j. The Maximum Likelihood Estimator MLE for HP ad F α P are defied, respectively, as HP ad F α P, with P beig the empirical distributio. We assume the fuctioal FP takes the form FP = fp i. 3 The it is evidet that the MLE FP for estimatig fuctioal FP i 3 ca be alteratively represeted as the followig liear fuctio of h 0,h,...,h : FP = f j=0 j h j. 4 Recall that the risk fuctio uder squared error for ay estimator ˆF i estimatig fuctioal FP may be decomposed as E P FP ˆF = E P ˆF FP E P ˆF EP ˆF, 5 where E P ˆF FP represets the squared bias, ad E P ˆF EP ˆF represets the variace. The subscript P meas that the expectatio is take with respect to the distributio P that geerates the i.i.d. observatios. We omit the subscript for the expectatio operator E if the meaig of the expectatio is clear from the cotext. Notatio: a b deotes mi{a,b}, a b deotes max{a,b}. For two o-egative series {a },{b }, otatio a b meas that there exists a positive uiversal costat C < such that a b C, for all. The otatioa b is equivalet to a b ad b a. Notatio a b meas that limif a b =. Throughout this paper, the otatios,,, ivolve absolute costats that may oly deped o α but ot S or. We deote by M S the space of discrete distributios with alphabet size S. A. Estimatig F α P III. MAIN RESULTS We split the upper bouds ad the lower bouds ito two theorems, ad preset their succict summaries i Corollary ad. Theorem Upper bouds. We have the followig upper bouds o the worst case squared error risk of MLE i estimatig F α P:

4 4 α : sup E P F α P F α P < α < : αα α 4. 6 sup E P F α P F α P 4 3S α/ 5S C α α/ α, α 4, 7 where C α, ω ϕ xα, / > 0 satisfies limsup C α, < for < α <, ad ω ϕ is the secod-order Ditzia Totik modulus of smoothess itroduced i Sectio IV-B. 3 / α < : sup E P F α P F α P 3S α/ 5S α/ α 0S α 0 S α α α < α < /: sup E P F α P F α P 3S α/ 5S α/ α 0S 0 α α S α α. 9 Moreover, i all the bouds preseted above, the first term bouds the square of the bias, ad the secod term bouds the variace. Theorem Lower bouds. We have the followig lower bouds o the worst case squared error risk of MLE i estimatig F α P: α 3/: there exists a costat C α > 0 such that for all, sup E P F α P F α P C α. 0 < α < 3/: if S = c, for ay c > 0, the lim if α sup E P F α P F α P > 0. 3 / α < : if S, the sup E P F α P F α P α α 7 α S α 64e α 4 [ S α α S α α] e /4 S α, 4 0 < α < /: if S, the sup E P F α P F α P α α 36 α S. 3 There are several iterestig implicatios of this result, highlighted i the followig corollaries. Corollary. For ay fixed α >, there exist uiversal covergece rates for F α P: sup sup E P F α P F α P { α < α < 3/ S N α 3/ 4 Corollary implies that, whe α 3/, estimatio of F α P is extremely simple i terms of covergece rate: plugi estimatio achieves the best possible rate as show i the theory of regular statistical experimets of classical asymptotic theory, see [48, Chap..7.]. Results of this form have appeared i the literature, for example, Atos ad Kotoyiais [49] showed that it suffices to take samples to cosistetly estimate F α P,α,α Z. However, whe < α < 3/, the rate α is cosiderably slower. Iterestigly, there exist estimators that demostrate better covergece rates for estimatig F α P, < α < 3/. Jiao et al. [0] showed that the miimax rate i estimatig F α P, < α < 3/, is l α as log as S l, which is achieved usig the geeral methodology developed therei for costructig miimax rate-optimal estimators for osmooth fuctioals. Let us ow examie the case 0 < α <, aother iterestig regime that has ot bee characterized before. I this regime, we observe sigificat icrease i the difficulty of the estimatio problem. I particular, the relative scalig betwee the umber of observatios ad the alphabet size S for cosistet estimatio of F α P exhibits a phase trasitio, ecapsulated i the followig. Corollary. Fix α 0,. The worst case squared error risk of the MLE F α P i estimatig F α P is characterized

5 5 as follows whe S: sup E P F α P F α P { S S α α / < α < S 0 < α / α 5 Corollary follows directly from Theorem ad Theorem. I particular, it implies that it is ecessary ad sufficiet to take S /α samples to cosistetly estimate F α P,0 < α < usig MLE. Thus, as oe might expect, the scale of the umber of measuremets required for cosistet estimatio icreases as α decreases. Whe α 0, the umber of samples required for the MLE grows super-polyomially i S, which is cosistet with the ituitio that F α P,α 0 is essetially equivalet to the alphabet size of a distributio, whose estimatio is kow to be very hard whe there may exist symbols with very small probabilities [50]. We exhibit some of our fidigs by plottig the value required ofl/ls for cosistet estimatio off α P usig the MLE F α P, as a fuctio of α, i Figure. l ls 0 /α ot achievable via MLE Theorem achievable via MLE Theorem Fig. : For ay fixed poit above the thick curve, cosistet estimatio of F α P is achieved usig MLE F α P as show i Theorem. For ay fixed poit below the thick curve i the regime 0 < α <, Theorem shows that the MLE does ot have vaishig maximum squared error risk. It turs out that oe ca costruct estimators that are better tha the MLE i terms of required sample complexity for cosistet estimatio for the regime 0 < α <. Ideed, Jiao et al. [0] showed that the miimax rate-optimal estimator requires S α ls samples to achieve cosistecy, which attais a logarithmic improvemet i the sample complexity over the MLE. α B. Estimatig HP We ot oly cosider HP, but also the so-called Miller Madow bias-corrected estimator [9] defied as H MM P = HP S. 6 Theorem 3. The worst case squared error risk of HP admits the followig upper boud for all S,: sup E P HP HP l S l If 5S, the ls 3. 7 sup E P HP HP S S 0 c l S. 8 Moreover, if 5S, the Miller Madow bias-corrected estimator satisfies sup E P H MM P HP S 0 c l S, 9 where the positive costat c > 0 i both expressios does ot deped o S or. Theorem 3 implies the followig corollary. Corollary 3. The worst case squared error risk of the MLE HP i estimatig HP is characterized as follows whe 5S: sup E P HP HP S l S. 30 Here the first term correspods to the squared bias, ad the secod term correspods to the variace. Paiski [4] showed that if = cs, where c > 0 is a costat, the maximum squared error risk of HP, ad the Miller Madow bias-corrected estimator H MM P, would be bouded from zero. Paiski [4] also showed that whe S,, the MLE is cosistet for estimatig etropy. Corollary 3 implies that it is ecessary ad sufficiet to take S samples for the MLE to be cosistet for estimatig etropy. Comparig the results for HP with those for F α P, we see that the ituitio thathp beig viewed close to F α P whe α is ideed approximately correct as HP coicides with α o the phase trasitio curve show i Figure. Table I summarizes the miimax squared error rates ad the worst case squared error rates of the MLE i estimatig HP ad F α P,α > 0. It is clear that the MLE caot achieve the miimax rates for estimatio of HP, ad F α P whe 0 < α < 3/. I these cases, there exist strictly better estimators whose performace with samples is roughly the same as

6 6 Miimax squared error rates Maximum squared error rates of MLE S HP l l S S S/lS [0], [6], [8], [] l S S Corollary 3 F α P,0 < α S l S /α /ls,l ls S [0] α S /α Corollary α F α P, < α < S l S α α S /α /ls S [0] S α α S /α Corollary F α P, < α < 3 l α S l [0] α S Corollary F α P,α 3 Theorem TABLE I: Summary of results i this paper ad the compaio [0] that of the MLE with l samples. This pheomeo was termed effective sample size elargemet i [0]. C. Dirichlet prior techiques applyig to etropy estimatio For symmetry, we restrict attetio to the case where the parameter α i the Dirichlet distributio takes the form a,a,...,a. I compariso to MLE HP, where P is the empirical distributio, the Dirichlet smoothig scheme HˆP B has a disadvatage: it requires the kowledge of the alphabet size S i geeral. We defie ˆp B,i = ˆp i a Sa, 3 ad p B,i = E[ˆp B,i ] = p i a Sa. 3 It is clear that ˆP B = Sa P Sa Sa U S 33 P B = Sa P Sa Sa U S, 34 where P stads for the empirical distributio, P is the true distributio, ad U S deotes the uiform distributio o the same alphabet with size S. Theorem 4. If max{sa, ea}, the the maximum squared error risk of HˆP B i estimatig HP is upper bouded as sup E P HˆP B HP l S Sa Sa [ 3l Sa Sa Sa l a ] Sa a S. 35 Here the first term bouds the squared bias, ad the secod term bouds the variace. Theorem 5. If max{5s,sa,ea}, the the maximum L risk of HˆP B i estimatig HP is lower bouded as sup E P HˆP B HP [ S a Sa 4Sa l S a 8 S 80 ] 48 c l S, 36 where c > 0 is a uiversal costat that does ot deped o a,s, or. If < Sa, the we have S sup E P HˆP B HP l S. 37 S If < ea, the we have S sup E P HˆP B HP l S. 38 S e If < 5S, ea, the we have sup E P HˆP B HP [ S a Sa 4Sa l /5 ], a where x is the largest iteger that does ot exceed x, ad x = max{x,0} represets the positive part of x. The followig corollary immediately follows from Theorem 4 ad Theorem 5. Corollary 4. If S ad a is upper bouded by a costat, the the maximum squared error risk of HˆP B vaishes. Coversely, if S, the the maximum squared error risk of HˆP B is bouded away from zero. The ext theorem presets a lower boud o the maximum risk of the Bayes estimator uder Dirichlet prior. Sice we have assumed that all α i = a, i S, the Bayes estimator uder Dirichlet prior is Ĥ Bayes = ψsa ax i Sa ψax i. 40

7 7 Theorem 6. If S, the ĤBayes SaS/ sup E P HP l Sae γ, 4 where γ is the Euler-Mascheroi costat. Evidet from Theorem 4, 5, ad 6 is the fact that i the best situatio i.e. a ot too large, both the Dirichlet prior smoothed plug-i estimator ad the Bayes estimator uder Dirichlet prior still require at least S samples to be cosistet, which is the same as MLE. I cotrast, the estimators i Valiat ad Valiat [6] [8], Jiao et al. [0], Wu ad Yag [] are cosistet if S ls, which is the optimal sample complexity. Thus, we ca coclude that the Dirichlet smoothig techique does ot solve the etropy estimatio problem. IV. FUNDAMENTAL IDEAS OF OUR ANALYSIS I this sectio, we discuss the fudametal tools we employed to obtai the results i Sectio III, as well as geeral recipes we suggest for aalyzig performaces of fuctioal estimators. A. Variace The variace characterizes the degree to which the radom variable FˆP is fluctuatig aroud its expectatio, ad the field of cocetratio iequalities perfectly fits our glove to give the desired results. For all the fuctioals we cosider, it turs out that the Efro Stei iequality [5] ad the bouded differeces iequality give very tight bouds. For completeess we state them below. Lemma. [5, Efro Stei iequality, Theorem 3.] Let Z,...,Z be idepedet radom variables ad let fz,z,...,z be a square itegrable fuctio. Moreover, if Z,Z,...,Z are idepedet copies of Z,Z,...,Z ad if we defie, for every i =,,...,, the f i = fz,z,...,z i,z i,z i,...,z, 4 Varf E [ f f i ]. 43 The followig iequality, which is called the bouded differeces iequality, is a useful corollary of the Efro Stei iequality. Lemma. [5, Bouded differeces iequality, Corollary 3.] If fuctio f: Z R has the bouded differeces property, i.e., for some oegative costats c,c,...,c, sup fz,...,z fz,...,z i,z z,...,z,z i Z i,z i,...,z c i, 44 for every i, the VarfZ,Z,...,Z 4 c i, 45 give that Z,Z,...,Z are idepedet radom variables. We refer the readers to Bouchero et al. [5] for a moder expositio of the cocetratio iequality toolbox. B. Bias It turs out that the bias aalysis i estimatio, albeit widely studied i statistics, seems to still largely bear a asymptotic ad expasio ature i the maistream statistical literature [53], [54]. I particular, the bootstrap [55] as a method for estimatig fuctioals was essetially oly aalyzed i a asymptotic settig [56]. Amog asymptotic aalysis techiques, probably the most popular oe is the Taylor expasio. We will show that the Taylor expasio may ecouter great difficulties i aalyzig the bias of MLE i iformatio measure estimatio. The, we will itroduce the field of approximatio theory usig positive liear operators ad demostrate that it is essetially equivalet to oasymptotic bias aalysis for plug-i fuctioal estimators. I doig so, we preset the readers with abudat hady tools from approximatio theory, which could be readily applicable to may problems that may seem highly itractable with stadard expasio methods. We start from etropy estimatio. I the literature, cosiderable effort has bee devoted to uderstadig the oasymptotic performace of the MLE HP i estimatig HP. Oe of the earliest ivestigatios i this directio is due to Miller [9] i 955, who showed that, for ay fixed distributio P, EHP = HP S O. 46 Equatio 46 was later refied by Harris [57] usig higher order Taylor series expasios to yield EHP = HP S O 3. p i 47 Harris s result reveals a udesirable cosequece of the Taylor expasio method: oe caot obtai uiform bouds o the bias of the MLE. Ideed, the term S p i ca be arbitrarily large for some distributiop. However, it is evidet that both HP ad HP are bouded above by ls, sice the maximum etropy of ay distributio supported o S elemets is l S. Coceivably, for such a distributio P that would make S p i very large, we eed to compute eve higher order Taylor expasios to obtai more accuracy, but eve with such efforts we caot obtai a uiform bias boud for all P. We gai oe of our key isights ito the bias of the MLE by relatig it to the approximatio error iduced by the Berstei polyomial approximatio of the fuctio f, which was first observed i Paiski [4]. To see this, we first compute the bias of FP i estimatig the fuctioal FP i 3.

8 8 Lemma 3. The bias of the estimator FP is give by BiasFP EFP FP = j f p j i j p i j fp i j=0 48 The bias term i 48 ca be equivaletly expressed as BiasFP = j f B j, p i fp i 49 = j=0 B [f]p i fp i, 50 whereb j, x j x j x j is the well-kow Berstei polyomial basis, ad B [f]x is the so-called Berstei polyomial for fuctiofx. Berstei i 9 [6] provided a isightful costructive proof of the Weierstrass theorem o approximatio of cotiuous fuctios usig polyomials, by showig that the Berstei polyomial of ay cotiuous fuctio coverges uiformly to that fuctio. From a fuctioal aalytic viewpoit, the Berstei polyomial is a operator that maps a cotiuous fuctio f C[0, ] to aother cotiuous fuctio B [f] C[0,]. This operator is liear i f, ad is positive because B [f] is also poitwise oegative if f is poitwise o-egative. Evidetly, boudig the approximatio error icurred by the Berstei polyomial is equivalet to boudig the bias of the MLE fx/, where X B, x. Fortuately, the theory of approximatio usig positive liear operators [6] provides us with advaced tools that are very effective for the bias aalysis our problem calls for. A cetury ago, probability theory served Berstei i breakig ew groud i fuctio approximatio. It is therefore very satisfyig that advacemets i the latter have come full circle to help us better uderstad probability theory ad statistics. We briefly review the geeral theory of approximatio usig positive liear operators below. Approximatio theory usig positive liear operators: Geerally speakig, for ay estimator ˆθ of a parametric model idexed by θ, the expectatio f E θ fˆθ is a positive liear operator for f, ad aalyzig the bias E θ fˆθ fθ is equivalet to aalyzig the approximatio properties of the positive liear operatore θ fˆθ i approximatigfθ. Hece, aalyzig the bias of ay plug-i estimator for fuctioals of parameters from ay parametric families ca be recast as a problem of approximatio theory usig positive liear operators [6]. Coversely, give a positive liear operator Lfx that operates o the space of cotiuous fuctios, the Riesz Markov Kakutai theorem implies that uder mild coditios the operator may be writte as Lfx = fdµ x = E µx fz,z µ x, 5 I I the literature of combiatorics, the sum j=0 a j,b j, x is called the Beroulli sum, ad various approaches have bee proposed to evaluate its asymptotics [58], [59], [60]. where {µ x } is a set of probability measures parametrized by x, which may be viewed as a parameter. If we view the radom variable Z as a summary statistics to plug-i the fuctioal f, the positive liear operator Lfx is othig. but the expectatio of the plug-i estimator fz. I this sese, there exists a oe-to-oe correspodece betwee essetially the most geeral bias aalysis problem i statistics, ad the most geeral positive liear operator approximatio problem i approximatio theory. After more tha a cetury s active research o approximatio usig positive liear operators, we ow have highly otrivial tools for positive liear operators of fuctios o oe dimesioal compact sets, but the geeral theory for vector valued multivariate fuctios o o-compact sets is still far from complete [6]. I the ext subsectio, we preset a sample of existig results i approximatio usig positive liear operators, corollaries of which will be used to aalyze the bias of the MLE for two examples: F α P ad HP. Some geeral results i bias aalysis: First, some elemetary approximatio theoretic cocepts eed to be itroduced i order to characterize the degree of smoothess of fuctios. For I R a iterval, the first-order modulus of smoothess ω f,t,t 0 is defied as [6] ω f,t sup{ fu fv : u,v I, u v t}. 5 The secod-order modulus of smoothess ω f,t,t 0 [6] is defied as uv ω f,t sup{ fu f fv : } u,v I, u v t. 53 Ditzia ad Totik [63] itroduced a class of moduli of smoothess, which proves to be extremely useful i characterizig the icurred approximatio errors. For simplicity, for fuctios defied o [0,], ϕx = x x, the first-order Ditzia Totik modulus of smoothess is defied as { ω ϕf,t sup fu fv : } uv u,v [0,], u v tϕ, 54 ad the secod-order Ditzia Totik modulus of smoothess is defied as uv ωϕf,t sup{ fu f fv : } uv u,v [0,], u v tϕ. 55 Recall that we deote by e j,j N {0}, the moomial fuctiose j y = y j,y I. The first estimate for geeral positive liear operators, usig modulus ω ad with precise costats, was give by Goska [64]. We rephrase Paltaea [6,

9 9 Cor....] as follows. Note that otatio e xe 0 deotes a cotiuous fuctio o I which is the differece of a liear fuctio y ad a costat fuctio with costat value x over I. I other words, it is a abbreviatio ofe y xe 0 y,y I, which is a fuctio of y rather tha x. For a positive liear fuctioal F, we adopt the followig otatio B F x = Fe xfe 0, V F = F e Fe e 0, 56 which represet the bias ad variace of a positive liear fuctioal F. Lemma 4. [6, Cor....] Let F: CI R be a positive liear fuctioal, where I R is a iterval. Suppose that Fe 0 =,t > 0,legthI t,s. The, Ff fx B F x ω f,t t F e xe 0 s t s ω f,t. 57 We remark that Lemma 4 ca be applied to boud the bias of plug-i estimators i very geeral models. For example, cosider a arbitrary statistical experimet {P θ,θ I}, from which we obtai i.i.d. samples X,X,...,X P θ. For ay estimator ˆθ, we would like to aalyze the bias of the plug-i estimator fˆθ for fuctioal fθ. Suppose legthi t,s, the Lemma 4 implies that E θ fˆθ fθ E θˆθ θ ω f,t t E ˆθ θ s t s ω f,t. 58 If we further assume that ˆθ is a ubiased estimator for θ, i.e., E θˆθ = θ holds for all θ I, the we have E θ fˆθ fθ E ˆθ θ s t s ω f,t. 59 Takig s = ad assumig Varˆθ legthi/, we have E θ fˆθ fθ 3 ω f, Varˆθ, 60 after we take t = E ˆθ θ. We remark that Lemma 4 is oly oe way to aalyze the bias, which is by o meas always tight. For example, the followig estimate usig Ditzia Totik modulus is sigificatly better tha Lemma 4 for certai fuctios such as the etropy. Lemma 5. [6, Thm..5..] If F: C[0,] R is a liear positive fuctioal ad Fe 0 =, the we have Ff fx B Fx h ϕx ω ϕ f,h 5 ω ϕ f,h, 6 for all f C[0,] ad 0 < h, where ϕx = x x ad h = F e xe 0 /ϕx = VF B F x /ϕx. The bias B F x ad variace V F x are defied i 56. Cosiderig the same statistical experimet {P θ,θ I}, ad the plug-i estimator fˆθ for fθ, if ˆθ is ubiased for θ ad Varˆθ ϕθ 4, the it follows from Lemma 5 that E θ fˆθ fθ 5 Varˆθ ω ϕ f,, 6 ϕθ Varˆθ after we take t = ϕθ. For certai fuctios fθ ad statistical models Lemma 5 is stroger tha Lemma 4. For example, if fθ = θ l θ, θ [0,], ad we have ˆθ B,θ. We will show i Lemma 8 that ωϕ f,t = t l4 t, ad ω f,t = tl4. We also have Varˆθ = θ θ. Hece, Lemma 4 gives the upper boud E θ fˆθ fθ 3l4 θ θ, 63 whereas Lemma 5 gives E θ fˆθ fθ 5l4 /, 64 which is much stroger whe is large ad θ ot too close to the edpoits of [0,]. There also exist various estimates for the bias whe the parameter lies i sets other tha a iterval i R. However, the bouds we preseted are i geeral ot optimal for specific fuctioals, thereby leavig ample room for future developmet. For example, ote that 63 is stroger tha 64 whe θ /, but Ha, Jiao, ad Weissma [65] showed that whe θ / the poitwise boud i 63 is still strictly suboptimal for the etropy fuctioal. Usurprisigly, to obtai the results i Sectio III, we eed to go beyod the geeral results i approximatio theory, ad icorporate the structure of specific fuctios. Note: I approximatio theory literature, researchers have explored the iteractios betwee geeral positive liear operator approximatio ad its probabilistic couterpart decades ago [66] [68]. However, i statistics literature related to positive liear approximatio, usually oly specific operators are used, such as the Berstei operator [69], ad the focus may ot be o obtaiig the tightest boud o bias [70], [7]. C. Lower bouds To lower boud the worst case performace of a specific estimator, we have essetially two approaches: first, to aalyze the bias or the variace of the specific estimator carefully; secod, to prove a lower boud that is satisfied by all the estimators, which aturally iclude the specific estimator we eed to aalyze. These two approaches have differet relative advatages ad disadvatages, so we utilize them together i the lower boud costructio. We refer the readers to Tsybakov [7] for a ice collectio of techiques to prove miimax lower bouds. Oe specific approach we use is the va Trees iequality, which we quote below. Let X,F,P θ ;θ Θ be a domiated family of distributios o some sample space X ; deote the domiatig measure

10 0 by µ. Assume Θ is a closed iterval o the real lie. Let fx θ deote the desity of P θ with respect to µ. Let π be some probability distributio o Θ with a desity λθ with respect to Lebesgue measure. Suppose that λ ad fx are both absolutely cotiuous µ-almost surely, ad that λ coverges to zero at the edpoits of the iterval Θ. We defie logfx θ Iθ = E θ 65 θ dlogλθ Iλ = E 66 dθ the Fisher iformatio for θ ad for a locatio parameter i λ, respectively. We assume Iθ is cotiuous i θ. We have the followig iequality. Lemma 6 va Trees iequality. [73] Uder assumptios above, the average risk of a arbitrary estimator ˆψX i estimatig a absolutely cotiuous fuctioal ψθ uder squared error loss satisfies the followig iequality: Eψ E ˆψX ψθ θ 67 E[Iθ]Iλ V. PROOFS OF THE UPPER BOUNDS I order to upper boud the maximum squared error risk of ay estimator, a atural approach would be to aalyze the squared bias term ad the variace term separately. The, it suffices to fid proper tools to give oasymptotic aalysis of the bias ad variace. A. Boudig the bias We first work to boud the bias. Lemma 3 shows that the bias of FP could be represeted as BiasFP = B [f]p i fp i, 68 where B [f]x is the Berstei polyomial correspodig to fx. The followig lemma summarizes some state-of-theart bouds for approximatio error of Berstei polyomials. Lemma 7 ca be derived easily from the geeral theory we preseted i Sectio IV-B. We emphasize that oe caot expect the bouds i Lemma 7 to be tight for ay f C[0,], sice the Berstei approximatio error itself could be a very complicated fuctio i C[0,], ad Lemma 7 is usig relatively simple fuctios to upper boud it. Lemma 7. The followig bouds are valid for fuctio approximatio error icurred by Berstei polyomials: Poitwise estimate: [6, Cor...] [74] for all cotiuous fuctios f o [0,], fx B [f]x 3 x x f, ω, 69 ad the costat 3/ is show by [74] to be the best costat; Norm estimate: [6, Cor. 4..0] forϕx = x x ad all cotiuous fuctios f o [0,], we have B [f] f 5 ω ϕ f, / ; 70 3 [75, Eq ] for f C [0,], i.e., twice cotiuously differetiable, fx B [f]x f x x ; 7 Proof. The poitwise estimate of Lemma 7 follows from Lemma 4. The orm estimate of Lemma 7 follows from Lemma 5. Regardig the third part, suppose radom variable X B,x. We have fx B [f]x = E x fx/ fx 7 = E x [f xx/ x f ξ X X/ x ] 73 = E xf ξ X X/ x 74 f E x X/ x 75 = f x x, 76 where we used Taylor expasio for fx/ at poit x with the Lagrage remaider. The proof is complete. Remark. Note that although 70 is i the form of a upper boud, it has bee show to be a lower boud as well. Totik [76] showed the followig equivalece property o the orm estimate of Berstei approximatio errors B [f]x fx ω ϕ f, /. 77 It is easy to calculate the secod-order modulus of smoothess ad the Ditzia Totik secod-order modulus of smoothess for fuctios x α ad xlx. The results are preseted i the followig lemma. Lemma 8. We have x α,0 < α < x α, < α < xlx ω f,t α t α α t α tl4 ωϕf,t α t α t t t l4 α t where the secod-order modulus results hold for 0 < t /, ad the Ditiza Totik secod-order modulus results hold for 0 < t. Bias of F α P : We first boud the bias icurred by F α P. α : Note that it is a remarkable fact that 77 holds for ay cotiuous fuctio fx. The lower boud proof of 77 is cosidered oe of the remarkable results i approximatio theory, ad curretly there are o short proofs of this fact. Ideed, Ditzia [77, Sectio 8] metioed that I still would like to see a ew simple proof of 8.4 Equatio 77 which I am sure will have implicatios for other operators.

11 I this case, f C [0,], applyig the third part of Lemma 7, fx B [f]x αα x x. 78 Thus, we have BiasF α P αα p i p i αα. 79 < α < The followig lemma presets a boud o the bias of F α P, which does ot deped o the alphabet size S. We ote that the proof of Lemma 9 heavily utilizes the special properties of fuctio x α ad the fact that S p i =. Lemma 9. The bias of F α P for estimatig F α P, < α <, is upper bouded by the followig: BiasF α P 4 α. 80 We also preset two additioal bouds ivolvig the alphabet sizes. Usig the poitwise estimate i Lemma 7, the bias term of the MLE is upper bouded as follows for all 0 < α <,α : 3 pi p i α 3 α α/ α/ p α/ i 8 3 α α/s 8 S α/ = 3 α S α/. 83 α/ Usig the orm estimate i Lemma 7, whe < α <, 5S the bias would be upper bouded by C α,, where C α, = ωϕx α, / is a fiite positive costat such that limsup C α, < for < α <. Combiig Lemma 9, the poitwise estimate, ad the orm estimate i Lemma 7, we kow that the bias of F α P for < α < is upper bouded as BiasF α P 4 α 3 α S α/ 5S C α/ α, < α < : The poitwise estimate from Lemma 7 is worked out i 83. Usig the orm estimate i Lemma 7, the bias would be upper bouded by α 5S. Combiig the α poitwise estimate ad the orm estimate, we kow that the bias of F α P for 0 < α < is upper bouded as BiasF α P 3 α S α/ α/ α 5S α. 85 Bias of HP : We the boud the bias icurred by HP. Usig the orm estimate i Lemma 7, we kow BiasHP 5Sl4. 86 Usig the poitwise estimate i Lemma 7, we obtai BiasHP 3 S l4. 87 It was show by Paiski [4, Prop. ] that the squared bias of MLE HP is upper bouded as BiasHP l S, 88 which is better tha the two bouds we obtaied usig Berstei polyomial results. However, we remark that 88 is obtaied usig special properties of the etropy fuctio ad coectios betwee KL-divergece ad χ -divergece [7], which caot be applied to geeral fuctios. Strukov ad Tima [66] also heavily exploited the structure of fuctio x α ad xlx i order to aalyze the Berstei approximatio error for these fuctios, ad obtaied tight-i-order results. 3 Bias of HˆP B : We apply the geeral theory of positive liear operator approximatio. The followig lemma is a stregtheed versio of Lemma 5. Lemma 0. If F: C[0,] R is a liear positive fuctioal ad Fe 0 =, the Ff fx ω f,b F x;x 5 ω ϕ f,h 89 for allf C[0,] ad0 < h, whereϕx = x x ad h = V F /ϕx, ad ω f,h;x sup{ fu fx : u [0,], u x h}. 90 The bias B F x ad variace V F x are defied i 56. Proof. Applyig Lemma 5 to x = Fe we have Ff ffe 5 ω ϕf,h 9 ad the 89 is the direct result of the triagle iequality Ff fx Ff ffe ffe fx. We show that Lemma 0 is ideed stroger tha Lemma 5. Firstly, due to h h, we have ω ϕf,h ω ϕf,h. Secod, for x /, we have B F x h ϕx ω ϕf,h B Fx h ϕx sup h ϕsf s 0 s 9 B F x sup x s x f s 93 sup x s x ω f,b F x;s 94 which is almost the supremum of ω f, Fe xe 0 ;s over s [x, x] ad is o less tha the poitwise result ω f, Fe xe 0 ;x, ad here we have used the iequality ϕs ϕx for x s x. A similar argumet also holds

12 for x > /. Hece, Lemma 0 trasforms the first order term from the orm result i Lemma 5 to a poitwise result. Applyig[ Lemma ] 0 to the fuctio fp = plp ad Ff = E f ˆpa Sa, where ˆp B,p, we have the followig lemma. Lemma. If max{sa,ea,4}, the sup E P HˆP B HP 5Sl Sa Sa Sa l Sa a. 95 Note that Lemma implies a slightly weaker bias boud tha Theorem 4, but it is oly sub-optimal up to a multiplicative costat. The bias boud i Theorem 4 is obtaied usig the followig lemma, whose proof oly applies to the etropy fuctio. Lemma. If max{ea,sa}, sup E P HˆP B HP l S Sa Sa Sa l B. Boudig the variace Sa a. 96 The ext lemma follows from a applicatio of bouded differece iequality preseted i Lemma. Lemma 3. The variace of FP satisfies the followig upper boud: VarFP max 0 j< fj / fj/. 97 If f is mootoe, the we ca stregthe the boud to be VarFP 4 max 0 j< fj / fj/. 98 We first boud the variace for F α P,α >. We have max j 0 j< /α j/ α α 99 α, 00 where i the last step we used Beroulli s iequality: x r rx, r,x >,x R. Usig Lemma, we kow the variace is upper bouded by VarF α P α 4. 0 We boud the variace of F α P,0 < α < i the followig lemma. Lemma 4. For 0 < α < /, we have sup VarF α P 0S α 3α 3α 8α α 8α S 4 e α α 0 S α. 03 For / α <, we have sup VarF α P 0S α 3α 3α 8α S α α 8α S 4 e α α 04 S α α. 05 Further, oe ca show that for all α 0,, 3α 3α α 8α 8α 4 0 e α, 06 which is used i Theorem. Regardig the variace of HP, we have Lemma 5. sup VarHP l ls 3 07 ls l. 08 The variace of HˆP B is upper bouded by the followig lemma. Lemma 6. The variace of HˆP B is upper bouded as follows: [ ] Sa Var HˆP B Sa 3l a S. 09 VI. PROOFS OF THE LOWER BOUNDS A. Lower bouds for estimatio of F α P whe α 3/ We apply the va Trees iequality as preseted i Lemma 6. It suffices to cosider the restricted case of S = ad prove the lower boud. Thus, the model is equivalet to observig a Biomial radom variable X B,p, ad oe aims to estimate the fuctioal ψ α p = p α p α. We have ψ αp = αp α α p α. 0

13 3 The Fisher iformatio for parameter p uder the Biomial model is Ip = p p. Suppose we impose prior λp o parameter p. The va Trees iequality implies sup E P F α P F α P if ˆ F α sup E P Fα ˆ F α P E E[F α P X S ] F α P [ αp α α p α ] λpdp [ ] E λ Iλ p p = [ αp α α p α ] λpdp [ ] E λ Iλ p p Bayes risk 3 4 where the secod iequality follows from the fact that the Bayes risk uder ay prior is upper bouded by the miimax risk [78]. Takig λp to be the Dirichlet prior with parameter a,b, i.e., λp = Ba,b pa p b,a >,b >, 5 we ca explicitly evaluate the itegrals above. Here Ba, b is the Beta fuctio. Takig a = 4,b = 3, we have sup E P F α P F α P 60αBα3,3 Bα, Takig C α = 7α Bα3,3 Bα,4, we have sup E P F α P F α P C α, for all. 7 Note that C α > 0 for all α 3/. B. Lower bouds for estimatio of F α P whe < α < 3/ The followig lemma was proved i [69]. Lemma 7. Let k 4 be a eve umber. Suppose that the k-th derivative of f satisfies f k 0 i 0,, Q k is the Taylor polyomial of order k to f at some x i 0,. The for x [0,], fx B [f]x Q k B [Q k ]x. 8 Cosider f α x = x α, < α <,x [0,]. Applyig Lemma 7 to f α, takig k = 6, we have the followig result. Lemma 8. Suppose f α x = x α, < α < o [0,]. For all x 0,, we have where f α x B [f α ]x αα xα x x α3α x α5 3α R x 3 R x 4, 9 R x = αα α α 3xα 3 x 4 x5 αxα 4, 0 R x = αα α α 3α 4 0 x α 4 x x x x. Note that we have assumed S = c, c > 0. If c, we take a uiform distributio o S elemets P = /S, /S,..., /S, otherwise we take distributio P = ǫ, ǫ,..., ǫ ǫ,, where ǫ will be take to be arbitrarily small. We first aalyze the c case. Applyig Lemma 8, we have S,..., ǫ S f α /S B [f α ]/S Note that f α x = x α = EF α P F α P αα S S α S α5 3α αα α α 3 4S α 3 3 α 4 αα α α 3α 4 0S α 4 4 o α = αα α c α 3 c α5 3α α α 3α 4 4c α 4 α α 3α 4 0c α 5 o α = αα c α α α5 3αc 4 α α 3α 4c 4 α α 3α 4c3 o α 0 αc α 4 330α85α 90α 3 α 4 0 α o α, where the first iequality follows from Lemma 8, ad i the

14 4 last step we have take c = i the followig expressio α5 3αc α α 3α 4c 4 4 α α 3α 4c3, 0 ad cosidered the fact that it is a mootoically decreasig fuctio with respect to c o 0,] for ay α,3/. For cases whe c >, sice we take P = ǫ, ǫ,..., ǫ ǫ, S,..., ǫ S, by a cotiuity argumet, the aalysis is exactly the same as that above whe we set c = as we ca take ǫ as small as possible. Oe ca verify that the fuctio α4 330α85α 90α 3 α 4 /0 is positive o iterval,3/. Defiig c α = αc α 4 330α 85α 90α 3 α 4 /0 > 0 whe c, ad cα = α4 330α 85α 90α 3 α 4 /0 > 0 whe c >, the proof is completed. Lemma 0. For α <, we have if sup E P ˆF Fα P ˆF [ α S α α 3e α 4 e /4 S α S α α] S α, 8 where the ifimum is take over all possible estimators. Sice this lower boud holds for all possible estimators, it also holds for the MLE F α P. Sice max{a,b} ab, we have the desired lower boud. D. Lower bouds for estimatio of HP C. Lower bouds for estimatio of F α P whe 0 < α < Applyig Lemma 7 to fuctio f α x = x α,α 0,, takig k = 4, we have the followig result: Lemma 9. For f α x = x α o [0,], α 0,,x 0,, we have f α x B [f α ]x α α x α x x α. 3 3 Suppose S. Defie distributio W = w,w,...,w S M S such that i S,w i = ; w S = S. 4 Note that w i, i S. It follows from Lemma 9 that S α α α F α W E W F α P 6 5 α αs = 6 α. 6 Thus, we kow for all 0 < α <, sup E P F α P F α P α α S 36 α. 7 It is show i [0] that the followig miimax lower boud holds for estimatio of F α P,/ α <. Braess ad Sauer [69] derived the followig lower boud for the approximatio error of Berstei polyomials for the fuctio gx = xlx: Lemma. Defie gx = xlx o [0,]. For x 5,x [0,], we have gx B [g]x x 0 x x. 9 Applyig Lemma to the estimatio of HP, we kow that if i S,p i 5, HP EHP S S p i Cosider the uiform distributio P with 5S, which guaratees p i 5. Sice S S, 3 p i we have sup HP EHP S S 0. 3 Thus, whe 5S, sup E P HP HP S S It was show i [, Prop. ] that the followig miimax lower boud holds. Lemma. There exists a uiversal costat c > 0 such that if Ĥ l sup E P Ĥ S HP c, 34 where the ifimum is take over all possible estimators Ĥ.

15 5 Hece, we have sup E P HP HP { S max S 0 S S 0 },c l S 35 c l S. 36 Similar argumets ca be applied to the Miller Madow estimator. E. Lower bouds for etropy estimatio usig HˆP B Sice HˆP B is a specific estimator for etropy, the followig lemma is proved via cosiderig several specific distributios. Lemma 3. If max{5s,sa,ea}, EP sup HˆP B HP S a Sa 4Sa l S a 8 S If < Sa, the EP sup HˆP B HP S ls. 38 S If < ea, the EP sup HˆP B HP S ls. 39 es If < 5S, ea, the EP sup HˆP B HP S a Sa 4Sa l /5 a The correspodig results i Theorem 5 follow from Lemma 3, Lemma, ad the iequality max{a,b} ab. F. Lower bouds for etropy estimatio usig ĤBayes We prove Theorem 6 below. Applyig Lemma 5, we have Ĥ Bayes ψsa ax i ψa 4 Sa = ψsa ψa 4 Sae γ l. 43 a Sice ĤBayes Sae is upper bouded by l γ for ay a empirical observatios, the squared error it icurs i Shao etropy estimatio whe the true distributio is the uiform distributio is at least SaS/ l Sae γ 44 if S. ACKNOWLEDGMENTS We thak Day Leviata, Gacho Tachev, ad Radu Paltaea for very helpful discussios regardig the literature o approximatio theory usig positive liear operators. We thak Jayadev Acharya, Alo Orlitsky, Aada Theertha Suresh, ad Himashu Tyagi for commuicatig to us the idepedet discovery that it suffices to take samples to cosistetly estimate F α P, whe α >. We thak Maya Gupta for raisig the questio o the optimality of the Dirichlet prior smoothig techiques applyig to etropy estimatio. We thak the aoymous reviewers ad the associate editor for very helpful commets that sigificatly improved the presetatio of the paper. APPENDIX A AUXILIARY LEMMAS We begi with the defiitio of the egative associatio property, which allows us to upper boud the variace by treatig each compoet of the empirical distributio P i as idepedet radom variables. Defiitio. [79, Def..] Radom variables X,X,,X S are said to be egatively associated if for ay pair of disjoit subsets A,A of {,,,S}, ad ay compoet-wise icreasig fuctios f,f, Covf X i,i A,f X j,j A To verify whether radom variables X,X,,X S are egatively associated or ot, the followig lemma presets a useful criterio. Lemma 4. [79, Thm..9] Let X,X,,X S be S idepedet radom variables with log-cocave desities. The the joit coditioal distributio of X,X,,X S give S X i is egatively associated. I light of the precedig lemma, we ca obtai the followig corollary. Corollary 5. For ay discrete probability distributio vector P M S, the radom variables X = X,X,,X S draw from the multiomial distributio X multi;p are egatively associated. Proof. Cosider the Poissoized model Y i Poip i, i S with all Y i idepedet, it is straightforward to verify that each Y i possesses a log-cocave distributio. The coditioig o S Y i =, we kow that Y,Y,,Y S S Y i = multi;p, hece Lemma 4 yields the desired result. The ext lemma gives bouds o the digamma fuctios ψz = Γ z Γz. Lemma 5. [80, Lemma.7] The digamma fuctio ψz is the oly solutio of the fuctioal equatio Fx = Fx x that is mootoe, strictly cocave or ad satisfies F = γ, where γ is the Euler Mascheroi costat.

Lecture 16: Achieving and Estimating the Fundamental Limit

Lecture 16: Achieving and Estimating the Fundamental Limit EE378A tatistical igal Processig Lecture 6-05/25/207 Lecture 6: Achievig ad Estimatig the Fudametal Limit Lecturer: Jiatao Jiao cribe: William Clary I this lecture, we formally defie the two distict problems