Maximum Likelihood Estimation of Functionals of Discrete Distributions

Size: px
Start display at page:

Download "Maximum Likelihood Estimation of Functionals of Discrete Distributions"

Transcription

1 Maximum Likelihood Estimatio of Fuctioals of Discrete Distributios Jiatao Jiao, Studet Member, IEEE, Kartik Vekat, Studet Member, IEEE, Yaju Ha, Studet Member, IEEE, ad Tsachy Weissma, Fellow, IEEE arxiv: v7 [cs.it] 0 Aug 07 Abstract We cosider the problem of estimatig fuctioals of discrete distributios, ad focus o tight up to uiversal multiplicative costats for each specific fuctioal oasymptotic aalysis of the worst case squared error risk of widely used estimators. We apply cocetratio iequalities to aalyze the radom fluctuatio of these estimators aroud their expectatios, ad the theory of approximatio usig positive liear operators to aalyze the deviatio of their expectatios from the true fuctioal, amely their bias. We explicitly characterize the worst case squared error risk icurred by the Maximum Likelihood Estimator MLE i estimatig the Shao etropy HP = S pilpi, ad the power sum F αp = S pα i,α > 0, up to uiversal multiplicative costats for each fixed fuctioal, for ay alphabet size S ad sample size for which the risk may vaish. As a corollary, for Shao etropy estimatio, we show that it is ecessary ad sufficiet to have S observatios for the MLE to be cosistet. I additio, we establish that it is ecessary ad sufficiet to cosider S /α samples for the MLE to cosistetly estimate F αp,0 < α <. The miimax rate-optimal estimators for both problems require S/ l S ad S /α /ls samples, which implies that the MLE has a strictly sub-optimal sample complexity. Whe < α < 3/, we show that the worst-case squared error rate of covergece for the MLE is α for ifiite alphabet size, while the miimax squared error rate is l α. Whe α 3/, the MLE achieves the miimax optimal rate regardless of the alphabet size. As a applicatio of the geeral theory, we aalyze the Dirichlet prior smoothig techiques for Shao etropy estimatio. I this cotext, oe approach is to plug-i the Dirichlet prior smoothed distributio ito the etropy fuctioal, while the other oe is to calculate the Bayes estimator for etropy uder the Dirichlet prior for squared error, which is the coditioal expectatio. We show that i geeral such estimators do ot improve over the maximum likelihood estimator. No matter how we tue the parameters i the Dirichlet prior, this approach caot achieve the miimax rates i etropy estimatio. The performace of the miimax rate-optimal estimator with samples is essetially at least as good as that of Dirichlet smoothed etropy estimators with l samples. Mauscript received Moth 00, 0000; revised Moth 00, 0000; accepted Moth 00, Date of curret versio Moth 00, This work was supported i part by the Ceter for Sciece of Iformatio CSoI uder grat agreemet CCF Materials of this paper were preseted i part at the 05 Iteratioal Symposium o Iformatio Theory, Hog Kog, Chia. Persoal use of this material is permitted. However, permissio to use this material for ay other purposes must be obtaied from the IEEE by sedig a request to pubs-permissios@ieee.org. Jiatao Jiao, Yaju Ha, ad Tsachy Weissma are with the Departmet of Electrical Egieerig, Staford Uiversity, CA, USA. {jiatao,yjha, tsachy}@staford.edu. Kartik Vekat was with the Departmet of Electrical Egieerig, Staford Uiversity. He is curretly with PDT Parters. kvekat@alumi.staford.edu. Idex Terms etropy estimatio, maximum likelihood estimator, Dirichlet prior smoothig, approximatio theory, high dimesioal statistics, Réyi etropy, approximatio usig positive liear operators I. INTRODUCTION Etropy ad related iformatio measures arise i iformatio theory, statistics, machie learig, biology, eurosciece, image processig, liguistics, secrecy, ecology, physics, ad fiace, amog other fields. Numerous iferetial tasks rely o data drive procedures to estimate these quatities see, e.g. [] [6]. We focus o two cocrete ad well-motivated examples of iformatio measures, amely the Shao etropy [7] HP p i lp i, ad the power sum F α P,α > 0: F α P p α i,α > 0. The power sum F α P fuctioal ofte emerges i various operatioal problems [8]. It also has coectios to the Réyi etropy [9] H α P via the formula H α P = lfαp α. Cosider estimatig the Shao etropy HP based o i.i.d. samples followig ukow discrete distributio P with ukow alphabet size S. This problem has a rich history with extesive study i various fields ragig from iformatio theory, statistics, eurosciece, physics, psychology, medicie, etc. We refer the reader to [0] for a review. Oe of the most widely used estimators for this purpose is the Maximum Likelihood Estimator MLE, which is simply the empirical etropy. The empirical etropy is a istatiatio of the plugi priciple i fuctioal estimatio, where a poit estimate of the parameter distributio P i this case is used to costruct a estimator for a fuctioal of the parameter via the plug-i approach. The idea of usig the MLE for estimatig iformatio measures of iterest i this case etropy, is ot oly ituitive, but has soud justificatio: asymptotic efficiecy. The beautiful theory of Hájek ad Le Cam [] [3] shows that, as the umber of observed samples grows without boud while the fiite parameter dimesio e.g., alphabet size remais fixed, the MLE performs optimally i estimatig ay differetiable fuctioal whe the statistical model complies with the beig LAN Local Asymptotic Normality coditio [3]. Thus, for fiite dimesioal problems, the problems

2 of parameter ad fuctioal estimatio are well uderstood i a asymptotic sese, ad the MLE appears to be ot oly atural but also theoretically justified. But does it make sese to employ the MLE to estimate the etropy i most practical applicatios? As it turs out, while asymptotically optimal i etropy estimatio, the MLE is by o meas sacrosact i may real applicatios, especially i regimes where the alphabet size is comparable to, or eve larger tha the umber of observatios. It was show that the MLE for etropy is strictly sub-optimal i the large alphabet regime [4], [5]. Therefore, classical asymptotic theory does ot satisfactorily address high dimesioal settigs, which are becomig icreasigly importat i the moder era of high dimesioal statistics. There has bee a wave of recet research activities focusig o aalyzig existig approaches of fuctioal estimatio, as well as proposig ew estimators that are provably ear optimal i the large alphabet regime. Paiski [4] showed that the MLE eeds S samples to cosistetly estimate the Shao etropy, ad Paiski [5] established the existece of a o-explicit estimator that oly required S samples. It implies that the MLE is strictly sub-optimal i terms of sample complexity. It was Valiat ad Valiat [6] who first explicitly costructed a liear programmig based estimator later modified i [7] that achieves cosistecy i etropy estimatio with S/ l S samples, which they also proved to be ecessary. Valiat ad Valiat [8] costructed aother approximatio based estimator that achieved better theoretical properties tha the liear programmig oes, which was ot yet show to be miimax rate-optimal for all rages of S ad. The authors [0] costructed the first miimax rate-optimal estimators for HP ad F α P,α > 0 based o best polyomial approximatio, which are agostic to the alphabet size S. Utilizig the released MATLAB ad Pytho packages of the estimators i [0], [9], [0] demostrated that these miimax rate-optimal estimators ca lead to sigificat performace boosts i various machie learig tasks. Wu ad Yag [] idepedetly applied the best polyomial approximatio idea to etropy estimatio ad obtaied the miimax rates. However, their estimator requires the kowledge of the alphabet size S. The approximatio ideas proved to be very fruitful i Acharya et al. [], Wu ad Yag [3], Ha, Jiao, ad Weissma [4], Jiao, Ha, ad Weissma [5], Bu et al. [6], Orlitsky, Suresh, ad Wu [7], Wu ad Yag [8]. The mai cotributio of this paper is a explicit characterizatio of the worst case squared error risk of estimatig HP ad F α P usig the MLE up to a uiversal multiplicative costat for each specific fuctioal, for all rages of S ad i which the risk may vaish. Uderstadig the beefits ad limitatios of the MLE i a oasymptotic settig serves two key purposes. First, the approach is a atural bechmark for comparig other more uaced procedures for estimatio of fuctioals. Secod, performace aalysis for the MLE reveals regimes where the problem is difficult, ad motivates the developmet of improvemets, which have bee validated i [0], [4] [8], [], []. As a byproduct of the aalysis, we explicitly poit out a equivalece betwee bias aalysis of fuctioal estimators usig plug-i rules ad approximatio theory usig positive liear operators. We believe these powerful tools itroduced from approximatio theory may have far reachig impacts i various applicatios i the iformatio theory commuity. We metio that there exist umerous other approaches proposed i various disciplies to estimate etropy, may amog which are difficult to aalyze theoretically. Amog them we metio the Miller Madow bias-corrected estimator ad its variats [9] [3], the jackkife estimator [3], the shrikage estimator [33], the coverage adjusted estimator [34], the Best Upper Boud BUB estimator [4], the B-Splies estimator [35], ad [36], [37] etc. For a Bayesia statisticia, a atural approach is to first impose a prior o the ukow discrete distributio before cosiderig estimatig etropy. The Dirichlet prior, beig the cojugate prior to the multiomial distributio, appears to be particularly popular i the Bayesia approach to etropy estimatio. Dirichlet smoothig may have two cootatios i the cotext of etropy estimatio: [38], [39] Oe first obtais a Bayes estimate for the discrete distributio P, which we deote by ˆP B, ad the plugs it i the etropy fuctioal to obtai the etropy estimate HˆP B. [40] [4] Oe calculates the Bayes estimate for etropy HP uder Dirichlet prior for squared error. The estimator is the coditioal expectatio E[HP X], where X represets the samples. Nemema, Shafee, ad Bialek [4] argued i a ituitive way why Dirichlet prior is bad for etropy estimatio ad proposed to use mixtures of Dirichlet priors. Archer, Park, ad Pillow [43] have come up with priors that perform better tha the Dirichlet prior. Also see [44], [45]. Aother cotributio of this paper is a explicit characterizatio of the worst case squared error risk of estimatig HP usig the Dirichlet prior plug-i approach up to a uiversal multiplicative costat, for all rages of S ad i which the risk may vaish. We show rigorously that either of the two approaches utilizig the Dirichlet prior result i improvemets over the MLE i the large alphabet regime. Specifically, these approaches require at least S to be cosistet, while the miimax rate-optimal estimators such as the oes i [0] [] to achieve cosistecy. The rest of the paper is orgaized as follows. We preset the mai results i Sectio III, discuss the fudametal ideas behid the proofs i Sectio IV, ad detail the proofs i Sectio V ad VI. Proofs of auxiliary lemmas are deferred to the appedices. oly eed S ls II. PRELIMINARIES The Dirichlet distributio with order S with parameters α,...,α S > 0 has a probability desity fuctio with respect to Lebesgue measure o the Euclidea space R S give by f x,,x S ;α,,α S = Bα S x αi i 3

3 3 o the ope S -dimesioal simplex defied by: x,,x S > 0 4 x x S < 5 x S = x x S 6 ad zero elsewhere. The ormalizig costat is the multiomial Beta fuctio, which ca be expressed i terms of the Gamma fuctio: S Bα = Γα i S, α = α,,α S. 7 Γ α i Assumig the ukow discrete distributio P follows prior distributio P Dirα, ad we observe a vector X = X,X,...,X S with multiomial distributio multi;p,p,...,p S, the oe ca show that the posterior distributio P P X is also a Dirichlet distributio with parameters αx = α X,α X,...,α S X S. 8 Furthermore, the posterior mea coditioal expectatio of p i give X is give by [46, Example 5.4.4] δ i X E[p i X] = α i X i S α. 9 i The estimator δ i X is widely used i practice for various choices ofα. For example, if α i = S, the the correspodig δ X,δ X,...,δ S X is the miimax estimator for P uder squared loss [46, Example 5.4.5]. However, it is o loger miimax uder other loss fuctios such as l loss, which was ivestigated i [47]. Note that the estimator δ i X subsumes the MLE ˆp i = Xi as a special case, sice we ca take the limit α 0 for δ i X to obtai MLE. We deote the empirical distributio by P = ˆp, ˆp,..., ˆp S. The Dirichlet prior smoothed distributio estimate is deoted as ˆP B, where ˆP B = S α P i S α i S α i α S α. 0 i Note that the smoothed distributio ˆP B ca be viewed as a covex combiatio of the empirical distributio P ad the α prior distributio S. We call the estimator HˆP B the Dirichlet prior smoothed αi plug-i estimator. Aother way to apply Dirichlet prior i etropy estimatio is to compute the Bayes estimator for HP uder squared error, give that P follows Dirichlet prior. It is well kow that the Bayes estimator uder squared error is the coditioal expectatio. It was show i Wolpert ad Wolf [40] that Ĥ Bayes E[HP X] = ψ α i X i α i X i S α ψα i X i, i X i where ψz Γ z Γz is the digamma fuctio. We call the estimator ĤBayes the Bayes estimator uder Dirichlet prior. Throughout this paper, we observe i.i.d. samples from a ukow discrete distributio P = p,p,...,p S. We deote the samples as i.i.d. radom variables {Z i } i takig values i Z = {,,...,S} with probability p,p,...,p S. Defiig X i ½Z j = i, i S, j= we kow that X,X,...,X S follows a multiomial distributio with parameter ;p,p,...,p S. Deote h j S ½X i = j, 0 j. The Maximum Likelihood Estimator MLE for HP ad F α P are defied, respectively, as HP ad F α P, with P beig the empirical distributio. We assume the fuctioal FP takes the form FP = fp i. 3 The it is evidet that the MLE FP for estimatig fuctioal FP i 3 ca be alteratively represeted as the followig liear fuctio of h 0,h,...,h : FP = f j=0 j h j. 4 Recall that the risk fuctio uder squared error for ay estimator ˆF i estimatig fuctioal FP may be decomposed as E P FP ˆF = E P ˆF FP E P ˆF EP ˆF, 5 where E P ˆF FP represets the squared bias, ad E P ˆF EP ˆF represets the variace. The subscript P meas that the expectatio is take with respect to the distributio P that geerates the i.i.d. observatios. We omit the subscript for the expectatio operator E if the meaig of the expectatio is clear from the cotext. Notatio: a b deotes mi{a,b}, a b deotes max{a,b}. For two o-egative series {a },{b }, otatio a b meas that there exists a positive uiversal costat C < such that a b C, for all. The otatioa b is equivalet to a b ad b a. Notatio a b meas that limif a b =. Throughout this paper, the otatios,,, ivolve absolute costats that may oly deped o α but ot S or. We deote by M S the space of discrete distributios with alphabet size S. A. Estimatig F α P III. MAIN RESULTS We split the upper bouds ad the lower bouds ito two theorems, ad preset their succict summaries i Corollary ad. Theorem Upper bouds. We have the followig upper bouds o the worst case squared error risk of MLE i estimatig F α P:

4 4 α : sup E P F α P F α P < α < : αα α 4. 6 sup E P F α P F α P 4 3S α/ 5S C α α/ α, α 4, 7 where C α, ω ϕ xα, / > 0 satisfies limsup C α, < for < α <, ad ω ϕ is the secod-order Ditzia Totik modulus of smoothess itroduced i Sectio IV-B. 3 / α < : sup E P F α P F α P 3S α/ 5S α/ α 0S α 0 S α α α < α < /: sup E P F α P F α P 3S α/ 5S α/ α 0S 0 α α S α α. 9 Moreover, i all the bouds preseted above, the first term bouds the square of the bias, ad the secod term bouds the variace. Theorem Lower bouds. We have the followig lower bouds o the worst case squared error risk of MLE i estimatig F α P: α 3/: there exists a costat C α > 0 such that for all, sup E P F α P F α P C α. 0 < α < 3/: if S = c, for ay c > 0, the lim if α sup E P F α P F α P > 0. 3 / α < : if S, the sup E P F α P F α P α α 7 α S α 64e α 4 [ S α α S α α] e /4 S α, 4 0 < α < /: if S, the sup E P F α P F α P α α 36 α S. 3 There are several iterestig implicatios of this result, highlighted i the followig corollaries. Corollary. For ay fixed α >, there exist uiversal covergece rates for F α P: sup sup E P F α P F α P { α < α < 3/ S N α 3/ 4 Corollary implies that, whe α 3/, estimatio of F α P is extremely simple i terms of covergece rate: plugi estimatio achieves the best possible rate as show i the theory of regular statistical experimets of classical asymptotic theory, see [48, Chap..7.]. Results of this form have appeared i the literature, for example, Atos ad Kotoyiais [49] showed that it suffices to take samples to cosistetly estimate F α P,α,α Z. However, whe < α < 3/, the rate α is cosiderably slower. Iterestigly, there exist estimators that demostrate better covergece rates for estimatig F α P, < α < 3/. Jiao et al. [0] showed that the miimax rate i estimatig F α P, < α < 3/, is l α as log as S l, which is achieved usig the geeral methodology developed therei for costructig miimax rate-optimal estimators for osmooth fuctioals. Let us ow examie the case 0 < α <, aother iterestig regime that has ot bee characterized before. I this regime, we observe sigificat icrease i the difficulty of the estimatio problem. I particular, the relative scalig betwee the umber of observatios ad the alphabet size S for cosistet estimatio of F α P exhibits a phase trasitio, ecapsulated i the followig. Corollary. Fix α 0,. The worst case squared error risk of the MLE F α P i estimatig F α P is characterized

5 5 as follows whe S: sup E P F α P F α P { S S α α / < α < S 0 < α / α 5 Corollary follows directly from Theorem ad Theorem. I particular, it implies that it is ecessary ad sufficiet to take S /α samples to cosistetly estimate F α P,0 < α < usig MLE. Thus, as oe might expect, the scale of the umber of measuremets required for cosistet estimatio icreases as α decreases. Whe α 0, the umber of samples required for the MLE grows super-polyomially i S, which is cosistet with the ituitio that F α P,α 0 is essetially equivalet to the alphabet size of a distributio, whose estimatio is kow to be very hard whe there may exist symbols with very small probabilities [50]. We exhibit some of our fidigs by plottig the value required ofl/ls for cosistet estimatio off α P usig the MLE F α P, as a fuctio of α, i Figure. l ls 0 /α ot achievable via MLE Theorem achievable via MLE Theorem Fig. : For ay fixed poit above the thick curve, cosistet estimatio of F α P is achieved usig MLE F α P as show i Theorem. For ay fixed poit below the thick curve i the regime 0 < α <, Theorem shows that the MLE does ot have vaishig maximum squared error risk. It turs out that oe ca costruct estimators that are better tha the MLE i terms of required sample complexity for cosistet estimatio for the regime 0 < α <. Ideed, Jiao et al. [0] showed that the miimax rate-optimal estimator requires S α ls samples to achieve cosistecy, which attais a logarithmic improvemet i the sample complexity over the MLE. α B. Estimatig HP We ot oly cosider HP, but also the so-called Miller Madow bias-corrected estimator [9] defied as H MM P = HP S. 6 Theorem 3. The worst case squared error risk of HP admits the followig upper boud for all S,: sup E P HP HP l S l If 5S, the ls 3. 7 sup E P HP HP S S 0 c l S. 8 Moreover, if 5S, the Miller Madow bias-corrected estimator satisfies sup E P H MM P HP S 0 c l S, 9 where the positive costat c > 0 i both expressios does ot deped o S or. Theorem 3 implies the followig corollary. Corollary 3. The worst case squared error risk of the MLE HP i estimatig HP is characterized as follows whe 5S: sup E P HP HP S l S. 30 Here the first term correspods to the squared bias, ad the secod term correspods to the variace. Paiski [4] showed that if = cs, where c > 0 is a costat, the maximum squared error risk of HP, ad the Miller Madow bias-corrected estimator H MM P, would be bouded from zero. Paiski [4] also showed that whe S,, the MLE is cosistet for estimatig etropy. Corollary 3 implies that it is ecessary ad sufficiet to take S samples for the MLE to be cosistet for estimatig etropy. Comparig the results for HP with those for F α P, we see that the ituitio thathp beig viewed close to F α P whe α is ideed approximately correct as HP coicides with α o the phase trasitio curve show i Figure. Table I summarizes the miimax squared error rates ad the worst case squared error rates of the MLE i estimatig HP ad F α P,α > 0. It is clear that the MLE caot achieve the miimax rates for estimatio of HP, ad F α P whe 0 < α < 3/. I these cases, there exist strictly better estimators whose performace with samples is roughly the same as

6 6 Miimax squared error rates Maximum squared error rates of MLE S HP l l S S S/lS [0], [6], [8], [] l S S Corollary 3 F α P,0 < α S l S /α /ls,l ls S [0] α S /α Corollary α F α P, < α < S l S α α S /α /ls S [0] S α α S /α Corollary F α P, < α < 3 l α S l [0] α S Corollary F α P,α 3 Theorem TABLE I: Summary of results i this paper ad the compaio [0] that of the MLE with l samples. This pheomeo was termed effective sample size elargemet i [0]. C. Dirichlet prior techiques applyig to etropy estimatio For symmetry, we restrict attetio to the case where the parameter α i the Dirichlet distributio takes the form a,a,...,a. I compariso to MLE HP, where P is the empirical distributio, the Dirichlet smoothig scheme HˆP B has a disadvatage: it requires the kowledge of the alphabet size S i geeral. We defie ˆp B,i = ˆp i a Sa, 3 ad p B,i = E[ˆp B,i ] = p i a Sa. 3 It is clear that ˆP B = Sa P Sa Sa U S 33 P B = Sa P Sa Sa U S, 34 where P stads for the empirical distributio, P is the true distributio, ad U S deotes the uiform distributio o the same alphabet with size S. Theorem 4. If max{sa, ea}, the the maximum squared error risk of HˆP B i estimatig HP is upper bouded as sup E P HˆP B HP l S Sa Sa [ 3l Sa Sa Sa l a ] Sa a S. 35 Here the first term bouds the squared bias, ad the secod term bouds the variace. Theorem 5. If max{5s,sa,ea}, the the maximum L risk of HˆP B i estimatig HP is lower bouded as sup E P HˆP B HP [ S a Sa 4Sa l S a 8 S 80 ] 48 c l S, 36 where c > 0 is a uiversal costat that does ot deped o a,s, or. If < Sa, the we have S sup E P HˆP B HP l S. 37 S If < ea, the we have S sup E P HˆP B HP l S. 38 S e If < 5S, ea, the we have sup E P HˆP B HP [ S a Sa 4Sa l /5 ], a where x is the largest iteger that does ot exceed x, ad x = max{x,0} represets the positive part of x. The followig corollary immediately follows from Theorem 4 ad Theorem 5. Corollary 4. If S ad a is upper bouded by a costat, the the maximum squared error risk of HˆP B vaishes. Coversely, if S, the the maximum squared error risk of HˆP B is bouded away from zero. The ext theorem presets a lower boud o the maximum risk of the Bayes estimator uder Dirichlet prior. Sice we have assumed that all α i = a, i S, the Bayes estimator uder Dirichlet prior is Ĥ Bayes = ψsa ax i Sa ψax i. 40

7 7 Theorem 6. If S, the ĤBayes SaS/ sup E P HP l Sae γ, 4 where γ is the Euler-Mascheroi costat. Evidet from Theorem 4, 5, ad 6 is the fact that i the best situatio i.e. a ot too large, both the Dirichlet prior smoothed plug-i estimator ad the Bayes estimator uder Dirichlet prior still require at least S samples to be cosistet, which is the same as MLE. I cotrast, the estimators i Valiat ad Valiat [6] [8], Jiao et al. [0], Wu ad Yag [] are cosistet if S ls, which is the optimal sample complexity. Thus, we ca coclude that the Dirichlet smoothig techique does ot solve the etropy estimatio problem. IV. FUNDAMENTAL IDEAS OF OUR ANALYSIS I this sectio, we discuss the fudametal tools we employed to obtai the results i Sectio III, as well as geeral recipes we suggest for aalyzig performaces of fuctioal estimators. A. Variace The variace characterizes the degree to which the radom variable FˆP is fluctuatig aroud its expectatio, ad the field of cocetratio iequalities perfectly fits our glove to give the desired results. For all the fuctioals we cosider, it turs out that the Efro Stei iequality [5] ad the bouded differeces iequality give very tight bouds. For completeess we state them below. Lemma. [5, Efro Stei iequality, Theorem 3.] Let Z,...,Z be idepedet radom variables ad let fz,z,...,z be a square itegrable fuctio. Moreover, if Z,Z,...,Z are idepedet copies of Z,Z,...,Z ad if we defie, for every i =,,...,, the f i = fz,z,...,z i,z i,z i,...,z, 4 Varf E [ f f i ]. 43 The followig iequality, which is called the bouded differeces iequality, is a useful corollary of the Efro Stei iequality. Lemma. [5, Bouded differeces iequality, Corollary 3.] If fuctio f: Z R has the bouded differeces property, i.e., for some oegative costats c,c,...,c, sup fz,...,z fz,...,z i,z z,...,z,z i Z i,z i,...,z c i, 44 for every i, the VarfZ,Z,...,Z 4 c i, 45 give that Z,Z,...,Z are idepedet radom variables. We refer the readers to Bouchero et al. [5] for a moder expositio of the cocetratio iequality toolbox. B. Bias It turs out that the bias aalysis i estimatio, albeit widely studied i statistics, seems to still largely bear a asymptotic ad expasio ature i the maistream statistical literature [53], [54]. I particular, the bootstrap [55] as a method for estimatig fuctioals was essetially oly aalyzed i a asymptotic settig [56]. Amog asymptotic aalysis techiques, probably the most popular oe is the Taylor expasio. We will show that the Taylor expasio may ecouter great difficulties i aalyzig the bias of MLE i iformatio measure estimatio. The, we will itroduce the field of approximatio theory usig positive liear operators ad demostrate that it is essetially equivalet to oasymptotic bias aalysis for plug-i fuctioal estimators. I doig so, we preset the readers with abudat hady tools from approximatio theory, which could be readily applicable to may problems that may seem highly itractable with stadard expasio methods. We start from etropy estimatio. I the literature, cosiderable effort has bee devoted to uderstadig the oasymptotic performace of the MLE HP i estimatig HP. Oe of the earliest ivestigatios i this directio is due to Miller [9] i 955, who showed that, for ay fixed distributio P, EHP = HP S O. 46 Equatio 46 was later refied by Harris [57] usig higher order Taylor series expasios to yield EHP = HP S O 3. p i 47 Harris s result reveals a udesirable cosequece of the Taylor expasio method: oe caot obtai uiform bouds o the bias of the MLE. Ideed, the term S p i ca be arbitrarily large for some distributiop. However, it is evidet that both HP ad HP are bouded above by ls, sice the maximum etropy of ay distributio supported o S elemets is l S. Coceivably, for such a distributio P that would make S p i very large, we eed to compute eve higher order Taylor expasios to obtai more accuracy, but eve with such efforts we caot obtai a uiform bias boud for all P. We gai oe of our key isights ito the bias of the MLE by relatig it to the approximatio error iduced by the Berstei polyomial approximatio of the fuctio f, which was first observed i Paiski [4]. To see this, we first compute the bias of FP i estimatig the fuctioal FP i 3.

8 8 Lemma 3. The bias of the estimator FP is give by BiasFP EFP FP = j f p j i j p i j fp i j=0 48 The bias term i 48 ca be equivaletly expressed as BiasFP = j f B j, p i fp i 49 = j=0 B [f]p i fp i, 50 whereb j, x j x j x j is the well-kow Berstei polyomial basis, ad B [f]x is the so-called Berstei polyomial for fuctiofx. Berstei i 9 [6] provided a isightful costructive proof of the Weierstrass theorem o approximatio of cotiuous fuctios usig polyomials, by showig that the Berstei polyomial of ay cotiuous fuctio coverges uiformly to that fuctio. From a fuctioal aalytic viewpoit, the Berstei polyomial is a operator that maps a cotiuous fuctio f C[0, ] to aother cotiuous fuctio B [f] C[0,]. This operator is liear i f, ad is positive because B [f] is also poitwise oegative if f is poitwise o-egative. Evidetly, boudig the approximatio error icurred by the Berstei polyomial is equivalet to boudig the bias of the MLE fx/, where X B, x. Fortuately, the theory of approximatio usig positive liear operators [6] provides us with advaced tools that are very effective for the bias aalysis our problem calls for. A cetury ago, probability theory served Berstei i breakig ew groud i fuctio approximatio. It is therefore very satisfyig that advacemets i the latter have come full circle to help us better uderstad probability theory ad statistics. We briefly review the geeral theory of approximatio usig positive liear operators below. Approximatio theory usig positive liear operators: Geerally speakig, for ay estimator ˆθ of a parametric model idexed by θ, the expectatio f E θ fˆθ is a positive liear operator for f, ad aalyzig the bias E θ fˆθ fθ is equivalet to aalyzig the approximatio properties of the positive liear operatore θ fˆθ i approximatigfθ. Hece, aalyzig the bias of ay plug-i estimator for fuctioals of parameters from ay parametric families ca be recast as a problem of approximatio theory usig positive liear operators [6]. Coversely, give a positive liear operator Lfx that operates o the space of cotiuous fuctios, the Riesz Markov Kakutai theorem implies that uder mild coditios the operator may be writte as Lfx = fdµ x = E µx fz,z µ x, 5 I I the literature of combiatorics, the sum j=0 a j,b j, x is called the Beroulli sum, ad various approaches have bee proposed to evaluate its asymptotics [58], [59], [60]. where {µ x } is a set of probability measures parametrized by x, which may be viewed as a parameter. If we view the radom variable Z as a summary statistics to plug-i the fuctioal f, the positive liear operator Lfx is othig. but the expectatio of the plug-i estimator fz. I this sese, there exists a oe-to-oe correspodece betwee essetially the most geeral bias aalysis problem i statistics, ad the most geeral positive liear operator approximatio problem i approximatio theory. After more tha a cetury s active research o approximatio usig positive liear operators, we ow have highly otrivial tools for positive liear operators of fuctios o oe dimesioal compact sets, but the geeral theory for vector valued multivariate fuctios o o-compact sets is still far from complete [6]. I the ext subsectio, we preset a sample of existig results i approximatio usig positive liear operators, corollaries of which will be used to aalyze the bias of the MLE for two examples: F α P ad HP. Some geeral results i bias aalysis: First, some elemetary approximatio theoretic cocepts eed to be itroduced i order to characterize the degree of smoothess of fuctios. For I R a iterval, the first-order modulus of smoothess ω f,t,t 0 is defied as [6] ω f,t sup{ fu fv : u,v I, u v t}. 5 The secod-order modulus of smoothess ω f,t,t 0 [6] is defied as uv ω f,t sup{ fu f fv : } u,v I, u v t. 53 Ditzia ad Totik [63] itroduced a class of moduli of smoothess, which proves to be extremely useful i characterizig the icurred approximatio errors. For simplicity, for fuctios defied o [0,], ϕx = x x, the first-order Ditzia Totik modulus of smoothess is defied as { ω ϕf,t sup fu fv : } uv u,v [0,], u v tϕ, 54 ad the secod-order Ditzia Totik modulus of smoothess is defied as uv ωϕf,t sup{ fu f fv : } uv u,v [0,], u v tϕ. 55 Recall that we deote by e j,j N {0}, the moomial fuctiose j y = y j,y I. The first estimate for geeral positive liear operators, usig modulus ω ad with precise costats, was give by Goska [64]. We rephrase Paltaea [6,

9 9 Cor....] as follows. Note that otatio e xe 0 deotes a cotiuous fuctio o I which is the differece of a liear fuctio y ad a costat fuctio with costat value x over I. I other words, it is a abbreviatio ofe y xe 0 y,y I, which is a fuctio of y rather tha x. For a positive liear fuctioal F, we adopt the followig otatio B F x = Fe xfe 0, V F = F e Fe e 0, 56 which represet the bias ad variace of a positive liear fuctioal F. Lemma 4. [6, Cor....] Let F: CI R be a positive liear fuctioal, where I R is a iterval. Suppose that Fe 0 =,t > 0,legthI t,s. The, Ff fx B F x ω f,t t F e xe 0 s t s ω f,t. 57 We remark that Lemma 4 ca be applied to boud the bias of plug-i estimators i very geeral models. For example, cosider a arbitrary statistical experimet {P θ,θ I}, from which we obtai i.i.d. samples X,X,...,X P θ. For ay estimator ˆθ, we would like to aalyze the bias of the plug-i estimator fˆθ for fuctioal fθ. Suppose legthi t,s, the Lemma 4 implies that E θ fˆθ fθ E θˆθ θ ω f,t t E ˆθ θ s t s ω f,t. 58 If we further assume that ˆθ is a ubiased estimator for θ, i.e., E θˆθ = θ holds for all θ I, the we have E θ fˆθ fθ E ˆθ θ s t s ω f,t. 59 Takig s = ad assumig Varˆθ legthi/, we have E θ fˆθ fθ 3 ω f, Varˆθ, 60 after we take t = E ˆθ θ. We remark that Lemma 4 is oly oe way to aalyze the bias, which is by o meas always tight. For example, the followig estimate usig Ditzia Totik modulus is sigificatly better tha Lemma 4 for certai fuctios such as the etropy. Lemma 5. [6, Thm..5..] If F: C[0,] R is a liear positive fuctioal ad Fe 0 =, the we have Ff fx B Fx h ϕx ω ϕ f,h 5 ω ϕ f,h, 6 for all f C[0,] ad 0 < h, where ϕx = x x ad h = F e xe 0 /ϕx = VF B F x /ϕx. The bias B F x ad variace V F x are defied i 56. Cosiderig the same statistical experimet {P θ,θ I}, ad the plug-i estimator fˆθ for fθ, if ˆθ is ubiased for θ ad Varˆθ ϕθ 4, the it follows from Lemma 5 that E θ fˆθ fθ 5 Varˆθ ω ϕ f,, 6 ϕθ Varˆθ after we take t = ϕθ. For certai fuctios fθ ad statistical models Lemma 5 is stroger tha Lemma 4. For example, if fθ = θ l θ, θ [0,], ad we have ˆθ B,θ. We will show i Lemma 8 that ωϕ f,t = t l4 t, ad ω f,t = tl4. We also have Varˆθ = θ θ. Hece, Lemma 4 gives the upper boud E θ fˆθ fθ 3l4 θ θ, 63 whereas Lemma 5 gives E θ fˆθ fθ 5l4 /, 64 which is much stroger whe is large ad θ ot too close to the edpoits of [0,]. There also exist various estimates for the bias whe the parameter lies i sets other tha a iterval i R. However, the bouds we preseted are i geeral ot optimal for specific fuctioals, thereby leavig ample room for future developmet. For example, ote that 63 is stroger tha 64 whe θ /, but Ha, Jiao, ad Weissma [65] showed that whe θ / the poitwise boud i 63 is still strictly suboptimal for the etropy fuctioal. Usurprisigly, to obtai the results i Sectio III, we eed to go beyod the geeral results i approximatio theory, ad icorporate the structure of specific fuctios. Note: I approximatio theory literature, researchers have explored the iteractios betwee geeral positive liear operator approximatio ad its probabilistic couterpart decades ago [66] [68]. However, i statistics literature related to positive liear approximatio, usually oly specific operators are used, such as the Berstei operator [69], ad the focus may ot be o obtaiig the tightest boud o bias [70], [7]. C. Lower bouds To lower boud the worst case performace of a specific estimator, we have essetially two approaches: first, to aalyze the bias or the variace of the specific estimator carefully; secod, to prove a lower boud that is satisfied by all the estimators, which aturally iclude the specific estimator we eed to aalyze. These two approaches have differet relative advatages ad disadvatages, so we utilize them together i the lower boud costructio. We refer the readers to Tsybakov [7] for a ice collectio of techiques to prove miimax lower bouds. Oe specific approach we use is the va Trees iequality, which we quote below. Let X,F,P θ ;θ Θ be a domiated family of distributios o some sample space X ; deote the domiatig measure

10 0 by µ. Assume Θ is a closed iterval o the real lie. Let fx θ deote the desity of P θ with respect to µ. Let π be some probability distributio o Θ with a desity λθ with respect to Lebesgue measure. Suppose that λ ad fx are both absolutely cotiuous µ-almost surely, ad that λ coverges to zero at the edpoits of the iterval Θ. We defie logfx θ Iθ = E θ 65 θ dlogλθ Iλ = E 66 dθ the Fisher iformatio for θ ad for a locatio parameter i λ, respectively. We assume Iθ is cotiuous i θ. We have the followig iequality. Lemma 6 va Trees iequality. [73] Uder assumptios above, the average risk of a arbitrary estimator ˆψX i estimatig a absolutely cotiuous fuctioal ψθ uder squared error loss satisfies the followig iequality: Eψ E ˆψX ψθ θ 67 E[Iθ]Iλ V. PROOFS OF THE UPPER BOUNDS I order to upper boud the maximum squared error risk of ay estimator, a atural approach would be to aalyze the squared bias term ad the variace term separately. The, it suffices to fid proper tools to give oasymptotic aalysis of the bias ad variace. A. Boudig the bias We first work to boud the bias. Lemma 3 shows that the bias of FP could be represeted as BiasFP = B [f]p i fp i, 68 where B [f]x is the Berstei polyomial correspodig to fx. The followig lemma summarizes some state-of-theart bouds for approximatio error of Berstei polyomials. Lemma 7 ca be derived easily from the geeral theory we preseted i Sectio IV-B. We emphasize that oe caot expect the bouds i Lemma 7 to be tight for ay f C[0,], sice the Berstei approximatio error itself could be a very complicated fuctio i C[0,], ad Lemma 7 is usig relatively simple fuctios to upper boud it. Lemma 7. The followig bouds are valid for fuctio approximatio error icurred by Berstei polyomials: Poitwise estimate: [6, Cor...] [74] for all cotiuous fuctios f o [0,], fx B [f]x 3 x x f, ω, 69 ad the costat 3/ is show by [74] to be the best costat; Norm estimate: [6, Cor. 4..0] forϕx = x x ad all cotiuous fuctios f o [0,], we have B [f] f 5 ω ϕ f, / ; 70 3 [75, Eq ] for f C [0,], i.e., twice cotiuously differetiable, fx B [f]x f x x ; 7 Proof. The poitwise estimate of Lemma 7 follows from Lemma 4. The orm estimate of Lemma 7 follows from Lemma 5. Regardig the third part, suppose radom variable X B,x. We have fx B [f]x = E x fx/ fx 7 = E x [f xx/ x f ξ X X/ x ] 73 = E xf ξ X X/ x 74 f E x X/ x 75 = f x x, 76 where we used Taylor expasio for fx/ at poit x with the Lagrage remaider. The proof is complete. Remark. Note that although 70 is i the form of a upper boud, it has bee show to be a lower boud as well. Totik [76] showed the followig equivalece property o the orm estimate of Berstei approximatio errors B [f]x fx ω ϕ f, /. 77 It is easy to calculate the secod-order modulus of smoothess ad the Ditzia Totik secod-order modulus of smoothess for fuctios x α ad xlx. The results are preseted i the followig lemma. Lemma 8. We have x α,0 < α < x α, < α < xlx ω f,t α t α α t α tl4 ωϕf,t α t α t t t l4 α t where the secod-order modulus results hold for 0 < t /, ad the Ditiza Totik secod-order modulus results hold for 0 < t. Bias of F α P : We first boud the bias icurred by F α P. α : Note that it is a remarkable fact that 77 holds for ay cotiuous fuctio fx. The lower boud proof of 77 is cosidered oe of the remarkable results i approximatio theory, ad curretly there are o short proofs of this fact. Ideed, Ditzia [77, Sectio 8] metioed that I still would like to see a ew simple proof of 8.4 Equatio 77 which I am sure will have implicatios for other operators.

11 I this case, f C [0,], applyig the third part of Lemma 7, fx B [f]x αα x x. 78 Thus, we have BiasF α P αα p i p i αα. 79 < α < The followig lemma presets a boud o the bias of F α P, which does ot deped o the alphabet size S. We ote that the proof of Lemma 9 heavily utilizes the special properties of fuctio x α ad the fact that S p i =. Lemma 9. The bias of F α P for estimatig F α P, < α <, is upper bouded by the followig: BiasF α P 4 α. 80 We also preset two additioal bouds ivolvig the alphabet sizes. Usig the poitwise estimate i Lemma 7, the bias term of the MLE is upper bouded as follows for all 0 < α <,α : 3 pi p i α 3 α α/ α/ p α/ i 8 3 α α/s 8 S α/ = 3 α S α/. 83 α/ Usig the orm estimate i Lemma 7, whe < α <, 5S the bias would be upper bouded by C α,, where C α, = ωϕx α, / is a fiite positive costat such that limsup C α, < for < α <. Combiig Lemma 9, the poitwise estimate, ad the orm estimate i Lemma 7, we kow that the bias of F α P for < α < is upper bouded as BiasF α P 4 α 3 α S α/ 5S C α/ α, < α < : The poitwise estimate from Lemma 7 is worked out i 83. Usig the orm estimate i Lemma 7, the bias would be upper bouded by α 5S. Combiig the α poitwise estimate ad the orm estimate, we kow that the bias of F α P for 0 < α < is upper bouded as BiasF α P 3 α S α/ α/ α 5S α. 85 Bias of HP : We the boud the bias icurred by HP. Usig the orm estimate i Lemma 7, we kow BiasHP 5Sl4. 86 Usig the poitwise estimate i Lemma 7, we obtai BiasHP 3 S l4. 87 It was show by Paiski [4, Prop. ] that the squared bias of MLE HP is upper bouded as BiasHP l S, 88 which is better tha the two bouds we obtaied usig Berstei polyomial results. However, we remark that 88 is obtaied usig special properties of the etropy fuctio ad coectios betwee KL-divergece ad χ -divergece [7], which caot be applied to geeral fuctios. Strukov ad Tima [66] also heavily exploited the structure of fuctio x α ad xlx i order to aalyze the Berstei approximatio error for these fuctios, ad obtaied tight-i-order results. 3 Bias of HˆP B : We apply the geeral theory of positive liear operator approximatio. The followig lemma is a stregtheed versio of Lemma 5. Lemma 0. If F: C[0,] R is a liear positive fuctioal ad Fe 0 =, the Ff fx ω f,b F x;x 5 ω ϕ f,h 89 for allf C[0,] ad0 < h, whereϕx = x x ad h = V F /ϕx, ad ω f,h;x sup{ fu fx : u [0,], u x h}. 90 The bias B F x ad variace V F x are defied i 56. Proof. Applyig Lemma 5 to x = Fe we have Ff ffe 5 ω ϕf,h 9 ad the 89 is the direct result of the triagle iequality Ff fx Ff ffe ffe fx. We show that Lemma 0 is ideed stroger tha Lemma 5. Firstly, due to h h, we have ω ϕf,h ω ϕf,h. Secod, for x /, we have B F x h ϕx ω ϕf,h B Fx h ϕx sup h ϕsf s 0 s 9 B F x sup x s x f s 93 sup x s x ω f,b F x;s 94 which is almost the supremum of ω f, Fe xe 0 ;s over s [x, x] ad is o less tha the poitwise result ω f, Fe xe 0 ;x, ad here we have used the iequality ϕs ϕx for x s x. A similar argumet also holds

12 for x > /. Hece, Lemma 0 trasforms the first order term from the orm result i Lemma 5 to a poitwise result. Applyig[ Lemma ] 0 to the fuctio fp = plp ad Ff = E f ˆpa Sa, where ˆp B,p, we have the followig lemma. Lemma. If max{sa,ea,4}, the sup E P HˆP B HP 5Sl Sa Sa Sa l Sa a. 95 Note that Lemma implies a slightly weaker bias boud tha Theorem 4, but it is oly sub-optimal up to a multiplicative costat. The bias boud i Theorem 4 is obtaied usig the followig lemma, whose proof oly applies to the etropy fuctio. Lemma. If max{ea,sa}, sup E P HˆP B HP l S Sa Sa Sa l B. Boudig the variace Sa a. 96 The ext lemma follows from a applicatio of bouded differece iequality preseted i Lemma. Lemma 3. The variace of FP satisfies the followig upper boud: VarFP max 0 j< fj / fj/. 97 If f is mootoe, the we ca stregthe the boud to be VarFP 4 max 0 j< fj / fj/. 98 We first boud the variace for F α P,α >. We have max j 0 j< /α j/ α α 99 α, 00 where i the last step we used Beroulli s iequality: x r rx, r,x >,x R. Usig Lemma, we kow the variace is upper bouded by VarF α P α 4. 0 We boud the variace of F α P,0 < α < i the followig lemma. Lemma 4. For 0 < α < /, we have sup VarF α P 0S α 3α 3α 8α α 8α S 4 e α α 0 S α. 03 For / α <, we have sup VarF α P 0S α 3α 3α 8α S α α 8α S 4 e α α 04 S α α. 05 Further, oe ca show that for all α 0,, 3α 3α α 8α 8α 4 0 e α, 06 which is used i Theorem. Regardig the variace of HP, we have Lemma 5. sup VarHP l ls 3 07 ls l. 08 The variace of HˆP B is upper bouded by the followig lemma. Lemma 6. The variace of HˆP B is upper bouded as follows: [ ] Sa Var HˆP B Sa 3l a S. 09 VI. PROOFS OF THE LOWER BOUNDS A. Lower bouds for estimatio of F α P whe α 3/ We apply the va Trees iequality as preseted i Lemma 6. It suffices to cosider the restricted case of S = ad prove the lower boud. Thus, the model is equivalet to observig a Biomial radom variable X B,p, ad oe aims to estimate the fuctioal ψ α p = p α p α. We have ψ αp = αp α α p α. 0

13 3 The Fisher iformatio for parameter p uder the Biomial model is Ip = p p. Suppose we impose prior λp o parameter p. The va Trees iequality implies sup E P F α P F α P if ˆ F α sup E P Fα ˆ F α P E E[F α P X S ] F α P [ αp α α p α ] λpdp [ ] E λ Iλ p p = [ αp α α p α ] λpdp [ ] E λ Iλ p p Bayes risk 3 4 where the secod iequality follows from the fact that the Bayes risk uder ay prior is upper bouded by the miimax risk [78]. Takig λp to be the Dirichlet prior with parameter a,b, i.e., λp = Ba,b pa p b,a >,b >, 5 we ca explicitly evaluate the itegrals above. Here Ba, b is the Beta fuctio. Takig a = 4,b = 3, we have sup E P F α P F α P 60αBα3,3 Bα, Takig C α = 7α Bα3,3 Bα,4, we have sup E P F α P F α P C α, for all. 7 Note that C α > 0 for all α 3/. B. Lower bouds for estimatio of F α P whe < α < 3/ The followig lemma was proved i [69]. Lemma 7. Let k 4 be a eve umber. Suppose that the k-th derivative of f satisfies f k 0 i 0,, Q k is the Taylor polyomial of order k to f at some x i 0,. The for x [0,], fx B [f]x Q k B [Q k ]x. 8 Cosider f α x = x α, < α <,x [0,]. Applyig Lemma 7 to f α, takig k = 6, we have the followig result. Lemma 8. Suppose f α x = x α, < α < o [0,]. For all x 0,, we have where f α x B [f α ]x αα xα x x α3α x α5 3α R x 3 R x 4, 9 R x = αα α α 3xα 3 x 4 x5 αxα 4, 0 R x = αα α α 3α 4 0 x α 4 x x x x. Note that we have assumed S = c, c > 0. If c, we take a uiform distributio o S elemets P = /S, /S,..., /S, otherwise we take distributio P = ǫ, ǫ,..., ǫ ǫ,, where ǫ will be take to be arbitrarily small. We first aalyze the c case. Applyig Lemma 8, we have S,..., ǫ S f α /S B [f α ]/S Note that f α x = x α = EF α P F α P αα S S α S α5 3α αα α α 3 4S α 3 3 α 4 αα α α 3α 4 0S α 4 4 o α = αα α c α 3 c α5 3α α α 3α 4 4c α 4 α α 3α 4 0c α 5 o α = αα c α α α5 3αc 4 α α 3α 4c 4 α α 3α 4c3 o α 0 αc α 4 330α85α 90α 3 α 4 0 α o α, where the first iequality follows from Lemma 8, ad i the

14 4 last step we have take c = i the followig expressio α5 3αc α α 3α 4c 4 4 α α 3α 4c3, 0 ad cosidered the fact that it is a mootoically decreasig fuctio with respect to c o 0,] for ay α,3/. For cases whe c >, sice we take P = ǫ, ǫ,..., ǫ ǫ, S,..., ǫ S, by a cotiuity argumet, the aalysis is exactly the same as that above whe we set c = as we ca take ǫ as small as possible. Oe ca verify that the fuctio α4 330α85α 90α 3 α 4 /0 is positive o iterval,3/. Defiig c α = αc α 4 330α 85α 90α 3 α 4 /0 > 0 whe c, ad cα = α4 330α 85α 90α 3 α 4 /0 > 0 whe c >, the proof is completed. Lemma 0. For α <, we have if sup E P ˆF Fα P ˆF [ α S α α 3e α 4 e /4 S α S α α] S α, 8 where the ifimum is take over all possible estimators. Sice this lower boud holds for all possible estimators, it also holds for the MLE F α P. Sice max{a,b} ab, we have the desired lower boud. D. Lower bouds for estimatio of HP C. Lower bouds for estimatio of F α P whe 0 < α < Applyig Lemma 7 to fuctio f α x = x α,α 0,, takig k = 4, we have the followig result: Lemma 9. For f α x = x α o [0,], α 0,,x 0,, we have f α x B [f α ]x α α x α x x α. 3 3 Suppose S. Defie distributio W = w,w,...,w S M S such that i S,w i = ; w S = S. 4 Note that w i, i S. It follows from Lemma 9 that S α α α F α W E W F α P 6 5 α αs = 6 α. 6 Thus, we kow for all 0 < α <, sup E P F α P F α P α α S 36 α. 7 It is show i [0] that the followig miimax lower boud holds for estimatio of F α P,/ α <. Braess ad Sauer [69] derived the followig lower boud for the approximatio error of Berstei polyomials for the fuctio gx = xlx: Lemma. Defie gx = xlx o [0,]. For x 5,x [0,], we have gx B [g]x x 0 x x. 9 Applyig Lemma to the estimatio of HP, we kow that if i S,p i 5, HP EHP S S p i Cosider the uiform distributio P with 5S, which guaratees p i 5. Sice S S, 3 p i we have sup HP EHP S S 0. 3 Thus, whe 5S, sup E P HP HP S S It was show i [, Prop. ] that the followig miimax lower boud holds. Lemma. There exists a uiversal costat c > 0 such that if Ĥ l sup E P Ĥ S HP c, 34 where the ifimum is take over all possible estimators Ĥ.

15 5 Hece, we have sup E P HP HP { S max S 0 S S 0 },c l S 35 c l S. 36 Similar argumets ca be applied to the Miller Madow estimator. E. Lower bouds for etropy estimatio usig HˆP B Sice HˆP B is a specific estimator for etropy, the followig lemma is proved via cosiderig several specific distributios. Lemma 3. If max{5s,sa,ea}, EP sup HˆP B HP S a Sa 4Sa l S a 8 S If < Sa, the EP sup HˆP B HP S ls. 38 S If < ea, the EP sup HˆP B HP S ls. 39 es If < 5S, ea, the EP sup HˆP B HP S a Sa 4Sa l /5 a The correspodig results i Theorem 5 follow from Lemma 3, Lemma, ad the iequality max{a,b} ab. F. Lower bouds for etropy estimatio usig ĤBayes We prove Theorem 6 below. Applyig Lemma 5, we have Ĥ Bayes ψsa ax i ψa 4 Sa = ψsa ψa 4 Sae γ l. 43 a Sice ĤBayes Sae is upper bouded by l γ for ay a empirical observatios, the squared error it icurs i Shao etropy estimatio whe the true distributio is the uiform distributio is at least SaS/ l Sae γ 44 if S. ACKNOWLEDGMENTS We thak Day Leviata, Gacho Tachev, ad Radu Paltaea for very helpful discussios regardig the literature o approximatio theory usig positive liear operators. We thak Jayadev Acharya, Alo Orlitsky, Aada Theertha Suresh, ad Himashu Tyagi for commuicatig to us the idepedet discovery that it suffices to take samples to cosistetly estimate F α P, whe α >. We thak Maya Gupta for raisig the questio o the optimality of the Dirichlet prior smoothig techiques applyig to etropy estimatio. We thak the aoymous reviewers ad the associate editor for very helpful commets that sigificatly improved the presetatio of the paper. APPENDIX A AUXILIARY LEMMAS We begi with the defiitio of the egative associatio property, which allows us to upper boud the variace by treatig each compoet of the empirical distributio P i as idepedet radom variables. Defiitio. [79, Def..] Radom variables X,X,,X S are said to be egatively associated if for ay pair of disjoit subsets A,A of {,,,S}, ad ay compoet-wise icreasig fuctios f,f, Covf X i,i A,f X j,j A To verify whether radom variables X,X,,X S are egatively associated or ot, the followig lemma presets a useful criterio. Lemma 4. [79, Thm..9] Let X,X,,X S be S idepedet radom variables with log-cocave desities. The the joit coditioal distributio of X,X,,X S give S X i is egatively associated. I light of the precedig lemma, we ca obtai the followig corollary. Corollary 5. For ay discrete probability distributio vector P M S, the radom variables X = X,X,,X S draw from the multiomial distributio X multi;p are egatively associated. Proof. Cosider the Poissoized model Y i Poip i, i S with all Y i idepedet, it is straightforward to verify that each Y i possesses a log-cocave distributio. The coditioig o S Y i =, we kow that Y,Y,,Y S S Y i = multi;p, hece Lemma 4 yields the desired result. The ext lemma gives bouds o the digamma fuctios ψz = Γ z Γz. Lemma 5. [80, Lemma.7] The digamma fuctio ψz is the oly solutio of the fuctioal equatio Fx = Fx x that is mootoe, strictly cocave or ad satisfies F = γ, where γ is the Euler Mascheroi costat.

Lecture 16: Achieving and Estimating the Fundamental Limit

Lecture 16: Achieving and Estimating the Fundamental Limit EE378A tatistical igal Processig Lecture 6-05/25/207 Lecture 6: Achievig ad Estimatig the Fudametal Limit Lecturer: Jiatao Jiao cribe: William Clary I this lecture, we formally defie the two distict problems

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Exponential Families and Bayesian Inference

Exponential Families and Bayesian Inference Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Mathematical Methods for Physics and Engineering

Mathematical Methods for Physics and Engineering Mathematical Methods for Physics ad Egieerig Lecture otes Sergei V. Shabaov Departmet of Mathematics, Uiversity of Florida, Gaiesville, FL 326 USA CHAPTER The theory of covergece. Numerical sequeces..

More information

Three Approaches towards Optimal Property Estimation and Testing

Three Approaches towards Optimal Property Estimation and Testing Three Approaches towards Optimal Property Estimatio ad Testig Jiatao Jiao (taford EE) Joit work with: Yaju Ha, Dmitri Pavlichi, Kartik Vekat, Tsachy Weissma Frotiers i Distributio Testig Workshop, FOC

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Lecture 12: September 27

Lecture 12: September 27 36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f. Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 016 MODULE : Statistical Iferece Time allowed: Three hours Cadidates should aswer FIVE questios. All questios carry equal marks. The umber

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1. Eco 325/327 Notes o Sample Mea, Sample Proportio, Cetral Limit Theorem, Chi-square Distributio, Studet s t distributio 1 Sample Mea By Hiro Kasahara We cosider a radom sample from a populatio. Defiitio

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Lecture 9: September 19

Lecture 9: September 19 36-700: Probability ad Mathematical Statistics I Fall 206 Lecturer: Siva Balakrisha Lecture 9: September 9 9. Review ad Outlie Last class we discussed: Statistical estimatio broadly Pot estimatio Bias-Variace

More information

MAS111 Convergence and Continuity

MAS111 Convergence and Continuity MAS Covergece ad Cotiuity Key Objectives At the ed of the course, studets should kow the followig topics ad be able to apply the basic priciples ad theorems therei to solvig various problems cocerig covergece

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

Minimax Estimation of Functionals of Discrete Distributions

Minimax Estimation of Functionals of Discrete Distributions Miimax Estimatio of Fuctioals of Discrete Distributios Jiatao Jiao, tudet Member, IEEE, Kartik Vekat, tudet Member, IEEE, Yaju Ha, tudet Member, IEEE, ad Tsachy Weissma, Fellow, IEEE arxiv:406.6956v5 [cs.it]

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Self-normalized deviation inequalities with application to t-statistic

Self-normalized deviation inequalities with application to t-statistic Self-ormalized deviatio iequalities with applicatio to t-statistic Xiequa Fa Ceter for Applied Mathematics, Tiaji Uiversity, 30007 Tiaji, Chia Abstract Let ξ i i 1 be a sequece of idepedet ad symmetric

More information

Entropy Rates and Asymptotic Equipartition

Entropy Rates and Asymptotic Equipartition Chapter 29 Etropy Rates ad Asymptotic Equipartitio Sectio 29. itroduces the etropy rate the asymptotic etropy per time-step of a stochastic process ad shows that it is well-defied; ad similarly for iformatio,

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer. 6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

ENGI Series Page 6-01

ENGI Series Page 6-01 ENGI 3425 6 Series Page 6-01 6. Series Cotets: 6.01 Sequeces; geeral term, limits, covergece 6.02 Series; summatio otatio, covergece, divergece test 6.03 Stadard Series; telescopig series, geometric series,

More information

Application to Random Graphs

Application to Random Graphs A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

lim za n n = z lim a n n.

lim za n n = z lim a n n. Lecture 6 Sequeces ad Series Defiitio 1 By a sequece i a set A, we mea a mappig f : N A. It is customary to deote a sequece f by {s } where, s := f(). A sequece {z } of (complex) umbers is said to be coverget

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

ON BARTLETT CORRECTABILITY OF EMPIRICAL LIKELIHOOD IN GENERALIZED POWER DIVERGENCE FAMILY. Lorenzo Camponovo and Taisuke Otsu.

ON BARTLETT CORRECTABILITY OF EMPIRICAL LIKELIHOOD IN GENERALIZED POWER DIVERGENCE FAMILY. Lorenzo Camponovo and Taisuke Otsu. ON BARTLETT CORRECTABILITY OF EMPIRICAL LIKELIHOOD IN GENERALIZED POWER DIVERGENCE FAMILY By Lorezo Campoovo ad Taisuke Otsu October 011 COWLES FOUNDATION DISCUSSION PAPER NO. 185 COWLES FOUNDATION FOR

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

Math 341 Lecture #31 6.5: Power Series

Math 341 Lecture #31 6.5: Power Series Math 341 Lecture #31 6.5: Power Series We ow tur our attetio to a particular kid of series of fuctios, amely, power series, f(x = a x = a 0 + a 1 x + a 2 x 2 + where a R for all N. I terms of a series

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

Lecture 11 October 27

Lecture 11 October 27 STATS 300A: Theory of Statistics Fall 205 Lecture October 27 Lecturer: Lester Mackey Scribe: Viswajith Veugopal, Vivek Bagaria, Steve Yadlowsky Warig: These otes may cotai factual ad/or typographic errors..

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Regression with an Evaporating Logarithmic Trend

Regression with an Evaporating Logarithmic Trend Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,

More information

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula Joural of Multivariate Aalysis 102 (2011) 1315 1319 Cotets lists available at ScieceDirect Joural of Multivariate Aalysis joural homepage: www.elsevier.com/locate/jmva Superefficiet estimatio of the margials

More information

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities Chapter 5 Iequalities 5.1 The Markov ad Chebyshev iequalities As you have probably see o today s frot page: every perso i the upper teth percetile ears at least 1 times more tha the average salary. I other

More information

1 Introduction to reducing variance in Monte Carlo simulations

1 Introduction to reducing variance in Monte Carlo simulations Copyright c 010 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a ukow mea µ = E(X) of a distributio by

More information

Sequences and Limits

Sequences and Limits Chapter Sequeces ad Limits Let { a } be a sequece of real or complex umbers A ecessary ad sufficiet coditio for the sequece to coverge is that for ay ɛ > 0 there exists a iteger N > 0 such that a p a q

More information

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1 Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS Jauary 25, 207 INTRODUCTION TO MATHEMATICAL STATISTICS Abstract. A basic itroductio to statistics assumig kowledge of probability theory.. Probability I a typical udergraduate problem i probability, we

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

s = and t = with C ij = A i B j F. (i) Note that cs = M and so ca i µ(a i ) I E (cs) = = c a i µ(a i ) = ci E (s). (ii) Note that s + t = M and so

s = and t = with C ij = A i B j F. (i) Note that cs = M and so ca i µ(a i ) I E (cs) = = c a i µ(a i ) = ci E (s). (ii) Note that s + t = M and so 3 From the otes we see that the parts of Theorem 4. that cocer us are: Let s ad t be two simple o-egative F-measurable fuctios o X, F, µ ad E, F F. The i I E cs ci E s for all c R, ii I E s + t I E s +

More information

6. Sufficient, Complete, and Ancillary Statistics

6. Sufficient, Complete, and Ancillary Statistics Sufficiet, Complete ad Acillary Statistics http://www.math.uah.edu/stat/poit/sufficiet.xhtml 1 of 7 7/16/2009 6:13 AM Virtual Laboratories > 7. Poit Estimatio > 1 2 3 4 5 6 6. Sufficiet, Complete, ad Acillary

More information

7 Sequences of real numbers

7 Sequences of real numbers 40 7 Sequeces of real umbers 7. Defiitios ad examples Defiitio 7... A sequece of real umbers is a real fuctio whose domai is the set N of atural umbers. Let s : N R be a sequece. The the values of s are

More information

Probability and Statistics

Probability and Statistics ICME Refresher Course: robability ad Statistics Staford Uiversity robability ad Statistics Luyag Che September 20, 2016 1 Basic robability Theory 11 robability Spaces A probability space is a triple (Ω,

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame Iformatio Theory Tutorial Commuicatio over Chaels with memory Chi Zhag Departmet of Electrical Egieerig Uiversity of Notre Dame Abstract A geeral capacity formula C = sup I(; Y ), which is correct for

More information

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS 18th Feb, 016 Defiitio (Lipschitz fuctio). A fuctio f : R R is said to be Lipschitz if there exists a positive real umber c such that for ay x, y i the domai

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

Lecture 3 : Random variables and their distributions

Lecture 3 : Random variables and their distributions Lecture 3 : Radom variables ad their distributios 3.1 Radom variables Let (Ω, F) ad (S, S) be two measurable spaces. A map X : Ω S is measurable or a radom variable (deoted r.v.) if X 1 (A) {ω : X(ω) A}

More information

Introductory statistics

Introductory statistics CM9S: Machie Learig for Bioiformatics Lecture - 03/3/06 Itroductory statistics Lecturer: Sriram Sakararama Scribe: Sriram Sakararama We will provide a overview of statistical iferece focussig o the key

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Chapter 8. Euler s Gamma function

Chapter 8. Euler s Gamma function Chapter 8 Euler s Gamma fuctio The Gamma fuctio plays a importat role i the fuctioal equatio for ζ(s) that we will derive i the ext chapter. I the preset chapter we have collected some properties of the

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Stat410 Probability and Statistics II (F16)

Stat410 Probability and Statistics II (F16) Some Basic Cocepts of Statistical Iferece (Sec 5.) Suppose we have a rv X that has a pdf/pmf deoted by f(x; θ) or p(x; θ), where θ is called the parameter. I previous lectures, we focus o probability problems

More information

Stat 421-SP2012 Interval Estimation Section

Stat 421-SP2012 Interval Estimation Section Stat 41-SP01 Iterval Estimatio Sectio 11.1-11. We ow uderstad (Chapter 10) how to fid poit estimators of a ukow parameter. o However, a poit estimate does ot provide ay iformatio about the ucertaity (possible

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if LECTURE 14 NOTES 1. Asymptotic power of tests. Defiitio 1.1. A sequece of -level tests {ϕ x)} is cosistet if β θ) := E θ [ ϕ x) ] 1 as, for ay θ Θ 1. Just like cosistecy of a sequece of estimators, Defiitio

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara Poit Estimator Eco 325 Notes o Poit Estimator ad Cofidece Iterval 1 By Hiro Kasahara Parameter, Estimator, ad Estimate The ormal probability desity fuctio is fully characterized by two costats: populatio

More information

4.1 Data processing inequality

4.1 Data processing inequality ECE598: Iformatio-theoretic methods i high-dimesioal statistics Sprig 206 Lecture 4: Total variatio/iequalities betwee f-divergeces Lecturer: Yihog Wu Scribe: Matthew Tsao, Feb 8, 206 [Ed. Mar 22] Recall

More information

Notes 27 : Brownian motion: path properties

Notes 27 : Brownian motion: path properties Notes 27 : Browia motio: path properties Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces:[Dur10, Sectio 8.1], [MP10, Sectio 1.1, 1.2, 1.3]. Recall: DEF 27.1 (Covariace) Let X = (X

More information

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam. Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the

More information

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution Iteratioal Mathematical Forum, Vol., 3, o. 3, 3-53 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/.9/imf.3.335 Double Stage Shrikage Estimator of Two Parameters Geeralized Expoetial Distributio Alaa M.

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider k-sample ad chage poit problems for idepedet data i a

More information

11 THE GMM ESTIMATION

11 THE GMM ESTIMATION Cotets THE GMM ESTIMATION 2. Cosistecy ad Asymptotic Normality..................... 3.2 Regularity Coditios ad Idetificatio..................... 4.3 The GMM Iterpretatio of the OLS Estimatio.................

More information