Notes On Nonparametric Density Estimation. James L. Powell Department of Economics University of California, Berkeley

Notes O Noparametric Desity Estimatio James L Powell Departmet of Ecoomics Uiversity of Califoria, Berkeley Uivariate Desity Estimatio via Numerical Derivatives Cosider te problem of estimatig te desity fuctio f(x) of a scalar, cotiuously-distributed iid sequece x i at a particular poit x If te desity f is i a kow parametric family (eg, Gaussia), estimatio of te desity reduces to estimatio of te fiite-dimesioal parameters tat caracterize tat particular desity i te parametric family Witout a parametric assumptio, toug, estimatio of te desity f over all poits i its support would ivolve estimatio of a ifiite umber of parameters, kow i statistics as a oparametric estimatio problem (toug ifiite-parametric estimatio migt be a more accurate title) Sice te desity fuctio f(x) is te derivative of te cumulative distributio fuctio F (x) Pr{x i x}, ad sice te empirical cdf ˆF (x) X {x i x} is te atural oparametric of te cdf, it seems atural to base estimatio of f o te empirical cdf However, wile ˆF is -cosistet ad asymptotically ormal, it would be clearly osesical to estimate f by differetiatig ˆF, sice its derivative is eiter zero or udefied Usig te defiitio of te desity f as te (rigt-) derivative of te cdf, F (x + ) F (x) f(x) lim, 0 we migt estimate te desity f by a correspodig differece ratio of ˆF : ˆf(x) ˆF (x + ) ˆF (x) X {x <x i x + }, were te perturbatio, also kow as a badwidt or widow widt ) is positive but small, depedig upo te sample size (ie, ) To sow MSE cosistecy of ˆf, it suffices to coose te badwidt sequece so tat te mea bias ad variace of ˆf bot ted to zero as te sample size icreases Sice te empirical cdf ˆF is a ubiased estimator of F tatis,e[ ˆF (x)] F (x) te bias of ˆf is evidetly if E[ ˆf(x)] f(x) F (x + ) F (x) f(x) 0 0 as

Te variace of ˆf(x) is! V ( ˆf(x)) X V {x <x i x + } wic will ted to zero if 2 V ({x <x i x + }) F (x + ) F (x) ( (F (x + ) F (x))) f(x) + O( ), as Tus, if te badwidt sequece teds to zero as teds to ifiity, but at a slower rate ta /, te MSE of ˆf will coverge to zero, esurig its (weak) cosistecy To arrow dow te coice of badwidt sequece, we migt wat to coose it to mamize te rate of covergece of te MSE of ˆf to zero To do tis, we eed a explicit expressio for its bias as a fuctio of Assumig F (x+) is smoot eoug to admit a secod-order Taylor s series expasio aroud 0, F (x + ) F (x)+f(x) + f 0 (x) 2 2 + o( 2 ), te MSE of ˆf ca be expressed as MSE( ˆf(x); ) E( ˆf(x)) 2 f(x)i + V ( ˆf(x)) µ f 0 (x) 2 2 + o( 2 )+ f(x) + O( ) Because te squared bias is directly related to, wile te variace is iversely related to, te fastest covergece of te MSE to zero occurs we te squared bias ad variace coverge to zero at te same speed (If tey coverge at differet speeds, te slower speed domiates) Tus te optimal badwidt sequece will satisfy O ( ) 2 µ O, so O µ! /3 will give te fastest rate of covergece of te MSE to zero, µ! MSE( ˆf(x); 2/3 )O Tis rate is slower ta te usual parametric rate of covergece of te MSE (if it ests) to zero, wic is O( ) for, eg, te sample mea 2

Altoug te optimal rate of covergece of to zero does ot deped upo f itself, te optimal level would require kowledge of f(x) Assumig te badwidt sequece is of te form c /3, we ca coose c to miimize te limitig ormalized MSE µ lim 2/3 MSE( ˆf(x); f 0 (x) 2 ) 2 c + f(x) c, wic is miimized i c at c! (f 0 (x)) 2 /3 2f(x) So te optimal badwidt sequece µ 2f(x) (f 0 (x)) 2 tat miimizes te leadig terms i te MSE formula is ifeasible, sice it requires kowledge of te desity f(x) itself ad its first derivative Toug it ca be sow (uder additioal coditios) tat cosistet estimatio of te costat c will ot affect te rate of covergece of ˆf, etc, oparametric estimatio of f 0 (x) caot be based upo ˆf(x) directly (sice its derivative is zero werever it is defied), but would require a similar umerical derivate approac, wic would etail its ow badwidt coice problem A alterative is to stadardize te coice of te costat term c for some kow parametric desity fuctio; for example, if f(x) σ φ( x µ σ /3 ), were φ is te stadard ormal desity, te µ x c 2 /3 φ(x) σ, 2 ad estimatio of te optimal costat would reduce to estimatio of te stadard deviatio σ of te x i distributio Istead of coosig te badwidt for a specific valueofx, we migt wat to coose it to miimize a global MSE criterio over all possible x values, suc as te itegrated mea-squared error criterio IMSE( ˆf; ) MSE( ˆf(x); )dx µ f 0 (x) 2 2 dx + + o(2 )+O( ) Te same calculatios as for te poitwise optimal badwidt yield te optimal IMSE badwidt + to be! /3 + 2 R (f 0 (x)) 2, dx wic still depeds upo te derivative of f (A similar expressio olds for te badwidt miimizig a average MSE criterio, were dx is replaced by f(x)dx i te formula for te IMSE) For te Gaussia 3

desity f(x) x µ σ φ( σ ), te optimal IMSE badwidt would be! /3 + 2 R ( xφ(x)) 2 σ dx 24σ /3, so stadardizig te badwidt coice to be optimal for ormal desities reduces te badwidt coice problem to estimatio of te stadard deviatio σ Multivariate Kerel Desity Estimatio Te umerical derivative estimator of te uivariate desity f(x) above is a special case of a geeral class of oparametric desity estimators called kerel desity estimators Now supposig x i R p, we ca tik of smootig out te empirical cdf ˆF (x) for by replacig it wit a covolutio of ˆF ad te distributio of a idepedet, cotiuously-distributed oise term ε, were is a small positive badwidt as above ad ε as a kow desity fuctio K(ε) (also kow as te kerel fuctio) For a particular realized value of x i, te desity fuctio for X i x i + ε would be f Xi (x) p K µ x by te usual cage-of-variables formula for multivariate desities (Here te absolute value of te determiat of te Jacobia dε/dxi 0 is p ) Tus te kerel desity estimator ˆf(x) is te average of f Xi over te observed values of x i i te sample, ˆf(x) X p K µ x As log as R K(u)du, te desity estimator ˆf(x) will also itegrate to oe over x, as befittig a desity fuctio Te umerical derivative estimator discussed above is a special case wit p ad K(u) { u<0}, te desity fuctio for a Uiform(, 0) radom variable Te MSE calculatios are straigtforward extesios of tose for te umerical derivative estimator Te expectatio of ˆf is " E[ ˆf(x)] X µ # E p K µ E p K µ z p K f(z)dz Makig te cage-of-variables u u(z) (x z) (wit du p dz), E[ ˆf(x)] K(u)f(x u)du, wic clearly teds to f(x) as 0 if f is cotiuous ad bouded above (by domiated covergece), as log as R K(u)du (Te last coditio implies R K(u) du <, a coditio tat will be more 4

relevat later, we te oegativity restrictio K(u) is relaxed) Assumig tat f(x) is smoot eoug for f(x u) to admit a secod-order Taylor s expasio aroud 0, ie, f(x u) f(x) f(x) x 0 (u)+ 2 (u)0 2 f(x) x x 0 (u)+o(2 ) µ µ f(x) f(x) x 0 u + 2 2 2 tr f(x) x x 0 uu0 + o( 2 ), te bias of ˆf ca be expressed as µ E[ ˆf(x)] f(x) f(x) x 0 µ uk(u)du + 2 2 2 tr f(x) x x 0 uu 0 K(u)du + o( 2 ) If te mea of te kerel, R uk(u)du, is ozero (as wit te oe-sided umerical derivative estimator above), te te bias is O(); owever,iftekerelfuctiok(u) is cose to be symmetric about zero, or, more geerally, if uk(u)du 0, (*) te te bias is E[ ˆf(x)] f(x) O( 2 ) Similarly, te variace of ˆf(x) ca be calculated as Var( ˆf(x)) X µ! Var p K µ µ Var p K µ µ 2 E p K µ µ 2 E p K µ z 2 2p K f(z)dz 2 (E[ ˆf(x)]) p [K (u)] 2 f(x u)du 2 (E[ ˆf(x)]) f(x) µ p [K (u)] 2 du + o p (Te secod equality exploits te fact tat x i is iid, ad te fift makes te same cage-of-variables as for te bias formula) So te bias of ˆf(x) is O( 2 ) uder coditio (*), ad its variace is O(( p ) ); te optimal badwidt sequece, wic equates te rate of covergece of te squared bias ad variace to zero, tus satisfies so µ O(( ) 4 )O ( ) p O µ /(p+4), 5

ad te MSE evaluated at is MSE( ˆf(x); )O µ! 4/(p+4) Note tat, as p icreases ie, te umber of compoets i x i, umber of argumets of f(x) icreases te best rate of covergece of te MSE declies, a peomeo refereced by te catc prase, te curse of dimesioality Te Asymptotic Distributio of te Kerel Desity Estimator Te kerel desity estimator ˆf(x) ca be rewritte as a sample average of idepedet, ideticallydistributed radom variables ˆf(x) X z i, were z i ˆf(x) µ x p K Here te secod subscript i z i deotes its depedece o te sample size troug te badwidt term Suc doubly-subscripted radom variables, were te rage of te first subscript is bouded above by te secod, are kow as triagular arrays, ad classical limit teorems must be modified to accout for te cagig distributio of te observatios as te sample size icreases To demostrate asymptotic ormality of ˆf(x), a coveiet cetral limit teorem is Liapuov s Cetral Limit Teorem for Triagular Arrays It statest tat, if te scalar radom variable z i is idepedetly (but ot ecessarily idetically) distributed wit variace Var(z i ) σ 2 i ad r-t absolute cetral momet E[ z i E(z i ) r ] ρ i < for some r>2, ad if ( P ρ i) /r P σ2 i /2 0 as 0 (kow as te Liapuov coditio), te is asymptotically ormal, z X z i z E[z ] p Var(z ) d N (0, ) Applyig tis teorem to ˆf(x), we obtai te variace of te z i terms as σ 2 i Var(z i ) µ µ Var p K f(x) µ p [K (u)] 2 du + o p 6

from our earlier calculatios; settig r 3, we get a upper boud for te tird cetral momet of z i as ρ i E[ z i E(z i ) 3 ] 8E[ z i 3 ] µ 8E " 3 # 8f(x) µ 2p K (u) 3 du + o 2p, were te iequality uses te expasio E[ z i E(z i ) 3 ] E[( z i + E(z i ) ) 3 ] E[ z i 3 ]+3E[ z i 2 ] E(z i ) +3E[ z i ] E(z i ) 2 +(E(z i )) 3 ad te last equality makes te same cage-of-variables calculatios as for te variace of z i Tus, for tis problem te Liapuov coditio is ( P ρ i) /r P /2 i σ2 ³ 8f(x) R K (u) 3 du + o /3 2p 2p ³ f(x) R [K (u)] 2 du + o /2 p if O p /3 2p/3! O O(( p ) /6 ) 0 p, /2 p/2 te same coditio imposed to esure tat Var( ˆf(x)) 0 as Uder tis coditio, te,! ˆf(x) E[ ˆf(x)] q Var( ˆf(x)) d N (0, ), or, subtitutig te expressio for Var( ˆf(x)) O(( p ) ) ad collectig terms, p ( ˆf(x) E[ ˆf(x)]) d N (0,f(x) [K (u)] 2 du) Tis is almost, but ot quite, i te form of te usual expressio for a asymptotic distributio of a estimator; te differece is tat te expectatio of te estimator E[ ˆf(x)] rater ta te true value f(x) is subtracted from ˆf(x) i tis expressio Writig p ( ˆf(x) f(x)) p ( ˆf(x) E[ ˆf(x)]) + p (E[ ˆf(x)] f(x)), te asymptotic ormality of ˆf(x) aroud f(x) will require te secod, bias term to coverge to a costat Isertig te expressio for te bias, p (E[ ˆf(x)] f(x)) µ µ 2 2 p 2 tr f(x) x x 0 uu 0 K(u)du + o( 2 ) O( p+4 ) 7

If te badwidt takes te form c µ /(p+4), so tat it coverges to zero at te optimal rate, te p (E[ ˆf(x)] f(x)) µ c(p+4)/2 2 f(x) tr 2 x x 0 δ(x), uu 0 K(u)du ad p ( ˆf(x) f(x)) d N (δ(x),f(x) [K (u)] 2 du) Here te appromatig ormal distributio for ˆf(x) would be cetered at f(x) +δ(x), ot f(x), ad costructio of te usual cofidece regios or test statistics would be complicated by te fact tat δ(x) depeds upo te ukow secod derivative of f(x) If te badwidt teds to zero faster ta te optimal rate, ie, te o µ /(p+4), p (E[ ˆf(x)] f(x)) 0, ad te bias term vaises from te asymptotic distributio, p ( ˆf(x) f(x)) d N (0,f(x) [K (u)] 2 du) Ofte i practice tis sort of udersmootig wic implies te bias of f(x) is egligible relative to te variace is assumed, ad cofidece itervals of te form " s s # f(x) f(x) 96 ˆf(x) [K (u)] 2 du, f(x)+96 ˆf(x) [K (u)] 2 du are reported, toug it is best to view te claimed 95% asymptotic coverage rate wit some skepticism Note tat if te badwidt teds to zero slower ta te optimal rate, eg, µ γ o, γ > p +4, te te bias of ˆf(x) domiates its stadard deviatio, ad te ormalized differece p ( ˆf(x) f(x)) diverges Rescalig for Multivariate Kerels Derivatives of te Kerel Desity ad Regressio Estimators Higer-Order (Bias-Reducig) Kerels 8