Density Estimation. Chapter 1

Size: px

Start display at page:

Download "Density Estimation. Chapter 1"

Ira Sharp
5 years ago
Views:

1 Capter 1 Desity Estimatio Te estimatio of probability desity fuctios (PDFs) ad cumulative distributio fuctios (CDFs) are corerstoes of applied data aalysis i te social scieces. Testig for te equality of two distributios (or momets tereof) is peraps te most basic test i all of applied data aalysis. Ecoomists, for istace, devote a great deal of attetio to te study of icome distributios ad ow tey vary across regios ad over time. Toug te PDF ad CDF are ofte te objects of direct iterest, teir estimatio also serves as a importat buildig block for oter objects beig modeled suc as a coditioal mea (i.e., a regressio fuctio ), wic may be directly modeled usig oparametric or semiparametric metods (a coditioal mea is a fuctio of a coditioal PDF, wic is itself a ratio of ucoditioal PDFs). After masterig te priciples uderlyig te oparametric estimatio of a PDF, te oparametric estimatio of te workorse of applied data aalysis, te coditioal mea fuctio cosidered i Capter, progresses i a fairly straigtforward maer. Careful study of te approaces developed i Capter 1 will be most elpful for uderstadig material preseted i later capters. We begi wit te estimatio of a uivariate PDF i Sectios 1.1 troug 1.3, tur to te estimatio of a uivariate CDF i Sectios 1.4 ad 1.5, ad te move o to te more geeral multivariate settig i Sectios 1.6 troug 1.8. Asymptotic ormality, uiform rates of covergece, ad bias reductio metods appear i Sectios 1.9 troug 1.1. Numerous illustrative applicatios appear i Sectio 1.13, wile teoretical ad applied exercises ca be foud i Sectio 1.14 We ow proceed wit a discussio of ow to estimate te PDF

2 4 1. DENSITY ESTIMATION f X (x) of a radom variable X. For otatioal simplicity we drop te subscript X ad simply use f(x) to deote te PDF of X. Some of te treatmets of te kerel estimatio of a PDF discussed i tis capter are draw from te two excellet moograps by Silverma (1986) ad Scott (199). 1.1 Uivariate Desity Estimatio To best appreciate wy oe migt cosider usig oparametric metods to estimate a PDF, we begi wit a illustrative example, te parametric estimatio of a PDF. Example 1.1. Suppose X 1, X,..., X represet idepedet ad idetically distributed (i.i.d.) draws from a ormal distributio wit mea µ ad variace σ. We wis to estimate te ormal PDF f(x). By assumptio, f(x) as a kow parametric fuctioal form (i.e., 1 uivariate ormal) give by f(x) = (πσ ) 1/ [ exp (x µ) /σ ], were te mea µ = E(X) ad variace σ = E[(X E(X)) ] = var(x) are te oly ukow parameters to be estimated. Oe could estimate µ ad σ by te metod of maximum likeliood as follows. Uder te i.i.d. assumptio, te joit PDF of (X 1,..., X ) is simply te product of te uivariate PDFs, wic may be writte as f(x 1,..., X ) = πσ i=1 1 (X i µ) 1 1 P (X i µ) e i=1 σ = (πσ ) / e σ. Coditioal upo te observed sample ad takig te logaritm, tis gives us te log-likeliood fuctio L(µ, σ ) l f(x 1,..., X ; µ, σ ) 1 = l(π) l σ (Xi µ). σ Te metod of maximum likeliood proceeds by coosig tose parameters tat make it most likely tat we observed te sample at ad give our distributioal assumptio. Tus, te likeliood fuctio (or a mootoic trasformatio tereof, e.g., l) expresses te plausibility of differet values of µ ad σ give te observed sample. We te maximize te likeliood fuctio wit respect to tese two ukow parameters. i=1

3 1.1. UNIVARIATE DENSITY ESTIMATION 5 Te ecessary first order coditios for a maximizatio of te loglikeliood fuctio are L(µ, σ )/ µ = 0 ad L(µ, σ )/ σ = 0. Solvig tese first order coditios for te two ukow parameters µ ad σ yields µˆ = 1 1 X i ad σˆ = ( X i µ) ˆ. i=1 i=1 µ ˆ ad σˆ above are te maximum likeliood estimators of µ ad σ, respectively, ad te resultig estimator of f( x) is fˆ(x) = 1 πσˆ [ exp 1 ( ) ] x µ ˆ. σˆ Te Acilles eel of ay parametric approac is of course te requiremet tat, prior to estimatio, te aalyst must specify te exact parametric fuctioal form for te object beig estimated. Upo reflectio, te parametric approac is somewat circular sice we iitially set out to estimate a ukow desity but must first assume tat te desity is i fact kow (up to a adful of ukow parameters, of course). Havig based our estimate o te assumptio tat te desity is a member of a kow parametric family, we must te aturally cofrot te possibility tat te parametric model is misspecified, i.e., ot cosistet wit te populatio from wic te data was draw. For istace, by assumig tat X is draw from a ormally distributed populatio i te above example, we i fact impose a umber of potetially quite restrictive assumptios: symmetry, uimodality, mootoically decreasig away from te mode ad so o. If te true desity were i fact asymmetric or possessed multiple modes, or was omootoic away from te mode, te te presumptio of distributioal ormality may provide a misleadig caracterizatio of te true desity ad could tereby produce erroeous estimates ad lead to usoud iferece. At tis jucture may readers will o doubt be poitig out tat, avig estimated a parametric PDF, oe ca always test weter te uderlyig distributioal assumptio is valid. We are, of course, completely sympatetic toward suc argumets. Ofte, owever, te rejectio of a distributioal assumptio fails to provide ay clear alterative. Tat is, we ca reject te assumptio of ormality, but tis rejectio leaves us were we started, peraps avig ruled out but oe of a large

4 6 1. DENSITY ESTIMATION umber of cadidate distributios. Agaist tis backdrop, researcers migt istead cosider oparametric approaces. Noparametric metods circumvet problems arisig from te eed to specify parametric fuctioal forms prior to estimatio. Rater ta presume oe kows te exact fuctioal form of te object beig estimated, oe istead presumes tat it satisfies some regularity coditios suc as smootess ad differetiability. Tis does ot, owever, come witout cost. By imposig less structure o te fuctioal form of te PDF ta do parametric metods, oparametric metods require more data to acieve te same degree of precisio as a correctly specified parametric model. Our primary focus i tis text is o a class of estimators kow as oparametric kerel estimators (a kerel fuctio is simply a weigtig fuctio), toug i Capters 14 ad 15 we provide a treatmet of alterative oparametric metodologies icludig earest eigbor ad series metods. Before proceedig to a formal teoretical aalysis of oparametric desity estimatio metods, we first cosider a popular example of estimatig te probability of a ead o a toss of a coi wic is closely related to te oparametric estimatio of a CDF. Tis i tur will lead us to te oparametric estimatio of a PDF. Example 1.. Suppose we ave a coi (peraps a ufair oe) ad we wat to estimate te probability of flippig te coi ad avig it lad eads up. Let p = P(H) deote te (ukow) populatio probability of obtaiig a ead. Takig a relative frequecy approac, we would flip te coi times, cout te frequecy of eads i trials, ad compute te relative frequecy give by 1 pˆ = {# of eads }, (1.1) wic provides a estimate of p. Te pˆ defied i (1.1) is ofte referred to as a frequecy estimator of p, ad it is also te maximum likeliood estimator of p (see Exercise 1.). Te estimator pˆ is, of course, fully oparametric. Ituitively, oe would expect tat, if is large, te pˆ sould be close to p. Ideed, oe ca easily sow tat te mea squared error (MSE) of pˆ is give by (see Exercise 1.3) [ def p(1 p) MSE (ˆp) = E (ˆp p) ] =, so MSE (ˆp) 0 as, wic is termed as pˆ coverges to p i mea square error; see Appedix A for te defiitios of various modes of covergece.

5 1.1. UNIVARIATE DENSITY ESTIMATION 7 We ow discuss ow to obtai a estimator of te CDF of X, wic we deote by F (x). Te CDF is defied as F (x) = P[X x]. Wit i.i.d. data X 1,...,X (i.e., radom draws from te distributio F ( )), oe ca estimate F (x) by F (x) = 1 { # of X i s x }. (1.) Equatio (1.) as a ice ituitive iterpretatio. Goig back to our coi-flip example, if a coi is suc tat te probability of obtaiig a ead we we flip it equals F (x) (F (x) is ukow), ad if we treat te collectio of data X 1,..., X as flippig a coi times ad we say tat a ead occurs o te i t trial if X i x, te P(H) = P(X i x) = F (x). Te familiar frequecy estimator of P(H) is equal to te umber of eads divided by te umber of trials: P(H) ˆ = # of eads 1 = { # of X i s x } F (x). (1.3) Terefore, we call (1.) a frequecy estimator of F (x). Just as before we estimatig P(H), we expect ituitively tat as gets large, ˆP(H) sould yield a more accurate estimate of P(H). By te same reasoig, oe would expect tat as, F (x) yields a more accurate estimate of F (x). Ideed, oe ca easily sow tat F (x) F (x) i MSE, wic implies tat F (x) coverges to F (x) i probability ad also i distributio as. I Appedix A we itroduce te cocepts of covergece i mea square error, covergece i probability, covergece i distributio, ad almost sure covergece. It is well establised tat F (x) ideed coverges to F (x) i eac of tese various seses. Tese cocepts of covergece are ecessary as it is easy to sow tat te ordiary limit of F (x) does ot exist, i.e., lim F (x) does ot exist (see Exercise 1.3, were te defiitio of a ordiary limit is provided). Tis example igligts te ecessity of itroducig ew cocepts of covergece modes suc as covergece i mea square error ad covergece i probability. Now we take up te questio of ow to estimate a PDF f(x) witout makig parametric presumptios about it s fuctioal form. From te

6 8 1. DENSITY ESTIMATION defiitio of f(x) we ave 1 d f(x) = F (x). (1.4) dx From (1.) ad (1.4), a obvious estimator of f(x) is fˆ(x) = were is a small positive icremet. By substitutig (1.) ito (1.5), we obtai F (x + ) F (x ), (1.5) 1 fˆ(x) = { # of X 1,..., X fallig i te iterval [x, x + ] }. If we defie a uiform kerel fuctio give by (1.6) k(z) = { 1/ if z 1 (1.7) 0 oterwise, te it is easy to see tat fˆ(x) give by (1.5) ca also be expressed as fˆ(x) = 1 ( Xi x ) k. (1.8) i=1 Equatio (1.8) is called a uiform kerel estimator because te kerel fuctio k( ) defied i (1.7) correspods to a uiform PDF. I geeral, we refer to k( ) as a kerel fuctio ad to as a smootig parameter (or, alteratively, a badwidt or widow widt). Equatio (1.8) is sometimes referred to as a aïve kerel estimator. I fact oe migt use may oter possible coices for te kerel fuctio k( ) i tis cotext. For example, oe could use a stadard ormal kerel give by 1 1 k(v) = π e v, < v <. (1.9) Tis class of estimators ca be foud i te first publised paper o kerel desity estimatio by Roseblatt (1956), wile Parze (196) establised a umber of properties associated wit tis class of estimators 1 We oly cosider te cotiuous X case i tis capter. We deal wit te discrete X case i Capters 3 ad 4. Recall tat te defiitio of te derivative of a fuctio g(x) is give by g(x+ d g(x)/dx = lim ) g(x) g(x+) g(x ) 0, or, equivaletly, d g(x)/dx = lim 0.

7 1.1. UNIVARIATE DENSITY ESTIMATION 9 ad relaxed te oegativity assumptio i order to obtai estimators wic are more efficiet. For tis reaso, tis approac is sometimes referred to as Roseblatt-Parze kerel desity estimatio. We will prove sortly tat te kerel estimator fˆ(x) defied i (1.8) costructed from ay geeral oegative bouded kerel fuctio k( ) tat satisfies (i) k(v) dv = 1 (ii) k(v) = k( v) (1.10) (iii) v k(v) dv = κ > 0 is a cosistet estimator of f(x). Note tat te symmetry coditio (ii) implies tat vk(v) dv = 0. By cosistecy, we mea tat fˆ(x) f(x) i probability (covergece i probability is defied i Appedix A). Note tat k( ) defied i (1.10) is a (symmetric) PDF. For recet work o kerel metods wit asymmetric kerels, see Abadir ad Lawford (004). To defie various modes of covergece, we first itroduce te cocept of te Euclidea orm ( Euclidea legt ) of a vector. Give a q 1 vector x = (x 1, x,..., x q ) R q, we use x to deote te Euclidea legt of x, wic is defied by x = [x x] 1/ x 1 + x + + x q. We q = 1 (a scalar), x is simply te absolute value of x. I te appedix we discuss te otatio O( ) ( big O ) ad o( ) ( small O ). Let a be a ostocastic sequece. We say tat a = O( α ) if a C α for all sufficietly large, were α ad C (> 0) are costats. Similarly, we say tat a = o( α ) if a / α 0 as. We are ow ready to prove te MSE cosistecy of fˆ(x). Teorem 1.1. Let X 1,..., X deote i.i.d. observatios avig a tree-times differetiable PDF f(x), ad let f (s) (x) deote te st order derivative of f(x) (s = 1,, 3). Let x be a iterior poit i te support of X, ad let fˆ(x) be tat defied i (1.8). Assume tat te kerel fuctio k( ) is bouded ad satisfies (1.10). Also, as, 0 ad, te ( ) 4 [ ] MSE fˆ(x) = κ f () κf(x) ( (x) + + o 4 + () 1) 4 = O ( 4 + () 1), (1.11)

8 10 1. DENSITY ESTIMATION were κ = v k(v) dv ad κ = k (v) dv. Proof of Teorem 1.1. ( ) { [ ] } MSE fˆ(x) E fˆ(x) f(x) ( = var fˆ(x) ( var fˆ(x) ) [ + ) + ( ) E fˆ(x) [ bias ] f(x) ( fˆ(x))]. We will evaluate te bias(fˆ(x)) ad var(fˆ(x)) terms separately. For te bias calculatio we will eed to use te Taylor expasio formula. For a uivariate fuctio g(x) tat is m times differetiable, we ave 1 g(x) =g(x 0 ) + g (1) (x 0 )(x x 0 ) + g () (x 0 )(x x 0 ) +! (m 1)! g (m 1) (x 0 )(x x 0 ) m 1 + m! g(m) (ξ)(x x 0 ) m, s g(x) were g (s) (x 0 ) = x s x=x, ad ξ lies betwee x ad x 0 0.

9 1.1. UNIVARIATE DENSITY ESTIMATION 11 Te bias term is give by { ( ) 1 ( Xi x )} bias fˆ(x) = E k f(x) i=1 [ ( = 1 X1 x )] E k f(x) (by idetical distributio) ( = 1 x1 x ) f(x 1 )k dx 1 f(x) = 1 f(x + v)k(v) dv f(x) = (cage of variable, x 1 x = v) { f(x) + f (1) (x)v + f(x) { = f(x) f () (x) = by (1.10) f () (x) 1 f () (x) v + O( 3 ) } k(v) dv v k(v) dv + O ( 3)} f(x) v k(v) dv + O ( 3), (1.1) were te O ( 3) term comes from (1/3!) 3 f (3) ( x)v 3 v k(v) dv C3 3 k(v)dv ( = O 3 ), were C is a positive costat, ad were x lies betwee x ad x + v. Note tat i te above derivatio we assume tat f(x) is treetimes differetiable. We ca weake tis coditio to f(x) beig twice differetiable, resultig i (O( 3 ) becomes o( ), see Exercise 1.5) ( ) ( ) bias fˆ(x) = E fˆ(x) = f () (x) f(x) v k(v) dv + o ( ). (1.13)

10 1 1. DENSITY ESTIMATION Next we cosider te variace term. Observe tat ( ) [ var fˆ(x) = var 1 ( k Xi x )] } i=1 { = 1 ( Xi x var [k )] + 0 i=1 (by idepedece) ( ( 1 X1 x )) = var k (by idetical distributio) { [ (X1 1 x )] [ ( ( X 1 x ))] } = E k E k { ( 1 = f(x 1 )k x1 x ) dx 1 [ ( x1 x ) ] } f(x 1 )k dx 1 { 1 = f(x + v)k (v) dv [ ] } f(x + v)k(v) dv { 1 [ = f(x) + f (ξ)v] (1) k (v) dv O ( )} = 1 ( ) } {f(x) k (v) dv + O v k (v) dv O () 1 = {κf(x) + O()}, (1.14) were κ = k (v) dv. Equatios (1.1) ad (1.14) complete te proof of Teorem 1.1. Teorem 1.1 implies tat (by Teorem A.7 of Appedix A) fˆ(x) f(x) = O p ( + () 1/) = o p (1). By coosig = c 1/α for some c > 0 ad α > 1, te coditios required for cosistet estimatio of f(x), 0 ad,

11 1.1. UNIVARIATE DENSITY ESTIMATION 13 are clearly satisfied. Te overridig questio is wat values of c ad α sould be used i practice. As ca be see, for a give sample size, if is small, te resultig estimator will ave a small bias but a large variace. O te oter ad, if is large, te te resultig estimator will ave a small variace but a large bias. To miimize MSE(fˆ(x)), oe sould balace te squared bias ad te variace terms. Te optimal coice of (i te sese tat MSE(fˆ(x)) is miimized) sould satisfy dmse(fˆ(x))/d = 0. By usig (1.11), it is easy to sow tat te optimal tat miimizes te leadig term of MSE(fˆ(x)) is give by opt = c(x) 1/5, (1.15) { were c(x) = κf(x)/[κ f () (x)] } 1/5. MSE(fˆ(x)) is clearly a poitwise property, ad by usig tis as te basis for badwidt selectio we are obtaiig a badwidt tat is optimal we estimatig a desity at a poit x. Examiig c(x) i (1.15), we ca see tat a badwidt wic is optimal for estimatio at a poit x located i te tail of a distributio will differ from tat wic is optimal for estimatio at a poit located at, say, te mode. Suppose tat we are iterested ot i tailorig te badwidt to te poitwise estimatio of f(x) but istead i tailorig te badwidt globally for all poits x, tat is, for all x i te support of f( ) (te support of x is defied as te set of poits of x for wic f(x) > 0, i.e., {x : f(x) > 0}). I tis case we ca coose optimally by miimizig te itegrated MSE (IMSE) of fˆ(x). Usig (1.11) we ave [ def 1 [ ] IMSE(fˆ) = E fˆ(x) f(x)] dx = 4 κ f () (x) dx 4 κ + + o ( 4 + () 1). (1.16) Agai lettig opt deote te optimal smootig parameter tat miimizes te leadig terms of (1.16), we use simple calculus to get opt = c 0 1/5, (1.17) { [f κ /5 κ 1/5 () (x) ] 1/5 were c 0 = dx} > 0 is a positive costat. Note tat if f () (x) = 0 for (almost) all x, te c 0 is ot well defied. For example, if X is, say, uiformly distributed over its support, te f (s) (x) = 0 for all x ad for all s 1, ad (1.17) is ot defied i tis case. It ca be sow tat i tis case (i.e., we X is uiformly

12 14 1. DENSITY ESTIMATION distributed), opt will ave a differet rate of covergece equal to 1/3 ; see te related discussio i Sectio ad Exercise A iterestig extesio of te above results ca be foud i Zide- Wals (005), wo examies te asymptotic process for te kerel desity estimator by meas of geeralized fuctios ad geeralized radom processes ad presets ovel results for caracterizig te beavior of kerel desity estimators we te desity does ot exist, i.e., we te desity does ot exist as a locally summable fuctio. 1. Uivariate Badwidt Selectio: Rule-of-Tumb ad Plug-I Metods Equatio (1.17) reveals tat te optimal smootig parameter depeds o te itegrated secod derivative of te ukow desity troug c 0. I practice, oe migt coose a iitial pilot value of to estimate f () (x) ] [ dx oparametrically, ad te use tis value to obtai opt usig (1.17). Suc approaces are kow as plug-i metods for obvious reasos. Oe popular way of coosig te iitial, suggested by Silverma (1986), is to assume tat f(x) belogs to a parametric family of distributios, ad te to compute usig (1.17). For example, if f(x) is a ormal PDF wit variace σ, te [ f () (x) ] dx = 3/[8π 1/ σ 5 ]. If a stadard ormal kerel is used, usig (1.17), we get te pilot estimate pilot = (4π) 1/10 [ (3/8)π 1/] 1/5 σ 1/5 1.06σ 1/5, (1.18) wic is te plugged ito [fˆ() (x)] dx, wic te may be used to obtai opt usig (1.17). A clearly udesirable property of te plug-i metod is tat it is ot fully automatic because oe eeds to coose a iitial value of to estimate [f () (x)] dx (see Marro, Joes ad Seater (1996) ad also Loader (1999) for furter discussio). Ofte, practitioers will use (1.18) itself for te badwidt. Tis is kow as te ormal referece rule-of-tumb approac sice it is te optimal badwidt for a particular family of distributios, i tis case te ormal family. Sould te uderlyig distributio be close to a ormal distributio, te tis will provide good results, ad for exploratory purposes it is certaily computatioally attractive. I practice, σ is replaced by te sample stadard deviatio of {X i } i=1, wile Silverma (1986, p. 47) advocates usig a more robust measure

13 1.. UNIVARIATE BANDWIDTH: CROSS-VALIDATION 15 of spread wic replaces σ wit A, a adaptive measure of spread give by A = mi(stadard deviatio, iterquartile rage/1.34). We ow tur our attetio to a discussio of a umber of fully automatic or data-drive metods for selectig tat are tailored to te sample at ad. 1.3 Uivariate Badwidt Selectio: Cross-Validatio Metods I bot teoretical ad practical settigs, oparametric kerel estimatio as bee establised as relatively isesitive to coice of kerel fuctio. However, te same caot be said for badwidt selectio. Differet badwidts ca geerate radically differig impressios of te uderlyig distributio. If kerel metods are used simply for exploratory purposes, te oe migt udersmoot te desity by coosig a small value of ad let te eye do ay remaiig smootig. Alteratively, oe migt coose a rage of values for ad plot te resultig estimates. However, for soud aalysis ad iferece, a priciple avig some kow optimality properties must be adopted. Oe ca tik of coosig te badwidt as beig aalogous to coosig te umber of terms i a series approximatio; te more terms oe icludes i te approximatio, te more flexible te resultig model becomes, wile te smaller te badwidt of a kerel estimator, te more flexible it becomes. However, icreasig flexibility (reducig potetial bias) ecessarily leads to icreased variability (icreasig potetial variace). See i tis ligt, oe aturally appreciates ow a umber of metods discussed below are motivated by te eed to balace te squared bias ad variace of te resultig estimate Least Squares Cross-Validatio Least squares cross-validatio is a fully automatic data-drive metod of selectig te smootig parameter, origially proposed by Rudemo (198), Stoe (1984) ad Bowma (1984) (see also Silverma (1986, pp )). Tis metod is based o te priciple of selectig a badwidt tat miimizes te itegrated squared error of te resultig estimate, tat is, it provides a optimal badwidt tailored to all x i te support of f(x).

14 16 1. DENSITY ESTIMATION Te itegrated squared differece betwee fˆ ad f is [ ] fˆ(x) f(x) dx = fˆ(x) dx fˆ(x)f(x) dx + f(x) dx. (1.19) As te tird term o te rigt-ad side of (1.19) is urelated to, coosig to miimize (1.19) is terefore equivalet to miimizig fˆ(x) dx fˆ(x)f(x) dx (1.0) wit respect to. I te secod term, fˆ(x)f(x) dx ca be writte as E X [fˆ(x)], were E X ( ) deotes expectatio wit respect to X ad ot wit respect to te radom observatios {X j } used for computig fˆ( ). Terefore, we may estimate E X [fˆ(x)] by 1 i=1 fˆ i(x i ) (i.e., replacig E X by its sample mea), were j=1 1 fˆ i(x i ) = ( 1) j=1,j=i k ( ) Xi X j (1.1) is te leave-oe-out kerel estimator of f(x i ). 3 Fially, we estimate te first term fˆ(x) dx by ( fˆ(x) Xi x dx = 1 ) ( Xj x ) k k dx i=1 j=1 = 1 ( ) Xi X j k, (1.) i=1 j=1 were k (v) = k(u)k(v u) du is te twofold covolutio kerel derived from k( ). If k(v) = exp( v /)/ π, a stadard ormal kerel, te k(v) = exp( v /4)/ 4π, a ormal kerel (i.e., ormal PDF) wit mea zero ad variace two, wic follows sice two idepedet N(0, 1) radom variables sum to a N(0, ) radom variable. 3 Here we empasize tat it is importat to use te leave-oe-out kerel estimator for computig E X( ) above. Tis is because te expectatios operator presumes tat te X ad te X j s are idepedet of oe aoter. Witout usig te leave-oe-out estimator, te cross-validatio metod will break dow; see Exercise 1.6 (iii).

15 1.3. UNIVARIATE BANDWIDTH: CROSS-VALIDATION 17 Least squares cross-validatio terefore cooses to miimize CV f () = 1 ( ) k Xi X j i=1 j=1 ( 1) i=1 j=i,j=1 ( ) Xi X j k, (1.3) wic is typically udertake usig umerical searc algoritms. It ca be sow tat te leadig term of CV f () is CV f0 give by (igorig a term urelated to ; see Exercise 1.6) CV f0 () = B 1 4 κ +, (1.4) were B 1 = (κ /4) [ [f () (x)] dx ] (κ = v k(v) dv, κ = k (v) dv). Tus, as log as f () (x) does ot vais for (almost) all x, we ave B 1 > 0. Let 0 deote te value of tat miimizes CV f0. Simple calculus sows tat 0 = c 0 1/5 were { [ ] } 1/5 c = [κ/(4b 1 )] 1/5 = κ 1/5 κ /5 f () 0 (x) dx. A compariso of 0 wit opt i (1.17) reveals tat te two are idetical, i.e., 0 opt. Tis arises because opt miimizes E[fˆ(x) f(x)] dx, wile 0 miimizes E[CV f ()], te leadig term of CV f (). It ca be easily see tat E[CV f ()] + f(x) dx is a alterative versio of E[fˆ(x) f(x)] dx; ece, E[CV f ()] + f(x) dx also estimates E[fˆ(x) f(x)] dx. Give tat f(x) dx is urelated to, oe would expect tat 0 ad opt sould be te same. Let ĥ deote te value of tat miimizes CV f (). Give tat CV f () = CV f0 +(s.o.), were (s.o.) deotes smaller order terms (ta CV f0 ) ad terms urelated to, it ca be sow tat ĥ = 0 + o p ( 0 ), or, equivaletly, tat ĥ 0 ĥ i probability. (1.5) 0 Ituitively, (1.5) is easy to uderstad because CV f () = CV f0 () + (s.o.), tus asymptotically a tat miimizes CV f () sould be

16 18 1. DENSITY ESTIMATION close to a tat miimizes CV f0 (); terefore, we expect tat ĥ ad 0 will be close to eac oter i te sese of (1.5). Härdle, Hall ad Marro (1988) sowed tat (ĥ 0 )/ 0 = O p ( 1/10 ), wic ideed coverges to zero (i probability) but at a extremely slow rate. We agai uderscore te eed to use te leave-oe-out kerel estimator we costructig CV f as give i (1.3). If istead oe were to use te stadard kerel estimator, least squares cross-validatio will break dow, yieldig ĥ = 0. Exercise 1.6 sows tat if oe does ot use te leave-oe-out kerel estimator we estimatig f(x i ), te = 0 miimizes te objective fuctio, wic of course violates te cosistecy coditio tat as. Here we implicitly impose te restrictio tat f () (x) is ot a zero fuctio, wic rules out te case for wic f(x) is a uiform PDF. I fact tis coditio ca be relaxed. Stoe (1984) sowed tat, as log as f(x) is bouded, te te least squares cross-validatio metod will select optimally i te sese tat [fˆ(x, ) ˆ f(x)] dx 1 almost surely, (1.6) if [fˆ(x, ) f(x)] dx were fˆ(x, ĥ) deotes te kerel estimator of f(x) wit cross-validatio selected ĥ, ad fˆ(x, ) is te kerel estimator wit a geeric. Obviously, te ratio defied i (1.6) sould be greater ta or equal to oe for ay. Terefore, Stoe s (1984) result states tat, asymptotically, cross-validated smootig parameter selectio is optimal i te sese of miimizig te estimatio itegrated square error. I Exercise 1.16 we furter discuss te ituitio uderlyig wy ĥ 0 eve we f(x) is a uiform PDF Likeliood Cross-Validatio Likeliood cross-validatio is aoter automatic data-drive metod for selectig te smootig parameter. Tis approac yields a desity estimate wic as a etropy teoretic iterpretatio, sice te estimate will be close to te actual desity i a Kullback-Leibler sese. Tis approac was proposed by Dui (1976). Likeliood cross-validatio cooses to maximize te (leave-oeout) log likeliood fuctio give by L = l L = l fˆ i(x i ), i=1

17 1.4. UNIVARIATE CDF ESTIMATION 19 were fˆ i(x i ) is te leave-oe-out kerel estimator of f(x i ) defied i (1.1). Te mai problem wit likeliood cross-validatio is tat it is severely affected by te tail beavior of f(x) ad ca lead to icosistet results for fat tailed distributios we usig popular kerel fuctios (see Hall (1987a, 1987b)). For tis reaso te likeliood crossvalidatio metod as elicited little iterest i te statistical literature. However, te likeliood cross-validatio metod may work well for a rage of stadard distributios (i.e., ti tailed). We cosider te performace of likeliood cross-validatio i Sectio 1.3.3, we we compare te impact of differet badwidt selectio metods o te resultig desity estimate, ad i Sectio 1.13, were we cosider empirical applicatios A Illustratio of Data-Drive Badwidt Selectio Figure 1.1 presets kerel estimates costructed from = 500 observatios draw from a simulated bimodal distributio. Te secod order Gaussia (ormal) kerel was used trougout, ad least squares crossvalidatio was used to select te badwidt for te estimate appearig i te upper left plot of te figure, wit lscv = We also plot te estimate based o te ormal referece rule-of-tumb ( ref = 0.34) alog wit a udersmooted estimate (1/5 lscv ) ad a oversmooted estimate (5 lscv ). 4 Figure 1.1 reveals tat least squares cross-validatio appears to yield a reasoable desity estimate for tis data, wile te referece rule-of-tumb is iappropriate as it oversmoots somewat. Extreme oversmootig ca lead to a uimodal estimate wic completely obscures te true bimodal ature of te uderlyig distributio. Also, udersmootig leads to too may false modes. See Exercise 1.17 for a empirical applicatio tat ivestigates te effects of uder- ad oversmootig o te resultig desity estimate. 1.4 Uivariate CDF Estimatio I Sectio 1.1 we itroduced te empirical CDF estimator F (x) give i (1.), wile Exercise 1.4 sows tat it is a -cosistet estimator 4 Likeliood cross-validatio yielded a badwidt of mlcv = 0.15, wic results i a desity estimate virtually idetical to tat based upo least squares crossvalidatio for tis dataset.

18 0 1. DENSITY ESTIMATION f(x) f(x) Least-Squares CV Referece f(x) X X Udersmooted Oversmooted f(x) X X Figure 1.1: Uivariate kerel estimates of a mixture of ormals usig least squares cross-validatio, te ormal referece rule-of-tumb, udersmootig, ad oversmootig ( = 500). Te correct parametric data geeratig process appears as te solid lie, te kerel estimate as te dased lie. of F (x). However, tis empirical CDF F (x) is ot smoot as it jumps by 1/ at eac sample realizatio poit. Oe ca, owever, obtai a smooted estimate of F (x) by itegratig fˆ(x). Defie Fˆ(x) = x fˆ(v) dv = 1 ( ) x Xi G, (1.7) x were G(x) = k(v) dv is a CDF (wic follows directly because k( ) is a PDF; see (1.10)). Te ext teorem provides te MSE of Fˆ(x). i=1

19 1.4. UNIVARIATE CDF ESTIMATION 1 Teorem 1.. Uder coditios give i Bowma, Hall ad Prva (1998), i particular, assumig tat F (x) is twice cotiuously differetiable, k(v) = d G(v)/dv is bouded, symmetric, ad compactly supported, ad tat d F (x)/dx is Hölder-cotiuous, 0 C ɛ for some 0 < ɛ < 1, te as, 8 [ ] MSE(Fˆ) = E Fˆ(x) F (x) = c 0 (x) 1 c 1 (x) 1 + c (x) 4 + o ( 4 + 1), were c 0 = F (x)(1 F (x)), c 1 (x) = α 0 f(x), α 0 = vg(v)k(v) dv, f(x) = d F (x)/dx, c (x) = [(κ /)F () (x)], κ = v k(v) dv, ad were F (s) (x) = d s F (x)/dx s is te st derivative of F (x). [ ] [ ( )] Proof. Note tat E Fˆ(x) = E G x X i. Te we ave ( = ) [ ( )] ( ) x Xi x z E G = G f(z)dz = G(v)f(x v) dv = G(v)dF (x v) = [G(v)F (x v)] v= v= + k(v)f (x v) dv [ = k(v) F (x) F (1) (x)v + (1/) F () (x)v dv + o( ) = F (x) + (1/)κ F () (x) + o( ), (1.8) were at te secod equality above we used [... ] dv = [... ] dv. Also ote tat we did ot use a Taylor expasio i G(v)F (x v) dv sice v m G(v) dv = + for ay m 0. We first used itegratio by parts to get k(v), ad te used te Taylor expasio sice v m k(v) dv is usually fiite. For example, if k(v) as bouded support or k(v) is a stadard ormal kerel fuctio, te v m k(v) dv is fiite for ay m 0.

20 1. DENSITY ESTIMATION Similarly, ( )] ( ) x E [G Xi x z = G f(z)dz = G (v)f(x v) dv = G (v)df (x v) = G(v)k(v)F (x v) dv = G(v)k(v)[F (x) F (1) (x)v] dv + O( ) = F (x) α 0 f(x) + O( ), were α 0 = vg(v)k(v) dv, ad were we ave used te fact tat G(v)k(v) dv = dg (v) = G ( ) G ( ) = 1, (1.9) because G( ) is a (user-specified) CDF kerel fuctio. From (1.8) we ave bias[fˆ(x)] = (1/)κ F () (x) + o( ), ad from (1.8) ad (1.9) we ave Hece, [ ] var Fˆ(x) [ = 1 var ( )] x Xi G { [ (x )] [ ( )] } Xi x Xi = 1 E G E G = 1 F (x)[1 F (x)] α 0 f(x) 1 + o(/). ( [ ( [ ] E Fˆ(x) F (x)) = bias Fˆ(x))] + var Fˆ(x) = 1 F (x) [1 F (x)] + 4 (κ /) [F () (x) α 0 f(x) + o( ). (1.30) Tis completes te proof of Teorem 1.. ]

21 1.4. UNIVARIATE CDF BANDWIDTH SELECTION: CV METHODS 3 From Teorem 1. we immediately obtai te followig result o te IMSE of Fˆ: [ IMSE(Fˆ) = E Fˆ(x) F (x)] dx = C 0 1 C C 4 + o ( 4 + 1), (1.31) were C j = cj (x) dx (j = 0, 1, ). Lettig 0 deote te value of tat miimizes te leadig term of IMSE, we obtai 0 = a 0 1/3, were a 0 = [C 1 /(4C )] 1/3, ece te optimal smootig parameter for estimatig uivariate a CDF as a faster rate of covergece ta te optimal smootig parameter for estimatig a uivariate PDF ( 1/3 versus 1/5 ). Wit 1/3, we ave = O( /3 ) = o( 1/ ). Hece, [Fˆ(x) F (x)] N(0, F (x)[1 F (x)]) i distributio by te Liapuov cetral limit teorem (CLT); see Teorem A.5 i Appedix A for tis ad a rage of oter useful CLTs. As is te case for oparametric PDF estimatio, oparametric CDF estimatio as widespread potetial applicatio toug it is ot early as widely used. For istace, it ca be used to test stocastic domiace witout imposig parametric assumptios o te uderlyig CDFs; see, e.g., Barrett ad Doald (003) ad Lito, Wag ad Maasoumi (005). 1.5 Uivariate CDF Badwidt Selectio: Cross-Validatio Metods Bowma et al. (1998) suggest coosig for Fˆ(x) by miimizig te followig cross-validatio fuctio: 1 { CV F () = 1(X i x) Fˆ i (x)} dx, (1.3) i=1 X j ) were Fˆ i(x) = ( 1) 1 ( x j=i G is te leave-oe-out estimator of F (x). Bowma et al. (1998) sow tat CV F = E[CV F ] + (s.o.) ad tat

22 4 1. DENSITY ESTIMATION (see Exercise 1.9) E[CV F ()] = F (1 F ) dx + 1 F (1 F ) dx C ). + C 4 + o ( 1 (1.33) We observe tat (1.33) as te same leadig term as IMSE(Fˆ) give i (1.31). Tus, asymptotically, selectig via cross-validatio leads to te same asymptotic optimality property for Fˆ(x) tat would arise we usig 0, te optimal determiistic smootig parameter. If we let ĥ deote te cross-validated smootig parameter, te it ca be sow tat / ˆ 0 1 i probability. Note tat we usig ĥ, te asymptotic distributio of Fˆ(x, ĥ) is te same as Fˆ(x, 0 ) (by usig a stocastic equicotiuity argumet as outlied i Appedix A), tat is, ( ) Fˆ(x) F (x) d N (0, F (x)(1 F (x))), (1.34) were Fˆ(x) is defied i (1.7) wit replaced by ĥ. Note tat o bias term appears i (1.34) sice bias(fˆ(x)) = O( 0 ) = O( /3 ) = o( 1/ ), wic was ot te case for PDF estimatio. Here te squared bias term as order smaller ta te leadig variace term of O( 1 ) (i.e., var(fˆ(x)) = O( 1 )). We ow tur our attetio to a geeralizatio of te uivariate kerel estimators developed above, amely multivariate kerel estimators. Agai, we cosider oly te cotiuous case i tis capter; we tackle discrete ad mixed cotiuous ad discrete data cases i Capters 3 ad Multivariate Desity Estimatio Suppose tat X 1,..., X costitute a i.i.d. q-vector (X i R q, for some q > 1) avig a commo PDF f(x) = f(x 1, x,..., x q ). Let X is deote te st compoet of X i (s = 1,..., q). Usig a product kerel fuctio costructed from te product of uivariate kerel fuctios, we estimate te PDF f(x) by fˆ(x) = 1 ( Xi x ) K, (1.35) 1... q i=1

23 1.6. MULTIVARIATE DENSITY ESTIMATION 5 ( ) ( ) ( ) were K Xi x = k Xi1 x 1 Xiq x k q 1 q, ad were k( ) is a uivariate kerel fuctio satisfyig (1.10). Te proof of MSE cosistecy of fˆ(x) is similar to te uivariate case. I particular, oe ca sow tat ( ) bias fˆ(x) = ( q q sf ss (x) + O 3 s s=1 s=1 κ ), (1.36) were f ss (x) is te secod order derivative of f(x) wit respect to x s, κ = v k(v) dv, ad oe ca also sow tat [ ( )] ( ) q 1 ( ) var fˆ(x) = κ q 1 f(x) + O s = O, 1... q 1... s=1 q (1.37) were κ = k (v) dv. Te proofs of (1.36) ad (1.37), wic are similar to te uivariate X case, are left as a exercise (see Exercise 1.11). Summarizig, we obtai te result ( ) [ ( ( ) MSE fˆ(x) = bias fˆ(x))] + var fˆ(x) ( ) q = O s + ( 1... q ) 1. s=1 Hece, if as, max 1 s q s 0 ad 1... q, te we ave fˆ(x) f(x) i MSE, wic implies tat fˆ(x) f(x) i probability. As we saw i te uivariate case, te optimal smootig parameters s sould balace te squared bias ad variace terms, i.e., 4 s = O ( ( 1... q ) 1) for all s. Tus, we ave s = c s 1/(q+4) for some positive costat c s (s = 1,..., q). Te cross-validatio metods discussed i Sectio 1.3 ca be easily geeralized to te multivariate data settig, ad we ca sow tat least squares cross-validatio ca optimally select te s s i te sese outlied i Sectio 1.3 (see Sectio 1.8 below). We briefly remark o te idepedece assumptio ivoked for te proofs preseted above. Our assumptio was tat te data is idepedet across te i idex. Note tat o restrictios were placed o te s idex for eac compoet X is (s = 1,..., q). Te product kerel is used simply for coveiece, ad it certaily does ot require tat te X is s

24 6 1. DENSITY ESTIMATION are idepedet across te s idex. I oter words, te multivariate kerel desity estimator (1.35) is capable of capturig geeral depedece amog te differet compoets of X i. Furtermore, we sall relax te idepedece across observatios assumptio i Capter 18, ad will see tat all of te results developed above carry over to te weakly depedet data settig. 1.7 Multivariate Badwidt Selectio: Rule-of-Tumb ad Plug-I Metods I Sectio 1. we discussed te use of te so-called ormal referece rule-of-tumb ad plug-i metods i a uivariate settig. Te geeralizatio of te uivariate ormal referece rule-of-tumb to a multivariate settig is straigtforward. Lettig q be te dimesio of X i, oe ca coose s = c s X s,sd 1/(4+q) for s = 1,..., q, were X s,sd is te i=1 sample stadard deviatio of {X is } ad c s is a positive costat. I practice oe still faces te problem of ow to coose c s. Te coice of c s = 1.06 for all s = 1,..., q is computatioally attractive; owever, tis selectio treats te differet X is s symmetrically. I practice, sould te joit PDF cage rapidly i oe dimesio (say i x 1 ) but cage slowly i aoter (say i x ), te oe sould select a relatively small value of c 1 (ece a small 1 ) ad a relatively large value for c ( ). Ulike te cross-validatio metods tat we will discuss sortly, rule-of-tumb metods do ot offer tis flexibility. For plug-i metods, o te oter ad, te leadig (squared) bias ad variace terms of fˆ(x) must be estimated, ad te 1,..., q must be cose to miimize te leadig MSE term of fˆ(x). However, te leadig MSE term of fˆ(x) ivolves te ukow f(x) ad its partial derivative fuctios, ad pilot badwidts must be selected for eac variable i order to estimate tese ukow fuctios. How to best select te iitial pilot smootig parameters ca be tricky i igdimesioal settigs, ad te plug-i metods are ot widely used i applied settigs to te best of our kowledge, or would we cousel teir use oter ta for exploratory data aalysis.

25 1.8. MULTIVARIATE BANDWIDTH SELECTION: CROSS-VALIDATION METHODS Multivariate Badwidt Selectio: Cross-Validatio Metods Least Squares Cross-Validatio Te uivariate least squares cross-validatio metod discussed i Sectio ca be readily geeralized to te multivariate desity estimatio settig. Replacig te uivariate kerel fuctio i (1.3) by a multivariate product kerel, te cross-validatio objective fuctio ow becomes 1 CV f ( 1,..., q ) = K (X i, X j ) i=1 j=1 ( 1) i=1 j=i,j=1 K (X i, X j ), (1.38) were q ( ) K (X i, X j ) = 1 Xis X js s k, s=1 q K (X i, X j ) = 1 s k s=1 s ( Xis X js ), s ad k (v) is te twofold covolutio kerel based upo k( ), were k( ) is a uivariate kerel fuctio satisfyig (1.10). Exercise 1.1 sows tat te leadig term of CV f ( 1,..., q ) is give by (igorig a term urelated to te s s) CV f0 ( 1,..., q ) = [ q Bs (x) s=1 ] κ q s dx +, (1.39) 1... q were B s (x) = (κ /)f ss (x). Defiig a s via s = a s 1/(q+4) (s = 1,..., q), we ave were CV f0 ( 1,..., q ) = 4/(q+4) χ f (a 1,..., a q ), (1.40) [ q κ q χ f (a 1,..., a q ) = Bs (x)a s dx +. (1.41) a 1... a q s=1 ]

26 8 1. DENSITY ESTIMATION Let te a 0 s s be te values of te a s s tat miimize χ f (a 1,..., a q ). Uder te same coditios used i te uivariate case ad, i additio, assumig tat f ss (x) is ot a zero fuctio for all s, Li ad Zou (005) sow tat eac a0 s is uiquely defied, positive, ad fiite (see Exercise 1.10). Let 0 1,..., 0 q deote te values of 1,..., q tat miimize CV f0. Te from (1.40) we kow tat 0 s = a 0 s 1/(q+4) = O ( 1/(q+4)). Exercise 1.1 sows tat CV f0 is also te leadig term of E[CV f ]. Terefore, te ostocastic smootig parameters 0 s ca be iterpreted as optimal smootig parameters tat miimize te leadig term of te IMSE. Let ĥ 1,..., ĥ q deote te values of 1,..., q tat miimize CV f. Usig te fact tat CV f = CV f0 + (s.o.), we ca sow tat ĥ s = 0 s + o p ( 0 s). Tus, we ave ĥ s 0 s 0 s ĥ = s 0 s 1 0 i probability, for s = 1,..., q. (1.4) Terefore, smootig parameters selected via cross-validatio ave te same asymptotic optimality properties as te ostocastic optimal smootig parameters. Note tat if f ss (x) = 0 almost everywere (a.e.) for some s, te B s = 0 ad te above result does ot old. Stoe (1984) sows tat te cross-validatio metod still selects 1,..., q optimally i te sese tat te itegrated estimatio square error is miimized; see also Ouyag et al. (006) for a more detailed discussio of tis case Likeliood Cross-Validatio Likeliood cross-validatio for multivariate models follows directly via (multivariate) maximizatio of te likeliood fuctio outlied i Sectio 1.3., ece we do ot go ito furter details ere. However, we do poit out tat, toug straigtforward to implemet, it suffers from te same defects outlied for te uivariate case i te presece of fat tail distributios (i.e., it as a tedecy to oversmoot i suc situatios). 1.9 Asymptotic Normality of Desity Estimators I tis sectio we sow tat fˆ(x) as a asymptotic ormal distributio. Te most popular CLT is te Lideberg-Levy CLT give i

27 1.9. ASYMPTOTIC NORMALITY OF DENSITY ESTIMATORS 9 Teorem A.3 of Appedix A, wic states tat 1/ [ 1 i=1 Z i ] N(0, σ ) i distributio, provided tat Z i is i.i.d. (0, σ ). Toug te Lideberg-Levy CLT ca be used to derive te asymptotic distributio of various semiparametric estimators discussed i Capters 7, 8, ad 9, it caot be used to derive te asymptotic distributio of fˆ(x). Tis is because fˆ(x) = 1 i Z i,, were te summad Z i, = K (X i, x) depeds o (sice = ()). We sall make use of te Liapuov CLT give i Teorem A.5 of Appedix A Teorem 1.3. Let X 1,..., X be i.i.d. q-vectors wit its PDF f( ) avig tree-times bouded cotiuous derivatives. Let x be a iterior poit of te support of X. If, as, s 0 for all s = 1,..., q, q, ad ( 1... q ) q s=1 s 0, te [ ] q 1... q fˆ(x) f(x) κ d (x) N(0, κ q f(x)). sf ss s=1 Proof. Usig (1.36) ad (1.37), oe ca easily sow tat [ ] q 1... q fˆ(x) f(x) κ sf ss (x) as asymptotic mea zero ad asymptotic variace κ q f(x), i.e., [ ] q 1... q fˆ(x) f(x) κ sf ss (x) s=1 = ( )] 1... q [fˆ(x) E fˆ(x) + [ ( ) 1... q E fˆ(x) = ( 1... q [fˆ(x) E = + O ( 1... q s (1... q ) 1/ i=1 q s=1 s=1 f(x) κ s f ss (x) s=1 )] fˆ(x) ) 3 (by (1.36)) [ ( Xi x ) ( ( Xi x ))] K E K + o(1) Z,i + o(1) N (0, κ q f(x)), i=1 d q ]

28 30 1. DENSITY ESTIMATION by Liapuov s CLT, provided we ca verify tat Liapuov s CLT coditio (A.1) olds, were ( Z,i = ( 1... q ) [K 1/ Xi x ) ( ( Xi x ))] E K ad def σ,i i=1 i=1 = var(z,i ) = κ q f(x) + o(1) by (1.37). Paga ad Ulla (1999, p. 40) sow tat (A.1) olds uder te coditio give i Teorem 1.3. Te coditio tat k(v) +δ dv < for some δ > 0 used i Paga ad Ulla is implied by our assumptio tat k(v) is oegative ad bouded, ad tat k(v) dv = 1, because k(v) +δ dv C k(v) dv = C is fiite, were C = sup v R q k(v) 1+δ Uiform Rates of Covergece Up to ow we ave demostrated oly te case of poitwise ad IMSE cosistecy (wic implies cosistecy i probability). I tis sectio we geeralize poitwise cosistecy i order to obtai a stroger uiform cosistecy result. We will prove tat oparametric kerel estimators are uiformly almost surely cosistet ad derive teir uiform almost sure rate of covergece. Almost sure covergece implies covergece i probability; owever, te coverse is ot true, i.e., covergece i probability may ot imply covergece almost surely; see Serflig (1980) for specific examples. We ave already establised poitwise cosistecy for a iterior poit i te support of X. However, it turs out tat popular kerel fuctios suc as (1.9) may ot lead to cosistet estimatio of f(x) we x is at te boudary of its support, ece we eed to exclude te boudary rages we cosiderig te uiform covergece rate. Tis igligts a importat aspect of kerel estimatio i geeral, ad a umber of kerel estimators itroduced i later sectios are motivated by te desire to mitigate suc boudary effects. We first sow tat we x is at (or ear) te boudary of its support, fˆ(x) may ot be a cosistet estimator of f(x). Cosider te case were X is uivariate avig bouded support. For simplicity we assume tat X [0, 1]. Te poitwise cosistecy result fˆ(x) f(x) = o p (1) obtaied earlier requires tat x lie i te

29 1.10. UNIFORM RATES OF CONVERGENCE 31 iterior of its support. Exercise 1.13 sows tat, for x at te boudary of its support, MSE fˆ(x)) may ot be o(1). Terefore, some modificatios may be eeded to cosistetly estimate f(x) for x at te boudary of its support. Typical modificatios iclude te use of boudary kerels or data reflectio (see Gasser ad Müller (1979), Hall ad Werly (1991), ad Scott (199, pp )). By way of example, cosider te case were x lies o its lowermost boudary, i.e., x = 0, ece fˆ(0) = () 1 i=1 K((X i 0)/). Exercise 1.13 sows tat for tis case, E[fˆ(0)] = f(0)/ + O(). Terefore, bias fˆ(0)] = E[fˆ(0)] f(0) = f(0)/ + O(), wic will ot coverge to zero if f(0) = 0 (we f(0) > 0). I te literature, various boudary kerels are proposed to overcome te boudary (bias) problem. For example, a simple boudary corrected kerel is give by (assumig tat X [0, 1]) 1 k ( y x ) / k(v) dv if x [0, ) k (x, y) = 1 k ( y x ) x/ if x [, 1 ] 1 k ( y x ) / (1 x)/ k(v) dv if x (1, 1], (1.43) were k( ) is a secod order kerel satisfyig (1.10). Now, we estimate f(x) by 1 fˆ(x) = k (x, X i ), (1.44) i=1 were k (x, X i ) is defied i (1.43). Exercise 1.14 sows tat te above boudary corrected kerel successfully overcomes te boudary problem. We ow establis te uiform almost sure covergece rate of fˆ(x) f(x) for x S, were S is a bouded set excludig te boudary rage of te support of X. I te above example, we te support of x is [0, 1], we ca coose S = [ɛ, 1 ɛ] for arbitrarily small positive ɛ (0 < ɛ < 1/). We assume tat f(x) is bouded below by a positive costat o S. Teorem 1.4. Uder smootess coditios o f( ) give i Masry (1996b), ad also assumig tat if x S f(x) δ > 0, we ave ( q (l()) 1/ sup fˆ(x) f(x) = O + ) ) 1/ s almost surely. x S ( 1... q s=1

30 3 1. DENSITY ESTIMATION A detailed proof of Teorem 1.4 is give i Sectio 1.1. Sice almost sure covergece implies covergece i probability, te uiform rate also olds i probability, i.e., uder te same coditios as i Teorem 1.4, we ave ( ) q (l()) 1/ sup fˆ(x) f(x) = O p ( 1... q ) 1/ +. x S Usig te results of (1.36) ad (1.37), we ca establis te followig uiform MSE rate. Teorem 1.5. Assumig tat f(x) is twice differetiable wit bouded secod derivatives, te we ave { [ ] } sup E fˆ(x) f(x) x S = O q s=1 ( 4 + ( 1... q ) 1 ) Proof. Tis follows from (1.36) ad (1.37), by otig tat sup ad supx S f ss (x) are bot fiite (s = 1,..., q). s=1 s s. x S f(x) Note tat altoug covergece i MSE implies covergece i probability, oe caot derive te uiform covergece rate i probability from Teorem 1.5. Tis is because { E sup x S [ ] } fˆ(x) f(x) = sup E x S [ fˆ(x) f(x)], ad [ ] P sup fˆ(x) f(x) > ɛ = x S [ ] sup P fˆ(x) f(x) > ɛ x S. Te sup ad te E( ) or te P( ) operators do ot commute wit oe aoter. Ceg (1997) proposes alterative (local liear) desity estimators tat acieve automatic boudary correctios ad ejoy some typical optimality properties. Ceg also suggests a data-based badwidt selector (i te spirit of plug-i rules), ad demostrates tat te badwidt selector is very efficiet regardless of weter tere are osmoot boudaries i te support of te desity.

31 1.11. HIGHER ORDER KERNEL FUNCTIONS Higer Order Kerel Fuctios Recall tat decreasig te badwidt lowers te bias of a kerel estimator but icreases its variace. Higer order kerel fuctios are devices used for bias reductio wic are also capable of reducig te MSE of te resultig estimator. May popular kerel fuctios suc as te oe defied i (1.10) are called secod order kerels. Te order of a kerel, ν (ν > 0), is defied as te order of te first ozero momet. For example, if uk(u) du = 0, but u k(u) du = 0, te k( ) is said to be a secod order kerel (ν = ). A geeral νt order kerel (ν is a iteger) must terefore satisfy te followig coditios: (i) k(u) du = 1, (ii) u l k(u) du = 0, (l = 1,..., ν 1), (1.45) (iii) uν k(u)du = κ ν = 0. Obviously, we ν =, (1.45) collapses to (1.10). If oe replaces te secod order kerel i fˆ(x) of (1.35) by a νt order kerel fuctio, te as was te case we usig a secod order kerel, uder te assumptio tat f(x) is νt order differetiable, ad assumig tat te s s all ave te same order of magitude, oe ca sow tat ad ( ) bias fˆ(x) = O ( q ν s s=1 ) (1.46) ( ) var fˆ(x) = O ( ( 1... q ) 1) (1.47) (see Exercise 1.15). Hece, we ave ( ) MSE fˆ(x) q = O ( s s=1 ν + ( 1... q ) 1 ) (1.48) ad q ( ) fˆ(x) f(x) = O p ν + ( 1... q ) 1/. s=1 Tus, by usig a νt iger order kerel fuctio (ν > ), oe ca reduce te order of te bias of fˆ(x) from O ( q s=1 s) to O ( q s=1 ν s), s

32 34 1. DENSITY ESTIMATION ad te optimal value of s may oce agai be obtaied by balacig te squared bias ad te variace, givig s = O ( 1/(ν+q)), wile te rate of covergece is ow fˆ(x) f(x) = O p ( ν/(ν+q) ). Assumig tat f(x) is differetiable up to ay fiite order, te oe ca coose ν to be sufficietly large, ad te resultig rate ca be made arbitrarily close to O p ( 1/ ). Note, owever, tat for ν >, o oegative kerel exists tat satisfies (1.45). Tis meas tat, ecessarily, we ave to assig egative weigts to some rage of te data wic implies tat oe may get egative desity estimates, clearly a udesirable sideeffect. Furtermore, i fiite-sample applicatios oegative secod order kerels ave ofte bee foud to yield more stable estimatio results ta teir iger order couterparts. Terefore, iger order kerel fuctios are maily used for teoretical purposes; for example, to acieve a -rate of covergece for some fiite dimesioal parameter i a semiparametric model, oe ofte as to use ig order kerel fuctios (see Capter 7 for suc a example). Higer order kerel fuctios are quite easy to costruct. Assumig tat k(u) is symmetric aroud zero, 5 i.e., k(u) = k( u), te u m+1 k(u) du = 0 for all positive itegers m. By way of example, i order to costruct a simple fourt order kerel (i.e., ν = 4), oe could begi wit, say, a secod order kerel suc as te stadard ormal kerel, set up a polyomial i its argumet, ad solve for te roots of te polyomial subject to te desired momet costraits. For example, lettig Φ(u) = (π) 1/ exp( u /) be a secod order Gaussia kerel, we could begi wit te polyomial k(u) = (a + bu )Φ(u), (1.49) were a ad b are two costats wic must satisfy te requiremets of a fourt order kerel. Lettig k(u) satisfy (1.45) wit ν = 4 ( u l k(u) du = 0 for l = 1, 3 because k(u) is a eve fuctio), we terefore oly require k(u) du = 1 ad u k(u) du = 0. From tese two restrictios, oe ca easily obtai te result a = 3/ ad b = 1/. For readers requirig some iger order kerel fuctios, we provide a few examples based o te secod order Gaussia ad Epaecikov kerels, peraps te two most popular kerels i applied oparametric estimatio. As oted, te fourt order uivariate Gaussia kerel 5 Typically, oly symmetric kerel fuctios are used i practice, toug see Abadir ad Lawford (004) for recet work ivolvig optimal asymmetric kerels.

33 1.11. PROOF OF THEOREM is give by te formula ( 3 1 k(u) = u ) exp( u /), π wile te sixt order uivariate Gaussia kerel is give by ( 15 5 k(u) = u + 1 u 4) exp( u /) π Te secod order uivariate Epaecikov kerel is te optimal kerel based o a calculus of variatios solutio to miimizig te IMSE of te kerel estimator (see Serflig (1980, pp )). Te uivariate secod order Epaecikov kerel is give by te formula { ( u ) if u < 5.0 k(u) = oterwise, te fourt order uivariate Epaecikov kerel by { ( k(u) = u ) ( 1 u ) 8 5 if u < oterwise, wile te sixt order uivariate Epaecikov kerel is give by { ( k(u) = u 31 + u 4) ( 1 1 u ) if u < oterwise. Figure 1. plots te secod, fourt, ad sixt order Epaecikov kerels defied above. Clearly, for ν >, te kerels ideed assig egative weigts wic ca result i egative desity estimates, ot a desirable feature. For related work ivolvig exact mea itegrated squared error for iger order kerels i te cotext of uivariate kerel desity estimatio, see Hase (005). Also, for related work usig iterative metods to estimate trasformatio-kerel desities, see Yag ad Marro (1999) ad Yag (000). 1.1 Proof of Teorem 1.4 (Uiform Almost Sure Covergece) Te proof below is based o te argumets preseted i Masry (1996b), wo establises uiform almost sure rates for local polyomial regressio wit weakly depedet (α-mixig) data; see Capter 18 for furter details o weakly depedet processes. Sice te bias of te kerel

4 Conditional Distribution Estimation

4 Conditional Distribution Estimation 4 Coditioal Distributio Estimatio 4. Estimators Te coditioal distributio (CDF) of y i give X i = x is F (y j x) = P (y i y j X i = x) = E ( (y i y) j X i = x) : Tis is te coditioal mea of te radom variable