Cross-Validation in Function Estimation

Size: px

Start display at page:

Download "Cross-Validation in Function Estimation"

Ada Gibbs
5 years ago
Views:

1 Crss-Validati i Fucti Estimati Chg Gu Octber 1, 2006 Crss-validati is a ituitive ad effective techique fr mdel selecti i data aalysis. I this discussi, I try t preset a few icaratis f the geeral techique i a few parametric fucti estimati settigs. Justificatis f the techique i Gaussia regressi settigs will be discussed, alg with pssible reass fr the lack f similar justificati i ther settigs. There will be discussis f sme subtle cceptual issues which put certai widely adpted ccepts/practice uder scrutiy. 1 Crss-Validati ad Related Techiques 1.1 PRESS ad C p Csider a liear regressi mdel with P 1 predictrs X 1,..., X P 1, Y = β 0 + β 1 X β p 1 X P 1 + ɛ, where ɛ N(0, σ 2 ). It is assumed that µ Y = β 0 + β 1 X β P 1 X P 1, but sme f the β j X j s may ctribute very little r t at all. Fr mdel selecti i this settig (a.k.a. variable selecti), tw effective techiques are PRESS ad C p. Observig (Y i, X i,1,..., X i,p 1 ), = 1,...,, e may calculate PRESS (Predicted REsidual Sum f Squares) fr every f the (2 P 1 1) pssible mdels, PRESS = (Y i Ŷi(i)) 2, ˆβ [i] ˆβ [i] where Ŷi(i) [i] = 0 + ˆβ 1 X i,1 + + P 1 X i,p 1 ad j s are LS estimates usig the ( 1) bservatis excludig (Y i, X i,1,..., X i,p 1 ). Oe the ca chse the mdel with the miimum PRESS scre. PRESS is prbably the first icarati f crss-validati, ad the bjective f mdel selecti is t achieve mre precise predicti. Fr every f the (2 P 1 1) pssible mdels, e may als calculate the C p statistic, C p = ˆβ [i] SSE ( 2p), MSE(X 1,..., X P 1 ) where SSE is fr the mdel uder csiderati ad p is its umber f cefficiets. Oe selects a mdel with a small C p (s that the MSE E(ˆµ µ) 2 is small) that is clse t p (s that E ˆµ µ). The bjective f C p -based selecti is fr the estimati precisi f µ Y but t fr the idetificati f the crrect mdel, as the full mdel is assumed t be crrect t start with. The estimati f egligible β j s diverts resurces ad thus results i higher variace f the imprtat ˆβ j s, ad the elimiati f the mir terms, while itrducig bias i ˆµ, helps t reduce the variace. 1

2 1.2 Crss-validati, GCV, ad C L Csider a regressi mdel Y = η(x) + ɛ where ɛ N(0, σ 2 ) ad f(x) is smth. Tw f the ppular appraches t the parametric estimati f f(x) are the kerel methd ad the pealized least squares methd. T keep thigs simple, csider ly uivariate x. Observig (Y i, X i ), = 1,...,, e may estimate f(x) via ˆη h (x) = 1 ( ) Xi x Y i K, h h where K( ) is a give kerel fucti. Typically, K(x) is uimdal ad symmetric with respect t 0, K(x)dx = 1, ad x j K(x)dx = 0 fr j = 1,..., m. This prvides a family f estimates idexed by the badwidth h, with smaller h yieldig smaller bias but larger variace ad larger h yieldig smaller variace but larger bias. Assumig (η (m) (x)) 2 dx <, e may estimate η(x) usig the miimizer η λ (x) f 1 (Y i η(x i )) 2 + λ (η (m) (x)) 2 dx, (1) where the smthig parameter λ plays similar rle as the h fr kerel estimates. With λ = e frces a parametric mdel f the frm η(x) = β 0 + β 1 x + + β m 1 x m 1. With λ = 0 + e btais a iterplat with miimum (η (m) (x)) 2 dx. The sluti is called a smthig (atural) splie as η λ is a piece-wise plymial f rder 2m 1. Evaluatig the estimated η(x) at the data pits X i, e gets the predicted values Ŷi. Fr the keral estimates, smthig splie estimates, ad ther parametric estimates f η(x) kw as liear smthers, e has Ŷ = AY, where A is the smthig matrix idexed by h r λ r the like. I practice, e eeds t select h r λ, fr which the methds i the secti title are desiged; the smthig matrix A plays a imprtat rle here. Write Ŷi = η λ (X i ) with η λ (x) the miimizer f (1), ad Ŷi(i) = η [i] λ (X i), where η [i] λ (x) miimizes the delete-e versi f (1), 1 (Y j η(x j )) 2 + λ (η (m) (x)) 2 dx. (2) j i It ca be shw that η [i] λ (x) is the miimizer f (1) with η[i] λ (X i) replacig Y i, thus a i,i (Y i Ŷi(i)) = Ŷ i Ŷi(i), r Y i Ŷi(i) = (Y i Ŷi)/(1 a i,i ). This leads t the rdiary crss-validati scre V 0 (λ) = 1 (Y i Ŷi(i)) 2 = 1 (Y i Ŷi) 2 (1 a i,i ) 2. (3) A ivariace argumet suggests the replacemet f a i,i by their average value tracea/, yieldig the geeralized crss-validati (GCV) f Crave ad Wahba (1979), V (λ) = 1 Y T (I A(λ)) 2 Y { 1 trace(i A(λ))} 2. (4) 2

3 Nte that althugh the derivati f GCV is thrugh (1), it ca be used ad d get used fr all liear smthers. Clsely related t GCV is the C L scre, which assumes a kw σ Optimality f U(λ) ad V (λ) U(λ) = 1 YT (I A(λ)) 2 Y + 2 σ2 tracea(λ), (5) Befre presetig the theretical justificati f C L ad GCV, let me try t clarify a few cceptual issues. I parametric statistics, e has a discrete cllecti f tetative mdels ad assumes that the crrect mdel is amg them, ad a mdel selecti methd is csistet if it zeres i t this crrect mdel as. The csistecy prperty is t predicti-rieted as i the desig f PRESS, r is it fiite-sample MSE-rieted as i the practical use f C p ; with fr fixed umber f parameters, variace is f ccer ad all that matters is the bias. [I am t familiar with this lie f literature, s I d t kw if there are results whe e f the tetative mdels is crrect, r whether e culd frmulate the prblem i such a way that the umber f parameters icrease with s bias-variace trade-ff remais relevat.] I parametric fucti estimati, the family f tetative estimates frm a trajectry i the fucti space ad e culd t ad des t assume that the true fucti is the trajectry. Istead, e lks fr the estimate that is clsest t the true fucti i sme sese. The bjective f such mdel selecti is t t idetify the crrect mdel (there is e), but rather t lcate the best estimate give the bserved data. Bias-variace trade-ff is the cetral issue here (thugh t always explicitly), ad as, the ptimal chice wuld have smaller ad smaller bias. Fr Gaussia regressi, a atural measure fr estimati precisi is the MSE the data pits, L(λ) = 1 (ˆη λ (X i ) η(x i )) 2, ad the estimate the trajectry that miimizes this MSE wuld be the ptimal chice. Nte that the trajectry is depedet the bserved data, s is the MSE lss ad its miimizer. The C L scre f (5) is actually a ubiased estimate f relative lss, with relative meaig the drppig f terms that d t deped λ. The ptimality f U(λ) ad V (λ) were established by Ker-Chau Li (1986): Let λ, λ u, ad λ v be the miimizers f L(λ), U(λ), ad V (λ), respectively, the L(λ ) L(λ u ) p 1, L(λ ) L(λ v ) p 1. (6) The key results leadig t these are that as λ 0 at certai rates that ctai the ptimal e, U(λ) L(λ) 1 ɛ T ɛ = p (L(λ)), V (λ) L(λ) 1 ɛ T ɛ = p (L(λ)); The mai cditi fr these is that R(λ), where R(λ) = E[L(λ)], which states that the parametric -csistecy is t achievable i the settig. Nte that U(λ) V (λ) p 1 ad L(λ) = p (1), s these are delicate results. Fr the recrd, as ad λ 0, L(λ) = O p (λ + 1 λ 1/2m ) fr the miimizer f (1). 3

4 1.4 Crss-validati fr desity estimati Csider a desity estimati prblem with idepedet samples X i f(x), i = 1,...,. estimate f(x), e may use the kerel estimate, f h (x) = 1 h ( ) Xi x K, h where K( ) is as give earlier. T assess the perfrmace f f h (x) as a estimate f f(x), e may use the Kullback-Leibler lss L(h) = E f [lg{f(x)/f h (X)}], which, after drppig the term E f [lg f(x)] t ivlvig h, reduces t the relative KL discrepacy E f [lg f h (X)]. Estimatig the relative KL by crss-validated sample mea, e has the KL crss-validati scre, T V (h) = 1 lg f [i] h (X i), (7) where f [i] h (x) is based the ( 1) samples excludig X i. It was shw by Peter Hall (1987) that if the tails f the kerel K(x) are thier tha the tails f f(x), the L(h ) L(h v ) p 1, (8) where h ad h v miimizes L(h) ad V (h), respectively. Parallel t (1), e may assume a fiite supprt X fr f(x) ad emply the lgistic desity trasfrm f(x) = e η(x) / X eη, ad estimate η(x) by the miimizer η λ (x) f the pealized likelihd fuctial, 1 { η(x i ) lg e η(x)} + λ (η (m) (x)) 2 dx. (9) X 2 The Kullback-Leibler distace is w L(λ) = E f [lg{f(x)/f λ (X)}] = E f [η(x) η λ (X)] + {lg X e ηλ(x) lg e η(x) }, (10) X with a relative KL discrepacy lg e ηλ(x) E f [η λ (X)], X where the first term ca be cmputed ad the secd term ca be estimated thrugh a crssvalidated sample mea, 1 η[i] λ (X i), with η [i] λ (x) miimizig the delete-e versi f (9), 1 1 j i { η(x j ) lg X e η(x)} + λ 2 (η (m) (x)) 2 dx. This yields a crss-validati scre V (λ) = lg e ηλ(x) 1 η [i] X λ (X i). (11) 4

5 While empirical results strgly suggest ptimality similar t thse established by Li (1806) ad Hall (1987), attempts the theretical aalysis have t bee successful. It ca be shw that the symmetrized KL lss, E f [lg{f(x)/f λ (X)}] E fλ [lg{f(x)/f λ (X)}], which rughly dubles the KL lss L(λ) f (10), is f the rder O p (λ + 1 λ 1/2m ), s the miimum KL lss is f rder O p ( 2m/(1+2m) ) = p ( 1/2 ). O the ther had, the estimati f E f [g(x)] thrugh the sample mea is at best f the rder O p ( 1/2 ). Oe seems t eed mre delicate term grupig fr ay success at a theretical aalysis f V (λ). 1.5 Crss-validati fr -Gaussia regressi ad hazard estimati Fr -Gaussia regressi ad hazard estimati, scres similar t (11) were derived fllwig similar lies. The empirical perfrmaces f these scres suggest ptimality prperties similar t (6) ad (8), but theretical aalysis is lackig. 1.6 Crss-validati fr regressi with crrelated errrs Fr regressi with crrelated data, tw scearis have bee treated i tw dissertatis by my frmer studets, Pig Ma ad Chu Ha, respectively. The first thesis was by Pig Ma, which ccers mixed-effect/variace-cmpet mdels f the frm Y = η(x) + z T b + ɛ, where b N(0, B) ad ɛ i.i.d N(0, σ 2 ), with the dimesi p f b much smaller tha ad its variace-cvariace matrix partly r etirely ukw; the errr variace-cvariace matrices are lw-rak mdificatis f σ 2 I. T estimate η(x), e may miimize 1 (Y i η(x i ) z T i b) 2 + λ (η (m) (x)) 2 dx + 1 bt Σb, where (λ, Σ) are tuig parameters; Σ shuld reflect the structure f B 1. The jit estimati f (η(x), b) yields Ŷ = ˆη(x) + zt ˆb, ad e still has a expressi Ŷ = A(λ, Σ)Y. The selecti f (λ, Σ) ca be de usig C L ad GCV, ad we were able t establish the ptimality f such practice similar t (6), with the lsses L 1 (λ, Σ) = 1 (Ŷi η(x i ) z T i b) 2 ad L 2 (λ, Σ) = 1 (ˆη η)t P Z (ˆη η), where η T = (η(x 1 ),..., η(x )), P Z = I Z(ZT Z) 1 Z T, ad Z T = (z 1,....z ). Nte that we are t ccered with the estimati f B i this exercise, but ly use Σ (r B 1 ) as tuig parameters fr the estimati f η(x) r η(x) + z T b. Fr the ptimality with respect t L 1, e eeds p = O( ). Fr L 2, e eeds p fixed. Similar practice was applied t -Gaussia regressi with satisfactry empirical perfrmace, but thery is lackig. 5

6 The secd thesis was by Chu Ha, which ccers statiary time series mdels r mixedeffect mdels with the dimesi f b grwig with ; the errr variace-cvariace matrices are lger lw-rak mdificatis f σ 2 I. Frmally, e has Y i = η(x i ) + ɛ i, ɛ N(0, σ 2 W 1 ), where W is kw up t a set f parameter γ. Oe may estimate η(x) via the miimizati f (Y η) T W (Y η) + λ (η (m) (x)) 2 dx, with (λ, γ) as tuig parameters. Fr the jit selecti f (λ, γ), we derived the C L like scre U(λ, γ) = 1 σ 2 YT W 1/2 (I A) 2 W 1/2 Y 1 lg W + 2 tracea, where W 1/2 Ŷ = AW 1/2 Y, ad the GCV like scre V (λ, γ) = lg{ 1 Y T W 1/2 (I A) 2 W 1/2 Y} 1 lg W + 2 tracea tracea. Remember that W depeds γ ad A depeds (λ, γ). Usig the Kullback-Leibler lss fr the jit estimati f (η 0 (x), γ 0 ) by (η(x), γ), [ 1 L(λ, γ) = E 0 2σ 2 (Y η)t W (Y η) 1 1 lg W 2 2σ 2 (Y η 0) T W 0 (Y η 0 ) + 1 ] 2 lg W 0 = 1 2σ 2 (η 0 η) T W (η 0 η) tr(w W0 I) 1 1 lg W W0, 2 2 the ptimality similar t (6) was established fr U ad V, ad the resultig ˆγ is -csistet. A key cditi fr the thery is that W 1 > ci fr sme c > 0 uifrmly ver the tetative γ. The thery applies t the stadard statiary ad ivertible ARMA mdels with γ i a cmpact set, ad t a mixed-effect mdel with p. 1.7 Refereces PRESS is i early every textbk regressi aalysis ad liear mdels, thugh I d t kw exactly wh first prpsed it. T read abut C p ad C L, check ut the classical referece f Mallws. Mallws, C. L. (1973). Sme cmmets C P, Techmetrics 15, Fr the mtivati ad derivati f GCV alg with sme attempt the theretical justificati f U(λ) ad V (λ), check ut the semial wrk f Crave ad Wahba (1979); the risk-based thery des t really justify its practical use, hwever. The really relevat lss-based justificati f U(λ) ad V (λ) was by Li (1986). Crave, P. ad G. Wahba (1979). Smthig isy data with splie fuctis: Estimatig the crrect degree f smthig by the methd f geeralized crss-validati, Numer. Math. 31, Li, K.-C. (1986). Asympttic ptimality f C L ad geeralized crss-validati i the ridge regressi with applicati t splie smthig, A. Statist. 14,

7 Kerel desity estimati ad the assciated badwidth selecti was ce a busiess by itself. The wrk f Hall (1987) fr the justificati f KL-based crss-validati scre (7) was a imprtat ctributi, but it was take by may as egative the KL distace as perfrmace measure ad egative CV as badwidth selectr, ad partly ispired the develpmets f the s-called plug-i methds fr badwidth selecti; I csider this a ufrtuate tur f evet. Hall, P. (1987). O Kullback-Leibler lss ad desity estimati, A. Statist. 15, The implemetati ad empirical perfrmace f (11) ca be fud i my jit wrk with Jigyua Wag ad i Chapter 6 f my bk. Discussis ccerig -Gaussia regressi ad hazard estimati are i Chapters 5 ad 7 f my bk. Gu, C. ad J. Wag (2003). Pealized likelihd desity estimati: Direct crss-validati ad scalable apprximati, Statist. Si. 13, Gu, C. (2002). Smthig Splie ANOVA Mdels. New Yrk: Spriger-Verlag. Details ccerig crss-validati fr regressi with crrelated errrs are t be fud i the fllwig articles. Gu, C. ad P. Ma (2005). Optimal smthig i parametric mixed-effect mdels, A. Statist. 33, Gu, C. ad P. Ma (2005). Geeralized parametric mixed-effect mdels: Cmputati ad smthig parameter selecti, J. Cmput. Graph. Statist. 14, Ha, C. ad C. Gu (2006). Optimal smthig with crrelated data, mauscript. 2 Prblems (Real ad Perceived), Ccepts, ad Ctrversies 2.1 A simple simulati settig We will use a simple simulati settig t illustrate sme f the issues t be discussed. The issues are t specific t this particular settig, as similar simulatis i ther settigs demstrate the same qualitative characteristics. O x i = (i.5)/, i = 1,,, we geerate 100 replicates f data frm Y i = η(x i ) + ɛ i, ɛ i N(0, σ 2 ), with η(x) = si(2πx π) ad σ 2 = 1. Fr λ a fie grid f lg 10 λ = ( 5)(.05)( 1), we calculate the miimizers f (1) with m = 2 fr each f the replicates ad evaluate varius quatities assciated with them. The grid was brad eugh t bracket the λ f iterest fr all the 100 replicates. 2.2 Udersmthig ad mdificatis f crss-validati Oe prblem suffered by crss-validati methds is udersmthig: i up t 10% f the cases, the methds lead t very small λ r h r the like, resultig i udersmthig r eve iterplati. The prblem des t seem t g away with larger, at least t fr up t 500. A alterative t V (λ) f (4) is the s-called geeralized maximum likelihd (GML) methd derived uder the empirical Bayes iterpretati f smthig splies, which is simply the restricted 7

8 L(λ) f GCV mi L(λ) lg 10 (λ) f V 1.4 (λ) r M(λ) lg 10 (λ) f V 1 (λ) Relative Efficacy M α Figure 1: Left: Perfrmaces f V α (λ) with α = 1 (faded) ad α = 1.4 fr = 100. Ceter: the λ miimizig V 1 (λ) versus that miimizig M(λ) (faded) r V 1.4 (λ), fr = 100. Right: mi L(λ)/L(ˆλ) with ˆλ miimizig M(λ) r V α (λ) at α = 1, 1.2, 1.4, 1.6, 1.8, fr = 100 (fatter bxes) ad = 500 (thier bxes). maximum likelihd (REML) methd fr mixed-effect/variace cmpet mdels. The GMLscre is give by M(λ) = 1 Y T (I A(λ))Y I A(λ) 1/( m), + where I A + is the prduct f the m psitive eigevalues f (I A). The GML methd ever iterplates, but csistetly udersmthes fr η(x) super smth. A simple mdificati seems t cure the udersmthig prblem fr crss-validati. Fr V (λ) f (4), e may use V α (λ) = 1 Y T (I A(λ)) 2 Y { 1 trace(i αa(λ))} 2, (12) where α > 1, ad fr V (λ) f (11), e may use V α (λ) = lg e ηλ(x) 1 X η λ (X i ) + α 1 {η λ (X i ) η [i] λ (X i)}. (13) Simulati studies suggest that a α i the rage wuld be the mst effective. I the settig f 2.1, e may calculate the lss L(λ) as well as the selecti scres V α (λ) ad M(λ) the λ grid fr all the replicates, ad idetify λ, λ v alg with the assciated L(λ). Figure 1 summarizes sme f the empirical results. A s-called exteded expetial (EE) methd was prpsed by Ku ad Efr (2002) t cmbie the stregths f C p ad GML, but I am t able t fllw the argumets. 2.3 Negative crrelati ad mdel idexig Oe majr criticism f crss-validati i the literature was the famus egative crrelati betwee the ptimal ad crss-validated badwidths, as demstrated i the middle frame f Figure 2. The egative crrelati is bthersme ly whe the idex λ is meaigful acrss-replicate, hwever, which will be aalyzed belw. 8

9 L(λ) f GCV * mi L(λ) lg 10 (λ v ) lg 10 (λ ) lg 10 (ρ v ) lg 10 (ρ ) Figure 2: Left: Perfrmaces f V 1 (λ) fr = 100. Ceter: lg 10 (λ ) versus lg 10 (λ v ). Right: lg 10 (ρ ) versus lg 10 (ρ v ). lg 10 (λ) lg 10 (ρ) lg 10 (λ) lg 10 (ρ) Efficacy f Typical ρ r λ ρ λ Figure 3: Left ad Ceter: The ρ-λ mappig ad the (ρ, λ ) pair f 100 replicates. Right: Efficacy f typical ρ r λ. Remember that ulike i the parametric settigs where e selects frm a discrete set f mdels, we are chsig frm a ctiuum f tetative estimates η λ (x). A imprtat issue is hw t alig estimates based differet data. Mathematically, the miimizer f (1) is the sluti t a cstraied LS prblem, mi (Y i η(x i )) 2, s.t. (η (m) (x)) 2 dx ρ fr sme ρ > 0. There is a e-t-e crrespdece betwee λ ad ρ give the data. Ituitively, ρ is meaigful acrss-replicate as it impses the same cstrait η(x) i the estimati prcess regardless the data e bserves, whereas the same λ implies differet cstraits fr differet bservatis. A simple simulati will cfirm that it is ideed the case, that η λ (x) based differet data shuld be aliged by the crrespdig ρ, but we first bserve i the right frame f Figure 2 that the egative crrelati disappears the ρ scale. I the settig f 2.1, e may calculate ρ = 1 0 (η(2) (x)) 2 dx fr all estimates η λ (x) the λ grid, fr all replicates, ad fr the true fucti η(x) = si(2πx π); the true fucti has ρ = I the left ad middle frames f Figure 3, we demstrate the ρ-λ mappig fr 100 9

10 replicates f samples f size = 100 ad idetify the 100 ptimal (ρ, λ ) the grid; it is reassurig t see that is i the middle f the ρ s. T assess the acrss-replicate iterpretability f ρ ad λ, we pick a typical ptimal ρ = , ad a typical ptimal λ = media(λ ), ad calculates their efficacy ver the replicates via L(ρ )/L( ρ) ad L(λ )/L( λ) which is shw i the right frame f Figure 3. The tight spread f ρ s cfirms the ituiti that ρ is the prper mdel idex here. Besides smthig splies, there des t seem t exist a ρ-idex fr ther smthig methds. Nevertheless, mdel idexig remais a imprtat issue, which has subtle implicatis i the thery ad practice f badwidth selecti. Eve fr smthig splies, the ρ-idex is difficult t wrk with, ad the idetificati f it as the prper mdel idex des t ffer ay peratial help. It hwever helps t idetify sme ppular but questiable ccepts ad practices. Oe f the misleadig ccept is the degree-f-freed i regressi, which is defied as tracea: give x i s, λ-tracea is a e-t-e mappig, s the selecti f λ thrugh tracea simply leaves thigs i the hads f radm ise ɛ i. I classical parametric statistics, the degree-ffreedm is defied i terms f mdel dimesi, ad it is relevat i settigs ther tha regressi. The cicidece that the trace f hat matrix i liear regressi matches the mdel dimesi des t autmatically qualify the trace as a valid mdel idex. I fact, tracea is much wrse tha λ, as it allws e t cmpare A(λ) with A(h) while they are t cmparable. Resamplig is widely used i may phases f statistical aalysis. Wrkig with a idex such as λ that is t meaigful acrss-replicate, hwever, e shuld avid usig resamplig fr badwidth selecti. Besides the egative crrelati, ather criticism agaist crss-validati is that the crssvalidated badwidth cverges very slwly asympttically. Tw aspects i this argumet eed clse scrutiy: i) if the target f cvergece is smethig like a fixed λ, the it is wrry because the ptimal λ chages with data ad λ v may simply be chasig λ. ii) The badwidth is t part f the stchastic settig but a tuig parameter i the estimati prcess, ad its meaig is ly thrugh L(λ), s it s kay fr λ v t be far frm λ as lg as L(λ )/L(λ v ) Lss versus risk Risk calculati is a basic exercise i the theretical aalysis f statistical prcedures, but whe de usig a mdel idex t iterpretable acrss-replicate, the prper use f it ca be tricky. Imagie that yu are studyig the MSE perfrmace f LS regressi empirically, ad yu have 3 mdels M 1 = {µ = E[Y ] = β 0 + β 1 x}, M 2 = {µ = β 0 + β 1 x + β 2 x 2 }, ad M 3 = {µ = β 0 + β 1 x + β 2 x 2 + β 3 x 3 }. After geeratig data, yu calculate ˆµ (j), j = 1, 2, 3, uder M j ad calculate the MSE 1 (ˆµ(j) i µ i ) 2, ad yu average ver replicates the MSEs crrespdig t the same ˆµ (j). Yu d be crazy t take ˆµ (1) frm the first 20 replicates t average with ˆµ (2) frm the ext 40 replicates. While t a much mir extet, the calculati f R(λ) = E[L(λ)] is like mixig ˆµ (1) s frm sme replicates with ˆµ (2) s frm ther replicates i the abve example. Yu may tice that I ever call η λ (x) a estimatr but ly a estimate give the data. May wuld call the miimizer f R(λ) ptimal, but sice I ever bserve a average sample, I csider a risk-ptimal λ meaigless, ad d t care abut ay cvergece twards it. While derivative ccepts based R(λ) have meaig i my bks, R(λ) serves as a imprtat aalytical device. Fr example, t prve V (λ) L(λ) 1 ɛ T ɛ = p (L(λ)), e prceeds by shwig that V (λ) L(λ) 1 ɛ T ɛ = p (R(λ)) ad that L(λ) R(λ) = p (R(λ)); t establish L(λ) = O p (λ + 1 λ 1/2m ), e simply shw that R(λ) = O(λ + 1 λ 1/2m ). 10

11 Risk calculati usig the ρ-idex wuld be cceptually meaigful, ly if it were pssible. Althugh e questis the appeal f lss-ptimal badwidth ad mst d simulatis i terms f it (egative crrelati wuld t be there therwise), may authrs claim that the data d t ctai eugh ifrmati fr e t pursue the lss-ptimal badwidth. I view f the results by Li (1986) ad Hall (1987), such claims are misleadig. 2.5 Refereces Fr the derivati f GML ad its asympttic aalysis, check Wahba (1985). The GML alway pick λ 2m/(2m+1) but fr η(x) super smth the ptimal e is λ 2m/(4m+1). Wahba, G. (1985). A cmparis f GCV ad GML fr chsig the smthig parameter i the geeralized splie smthig prblem, A. Statist. 13, The mdificatis f crss-validati i (12) ad (13) ad the empirical perfrmaces ca be fud i the articles based the theses f Jigyua Wag ad Yug-Ju Kim. Gu, C. ad J. Wag (2003). Pealized likelihd desity estimati: Direct crss-validati ad scalable apprximati, Statist. Si. 13, Kim, Y.-J. ad C. Gu (2004). Smthig splie Gaussia regressi: Mre scalable cmputati via efficiet apprximati, J. Ry. Statist. Sc. Ser. B 66, Details f the EE methd ca be fud i the referece belw. Ku, S. ad B. Efr (2002). Smthers ad the C p, GML ad EE criteria: A gemetric apprach, J. Amer. Statist. Assc. 97, The egative crrelati betwee crss-validated badwidth ad the ptimal badwidth was bserved by may, ad was publicized by Sctt ad Terrell (1987) ad Hall ad Jhste (1992), thugh all tk it at the face value ad tried t fix it, which is uecessary. Sctt, D. W. ad G. R. Terrell (1987). Biased ad ubiased crss-validati i desity estimati, J. Amer. Statist. Assc. 82, Hall, P. ad I. Jhste (1992). Empirical fuctials ad efficiet smthig parameter selecti (with discussi) J. Ry. Statist. Sc Ser. B 54, Detailed discussi f mdel idexig ad the ramificatis i badwidth selecti ca be fud i the fllwig referece. Gu, C. (1998). Mdel idexig ad smthig parameter selecti i parametric fucti estimati (with discussi), Statist. Si. 8,

5.1 Two-Step Conditional Density Estimator

5.1 Two-Step Conditional Density Estimator 5.1 Tw-Step Cditial Desity Estimatr We ca write y = g(x) + e where g(x) is the cditial mea fucti ad e is the regressi errr. Let f e (e j x) be the cditial desity f e give X = x: The the cditial desity