Localized Model Selection for Regression

Lcalized Mdel Selectin fr Regressin Yuhng Yang Schl f Statistics University f Minnesta Church Street S.E. Minneaplis, MN 5555 May 7, 007 Abstract Research n mdel/prcedure selectin has fcused n selecting a single mdel glbally. In many applicatins, especially fr high-dimensinal r cmplex data, hwever, the relative perfrmance f the candidate prcedures typically depends n the lcatin, and the glbally best prcedure can ften be imprved when selectin f a mdel is allwed t depend n lcatin. We cnsider lcalized mdel selectin and derive their theretical prperties. 1 Intrductin In statistical mdeling, usually a number f candidate mdels are cnsidered. Traditinal mdel selectin thery and practice fcus n the search f a single candidate that is treated as the true r best mdel. There are gd reasns fr this: 1 cmparing the candidates in terms f glbal perfrmance is a gd starting pint; in simple situatins, the relative perfrmance in ranking f the candidate mdels may nt depend n x the vectr f predictrs; 3 if ne f the candidate mdels is believed t describe the data very well, ne des want t find it fr ptimal predictin and interpretatin. Hwever, fr high dimensinal r cmplex data, glbal selectin may be sub-ptimal. Cnsider tw example situatins. 1. Fr univariate regressin, if the mean functin is infinite dimensinal with respect t the candidate mdels, there des nt seem t be a strng reasn t believe that ne mdel wrks better than the thers at all x values. Thus a glbal ranking and selectin may nt be the best thing t d. This will be demnstrated later.. In many current statistical applicatins, the number f predictrs is very large, r even much larger than the number f bservatins. In such cases, there can be substantial uncertainty in mdeling. Fr instance, when selecting imprtant genes fr explaining a respnse variable ut f 5000 genes based n 100 bservatins, any variable selectin methd is explratry in nature and it seems 1

clear that ne cannt expect the selected mdel t have really captured the relatinship between the respnse and the predictrs. When ne cnsiders different types f mdels, they ften perfrm the best in different regins. The abve situatins mtivate the cnsideratin f lcalized mdel selectin. With the ever increasing cmputing pwer, selecting prcedures lcally becmes feasible in implementatin. Given the directin f lcalized mdel selectin, ne may wnder if it is better t take a lcal estimatin apprach in the first place and put the effrt n building a single gd estimatr. In ur pinin, this des nt wrk in general. First, glbal cnsideratins are imprtant fr aviding verfit, and fr high dimensinal situatins, lcal estimatin withut glbal cnstraints is ften nt pssible. Secnd, ne has many different ways t d lcal estimatin and then ne is back t the prblem f prcedure selectin. Nw let us set up the prblem mathematically. Let X i, Y i, i = 1,..., n be iid bservatins with X i taking values in X, a measurable set in R d fr sme d 1. Let Y i = fx i ε i, where f is the regressin functin t be estimated under squared errr lss and the errr ε i has mean zer and variance σ. Unless stated therwise, ε i is assumed t be independent f X i. Suppse δ j, j J are a finite cllectin f statistical prcedures fr estimating the regressin functin, each prducing an estimatr f f based n a given sample. We will fcus n the case with tw prcedures, althugh similar results hld mre generally. The rest f the paper is rganized as fllws. In Sectin, we give an illustratin t mtivate lcalized mdel selectin. In Sectin 3, we mentin three appraches t lcalized mdel selectin. In Sectin, we prvide a result that characterizes perfrmance f a lcalized crss validatin methd. In Sectin 5, we study preference-regin-based lcalized mdel selectin. Cncluding remarks are given in Sectin 6. Prfs f the therems are put in an appendix. A mtivating illustratin Cnsider estimating a regressin functin n [ 1, 1] based n 100 bservatins. The x values are unifrmly spaced. The true regressin functin is fx = 0.5x 0.8 π exp 00x 0.5 0.8 π exp 00x 0.5 and the errr is frm N0, 0.3. A typical realizatin f the data is given in Figure 1. The linear trend is bvius, but ne als sees pssible nnlinearities. Naturally, ne may cnsider a simple linear mdel thinking that the smewhat

unusual pattern in the middle might be caused by a few utliers r the deviatin frm linearity may nt be serius enugh t pursue. Alternatively ne may cnsider a nnparametric methd. We chse a smthing spline methd that is prvided in R with the default chices f the cntrl parameters. 1.0 0.5 0.0 0.5 1.0 y A Typical Realizatin f 100 Observatins frm the Nn linear Mdel 1.0 0.5 0.0 0.5 1.0 x Figure 1: Scatter Plt f a Typical Realizatin f 100 Observatins.1 Advantage f lcalized selectin with a certain frm f preference regin We cmpare a crss validatin CV methd fr glbal selectin between the linear regressin and the smthing spline prcedures with a selectin methd that recgnizes that the tw estimatrs may perfrm well in different regins. Fr the CV methd, we randmly split the data int tw equally sized parts, find the linear and the smthing spline SS estimates using the first part and cmpute the predictin squared errr n the secnd part. We repeat this 50 times and chse the ne with smaller median predictin errr ver the randm splittings. Let f G x be the resulting estimatr. Frm the scatter plt, ne may suspect that the linear estimate may perfrm prly arund 0, while the nnparametric estimate may be undersmth when x is away frm 0. Thus, fr the cmpeting nn-glbal selectin methd, we cnsider estimatrs f the frm fx; c = f L xi { x c} f SS xi { x <c}, where f L is the estimatr f f based n the linear mdel, and f SS is the SS estimatr. We use CV similarly as befre t chse c in the range f [0, 1] at a grid f width 0.01. Let f NG x be the resulting estimatr. Nte that when c = 0, the linear estimatr is selected fr all x, and when c = 1, the 3

nnparametric estimatr is selected fr all x, but fr c between, we use SS nly when x is n bigger than c. At a given x 0, cmpute the squared errr lsses f f G x 0 and f NG x 0 respectively. Generate 00 independent data sets frm the true mdel t simulate the risks f f G x 0 and f NG x 0. The results are presented in Figure at a number f x 0 values in the range f 1, 1. Cmparing Risks: Glbal Selectin vs Lcal Selectin Risk Rati 1.0 1.5.0.5 3.0 0.5 0.0 0.5 x Figure : Risk Rati f a Glbal CV vs a Nn-Glbal Selectin Methd The figure clearly shws that the nn-glbal selectin perfrms much better than the glbal selectin except in a very small neighbrhd arund zer, demnstrating the ptentially great advantage f cnsidering nn-glbal selectin amng candidate prcedures. It is prbably fair t say that ne may nt necessarily chse the frm f fx; c frm inspecting the scatter plt. In the next subsectin, we use lgistic regressin t find the preference regin autmatically.. Classificatin fr estimating the preference regin We cntinue with the earlier setting except that the sample size is nw 00 s that the cntrast is mre clearly seen in the next figure. We fcus n ne typical realizatin f the data. A scatter plt f the data with the linear and the smthing spline fits is given in the upper-left panel f Figure 3. We use lgistic regressin as the classificatin methd t find the regin where the linear estimatr perfrms better than the smthing spline estimatr. With a randm splitting f the data int tw parts f equal size, we fit a straight-line mdel and the smthing spline using the first 100 bservatins, and btain the binary variable that indicates which methd has a smaller predictin errr n the secnd 100 bservatins. A

typical utcme is in the upper-right panel f Figure 3. One may get the impressin that SS tends t d better in the middle and less well at the ends. We fit a lgistic regressin mdel with three terms: 1, x and x. The estimated prbability that the linear mdel perfrms better at x is px = 1 1 exp 0.379 0.05x 1.97x. Nte that px > 0.5 crrespnds t x < 0.51 r x > 0.9, which is very sensible judging frm ur knwledge f the true mean functin despite that the upper-right panel f the figure des nt seem t be very visually infrmative fr relating the relative perfrmance f the tw estimatrs and x. This yields the fllwing cmbined estimate f the regressin functin { ˆfL x if x < 0.51 r x > 0.9 fx = ˆf SS x therwise, where ˆf L x and ˆf SS x are the linear and smthing spline estimates f f based n the full data. This estimate is cmpared with the true functin in the lwer-right panel f Figure 3. Frm the lwer-left panel f Figure 3, we see that the linear estimate is very pr arund -0.5 and 0.5 nt surprisingly; in cntrast, the SS estimate is quite gd there, but it pays a price fr being flexible in terms f accuracy at ther lcatins. Glbally, the SS is much better than the linear estimate. Indeed, the rati f their integrated squared errrs is ver 6.3. Thus frm the glbal perspective, the SS estimate is the clear winner. Hwever, with the use f lgistic regressin, we prperly figured ut that when x is far away frm zer by a certain degree, the linear estimate is better. Althugh the linear estimate is slightly biased even in the linear parts, its use in the final cmbined estimate is very helpful as seen in the figure. Numerically, the integrated squared errr f the cmbined estimate is nly 35% f that f the SS estimate. Therefre a glbally pr methd can still make a gd cntributin if it is prperly cmbined with a glbally much better estimatr. 3 Three appraches t nn-glbal mdel selectin Cnsider tw prcedures δ 1 and δ with risks Rδ 1 ; x; n and Rδ ; x; n at a given x value based n a sample f size n. Let A = {x : Rδ 1 ; x; n Rδ ; x; n} be the set f x at which the first prcedure perfrms n wrse than the secnd ne. It is the preference regin f δ 1 relative t δ. Ideally, fr selectin, ne wuld use δ 1 n A and δ n A c. In reality, ne may have little prir knwledge n the frm f A and t deal with the issue, ne may cnsider graphical inspectins when x is f a lw dimensin r cnsider A frm a class f sets with a prper cmplexity in hpe that ne member is clse t A. One can cnsider varius sets f A f different degrees f lcality. We cnsider three appraches 5

Scatter Plt with Fitted Values Abs. Pred. Err. n Test Data y 1.0 0.5 0.0 0.5 1.0 Abslute Predictin Errr 0.0 0. 0. 0.6 0.8 1.0 1. Lin.: SS: 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 x x Cmpare Estimates with True Mean Cmpare Cmbined Estimate with True Mean True Mean and Estimates 0.6 0. 0. 0.0 0. 0. 0.6 True Lin. SS True Mean and Cmbined Estimate 0.6 0. 0. 0.0 0. 0. 0.6 True Cmb. 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 x x Figure 3: An Example f Using Lgistic Regressin fr Nn-Glbal Selectin belw. 3.1 Neighbrhd based selectin Instead f directly estimating A, at each given x 0, ne cnsiders a lcal neighbrhd arund x 0 and tries t find the candidate that perfrms better in the lcal area. This apprach can be very helpful when A cannt be described well by simple sets. It will be studied in detail in the next sectin. 6

3. Empirical risk minimizatin One cnsiders a cllectin f sets f a certain mathematical frm and tries t identify the ne with best perfrmance. Here the cllectin may depend n the sample size. The size f the cllectin can be pre-determined r adaptively chsen. This apprach will be briefly explred in Sectin 5. 3.3 Classificatin based selectin Smetimes, ne is less cnfident t g with any specific cllectin f sets as the preference regin. This is especially true when the input dimensin is high, where neighbrhd based selectin may perfrm prly due t the curse f dimensinality. In such a case, ne can take advantage f classificatin methds t cnduct lcalized selectin as in Sectin.. In general, ne splits the data int tw parts. The first part is used t btain the estimates frm the candidate prcedures, and then make predictins fr the secnd part. Based n the relative predictive perfrmance in each case, we create a new variable that simply indicates which estimate is the better ne. Then we can apply a sensible classificatin methd t relate the perfrmance indicatin variable t the cvariates. If a classifier perfrms well, it indicates that the candidate prcedures relative perfrmance depends n the cvariates, and we can d better than glbally selecting ne f the prcedures. Lcalized crss validatin selectin Crss validatin see, e.g., Allen, 197; Stne, 197; Geisser, 1975 has been widely used fr mdel selectin. In this sectin, we cnsider a lcalized crss validatin methd fr selecting a candidate prcedure lcally. Fr a given x 0, cnsider the ball centered at x 0 with radius r n fr sme r n > 0 under the Euclidean distance. We randmly split the data int a training set f size n 1 and a test set f size n, and then use the training set t d estimatin by each regressin prcedure. Fr evaluatin, cnsider nly the data pints in the test set that are in the given neighbrhd f x 0. Let ĵx 0 = ĵ n x 0 be the prcedure that has the smaller average squared predictin errr. This prcess is repeated with a number f randm splittings f the bservatins t avid the splitting bias. The prcedure that wins mre frequently ver the permutatins is the final winner. We call this a lcalized crss validatin L-CV at x 0 f r n -neighbrhd. Fr the lcal selectin prblem, we are interested in the questin that under what cnditins, the lcally better estimatr will be selected with high prbability. Mre precisely, assuming that when n is large enugh, ne prcedure has risk at x 0 smaller than that f the ther ne, we want t select the 7

better ne with prbability appraching 1. If a selectin methd has this prperty, we say it is lcally cnsistent in selectin at x 0. Fr related results n glbal CV, see, e.g., Li 1987, Burman 1989, Zhang 1993, Sha 1997, Wegkamp 003, and Yang 007. Given x 0 and η > 0, let L f; x 0 ; η = x Bx 0;η fx fx P ex dx be a lcal lss f an estimate f arund x 0, where P ex dentes the distributin f X cnditinal n that X takes value in the neighbrhd Bx 0 ; η. Let s,x0;η dente the L s nrm arund x 0 with radius η, i.e., g s,x0;η= x Bx0;η gx s P ex dx 1/s. Definitin 1. Prcedure δ 1 r f 1,n is asympttically better than δ r f,n under the squared lss at η n -neighbrhd f x 0, dented δ 1 δ at x 0 ; η n, if fr each nn-increasing sequence η n with 0 < η n η n and every 0 < ǫ < 1, there exists a cnstant c ǫ > 0 such that when n is large enugh, P L f,n ; x 0 ; η n 1 c ǫ L f 1,n ; x 0 ; η n 1 ǫ. 1 When ne f the prcedures is asympttically better than the ther at a neighbrhd f x 0, we say δ 1 and δ are rdered at x 0. One may wnder if it is better t require the cnditin in 1 fr η n nly. The answer is n because it is pssible that L f,n ; x 0 ; η n is smaller than L f 1,n ; x 0 ; η n with high prbability fr ne sequence f η n yet the ppsite hlds fr a smaller sequence 0 < η n η n. This can happen, fr example, when δ 1 is asympttically better than δ glbally i.e., with n restrictin n η n but wrse lcally. In general, the space X can be decmpsed int three regins: thse x 0 at which δ 1 is asympttically better than δ, thse at which δ is asympttically better than δ 1, and the rest f x 0 at which δ 1 and δ cannt be cmpared accrding t Definitin 1. Lcal selectin can be hped t be successful nly fr the first tw regins. Definitin. A prcedure δ r { f n } n=1 is said t cnverge exactly at rate {a n } in prbability at η n - neighbrhd f x 0 if fr each nn-increasing sequence η n with 0 < η n η n, L f; x 0 ; η n = O p a n, and fr every 0 < ǫ < 1, there exists c ǫ > 0 such that when n is large enugh, P L f; x 0 ; η n c ǫ a n 1 ǫ. Definitin 3. A selectin methd is said t be lcally cnsistent in selectin at x 0 if δ 1 and δ are rdered at x 0 and the asympttically better prcedure at x 0 is selected with prbability ging t 1. If the selectin methd is lcally cnsistent at every x 0 at which the prcedures are rdered, it is said t be lcally cnsistent. In Definitin, the secnd cnditin simply says that the lss des nt cnverge faster than a n in an apprpriate sense. Fr the fllwing result, we assume that f 1,n and f,n cnverge exactly at rate {p n } and {q n } in prbability at η n -neighbrhd f x 0 respectively, where p n and q n are tw sequences f nn-increasing psitive numbers. 8

Sme technical cnditins are needed. Cnditin 1. There exists a sequence f psitive numbers A n such that fr j = 1,, f f j,n = O p A n. Cnditin. There exists η n > 0 such that δ 1 δ at x 0 ; η n r δ δ 1 at x 0 ; η n. Let j x 0 dente the better prcedure. Cnditin 3. There exists a sequence f psitive numbers {M n } such that fr j = 1,, fr each nn-increasing sequence 0 η n η n, we have f f j,n,x0;eη n f f = O p M n. j,n,x0;eη n Therem 1: Under Cnditins 1-3, as lng as n 1, n, and r n are chsen t satisfy 1: n r d n M n 1 ; n r d n maxp n 1, q n1 /1 A n1 ; 3 r n η n, we have that with prbability ging t 1, the better prcedure f j x 0,n will be selected. Frm the result, several factrs affect the ability f the L-CV t identify the best estimatr at a given x 0. The larger η n, the larger windw fr chsing the neighbr size r n fr L-CV. Intuitively, if η n is very clse t zer, which ccurs when the tw prcedures cnstantly switch psitin in terms f lcal perfrmance, it may nt be feasible t identify the lcally better ne at all. The cnstants M n and A n cme int the picture mre fr a technical reasn. Fr simplicity, fr the fllwing discussin, we assume that M n and A n are bth f rder 1. Then the requirements n the chice f data splitting rati and the neighbr size becme n r d n maxp n 1, q n1 and r n η n. Cnsider, fr example, η n = lg n 1. When at least ne f the prcedures is nnparametric with maxp n, q n cnverging mre slwly than the parametric rate n 1/, Yang 007 shwed that fr selecting the glbally better prcedure, it suffices t take n at least f the same rder as n 1 which is nt enugh fr cmparing parametric mdels as shwn in Sha 1993. With lcal selectin, hwever, we need t be mre careful s as t satisfy 1 n maxp n 1, q n 1 rd n lg n 1 d. If maxp n, q n is f rder n 1/3, then with n 1 = n, the cnditin is simply n 1/3d r n lg n 1 1. When the number f bservatins is mderately large, ne can use a data dependent apprach t chse r n. Fr example, taking r n t be the same fr all x, ne may cnsider anther level f data splitting fr empirically evaluating the perfrmance f each chice f r n. Theretical results are pssible, but we will nt pursue them in this wrk. 9

5 Preference regin selectin Let f 1 x = f 1,n1 x and f x = f,n1 x be the estimates by the tw regressin prcedures based n the first part f the data f size n 1. Fr a measurable set A, define fx; A = f 1 xi {x A} f xi {x/ A}. Since E fx; A fx = E f1 x fx I {x A} f x fx I {x/ A} = E f1 x fx I{x A} E f x fx I{x/ A}, it fllws that the preference regin A = {x : E f1 x fx E f x fx } is the best chice f A. Obviusly, we have E f; A f min E f 1 f, f f, i.e., if it is pssible t identify A, the estimatr fx; A is better r at least n wrse than the glbal winner between the tw candidate prcedures. The ideal lcal selectin can be arbitrarily better than the glbal selectin in terms f their risk rati. We cnsider a cllectin f sets f manageable cmplexity and try t find the best set A within the class. Let A n be a class f sets and let A n be the set that minimizes E f; A f ver A n. Then E f; A n f E f; A f is the apprximatin errr due t the use f a set in A n, which may nt cntain A. Of curse, A n is als unknwn, and needs t be estimated. Let µ B A dente the prbability under the distributin f X i f the symmetric difference f A and B. Assume that A n has metric entrpy bunded abve by H n ǫ under the distance da, B = µ B A fr cncept and related results invlving metric entrpy, see, e.g., Klmgrv and Tihmirv, 1959; Yatracs, 1985; Birgé, 1986; van de Geer, 1993; Yang and Barrn, 1999. Let A n,0 be a discretized cllectin f sets that serves as an ǫ n -net fr A n under d. Let Ân be the minimizer f the empirical predictin errr ver A n,0, i.e., 1 Â n = arg min A A n,0 n n i=n 11 Y i fx i ; A. Therem : Assume that the errrs are nrmally distributed with mean zer and variance σ > 0, and f j,n f C a.s. fr sme cnstant C < fr bth prcedures. Then the final estimatr 10

f; Ân satisfies E f; Ân f C 1 E f; A n f ǫ n H nǫ n n C E f; A f E f; A n f; A ǫ n H nǫ n, n where the cnstants C 1 and C depend nly n C and σ. T ptimize the risk bund abve, with A n given, we need t balance Hnǫn n and ǫ n. The issue becmes mre cmplicated when ne needs t chse A n. Clearly, a large chice f A n reduces the ptential apprximatin errr but at the same time increases the estimatin errr due t searching ver a larger class f sets A. One apprach t handling the apprximatin errr is t assume that A is in a given cllectin f sets B, and then characterize the unifrm apprximatin errr f sets in B by sets in A n. Defined γ n = sup B B inf A An µ B A. Under prper assumptins n B and A n, the rate f cnvergence f γ n can be derived and then used fr btaining the cnvergence rate f the final estimatr frm the lcalized selectin. An example Let A k cnsist f all pssible unins f at mst k cubes in X = [0, 1] d. Let B be the cllectin f all the sets that each can be well apprximated by a set in A k in the sense that fr each B B, there exists a set A in A k such that µ B A ck τ fr sme cnstants c, τ > 0. Fr simplicity, we assume that X i has a unifrm distributin n [0, 1] d, althugh a similar result hlds if X i has a unifrmly upper bunded Lebesgue density. T btain an ε-net in A k fr B, we first chse k = k ε such that ck τ ε/ k ε is then f rder ε 1/τ. Then fr any B B, there exists a set A in A kε such that µ B A ε/. Cnsequently, an ε/-net in A kε will be an ε-net fr B. Nw if we have a grid n [0, 1] d f width ε/ k fr each crdinate, then the cllectin f all cubes with vertices n the grid frm an ε/ k-net fr A kε and thus als an ε-net fr B. The number f pssible unins f k such cubes is f rder kεd kε = ε 11/τε 1/τd. ε It fllws that the metric entrpy f B is f rder O ε 1/τ lgε 1. If we select A ver the ε-net in A k by crss validatin with a prper discretizatin, by Therem, with a chice f n 1 and n bth f rder n, the cmbined estimatr satisfies E f; Ân f = O E f; A f ǫ n lg ǫ 1 n. nǫ 1/τ n 11

Balancing the last tw terms in the abve expressin, the ptimal chice f ǫ n is lg n/n τ τ1, and cnsequently the risk bund becmes E f; Ân f = O E f; A f lg n/n τ τ1. Thus, when τ is large, the additinal term lg n/n τ τ1 is clse t lg n/n, which is typically negligible fr nnparametric regressin. Then we achieve the perfrmance f the ideal lcalized selectin up t a relatively small discrepancy term. 6 Cncluding remarks Overall perfrmance f a candidate mdel prcedure has been the dminating measure used fr mdel selectin. If ne candidate mdel is thught t be true r ptimal, it is then ratinal t try t identify it. Hwever, in many applicatins, this practice is sub-ptimal because the glbally best candidate prcedure, even if it is much better than thers, may still have unsatisfactry lcal behavirs in certain regins, which can be well remedied with helps frm ther candidates that are glbally inferir. We tk tw directins in lcalized mdel selectin: a lcalized crss validatin that selects a mdel/prcedure at each x value based n perfrmance assessment in a neighbrhd f x, and a preference regin selectin frm a cllectin f sets, which tries t estimate the regins where each candidate perfrms better. Fr the lcalized crss validatin, as lng as the neighbrhd size and the data splitting rati are chsen prperly, the lcally better estimatr will be selected with high prbability. Fr preference regin selectin, when the cmplexity f the cllectin f the candidate sets f the preference regin is prperly cntrlled, the final prcedure based n the empirically selected preference regin behaves well in risk. Besides glbal selectin and lcalized selectin f a mdel, smetimes it is advantageus t cnsider glbal cmbinatin f the candidates with weights glbally determined, r lcalized cmbinatin where weights depend n the x value see, e.g., Pan, Xia and Huang, 006. Fr high-dimensinal r cmplex data, these alternatives can prvide flexibility needed t further imprve perfrmance f the candidate prcedures. 7 Appendix Prf f Therem 1. Much f the prf fllws frm Yang 007, except that lcalized selectin is cnsidered in this wrk. We first fcus n the analysis withut multiple data splittings. Withut lss f generality, assume that δ 1 is asympttically better than δ at x 0. Let I = I x0,r n = {i : n 1 1 i n 1

and X i Bx 0 ; r n } dente the bservatins in the evaluatin set with X i clse t x 0, and let ñ be the size f I. Because LCV f j,n1 = Y i f j,n1 X i fx i f j,n1 X i ε i = = ε i LCV f 1,n1 LCV f,n1 is equivalent t fx i f j,n1 X i ε i f,n1 X i f 1,n1 X i fx i f,n1 X i Cnditinal n Z 1 = X i, Y i n1 i=1 and X = X n11,..., X n, assuming larger than fx i f 1,n1 X i, by Chebyshev s inequality, we have ε i fx i f j,n1 X i, fx i f 1,n1 X i. fx i f,n1 X i is P LCV f 1,n1 > LCV f,n1 Z 1, X σ min 1, f,n1 X i f 1,n1 X i fx i f,n1 X i fx i f. 1,n1 X i Let Q n dente the rati in the upper bund in the abve inequality and let S n be the event f Then because fx i f,n1 X i > fx i f 1,n1 X i. P LCV f 1,n1 > LCV f,n1 { = P LCV f 1,n1 > LCV f } {,n1 S n P CV f 1,n1 > CV f },n1 Sn c E P CV f 1,n1 > CV f,n1 Z 1, X I Sn PSn c E min 1, Q n PS c n, fr cnsistency, it suffices t shw PS c n 0 and Q n 0 in prbability. Suppse we can shw that fr each ǫ > 0, there exists α ǫ > 0 such that when n is large enugh, fx i f,n1 X i P fx i f 1 α ǫ 1 ǫ. 1,n1 X i Then PS n 1 ǫ and thus PS c n 0 as n. Since f,n1 X i f 1,n1 X i fx i f 1,n1 X i 13 fx i f,n1 X i,

with prbability n less than 1 ǫ, we have Q n 8σ fx i f 1,n1 X i fx i f,n1 X i 1 1 1α ǫ fx i f,n1 X i 8σ 1 1 1 1 1α ǫ 1α ǫ fx i f,n1 X i. 3 Frm and 3, t shw PS c n 0 and Q n 0 in prbability, it suffices t shw and fx i f,n1 X i in prbability. Suppse a slight relaxatin f Cnditin 1 hlds: fr every ǫ > 0, there exists A n1,ǫ such that fr j = 1,, when n 1 is large enugh, f P fj,n1 A n1,ǫ ǫ. { f Let H n1 dente the event max f1,n1, f f,n1 A n1,ǫ fx i f f j,n1 X i fj,n1 Cnditinal n Z 1 and H n1, V ar Z 1W n11 E Z 1 }. Then n H n1, we have W i = with X i Bx 0 ; r n is bunded between A n1,ǫ and A n1,ǫ. fx n11 f f j,n1 X n11 = fj,n1,x 0;r n, where the subscript Z 1 in V ar Z 1 and E Z 1 is used t dente the cnditinal expectatin given Z 1. Thus cnditinal n Z 1, n H n1, by Bernstein s inequality see, e.g., Pllard 198, page 193, fr each x > 0, P Z 1 fx i f f 1,n1 X i ñ f1,n1 x,x 0;r n Taking x = β n ñ f f1,n1 exp 1 f ñ f1,n1 P Z 1 x An1,ǫ x,x 3 0;r n., the abve inequality becmes,x 0;r n fx i f f 1,n1 X i 1 βn ñ f1,n1,x 0;r n exp 1 f f 1,n1 β nñ f f 1,n1 An1,ǫ β n 3,x 0;r n,x 0;r n f f 1,n1,x 0;r n Under Cnditin, fr every ǫ > 0, there exists α ǫ > 0 such that when n is large enugh, f f,n1,x P 0;r n f f 1 αǫ ǫ. 1,n1,x 0;r n 1.

Take β n such that 1 β n = f f,n1,x 0;r n 1 α ǫ/ f f 1,n1,x 0;r n. Then with prbability at least 1 ǫ, β n α ǫ / 1α ǫ /. Let D n dente this event. Then n D n, we have and P Z 1 = P Z 1 P Z 1 If we have β n = = f f,n1 1 α f ǫ / f1,n1,x 0;r n,x 0;r n f 1 α ǫ / f1,n1,x 0;r n f f 1α ǫ,n1 / f f,n1 b,x 0 ;rn,x 1α 0;r n ǫ f 1 α ǫ / f1,n1,x 0;r n α ǫ f f,n1,x 0;r n f 1 α ǫ 1 α ǫ / f1,n1,x 0;r n, fx i f f 1,n1 X i 1 βn ñ f1,n1,x 0;r n fx i f ñ f 1,n1 X i 1 α ǫ / f,n1,x 0;r n fx i f α ǫ f f,n1,x 1,n1 X i 1 0;r n 1 α ǫ1 α ǫ/ f f 1,n1 α exp ǫ 8 1 α ǫ 1 α ǫ / f f 1,n1,x 0;r n f ñ f,n1 f f 1,n1,x 0;r n ñ f f,n1,x 0;r n,x 0;r n α ǫa n1,ǫ f 31α ǫ1α / f,n1 ǫ ñ f f1,n1,x 0;r n.,x 0;r n in prbability, 5 f ñ f,n1,x 0;r n A n1,ǫ in prbability, 6 then the upper bund in the last inequality abve cnverges t zer in prbability. Frm these pieces, we can cnclude that P fx i f ñ f 1,n1 X i 1 α ǫ / f,n1 3ǫ ǫ, n, 7,x 0;r n,x 0;r n 15

fr sme ǫ, n 0 as n. Indeed, fr every given ǫ > 0, when n is large enugh, 1 P fx i ñ f 1 f 1,n1 X i 1 α ǫ / f,n1,x 0;r n { PHn c 1 PDn c 1 P H n1 D n fx i ñ f 1 1,n1 X i f 1 α ǫ/ f,n1 3ǫ EP H n1 D n { 1 fx i ñ f 1 f 1,n1 X i 1 α ǫ / f,n1 α 3ǫ E exp ǫ 8 1 α ǫ 1 α ǫ/ 3ǫ ǫ, n, f f 1,n1,x 0;r n ñ f f,n1,x 0;r n,x 0;r n } Z 1 α ǫa n1,ǫ f 31α ǫ1α / f,n1 ǫ,x 0;r n,x 0;r n } where the expectatin in the upper bund f the last inequality abve i.e., ǫ, n cnverges t zer due t the cnvergence in prbability t zer f the randm variables f the expnential expressin and their unifrm integrability since they are bunded abve by 1, prvided that 5 and 6 hld. The assertin f 7 then fllws. Fr the ther estimatr, similarly, fr 0 < β n < 1, we have P Z 1 fx i f f,n1 X i 1 βn ñ f,n1,x 0;r n f ñ β n f,n1 exp 1,x 0;r n f f,n1 f f.,n1 An1,ǫ βn e,x 3 0;r n,x 0;r n If we have f ñ β n f,n1,x 0;r n f f in prbability, 8,n1,x 0;r n f ñ βn f,n1,x 0;r n A n1,ǫ in prbability, 9 then fllwing a similar argument used fr f 1,n1, we have P fx i f f,n1 X i 1 βn ñ f,n1 0. 10,x 0;r n f Frm this, if ñ f,n1 in prbability and β n is bunded away frm 1, then hlds.,x 0;r n If in additin, we can chse β n 0, then fr each given ǫ, we have 1 β n f f,n1,x 0;r n > 1α ǫ f 1α / f,n1 fr sme small α ǫ > 0 when n 1 is large enugh. Nw fr every ǫ > 0, we can ǫ,x 0;r n 16

find ǫ > 0 such that 3ǫ ǫ/3 and there exists an integer n 0 such that when n n 0 the prbability in 10 is upper bunded by ǫ/3 and ǫ, n ǫ/3. Cnsequently when n n 0, fx i f,n1 X i P fx i f 1 α ǫ 1 ǫ. 1,n1 X i Recall that we needed the cnditins 5, 6, 8, and 9 fr t hld. Under Cnditin 3, en f b f,n1,x 0 ;rn f f b is lwer bunded in rder in prbability by en f f b,n1,x 0 ;rn 1,n1 M,x n 0 ;rn 1 f f b. Frm all abve, since 1,n1,x 0 ;rn f 1,n1 and f,n1 cnverge exactly at rates p n and q n respectively under the L lss, we knw that fr the cnclusin f Therem 1 t hld, it suffices t have the requirements: fr every ǫ > 0, fr sme β n 0, ñ β n M n 1, ñ q n1 /p n1 M n 1, ñ βn q n 1 A n1,ǫ, ñ q n 1 A n1,ǫ, ñ q n 1. Nte that ñ is a randm variable that has a binmial distributin with n trials and success prbability P X i Bx 0 ; r n. Under the assumptin that the density f X is bunded away frm zer in a neighbrhd f x 0, P X i Bx 0 ; r n is f rder r d n. We need t lwer bund ñ in prbability. Fr 1 j n, let D j be independent Bernulli randm variables with success prbability β. Then applying the Bernstein s inequality see, e.g., Pllard 198, p. 193, we have n P D j n β/ exp 3n β. 11 8 j=1 Fr ur setting, β is f rder r d n. Thus ñ is at least f rder n r d n in prbability if n r d n. Cnsequently, the earlier requirements n the splitting rati becme that fr every ǫ > 0, fr sme β n 0, n r d n β n M n 1, n r d n q n1 /p n1 M n 1, n r d n β n q n 1 A n1,ǫ, n r d n q n 1 A n1,ǫ, n r d n q n 1. 17

f Under Cnditin 1, fr every ǫ > 0, there exists a cnstant B ǫ > 0 such that P fj,n1 B ǫ A n1 ǫ when n 1 is large enugh. That is, fr a given ǫ > 0, we can take A n1,ǫ = OA n1. Therefre if we have n r d n M n 1 and n r d n q n 1 / 1 A n1, then we can find β n 0 such that the abve 5 displayed requirements are all satisfied. Let π dente a permutatin f the bservatins, and let Π dente a set f such permutatins. Let LCV π f j,n1 be the L-CV criterin value under the permutatin π. If LCV π f 1,n1 LCV π f,n1, then let τ π = 1 and therwise let τ π = 0. Let W dente the values f X 1, Y 1,..., X n, Y n ignring the rders. Under the i.i.d assumptin n the bservatins, bviusly, cnditinal n W, every rdering f these values has exactly the same prbability and thus PLCV f 1,n1 LCV f,n1 = EPLCV f 1,n1 LCV f,n1 W π Π = E τ π. Π Frm the earlier analysis, we knw that PLCV f 1,n1 LCV f,n1 1. Thus E π Π τ π/ Π 1. Since π Π τ π/ Π is between 0 and 1, fr its expectatin t cnverge t 1, we must have π Π τ π/ Π 1 in prbability. This cmpletes the prf f Therem 1. Prf f Therem : Let Ã n be the minimizer f E f; A f ver A A n,0. By the results f Wegkamp 003, Sectin, in particular, Therem.1, we have E f; Ân f E f; Ã n f 1n 6C lg A n,0 e 1/6C 16σ lg A n,0 e 1/16σ n E f; Ã n f B 1 B H n ǫ n, n n where B 1 and B are cnstants that depend nly n C and σ. Nte that fx; A fx; B dµ f1 x fx I {x A} I {x B} dµ f x fx I {x A} I {x B} dµ = f1 x fx dµ f x fx dµ = B A B A B c A c f1 x fx dµ f x fx dµ C µ B A. Tgether with that A n,0 is an ǫ n -net under d, we have E f; Ã n f E f; A n f 8C ǫ n. The cnclusins then fllw. This cmpletes the prf f Therem. 18

8 Acknwledgments This research was supprted by US Natinal Science Fundatin CAREER Grant DMS00933. We thank three referees and the editrs fr helpful cmments n imprving the paper. References [1] Allen, D.M. 197 The relatinship between variable selectin and data augmentatin and a methd fr predictin. Technmetrics, 16, 15-17. [] Birgé, L. 1986 On estimating a density using Hellinger distance and sme ther strange facts. Prbability Thery and Related Fields, 71, 71-91. [3] Burman, P. 1989 A cmparative study f rdinary crss-validatin, ν-fld crss-validatin and the repeated learning-testing methds, Bimetrika, 76, 503-51. [] Geisser, S. 1975 The predictive sample reuse methd with applicatins, Jurnal f the American Statistical Assciatin, 70, 30-38. [5] Klmgrv, A.N. and Tihmirv, V.M. 1959 ǫ-entrpy and ǫ-capacity f sets in functin spaces. Uspehi Mat. Nauk 1, 3-86. [6] Li, K.-C. 1987 Asympttic ptimality fr C p, C L, crss-validatin and generalized crss-validatin: Discrete index set, The Annals f Statistics, 15, 958-975. [7] Pan, W., Xia, G. and Huang, X. 006 Using input dependent weights fr mdel cmbinatin and mdel selectin with multiple surces f data, Statistics Sinica, 16, 53-50. [8] Pllard, D. 198 Cnvergence f Stchastic Prcesses. Springer, New Yrk. [9] Sha, Jun 1993 Linear mdel selectin by crss-validatin Jurnal f the American Statistical Assciatin, 88, 86-9. [10] Sha, J. 1997 An asympttic thery fr linear mdel selectin with discussin Statistica Sinica, 7, 1-. [11] Stne, M. 197 Crss-validatin chice and assessment f statistical predictins. Jurnal f the Ryal Statistical Sciety, Ser.B, 36, 111-17. [1] van de Geer, S. 1993 Hellinger-cnsistency f certain nnparametric maximum likelihd estimatrs. Annals f Statistics, 1, 1-. [13] Wegkamp, M.H. 003. Mdel selectin in nnparametric regressin. Ann. Statist., 31, 5-73. [1] Yatracs, Y.G. 1985 Rates f cnvergence f minimum distance estimatrs and Klmgrv s entrpy. Annals f Statistics, 13, 768-77. [15] Yang, Y. 007 Cnsistency f crss validatin fr cmparing regressin prcedures. Accepted by Ann. Statistics. [16] Yang, Y. and Barrn, A.R. 1999 Infrmatin-theretic determinatin f minimax rates f cnvergence. Ann. Statistics, 7, 156-1599. [17] Zhang, P. 1993 Mdel selectin via multifld crss validatin, Annals f Statistics, 1, 99-313. 19