AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION

Size: px

Start display at page:

Download "AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION"

Calvin Lamb
6 years ago
Views:

1 Statistica Siica 7(1997), AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION Ju Shao Uiversity of Wiscosi Abstract: I the problem of selectig a liear model to approximate the true ukow regressio model, some ecessary ad/or sufficiet coditios are established for the asymptotic validity of various model selectio procedures such as Akaike s AIC, Mallows C p, Shibata s FPE λ, Schwarz BIC, geeralized AIC, crossvalidatio, ad geeralized cross-validatio. It is foud that these selectio procedures ca be classified ito three classes accordig to their asymptotic behavior. Uder some fairly weak coditios, the selectio procedures i oe class are asymptotically valid if there exist fixed-dimesio correct models; the selectio procedures i aother class are asymptotically valid if o fixed-dimesio correct model exists. The procedures i the third class are compromises of the procedures i the first two classes. Some empirical results are also preseted. Key words ad phrases: AIC, asymptotic loss efficiecy, BIC, cosistecy, C p, crossvalidatio, GIC, squared error loss. 1. Itroductio Let y =(y 1,...,y ) be a vector of idepedet resposes ad X = (x 1,...,x ) be a p matrix whose ith row x i is the value of a p -vector of explaatory variables associated with y i. For iferece purposes, a class of models, idexed by α A, is to characterize the relatio betwee the mea respose µ = E(y X ) ad the explaatory variables. If A cotais more tha oe model, the we eed to select a model from A usig the give X ad the data vector y. The followig are some typical examples. Example 1. Liear regressio. Suppose that p = p for all ad µ = X β with a ukow p-vector β. Write β =(β 1, β 2 ) ad X =(X 1, X 2 ). It is suspected that the sub-vector β 2 = 0, i.e., X 2 is actually ot related to µ. The we may propose the followig two models: Model 1: µ =X 1 β 1 Model 2: µ =X β. I this case, A = {1, 2}. It is well kow that the least squares fittig uder model 1 is more efficiet tha that uder model 2 if β 2 = 0. More geerally, we ca cosider models µ = X (α)β(α), (1.1)

2 222 JUN SHAO where α is a subset of {1,...,p} ad β(α) (orx (α)) cotais the compoets of β (or colums of X ) that are idexed by the itegers i α. I this case A cosists of some distict subsets of {1,...,p}. If A cotais all oempty subsets of {1,...,p}, the the umber of models i A is 2 p 1. Example 2. Oe-mea versus k-mea. Suppose that observatios are from k groups. Each group has r observatios that are idetically distributed. Thus, = kr, wherek = k ad r = r are itegers. Here we eed to select oe model from the followig two models: (1) the oe-mea model, i.e., the k groups have a commo mea; (2) the k-mea model, i.e., the k groups have differet meas. To use the same formula as that i (1.1), we defie p = k, X = 1 r r 1 r r 0 1 r 0 1 r r ad β =(µ 1,µ 2 µ 1,...,µ k µ 1 ),where1 r deotes the r-vector of oes. The A = {α 1,α k },whereα 1 = {1} ad α k = {1,...,k}. Example 3. Liear approximatios to a respose surface. Suppose that we wish to select the best approximatio to the true mea respose surface from a class of liear models. Note that the approximatio is exact if the respose surface is actually liear ad is i A. The proposed models are µ = X (α)β (α), α A, where X (α) is a sub-matrix of X ad β (α) is a sub-vector of a p -vector β whose compoets have to be estimated. As a more specific example, we cosider the situatio where we try to approximate a oe-dimesioal curve by a polyomial, i.e., µ = X (α)β (α) withtheith row of X (α) beig(1,t i,t 2 i,...,th 1 i ), i =1,...,. I this case A = {α h,h =1,...,p } ad α h = {1,...,h} correspods to a polyomial of order h used to approximate the true model. The largest possible order of the polyomial may icrease as icreases, sice the more data we have, the more terms we ca afford to use i the polyomial approximatio. We assume i this paper that the models i A are liear models ad the least squares fittig is used uder each proposed model. Each model i A is deoted by α, a subset of {1,...,p }. After observig the vector y, our cocer is to select a model α from A so that the squared error loss L (α) = µ ˆµ (α) 2 (1.2)

3 ASYMPTOTICS FOR LINEAR MODEL SELECTION 223 be as small as possible, where is the Euclidea orm ad ˆµ (α) istheleast squares estimator (LSE) of µ uder model α. Note that miimizig L (α) is equivalet to miimizig the average predictio error E[ 1 z ˆµ (α) 2 y ], where z =(z 1,...,z ) ad z i is a future observatio associated with x i ad is idepedet of y i. A cosiderable umber of selectio procedures were proposed i the literature, e.g., the AIC method (Akaike (1970)); the C p method (Mallows (1973)); the BIC method (Schwarz (1978); Haa ad Qui (1979)); the FPE λ method (Shibata (1984)); the geeralized AIC such as the GIC method (Nishii (1984), Rao ad Wu (1989)) ad its aalogues (Pötscher (1989)); the delete-1 crossvalidatio (CV) method (Alle (1974), Stoe (1974)); the geeralized CV (GCV) method (Crave ad Wahba (1979)); the delete-d CV method (Geisser (1975), Burma (1989), Shao (1993), Zhag (1993)); ad the PMDL ad PLS methods (Rissae (1986), Wei (1992)). Some asymptotic results i assessig these selectio procedures have bee established i some particular situatios. Nishii (1984) ad Rao ad Wu (1989) showed that i Example 1, the BIC ad GIC are cosistet (defiitios of cosistecy will be give i Sectio 2), whereas the AIC ad C p are icosistet. O the other had, Stoe (1979) showed that i some situatios (Example 2), the BIC is icosistet but the AIC ad C p are cosistet. I Example 3, Shibata (1981) ad Li (1987) showed that the AIC, the C p, ad the delete-1 CV are asymptotically correct i some sese. However, Shao (1993) showed that i Example 1, the delete-1 CV is icosistet ad the delete-d CV is cosistet, provided that d/ 1. These results do ot provide a clear picture of the performace of the various selectio procedures. Some of these coclusios are obviously cotrary to each other. But this is because these results are obtaied i quite differet circumstaces. A crucial factor that almost determies the asymptotic performaces of various model selectio procedures is whether or ot A cotais some correct models i which the dimesios of regressio parameter vectors do ot icrease with. This will be explored i detail i the curret paper. The purpose of this paper is to provide a asymptotic theory which shows whe the various selectio procedures are asymptotically correct (or icorrect) uder a asymptotic framework coverig all situatios described i Examples 1-3. After itroducig some otatios ad defiitios i Sectio 2, we study the asymptotic behavior of the GIC method i Sectio 3 ad other selectio procedures cited above i Sectio 4. Some umerical examples are give i Sectio 5. Sectio 6 cotais some techical details.

4 224 JUN SHAO 2. Notatio ad Defiitios Throughout the paper we assume that (X X ) 1 exists ad that the miimum ad maximum eigevalues of X X are of order. The matrices X, =1, 2,..., are cosidered to be o-radom. The results i this paper are also valid i the almost sure sese whe the X are radom, provided that the required coditios ivolvig X hold for almost all sequeces X, =1, 2,... Let A be a class of proposed models (subsets of {1,...,p }) for selectio. The umber of models i A is fiite, but may deped o. For α A, the proposed model is µ = X (α)β (α), where X (α) isa p (α) submatrix of the p matrix X ad β (α) is a p (α) 1 sub-vector of a ukow p 1 vector β. Without loss of geerality, we assume that the largest model ᾱ = {1,...,p } is always i A. The dimesio of β (α), p (α), will be called the dimesio of the model α. Uder model α, the LSE of µ is ˆµ (α) =H (α)y,whereh (α) =X (α)[x (α) X (α)] 1 X (α). Aproposedmodelα A is said to be correct if µ = X (α)β (α) is actually true. Note that A may ot cotai a correct model (Example 3); a correct model is ot ecessarily the best model, sice there may be several correct models i A (Examples 1 ad 2) ad there may be a icorrect model havig a smaller loss tha the best correct model (Example 2). Let A c = {α A : µ = X (α)β (α)} deote all the proposed models that are actually correct models. It is possible that A c is empty or A c = A. Let e = y µ. It is assumed that the compoets of e are idepedet ad idetically distributed with V (e X )=σ 2 I,whereI is the idetity matrix of order. The loss defied i (1.2) is equal to L (α) = (α)+(e H (α)e )/, where (α) =( µ H (α)µ 2 )/.Notethat (α) =0wheα A c.the risk (the expected average squared error) is R (α) =E[L (α)] = (α)+ σ2 p (α). Let ˆα deote the model selected usig a give selectio procedure ad let α L be a model miimizig L (α) overα A. The selectio procedure is said to be cosistet if { } P ˆα = α L 1 (2.1) (all limitig processes are uderstood to be as ). Note that (2.1) implies } P {L (ˆα )=L (α L ) 1. (2.2)

5 ASYMPTOTICS FOR LINEAR MODEL SELECTION 225 Thus, ˆµ (ˆα ) is i this sese asymptotically as efficiet as the best estimator amog ˆµ (α), α A. (2.1) ad (2.2) are equivalet if L (α) has a uique miimum for all large. The cosistecy defied i (2.1) is i terms of model selectio, i.e., we treat ˆα as a estimator of α L (it is a well defied estimator if α L is o-radom, e.g., i Example 1). This cosistecy is ot related to the cosistecy of ˆµ (ˆα )asa estimator of µ, i.e., L (ˆα ) p 0. I fact, it may ot be worthwhile to discuss the cosistecy of ˆµ (ˆα ), sice sometimes there is o cosistet estimator of µ (e.g., mi α A L (α) p 0) ad sometimes there are too may cosistet estimators of µ (e.g., max α A L (α) p 0, i which case ˆµ (α) is cosistet for ay α). I some cases a selectio procedure does ot have property (2.1), but ˆα is still close to α L i the followig sese that is weaker tha (2.1): L (ˆα )/L (α L ) p 1, (2.3) where p deotes covergece i probability. A selectio procedure satisfyig (2.3) is said to be asymptotically loss efficiet, i.e., ˆα is asymptotically as efficiet as α L i terms of the loss L (α). Sice the purpose of model selectio is to miimize the loss L (α), (2.3) is a essetial asymptotic requiremet for a selectio procedure. Clearly, cosistecy i the sese of (2.1) implies asymptotic loss efficiecy i the sese of (2.3). I some cases (e.g., Examples 1 ad 2), cosistecy is the same as asymptotic loss efficiecy. The proof of the followig result is give i Sectio 6. Propositio 1. Suppose that p / 0, (2.4) lim if mi (α) > 0 (2.5) α A A c ad A c is oempty for sufficietly large. The (2.1) is equivalet to (2.3) if either p (α L ) p or A c cotais exactly oe model for sufficietly large. The followig regularity coditio will ofte be used i establishig asymptotic results: 1 0, (2.6) [R (α)] l α A A c where l is some fixed positive iteger such that E(y 1 µ 1 ) 4l <. Note that coditio (2.6) is exactly the same as coditio (A.3) i Li (1987) whe A c is

6 226 JUN SHAO empty; but Li s coditio (A.3) may ot hold whe A c is ot empty. If the umber of models i A is bouded (Examples 1 ad 2), the (2.6) with l =1is the same as mi α A A c R (α), (2.7) the coditio (A.3 ) i Li (1987). Whe A = {α h,h =1,...,p } with α h = {1,...,h} (e.g., polyomial approximatio i Example 3), Li (1987) showed that coditio (2.6) with l = 2 is the same as (2.7). Uder a additioal assumptio that e is ormal, we may replace (2.6) by α A A c δr(α) 0 for ay 0 <δ<1, which is Assumptio 2 i Shibata (1981). 3. The GIC λ Method May model selectio procedures are idetical or equivalet to the procedure which miimizes Γ,λ (α) = S (α) + λ ˆσ p 2 (α) (3.1) over α A,whereS (α) = y ˆµ (α) 2,ˆσ 2 is a estimator of σ2,ad{λ } is a sequece of o-radom umbers 2adλ / 0. This procedure will be called the GIC λ method. If ˆσ 2 = S (ᾱ )/( p ), ᾱ = {1,...,p }, the the GIC λ with λ is the GIC method i Rao ad Wu (1989); the GIC λ with λ 2istheC p method i Mallows (1973); ad the GIC λ with λ λ>2is the FPE λ method i Shibata (1984). Sice the GIC λ is a good represetative of the model selectio procedures cited i Sectio 1, we first study its asymptotic behavior. Let the model selected by miimizig Γ,λ (α) beˆα,λ. Cosider first the case of λ 2. Assume that ˆσ 2 is a cosistet estimator of σ 2. It is show i Sectio 6 that e 2 Γ,2 (α) = e H (α)e α A c e 2 + L (α)+o p (L (α)) α A A c, (3.2) + 2ˆσ2 p (α) where the equality for the case of α A A c holds uder coditio (2.6) ad the o p is uiformly i α A A c. It follows directly from (3.2) that ˆα,2 is asymptotically loss efficiet i the sese of (2.3) if there is o correct model i A, i.e., A c is empty. If A c is ot empty but cotais exactly oe model for each, saya c = {αc },theˆα,2 is also asymptotically loss efficiet. This ca be show by usig (3.2) as follows. If p (α c ),the 2ˆσ 2 p (α c ) e H (α c )e = ˆσ2 p (α c ) + o p ( ˆσ 2 p (α c ) ) = L (α c )+o p(α c ),

7 ASYMPTOTICS FOR LINEAR MODEL SELECTION 227 which, together with (3.2), implies that Γ,2 (α) = e 2 + L (α)+o p (L (α)) uiformly i α A ad, therefore, ˆα,2 is asymptotically loss efficiet. If p (α c ) is fixed, the (2.5) holds (Nishii (1984)), which implies that α L = α c ad mi α A A c Γ,2(α) p 0 ad, therefore, P {ˆα,2 = α L } 1, i.e., ˆα,2 is cosistet i the sese of (2.1). As the followig example idicates, however, ˆα,2 may ot be a asymptotically loss efficiet procedure whe A c cotais more tha oe model. Example 4. Suppose that A = A c = {α 1,α 2 }, i.e., A cotais two models ad both models are correct. Assume that α 1 α 2. Let p 1 ad p 2 be the dimesios of the models α 1 ad α 2, respectively. The p 1 <p 2 ad Q = H (α 2 ) H (α 1 ) is a projectio matrix of rak p 2 p 1.SiceS (α i )= e e e H (α i )e,ˆα,2 = α 1 if ad oly if 2ˆσ 2( p 2 p 1 ) > e Q e. Case 1. p 1. If p 2 p 1,thee Q e /( p 2 p 1 ) p σ 2 ad P {ˆα,2 = α 1 } 1, i.e., the ˆα,2 is cosistet. If p 2 p 1 q for a fixed positive iteger q, thep 2 /p 1 1, i which case L (α 2 )/L (α 1 ) p 1, i.e., ay selectio procedure is asymptotically loss efficiet. Case 2. p 1.Ifp 2 p 1, the we still have e Q e /( p 2 p 1 ) p σ 2, which implies that ˆα,2 is cosistet. Assume that p 2 p 1 ad that for ay fixed iteger q ad costat c>2, ( ) lim if if P e Q Q e >cσ 2 q > 0, (3.3) Q,q where Q,q = {all projectio matrices of rak q}. Note that coditio (3.3) holds if e N(0,σ 2 I ). From (3.3) ad the fact that p 1 ad p 2 p 1,theratio L (ˆα,2 ) L (α 1 ) = I(ˆα,2 = α 1 )+L (α 2 ) L (α 1 ) I(ˆα,2 = α 2 )=1+W I(ˆα,2 = α 2 ) does ot ted to 1, where W = e Q e /e H (α 1 )e ad I(C) is the idicator fuctio of the set C. For example, whe e N(0,σ 2 I ), the p 1 W /( p 2 p 1 ) is a F-radom variable with degrees of freedom p 2 p 1 ad p 1. Hece ˆα,2 is ot asymptotically loss efficiet. I Example 4, ˆα,2 is asymptotically loss efficiet if ad oly if A c does ot cotai two models with fixed dimesios. This is actually true i geeral. Let α c be the model i A c with the smallest dimesio.

8 228 JUN SHAO Theorem 1. Suppose that (2.6) holds ad that ˆσ 2 is cosistet for σ 2. (i) If A c cotais at most oe model for all, theˆα,2 is asymptotically loss efficiet i the sese of (2.3). Furthermore, if A c cotais a uique model with fixed dimesio for all, theˆα,2 is cosistet i the sese of (2.1). (ii) Assume that A c cotais more tha oe models for sufficietly large. If α A c 1 [ p (α)] m 0 (3.4) for some positive iteger m such that E(y 1 µ 1 ) 4m <, theˆα,2 is asymptotically loss efficiet. If (3.4) does ot hold but α A c,α αc 1 [ p (α) p (α c )] m 0 (3.5) for some positive iteger m such that E(y 1 µ 1 ) 4m <, theˆα,2 is asymptotically loss efficiet. (iii) Assume that A c cotais more tha oe models for sufficietly large ad that (3.3) holds. The a ecessary coditio for ˆα,2 beig asymptotically loss efficiet is that p (α c ) or mi p (α) p (α c ). (3.6) α A c,α αc (iv) If the umber of models i A c is bouded, or if m =2ad A = {α i,i = 1,...,p } with α i = {1,...,i}, the coditio (3.6) is also sufficiet for the asymptotic loss efficiecy of ˆα,2. Remark 1. Coditio (3.6) meas that A does ot cotai two correct models with fixed dimesios. Remark 2. I Theorem 1, the estimator ˆσ 2 is required to be cosistet for σ 2. A popular choice of ˆσ 2 is S(ᾱ )/( p ), the sum of squared residuals (uder the largest model i A ) over its degree of freedom. This estimator is cosistet if A c is ot empty, but is ot ecessarily cosistet whe Ac is empty, i.e., there is o correct model i A. We shall further discuss this issue i Sectio 4. If there are a few replicates at each x i, the we ca compute the withi-group sample variace for each i ad the average of the withi-group sample variaces is always a cosistet estimator of σ 2. Theorem 1 idicates that asymptotically, the GIC λ method with λ 2 ca be used to fid (1) the best model amog icorrect models; (2) the better

9 ASYMPTOTICS FOR LINEAR MODEL SELECTION 229 model betwee a correct model ad a icorrect model; but it is too crude to be useful i distiguishig correct models with fixed dimesios, i.e., it teds to overfit (select a correct model with a uecessarily large dimesio). From (3.1), Γ,λ (α) is a sum of two compoets: S (α)/, which measures the goodess of fit of model α, adλ ˆσ 2p (α)/, which is a pealty o the use of models with large dimesios. I view of the fact that the use of λ 2 teds to overfit, it is atural to cosider a large λ i (3.1), i.e., to put a heavy pealty o the use of models with large dimesios. The reaso why ˆα,2 may ot be asymptotically loss efficiet is that the miimizer of Γ,2 (α) e 2 = 2ˆσ2 p (α) e H (α)e, which is cosidered as a fuctio of α A c, may ot be the same as the miimizer of L (α) =σ 2 p (α)/. What will occur if we use a λ that? Similar to the expasio (3.2), we have Γ,λ (α)= e 2 + λˆσ2 p (α) e H (α)e α A c e 2 +L (α)+ (λ ˆσ2 2σ2 )p (α) +o p (L (α)) α A A c, where the equality for the case of α A A c holds uder coditio (2.6). If (3.7) e max H (α)e α A c λ ˆσ 2p (α) p 0, (3.8) the, for α A c,γ,λ (α) e 2 / is domiated by the term λ ˆσ 2 p (α)/ which has the same miimizer as L (α) =e H (α)e /. Hece, P {ˆα,λ A c but ˆα,λ α c } 0, (3.9) where ˆα,λ is the model selected usig the GIC λ ad α c is the model i A c with the smallest dimesio. This meas that the GIC λ method picks the best model i A c as log as (3.8) holds, which is implied by a weak coditio lim sup α A c 1 [ p (α)] m < (3.10) for some positive iteger m such that E(y 1 µ 1 ) 4m <. Note that (3.10) holds if the umber of models i A is bouded (Examples 1 ad 2) or if m =2 ad A = {α i,i =1,...,p } with α i = {1,...,i} (polyomial approximatio i Example 3).

10 230 JUN SHAO For the asymptotic correctess of the GIC λ method, the remaiig questio is whether it ca assess the models i A A c. Ufortuately, the GIC λ teds to select a model with a small dimesio ad, therefore, may fail to be asymptotically loss efficiet if models with small dimesios have large values of L (α). More precisely, if there are α 1 ad α 2 i A A c such that L (α lim 1 ) L (α > 1 but lim 1 )+(λ ˆσ 2 2σ 2 )p (α 1 )/ L (α 2 ) L (α 2 )+(λ ˆσ 2 2σ 2 )p (α 2 )/ < 1 (3.11) (which implies lim p (α 1 )/p (α 2 ) < 1), the the GIC λ is ot asymptotically loss efficiet. A ecessary coditio for ˆα,λ to be asymptotically loss efficiet is that (3.11) does ot hold for ay α 1 ad α 2. Of course, (3.11) is almost impossible to check. I the followig theorem we provide some sufficiet coditios for the asymptotic loss efficiecy of the GIC λ. Theorem 2. Suppose that (2.6) ad (3.10) hold ad that ˆσ 2 p 0 ad ˆσ 2 p. (i) A sufficiet coditio for the asymptotic loss efficiecy of ˆα,λ is that (2.5) holds ad λ is chose to satisfy λ ad λ p 0. (3.12) (ii) If A cotais at least oe correct model with fixed dimesio for sufficietly large, λ ad λ / 0, theˆα,λ is cosistet. Remark 3. Ulike the case of λ 2, it is ot required i Theorem 2 that ˆσ 2 be a cosistet estimator of σ 2. We ow apply Theorems 1 ad 2 to Examples 1-3. Example 1. (cotiued) We use the otatio give by (1.1). I this example (2.4), (2.5), (2.6) ad (3.10) hold. Note that A c is ot empty ad cosistecy i the sese of (2.1) is the same as asymptotic loss efficiecy i the sese of (2.3) (Propositio 1). By Theorem 1, ˆα,2 is cosistet if ad oly if ᾱ = {1,...,p} is the oly correct model. By Theorem 2(ii), ˆα,λ is always cosistet if λ ad λ / 0. Example 2. (cotiued) Note that = kr meas that either k or r. Usig Theorems 1 ad 2, we ow show that ˆα,2 is better whe k, whereas ˆα,λ with λ satisfyig (3.12) is better whe r.

11 ASYMPTOTICS FOR LINEAR MODEL SELECTION 231 It is easy to see that (2.6) ad (3.10) hold. Coditio (2.5) holds if k is fixed. If k, the (2.5) is the same as lim if k 1 k k (µ j 1 k µ i ) 2 > 0, k j=1 i=1 which is a reasoable coditio. Cosider first the case where k ad r is fixed. Sice the differece i dimesios of the two models i A is k 1, a applicatio of Theorem 1(i)&(iv) shows that ˆα,2 is always asymptotically loss efficiet. O the other had, it ca be show that if λ,thep {ˆα,λ = α 1 } 1. Hece ˆα,λ is asymptotically loss efficiet if ad oly if the oe-mea model is correct. Next, cosider the case where r ad k is fixed. I this case the dimesios of both models are fixed. By Propositio 1, cosistecy is the same as asymptotic loss efficiecy. By Theorem 2, ˆα,λ with λ ad λ / 0 is cosistet. By Theorem 1, ˆα,2 is cosistet if ad oly if the oe-mea model is icorrect. Fially, cosider the case where k ad r. Sice p / = r 1 0, cosistecy is the same as asymptotic loss efficiecy (Propositio 1). By Theorems 1 ad 2, both ˆα,2 ad ˆα,λ are cosistet, but λ has to be chose so that (3.12) holds, i.e., λ /r 0. For example, if we choose λ =log (GIC λ is the equivalet to the BIC i Schwarz (1978)), the ˆα,λ is icosistet if log /r 0. This is exactly what was described i Sectio 3 of Stoe (1979). Example 3. (cotiued) I this case p as. Coditios (2.6) ad (3.10) are usually satisfied with m = 2. If there exists a correct model i A for some, the there are may correct models i A ad by Theorems 1 ad 2, ˆα,λ is cosistet but ˆα,2 is ot. O the other had, if there is o correct model i A for all, theˆα,2 is asymptotically loss efficiet but ˆα,λ may ot, sice coditio (2.5) may ot hold. I coclusio, the GIC λ method with λ 2 is more useful i the case where there is o fixed-dimesio correct model, whereas the GIC λ method with λ is more useful i the case where there exist fixed-dimesio correct models. To ed this sectio, we discuss briefly the GIC λ with λ λ, acostat larger tha 2. It is apparet that the GIC λ with a fixed λ>2 is a compromise betwee the GIC 2 ad the GIC λ with λ. The asymptotic behavior of the GIC λ, however, is ot as good as the GIC 2 i the case where o fixeddimesio correct model exists, ad ot as good as the GIC λ whe there are

12 232 JUN SHAO fixed-dimesio correct models. This ca be see from the proofs of Theorems 1 ad 2 i Sectio Other Selectio Methods I this sectio we show that some selectio methods cited i Sectio 1 have the same asymptotic behavior (i terms of cosistecy ad asymptotic loss efficiecy) as the GIC λ uder certai coditios. First, cosider the GIC λ with the followig particular choice of ˆσ : 2 ˆσ 2 = S (ᾱ ) p, (4.1) where ᾱ = {1,...,p }. If (4.1) is used, the the GIC 2 is the C p method (Mallows (1973)) ad the GIC λ is the GIC i Rao ad Wu (1989). The estimator i (4.1), however, is ot ecessarily cosistet for σ 2 if ᾱ is a icorrect model. Asymptotic behavior of the C p (λ 2) is give i the followig result. Theorem 1A. (i) If (ᾱ ) 0 ad p / 1, theˆσ 2 i (4.1) is cosistet for σ 2 ad, therefore, the assertios (i)-(iv) i Theorem 1 are valid for the C p. (ii) If (2.4) holds, the the assertios (i)-(iv) i Theorem 1 are valid for the C p. Note that i Theorem 2, we do ot eed ˆσ 2 to be cosistet. Hece we have the followig result for the case where λ. Theorem 2A. Assume that (2.6) ad (3.10) hold. The the assertios (i)-(ii) i Theorem 2 are valid for the GIC λ with ˆσ 2 give by (4.1) ad λ. If we use ˆσ 2 =ˆσ2 (α) = S (α) p (α) (a estimate of σ 2 depeds o the model α) i (3.1), the we select a model by miimizig Γ,λ (α) = S [ (α) 1+ λ ] p (α). p (α) If λ p / 0, this method has the same asymptotic behavior as the method miimizig log S (α) + λ p (α) p (α), sice log(1 + x) x as x 0. Miimizig Γ,λ (α) iskowastheaicif λ 2adtheBICifλ =log. Let α,λ be the model selected by miimizig Γ,λ (α) overα A. We have the followig result similar to Theorems 1 ad 2.

13 ASYMPTOTICS FOR LINEAR MODEL SELECTION 233 Theorem 3. Suppose that (2.6) holds. (i) The assertios (i)-(iv) i Theorem 1 are valid for α,2 (the AIC ) if either (2.4) holds or p max (α) 0 ad 1. (4.2) α A (ii) Assume that (3.10) holds. The the assertios (i)-(ii) i Theorem 2 are valid for α,λ with λ. The delete-1 CV method selects a model by miimizig CV,1 (α) = [I H (α)] 1 [y ˆµ (α)] 2 over α A, where H (α) is a diagoal matrix whose ith diagoal elemet is the ith diagoal elemet of H (α). The GCV method replaces H (α) by [ 1 tr H(α)]I = [ 1 p (α)]i, where tra is the trace of the matrix A, ad hece it selects a model by miimizig From the idetity GCV (α) = S (α) [1 1 p (α)] 2. 1 [1 1 p (α)] 2 =1+ 2p [ ] (α) p (α) + p (α) 2, p (α) we kow that the GCV ad the AIC have the same asymptotic behavior if [ max α A p (α) p (α) which holds if ad oly if (2.4) holds. ] 2 /[ 1+ 2p (α) ] 0, (4.3) p (α) Theorem 4. Suppose that (2.6) holds. (i) The assertios (i)-(iv) i Theorem 1 are valid for the GCV if either (2.4) or (4.2) holds. (ii) Assume that h =max i x i(x X ) 1 x i 0. (4.4) The the assertios (i)-(iv) i Theorem 1 are valid for the delete-1 CV. Coditio (4.4) is stroger tha coditio (2.4). Whe either (2.4) or (4.2) holds, the GCV ad the delete-1 CV may ot be asymptotically loss efficiet. Example 2. (cotiued) We cosider Example 2 i the situatio where k is large but r, the umber of replicatio, is fixed. Sice p / = r 1, (2.4) does ot hold.

14 234 JUN SHAO Assume that lim (α 1 )= > 0, i.e., (4.2) does ot hold. Let y ij be the jth observatio i the ith group, j =1,...,r, i =1,...,k,ȳ i be the ith group mea, ȳ be the overall mea, SS 1 = k i=1 rj=1 (y ij ȳ) 2 ad SS k = k i=1 rj=1 (y ij ȳ i ) 2. The delete-1 CV ad the GCV are idetical i this case ad select the oe-mea model if ad oly if SS 1 /(1 1 ) 2 <SS k /(1 r 1 ) 2.From L (α 1 ) L (α k ) p r σ 2, SS 1 p σ 2 + ad SS k p (r 1)σ 2, r the delete-1 CV (or the GCV) is ot asymptotically loss efficiet if σ 2 /r < σ 2 /(r 1). The delete-d CV is a extesio of the delete-1 CV. Suppose that we split the (1 + p )matrix(y, X ) ito two distict sub-matrices: a d (1 + p )matrix (y,s, X,s ) cotaiig the rows of (y, X ) idexed by the itegers i s, a subset of {1,...,} of size d, ada( d) (1 + p )matrix(y,s c, X,s c) cotaiig the rows of (y, X ) idexed by the itegers i s c, the complemet of s. For ay α A,weestimateβ (α) byˆβ,s c(α), the LSE based o (y,s c, X,s c) uder model α. The model is the assessed by y,s ˆµ,s (α) 2,whereˆµ,s (α) = X,s (α)ˆβ,s c(α) adx,s (α) isad p (α) matrix cotaiig the colums of X,s idexed by the itegers i α. LetS be a class of N subsets s. The delete-d CV method selects a model by miimizig CV,d (α) = 1 y dn,s ˆµ,s (α) 2 s S over α A. The set S ca be obtaied by usig a balaced icomplete block desig (Shao (1993)) or by takig a simple radom sample from the collectio of all possible subsets of {1,...,} of size d (Burma (1989), Shao (1993)). While the delete-1 CV has the same asymptotic behavior as the C p (Theorem 4), the delete-d CV has the same asymptotic behavior as the GIC λ with λ = +1. (4.5) d If d/ 0, the λ 2; if d/ τ (0, 1), the λ 1 1 τ +1, a fixed costat larger tha 2; if d/ 1, the λ. I view of the discussio (i the ed of Sectio 3) for the GIC λ with a fixed λ>2, we cosider oly the case where d is chose so that d/ 1(λ ). Theorem 5. Suppose that (2.5), (2.6) ad (3.10) hold ad that max s S sup X,s c 2 c =1 d X,s cc 2 d 0.

15 ASYMPTOTICS FOR LINEAR MODEL SELECTION 235 The the delete-d CV is asymptotically loss efficiet if d is chose so that d 1 ad p 0. (4.6) d If, i additio, A cotais at least oe correct model with fixed dimesio, the the delete-d CV is cosistet. Remark 4. Coditio (4.6) implies coditio (2.4) ad is similar to coditio (3.12) i Theorem 2. I fact, p /( d) 0 is a very atural requiremet for usig the delete-d CV, sice d is the umber of observatios used to fit a iitial model with as may as p parameters. The PMDL ad PLS methods (Rissae (1986), Wei (1992)) are show to have the same asymptotic behavior as the BIC method (which is a special case of the GIC) uder some situatios (Wei (1992)). However, these two methods are iteded for the case where e is a time series so that the observatios have a atural order. Hece, we do ot discuss these methods here. I coclusio, the methods discussed so far ca be classified ito the followig three classes accordig to their asymptotic behaviors: Class 1. The GIC 2,theC p, the AIC, the delete-1 CV, ad the GCV. Class 2. The GIC λ with λ ad the delete-d CV with d/ 1. Class 3. The GIC λ with a fixed λ>2 ad the delete-d CV with d/ τ (0, 1). The methods i class 1 are useful i the case where there is o fixed-dimesio correct model. With a suitable choice of λ or d, the methods i class 2 are useful i the case where there exist fixed-dimesio correct models. The methods i class 3 are compromises of the methods i class 1 ad the methods i class 2; but their asymptotic performaces are ot as good as those of the methods i class 1 i the case where o fixed-dimesio correct model exists, ad ot as good as those of the methods i class 2 whe there are fixed-dimesio correct models. 5. Empirical Results We study the magitude of P {ˆα = α L } with a fixed by simulatio i two examples. Although some selectio methods are show to have the same asymptotic behavior, their fixed sample performaces (i terms of P {ˆα = α L }) may be differet. The first example is the liear regressio (Example 1) with p =5;thatis, y i = β 1 x i1 + β 2 x i2 + β 3 x i3 + β 4 x i4 + β 5 x i5 + e i, i =1,...,40, where e i are idepedet ad idetically distributed as N(0, 1), x ij is the ith value of the jth explaatory variable x j, x i1 1, ad the values of x ij, j =2, 3, 4, 5,

16 236 JUN SHAO are take from a example i Gust ad Maso (1980) (also, see Table 1 i Shao (1993)). This study is a extesio of that i Shao (1993) which studies the cross-validatio methods oly. Table 1. Selectio probabilities i regressio problem based o 1000 simulatios. True β Model AIC C p GIC CV 1 GCV CV d (2, 0, 0, 4, 0) 1,4** ,2,4* ,3,4* ,4,5* ,2,3,4* ,2,4,5* ,3,4,5* ,2,3,4,5* (2, 0, 0, 4, 8) 1,4,5** ,2,4,5* ,3,4,5* ,2,3,4,5* (2, 9, 0, 4, 8) 1,4, ,2,4,5** ,3,4,5* ,2,3,4,5* (2, 9, 6, 4, 8) 1,2,3, ,2,4, ,3,4, ,2,3,4,5** * A correct model ** The optimal correct model Six selectio procedures, the AIC, the C p,thegic λ with ˆσ 2 give by (4.1), the delete-1 CV (deoted by CV 1 ), the GCV, ad the delete-d CV (deoted by CV d ), are applied to select a model from 2 p 1 = 31 models. The λ i the GIC is chose to be log = log so that this GIC is almost the same as the BIC. The d i the delete-d CV is chose to be 25 so that (4.5) approximately holds ad the delete-d CV is comparable with the GIC. The S i the delete-d CV is obtaied by takig a radom sample of size 2 = 80 from all possible subsets of {1,...,40} of size 25. For these six selectio procedures, the empirical probabilities (based o 1,000 simulatios) of selectig each model are reported i Table 1, where each model is deoted by a subset of {1,...,5} that cotais the idices of the explaatory variables x j i the model. Models correspodig to zero empirical probabilities for all the methods i the simulatio are omitted.

17 ASYMPTOTICS FOR LINEAR MODEL SELECTION 237 The secod example cosidered is the polyomial approximatio to a possibly oliear curve (Example 3); that is, we select a model from the followig class of models: y i = β 0 + β 1 x i + + β h 1 x h 1 i + e i, h =1,...,p. (5.1) I the simulatio, =40adp = 5. Other settigs ad the selectio procedures cosidered are the same as those i the first example. The values of x i are take to be the same as x i2 i the first example. We cosider situatios where oe of the models i (5.1) is correct, as well as the case where the true model is y i =exp(2x i )+e i so that oe of the models i (5.1) is correct. The results are reported i Table 2. Table 2. Selectio probabilities i polyomial approximatio problem based o 1000 simulatios. True E(y i ) Model AIC C p GIC CV 1 GCV CV d 1 h = 1** h = 2* h = 3* h = 4* h = 5* x i h = 2** h = 3* h = 4* h = 5* x i +2x 2 i h = 3** h = 4* h = 5* x i +2x 2 i h = x 3 i /2 h = 4** h = 5* x i +2x 2 i h = x 3 i /2+2x4 i /3 h = 5** exp (2x i ) h = * A correct model ** The optimal correct model The followig is a summary of the results i Tables 1 ad 2. (1) The procedures i class 2 (the GIC ad the CV d ) have much better empirical performaces tha the procedures i class 1 (the AIC, the C p,thecv 1,ad

18 238 JUN SHAO the GCV) whe there are at least two fixed-dimesio correct models. The probability P {ˆα = α L } may be very low for the methods i class 1 whe the dimesio of the optimal model is ot close to p. This cofirms the asymptotic results established i Sectios 3 ad 4. (2) The performaces of two methods i class 2 may be substatially differet. For example, the probability of the GIC selectig the optimal model ca be as low as i the first example, whereas the CV d selects the optimal model with probability higher tha 0.90 i all cases. O the other had, the CV d selects a icorrect model sometimes with a small chace. 6. Proofs Proof of Propositio 1. We oly eed to show that (2.3) does ot hold, assumig that (2.1) does ot hold. If A c cotais exactly oe model, the by (2.4), L (α L ) p 0; but by (2.5), L (ˆα ) p 0. Hece (2.3) does ot hold. Next, assume that A c cotais more tha oe models but p (αl ) p. Sice P {ˆα α L } 0, there exists α 1 A c such that α 1 α L ad P {ˆα = α 1 } 0. The [ ] [ L (ˆα ) L (α L ) 1 L (α 1 ) e ] L (α L ) 1 I(ˆα =α 1 )= H (α 1 )e e H (α L )e 1 I(ˆα =α 1 ) p 0. Proofof(3.2).Notethat Γ,2 (α) = e 2 Hece (3.2) follows from ad + L (α)+ 2(ˆσ2 σ 2 )p (α) + 2[σ2 p (α) e H (α)e ] + 2e [I H (α)]µ. ˆσ 2 max σ2 p (α) p 0, (6.1) α A A c L (α) σ 2 p max (α) e H (α)e p 0, (6.2) α A A c L (α) e max [I H (α)]µ p 0. (6.3) α A A c L (α) Result (6.1) follows from (6.2), e H (α)e L (α), ad the fact that ˆσ 2 σ 2 p 0. Results (6.2) ad (6.3) ca be show usig the same argumet i Li (1987), p.970 uder coditio (2.6).

19 ASYMPTOTICS FOR LINEAR MODEL SELECTION 239 Proof of Theorem 1. The first statemet i (i) is proved i Sectio 3. The secod statemet i (i) is a cosequece of the first statemet ad Propositio 1. For (ii), it suffices to show that Γ,2 (α) = e 2 + L (α)+o p (L (α)) uiformly i α A c, which follows from either max e H (α)e σ 2 p 0 (6.4) p (α) or α A c max α A c,α αc α A c e [H (α) H (α c )]e p (α) p (α c ) σ 2 p 0. (6.5) From Theorem 2 of Whittle (1960), E e H (α)e σ 2 2m c p (α) [ p (α)] m, (6.6) where c is a positive costat. The for ay ɛ>0, { P max e H } (α)e σ 2 >ɛ cɛ 2m 1 p (α) [ p (α)] m. α A c Hece (6.4) is implied by coditio (3.4). A similar argumet shows that (6.5) is implied by coditio (3.5). The result i (iii) ca be proved usig the same argumet i Example 4. For (iv), it suffices to show that p (α c ) is the same as (3.4) ad mi α A c,α α c p (α) p (α c ) is the same as (3.5), which is apparet if the umber of models i A c is bouded. The proof for the case where m =2adA = {α i,i= 1,...,p } with α i = {1,...,i} is the same as that i Li (1987), p.963. ProofofTheorem2. From (6.6) ad coditio (3.10), e H (α)e λ p (α) = O p (λ 1 ) uiformly i α A c. Hece (3.9) holds. Sice L (α) > (α), (3.7) ad coditios (2.5) ad (3.12) imply that Γ,λ (α) = e 2 + L (α) +o p (L (α)) uiformly i α A A c, ad if A c is ot empty, Γ,λ (α c ) = o p (L (α)) uiformly i α A A c. The result i (i) is established. If A cotais at least oe correct model with fixed dimesio, the (2.5) holds ad P {ˆα,λ = α c } 1. The cosistecy of ˆα,λ follows from the fact that α L = α c.

20 240 JUN SHAO Proofs of Theorems 1A, 2A, ad 3. First, cosider Theorem 1A. Note that S (ᾱ ) p = e [I H (ᾱ )]e p + p (ᾱ )+ 2e [I H (ᾱ )]µ p. (6.7) If (ᾱ ) 0adp / 1, the S (ᾱ )/( p ) is cosistet ad the result i (i) follows. Suppose ow that (2.4) holds. For the result i (ii), it suffices to show that (6.1) still holds for ˆσ 2 = S (ᾱ )/( p ). From (6.7), S (ᾱ )/( p )= σ 2 + o p (1) + O p ( (ᾱ )). Hece (6.1) follows from the fact that (ᾱ )p max (α) p α A L (α). The proofs for Theorem 2A ad Theorem 3 are similar. Proof of Theorem 4. (i) If (2.4) holds, the (4.3) holds ad the result follows from Theorem 3(i). Now, assume that (4.2) holds. The p (α L )/ p 0ad p (ˆα )/ p 0, where ˆα is the model selected by the GCV. If A c is empty for all, the GCV (ˆα )= e 2 + L (ˆα )+o p (L (ˆα )), GCV (α L )= e 2 + L (α L )+o p(l (α L )), ad 0 GCV (ˆα ) GCV (α L ) L (ˆα ) = L (ˆα ) L (α L ) L (ˆα ) + o p (1) o p (1). This proves that (2.3) holds. The proof for the case where A c is oempty is similar to the proof of Theorem 1. (ii) Defie T (α) =[y ˆµ (α)] H (α)[y ˆµ (α)]. The CV,1 (α) = S (α) The result follows if T (α) σ 2 p (α) L (α) E T (α) p (α) σ2 + 2T (α) ( ) h T (α) + O p. = o p (1) uiformly i α A A c, (6.8) 2m c [ p (α)] m α Ac (6.9)

21 ASYMPTOTICS FOR LINEAR MODEL SELECTION 241 ad E T (α) T (α c ) p (α) p (α c ) 2m σ2 c [ p (α) p (α c )] m α Ac (6.10) for some c>0 ad positive iteger m such that E(y 1 µ 1 ) 4m <. Let W (α) =[I H (α)] H (α)[i H (α)]. Whe α A A c, T (α) =e W (α)e +2e W (α)µ + µ W (α)µ. (6.11) From Theorem 2 of Whittle (1960), E e W (α)e E[e W 2l (α)e ] c[trw 2 (α)]l ch l [trw (α)] l. (6.12) Note that trw (α) =tr H (α)[i H (α)] = p (α) tr H2 (α) p (α). (6.13) By (2.6), (6.12) ad (6.13), { P max e W (α)e E[e W } (α)e ] R (α) >ɛ cɛ 2l 1 [R (α)] l 0. α A A c α A A c The (6.8) follows from (6.11), µ W (α)µ h (α) h R (α) adthefact that (2.6) implies max L (α) α A A c R (α) 1 p 0. Results (6.9) ad (6.10) follow from Theorem 2 of Whittle (1960), the idetity (6.13), ad the fact that tr H2 (α) h p (α) ad tr[ H2 (α) H2 (α c )] 2h [ p (α) p (α c )] whe α A c. Proof of Theorem 5. It follows from the proof i Shao (1993), Appedix that uder the give coditios, CV,d (α) = S (α) + λ T (α) ( ) λ T (α) + o p uiformly i α A,whereλ is give by (4.5) ad T (α) is give by (6.11). The the result follows from the give coditios ad result (6.9). Ackowledgemets I would like to thak the referees ad a Associate Editor for their costructive commets ad suggestios. The research was supported by Natioal Scieces Foudatio Grat DMS ad Natioal Security Agecy Grat MDA

22 242 JUN SHAO Refereces Akaike, H. (1970). Statistical predictor idetificatio. A. Ist. Statist. Math. 22, Alle, D. M. (1974). The relatioship betwee variable selectio ad data augmetatio ad a method for predictio. Techometrics 16, Burma, P. (1989). A comparative study of ordiary cross-validatio, v-fold cross-validatio ad the repeated learig-testig methods. Biometrika 76, Crave, P. ad Wahba, G. (1979). Smoothig oisy data with splie fuctios: Estimatig the correct degree of smoothig by the method of geeralized cross-validatio. Numer. Math. 31, Geisser, S. (1975). The predictive sample reuse method with applicatios. J. Amer. Statist. Assoc. 70, Gust, R. F. ad Maso, R. L. (1980). Regressio Aalysis ad Its Applicatio. MarcelDekker, New York. Haa, E. J. ad Qui, B. G. (1979). The determiatio of the order of a autoregressio. J. Roy. Statist. Soc. Ser. B 41, Li, K.-C. (1987). Asymptotic optimality for C p,c L, cross-validatio ad geeralized crossvalidatio: Discrete idex set. A. Statist. 15, Mallows, C. L. (1973). Some commets o C p. Techometrics 15, Nishii, R. (1984). Asymptotic properties of criteria for selectio of variables i multiple regressio. A. Statist. 12, Pötscher, B. M. (1989). Model selectio uder ostatioary: autoregressive models ad stochastic liear regressio models. A. Statist. 17, Rao, C. R. ad Wu, Y. (1989). A strogly cosistet procedure for model selectio i a regressio problem. Biometrika 76, Rissae, J. (1986). Stochastic complexity ad modelig. A. Statist. 14, Schwarz, G. (1978). Estimatig the dimesios of a model. A. Statist. 6, Shao, J. (1993). Liear model selectio by cross-validatio. J. Amer. Statist. Assoc. 88, Shibata, R. (1981). A optimal selectio of regressio variables. Biometrika 68, Shibata, R. (1984). Approximate efficiecy of a selectio procedure for the umber of regressio variables. Biometrika 71, Stoe, M. (1974). Cross-validatory choice ad assessmet of statistical predictios. J. Roy. Statist. Soc. Ser. B 36, Stoe, M. (1979). Commets o model selectio criteria of Akaike ad Schwarz. J. Roy. Statist. Soc. Ser. B 41, Wei, C. Z. (1992). O predictive least squares priciples. A. Statist. 20, Whittle, P. (1960). Bouds for the momets of liear ad quadratic forms i idepedet variables. Theory Probab. Appl. 5, Zhag, P. (1993). Model selectio via multifold cross validatio. A. Statist. 21, Departmet of Statistics, Uiversity of Wiscosi, 1210 W. Dayto St., Madiso, WI 53706, U.S.A. (Received March 1994; accepted September 1996)

23 ASYMPTOTICS FOR LINEAR MODEL SELECTION 243 COMMENT Rudolf Bera Uiversity of Califoria, Berkeley Professor Shao s welcome asymptotic aalysis of stadard model selectio procedures divides these ito three categories: those that perform better whe oe or more correct models have fixed dimesio uder the asymptotics; those that do better whe o correct model has fixed dimesio; ad itermediate methods. Adoptig the premise that model selectio is iteded to reduce estimatio risk uder quadratic loss, my discussio will draw attetio to two poits: GIC λ selectio estimators with lim λ = ca have arbitrarily high asymptotic risk whe the sigal-to-oise ratio is large eough. GIC λ selectio estimators with either λ = 2 or lim λ = are ot asymptotically miimax uless the sigal to oise ratio coverges to zero. They are domiated, i maximum risk, by a variety of procedures that taper the compoets of the least squares fit toward zero. I will develop both poits i a sigal recovery settig that is formally a special case of Shao s problem. Suppose that X = {X,t : t T } is a observatio o a discrete sigal ξ = {ξ,t : t T } that is measured with error at the time poits T = {1,...,}. The measuremet errors are idepedet ad are such that the distributio of each compoet X,t is N(ξ,t,σ 2 ). For ay real-valued fuctio f defied o T,letave(f) = 1 t T f(t). The time-averaged quadratic loss of ay estimator ˆξ is the ad the correspodig risk is L (ˆξ,ξ )=ave[(ˆξ ξ ) 2 ] R (ˆξ,ξ,σ 2 )=EL (ˆξ,ξ ). Model selectio ad related estimators typically have smaller risk whe all but a few compoets of ξ are small. With eough prior iformatio, this favorable situatio may be approximated by suitable orthogoal trasformatio of X before estimatio. This trasformatio leaves the Gaussia error distributio uchaged. A model selectio or other estimator costructed i the ew coordiate system may be trasformed back to the origial coordiate system without chagig its quadratic loss. Thus, i sigal recovery problems, the {X,t } might be Fourier, or wavelet, or aalysis of variace, or orthogoal polyomial coefficiets of the observed sigal.

24 244 JUN SHAO Let u [0, 1]. We cosider ested model selectio, i which the cadidate estimators have the form ˆξ (u) ={ˆξ,t (u) :t T },withˆξ,t (u) =X,t wheever t/( +1) u ad ˆξ,t = 0 otherwise. The value of u will be chose by the GIC λ method i Shao s Sectio 3. Let ˆσ 2 be a cosistet estimator of σ 2 that satisfies lim sup ave(ξ 2 )/σ2 r E ˆσ 2 σ 2 =0 (1) for every r [0, ). Such variace estimators may costructed exterally usig replicatio or iterally by methods such as those described i Rice (1984). The GIC λ selectio criterio is ˆΓ (u, λ )=ˆγ (u)+λ ˆσ 2 1 [( +1)u] I, where ˆγ (u) = 1 t/(+1)>u X2,t ad [ ] I is the iteger part fuctio. Let û be the smallest value of u [0, 1] that miimizes ˆΓ (u, λ ). Existece of û is assured because the criterio fuctio assumes oly a fiite umber of values. The model selectio estimator ˆξ (û ) will be deoted by ˆξ,λ. Propositio 1. I the sigal-plus-oise model, with ˆσ 2 satisfyig (1), thefollowig bouds hold for every r [0, ): lim If lim λ =, the sup ave(ξ 2 )/σ2 r lim sup ave(ξ)/σ 2 2 r The least squares estimator X satisfies lim sup ave(ξ 2 )/σ2 r R (ˆξ,2,ξ,σ 2 )=σ 2 mi(r, 1). (2) R (ˆξ,λ,ξ,σ 2 )=σ 2 r. (3) R (X,ξ,σ 2 )=σ 2. (4) This propositio will be proved at the ed of the discussio. Let us cosider some implicatios: (a) If ξ is a voltage sigal, the ave(ξ) 2 is the time-averaged power dissipated by this sigal i passig through a uit resistace. Cosequetly, ave(ξ 2)/σ2 is the time-averaged sigal-to-oise ratio i our sigal recovery problem. The maximum risks i Propositio 1 are computed over subsets of ξ values that are geerated by boudig the sigal-to-oise ratio from above. (b) For r = 0, the limitig maximum risks i Propositio 1 do ot distiguish betwee the performace of ˆξ,2 ad ˆξ,λ with lim λ =. Theorems 1 ad 2 i Shao s paper idicate that the latter estimators may perform better

25 ASYMPTOTICS FOR LINEAR MODEL SELECTION 245 i some (but ot all) circumstaces where the sigal-to-oise ratio coverges to zero. (c) As log as the sigal-to-oise ratio does ot exceed 1, both ˆξ,2 ad ˆξ,λ with lim λ = have the same asymptotic maximum risk. Oce the sigal to oise ratio exceeds 1, the ˆξ,λ has greater asymptotic maximum risk tha ˆξ,2 or eve the least squares estimator X. (d) For all values of r, the asymptotic maximum risk of ˆξ,λ with lim λ = coicides with that of the trivial estimator ˆξ = 0. This does ot mea that ˆξ,λ is trivial. (e) For all values of r, the asymptotic maximum risk of ˆξ,2 equals the smaller of the asymptotic maximum risks of X ad ˆξ,λ with lim λ =. This argumet strogly promotes the use of ˆξ,2 over these two competitors uless oe is cofidet that the special circumstaces of remark b hold. How well do model selectio estimators perform withi the class of all estimators of ξ? A aswer that complemets Propositio 1 is Propositio 2. I the sigal-plus-oise model, with ˆσ 2 satisfyig (1), thefollowig equality holds for every r [0, ): lim if sup R (ˆξ,ξ,σ 2 )=σ 2 r/(r +1). (5) ˆξ ave(ξ)/σ 2 2 r This result follows from Pisker s (1980) geeral lower boud o risk i sigal recovery from Gaussia oise. It may also be derived from ideas i Stei (1956) by cosiderig best orthogoally equivariat estimators i the submodel where ave(ξ 2 )/σ2 = r. To be asymptotically miimax, a estimator ˆξ must satisfy lim sup ave(ξ)/σ 2 2 r R (ˆξ,ξ,σ 2 )=σ 2 r/(r +1). Simplest amog asymptotically miimax estimators is the James-Stei (1961) estimator ˆξ,S =[1 ˆσ 2 /ave(x2 )]+ X, where [ ] + deotes the positive part fuctio ad ˆσ 2 is a estimator of σ 2 that satisfies (1). For every positive, fiite r ad σ 2, σ 2 r/(r+1) <σ 2 mi(r, 1). Hece, for large, the James-Stei estimator domiates, i maximum risk, ay of the three estimators discussed i Propositio 1. Figure 1 reveals the extet of this domiatio.

26 246 JUN SHAO 4 Maximum Risk (i multiples of variace) Sigal-to-Noise Ratio Figure 1. The asymptotic maximum risks of ested model selectio estimators geerated by the GIC 2 criterio (solid lies) ad by the GIC λ criterio whe lim λ = (broke lie). The asymptotic miimax risk, attaied by good tapered estimators, is the dashed curve below. The story oly begis here. We ca costruct asymptotically miimax estimators that domiate the James-Stei estimator over submodels. Let G be a give closed covex subset of [0, 1] T that cotais all costats i [0, 1]. Each fuctio g G defies a cadidate modulatio estimator gx = {g(t)x,t : t T } for ξ. The risk of this cadidate estimator uder quadratic loss is R (gx,ξ,σ 2 )=ave[σ 2 g 2 + ξ 2 (1 g)2 ]. Here squarig is doe compoetwise. A estimator of this risk, suggested by Stei s ubiased estimator for risk or by the C p idea, is ˆR (g) =ave[g 2ˆσ 2 +(1 g)2 (X 2 ˆσ2 )]. The proposal is to use the modulatio estimator ĝ X,whereĝ miimizes ˆR (g) over all g G.WheG cosists of all costat fuctios i [0, 1] T, the modulatio estimator ĝ X is just the James-Stei estimator described above. To improve o James-Stei, let G,mo be the set of all oicreasig fuctios i [0, 1] T. The class of cadidate estimators {gx : g G,mo } ow

27 ASYMPTOTICS FOR LINEAR MODEL SELECTION 247 cotais the ested model selectio estimators discussed earlier. It cotais as well cadidate estimators that selectively taper the coordiates of X towards zero. Choosig ĝ,mo to miimize ˆR (g) overg G,mo geeralizes precisely choosig û to miimize ˆΓ (u, 2) over u [0, 1]. Because the class of cadidate modulators G,mo cotais all costat fuctios o [0, 1] T, it turs out that ĝ,mo X is asymptotically miimax, ulike ˆξ,2 : lim sup ave(ξ)/σ 2 2 r R (ĝ,mo X,ξ,σ 2 )=σ 2 r/(r +1). Thus, o the oe had, ĝ,mo X domiates, for every r>0, the ested model selectio estimators treated i Propositio 1. O the other had, because G,mo is richer tha the class of all costats i [0, 1] T, the maximum risk of the estimator ĝ,mo X asymptotically domiates that of the James-Stei estimator over large classes of submodels withi ave(ξ 2 )/σ2 r. For further details o these poits, o other iterestig choices of G, ad o algorithms for computig ĝ,seeberaaddümbge (1996). I short, whe quadratic risk is the criterio ad the sigal-to-oise ratio is ot asymptotically zero, data-drive taperig of X is superior to model selectio for estimatig ξ. This fidig is ot etirely surprisig, sice the compoets of X could be Fourier or wavelet coefficiets computed from the origial data; ad taperig is kow to reduce the Gibbs pheomeo that is created by trucatig a Fourier series. ProofofPropositio1. Fix r ad suppose throughout that ave(ξ 2 )/σ2 r for every. Result (4) is obvious. Let V,1 (u) = 1/2 V,2 (u) = 1/2 t/(+1)>u t/(+1)>u [(X,t ξ,t ) 2 σ 2 ] ξ,t (X,t ξ,t ) for every 0 u 1. Let deote the supremum orm o [0, 1]. By Kolmogorov s iequality, there exist fiite costats C 1 ad C 2 such that sup ave(ξ 2 )/σ 2 r E V,i C i for i =1, 2 ad every 1. First step. Recall the defiitio of ˆγ (u) adletν (u) = 1 t/(+1)>u ξ2,t. The ˆγ (u) =ν (u)+σ 2 (1 1 [( +1)u] I )+ 1/2 V,1 (u)+2 1/2 V,2 (u). (6) Cosequetly, ˆΓ (, 2) ν ( ) σ 2 (1 + ) = o E (1). (7)

Lecture 24: Variable selection in linear models

Lecture 24: Variable selection in linear models Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet