AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION

Size: px
Start display at page:

Download "AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION"

Transcription

1 Statistica Siica 7(1997), AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION Ju Shao Uiversity of Wiscosi Abstract: I the problem of selectig a liear model to approximate the true ukow regressio model, some ecessary ad/or sufficiet coditios are established for the asymptotic validity of various model selectio procedures such as Akaike s AIC, Mallows C p, Shibata s FPE λ, Schwarz BIC, geeralized AIC, crossvalidatio, ad geeralized cross-validatio. It is foud that these selectio procedures ca be classified ito three classes accordig to their asymptotic behavior. Uder some fairly weak coditios, the selectio procedures i oe class are asymptotically valid if there exist fixed-dimesio correct models; the selectio procedures i aother class are asymptotically valid if o fixed-dimesio correct model exists. The procedures i the third class are compromises of the procedures i the first two classes. Some empirical results are also preseted. Key words ad phrases: AIC, asymptotic loss efficiecy, BIC, cosistecy, C p, crossvalidatio, GIC, squared error loss. 1. Itroductio Let y =(y 1,...,y ) be a vector of idepedet resposes ad X = (x 1,...,x ) be a p matrix whose ith row x i is the value of a p -vector of explaatory variables associated with y i. For iferece purposes, a class of models, idexed by α A, is to characterize the relatio betwee the mea respose µ = E(y X ) ad the explaatory variables. If A cotais more tha oe model, the we eed to select a model from A usig the give X ad the data vector y. The followig are some typical examples. Example 1. Liear regressio. Suppose that p = p for all ad µ = X β with a ukow p-vector β. Write β =(β 1, β 2 ) ad X =(X 1, X 2 ). It is suspected that the sub-vector β 2 = 0, i.e., X 2 is actually ot related to µ. The we may propose the followig two models: Model 1: µ =X 1 β 1 Model 2: µ =X β. I this case, A = {1, 2}. It is well kow that the least squares fittig uder model 1 is more efficiet tha that uder model 2 if β 2 = 0. More geerally, we ca cosider models µ = X (α)β(α), (1.1)

2 222 JUN SHAO where α is a subset of {1,...,p} ad β(α) (orx (α)) cotais the compoets of β (or colums of X ) that are idexed by the itegers i α. I this case A cosists of some distict subsets of {1,...,p}. If A cotais all oempty subsets of {1,...,p}, the the umber of models i A is 2 p 1. Example 2. Oe-mea versus k-mea. Suppose that observatios are from k groups. Each group has r observatios that are idetically distributed. Thus, = kr, wherek = k ad r = r are itegers. Here we eed to select oe model from the followig two models: (1) the oe-mea model, i.e., the k groups have a commo mea; (2) the k-mea model, i.e., the k groups have differet meas. To use the same formula as that i (1.1), we defie p = k, X = 1 r r 1 r r 0 1 r 0 1 r r ad β =(µ 1,µ 2 µ 1,...,µ k µ 1 ),where1 r deotes the r-vector of oes. The A = {α 1,α k },whereα 1 = {1} ad α k = {1,...,k}. Example 3. Liear approximatios to a respose surface. Suppose that we wish to select the best approximatio to the true mea respose surface from a class of liear models. Note that the approximatio is exact if the respose surface is actually liear ad is i A. The proposed models are µ = X (α)β (α), α A, where X (α) is a sub-matrix of X ad β (α) is a sub-vector of a p -vector β whose compoets have to be estimated. As a more specific example, we cosider the situatio where we try to approximate a oe-dimesioal curve by a polyomial, i.e., µ = X (α)β (α) withtheith row of X (α) beig(1,t i,t 2 i,...,th 1 i ), i =1,...,. I this case A = {α h,h =1,...,p } ad α h = {1,...,h} correspods to a polyomial of order h used to approximate the true model. The largest possible order of the polyomial may icrease as icreases, sice the more data we have, the more terms we ca afford to use i the polyomial approximatio. We assume i this paper that the models i A are liear models ad the least squares fittig is used uder each proposed model. Each model i A is deoted by α, a subset of {1,...,p }. After observig the vector y, our cocer is to select a model α from A so that the squared error loss L (α) = µ ˆµ (α) 2 (1.2)

3 ASYMPTOTICS FOR LINEAR MODEL SELECTION 223 be as small as possible, where is the Euclidea orm ad ˆµ (α) istheleast squares estimator (LSE) of µ uder model α. Note that miimizig L (α) is equivalet to miimizig the average predictio error E[ 1 z ˆµ (α) 2 y ], where z =(z 1,...,z ) ad z i is a future observatio associated with x i ad is idepedet of y i. A cosiderable umber of selectio procedures were proposed i the literature, e.g., the AIC method (Akaike (1970)); the C p method (Mallows (1973)); the BIC method (Schwarz (1978); Haa ad Qui (1979)); the FPE λ method (Shibata (1984)); the geeralized AIC such as the GIC method (Nishii (1984), Rao ad Wu (1989)) ad its aalogues (Pötscher (1989)); the delete-1 crossvalidatio (CV) method (Alle (1974), Stoe (1974)); the geeralized CV (GCV) method (Crave ad Wahba (1979)); the delete-d CV method (Geisser (1975), Burma (1989), Shao (1993), Zhag (1993)); ad the PMDL ad PLS methods (Rissae (1986), Wei (1992)). Some asymptotic results i assessig these selectio procedures have bee established i some particular situatios. Nishii (1984) ad Rao ad Wu (1989) showed that i Example 1, the BIC ad GIC are cosistet (defiitios of cosistecy will be give i Sectio 2), whereas the AIC ad C p are icosistet. O the other had, Stoe (1979) showed that i some situatios (Example 2), the BIC is icosistet but the AIC ad C p are cosistet. I Example 3, Shibata (1981) ad Li (1987) showed that the AIC, the C p, ad the delete-1 CV are asymptotically correct i some sese. However, Shao (1993) showed that i Example 1, the delete-1 CV is icosistet ad the delete-d CV is cosistet, provided that d/ 1. These results do ot provide a clear picture of the performace of the various selectio procedures. Some of these coclusios are obviously cotrary to each other. But this is because these results are obtaied i quite differet circumstaces. A crucial factor that almost determies the asymptotic performaces of various model selectio procedures is whether or ot A cotais some correct models i which the dimesios of regressio parameter vectors do ot icrease with. This will be explored i detail i the curret paper. The purpose of this paper is to provide a asymptotic theory which shows whe the various selectio procedures are asymptotically correct (or icorrect) uder a asymptotic framework coverig all situatios described i Examples 1-3. After itroducig some otatios ad defiitios i Sectio 2, we study the asymptotic behavior of the GIC method i Sectio 3 ad other selectio procedures cited above i Sectio 4. Some umerical examples are give i Sectio 5. Sectio 6 cotais some techical details.

4 224 JUN SHAO 2. Notatio ad Defiitios Throughout the paper we assume that (X X ) 1 exists ad that the miimum ad maximum eigevalues of X X are of order. The matrices X, =1, 2,..., are cosidered to be o-radom. The results i this paper are also valid i the almost sure sese whe the X are radom, provided that the required coditios ivolvig X hold for almost all sequeces X, =1, 2,... Let A be a class of proposed models (subsets of {1,...,p }) for selectio. The umber of models i A is fiite, but may deped o. For α A, the proposed model is µ = X (α)β (α), where X (α) isa p (α) submatrix of the p matrix X ad β (α) is a p (α) 1 sub-vector of a ukow p 1 vector β. Without loss of geerality, we assume that the largest model ᾱ = {1,...,p } is always i A. The dimesio of β (α), p (α), will be called the dimesio of the model α. Uder model α, the LSE of µ is ˆµ (α) =H (α)y,whereh (α) =X (α)[x (α) X (α)] 1 X (α). Aproposedmodelα A is said to be correct if µ = X (α)β (α) is actually true. Note that A may ot cotai a correct model (Example 3); a correct model is ot ecessarily the best model, sice there may be several correct models i A (Examples 1 ad 2) ad there may be a icorrect model havig a smaller loss tha the best correct model (Example 2). Let A c = {α A : µ = X (α)β (α)} deote all the proposed models that are actually correct models. It is possible that A c is empty or A c = A. Let e = y µ. It is assumed that the compoets of e are idepedet ad idetically distributed with V (e X )=σ 2 I,whereI is the idetity matrix of order. The loss defied i (1.2) is equal to L (α) = (α)+(e H (α)e )/, where (α) =( µ H (α)µ 2 )/.Notethat (α) =0wheα A c.the risk (the expected average squared error) is R (α) =E[L (α)] = (α)+ σ2 p (α). Let ˆα deote the model selected usig a give selectio procedure ad let α L be a model miimizig L (α) overα A. The selectio procedure is said to be cosistet if { } P ˆα = α L 1 (2.1) (all limitig processes are uderstood to be as ). Note that (2.1) implies } P {L (ˆα )=L (α L ) 1. (2.2)

5 ASYMPTOTICS FOR LINEAR MODEL SELECTION 225 Thus, ˆµ (ˆα ) is i this sese asymptotically as efficiet as the best estimator amog ˆµ (α), α A. (2.1) ad (2.2) are equivalet if L (α) has a uique miimum for all large. The cosistecy defied i (2.1) is i terms of model selectio, i.e., we treat ˆα as a estimator of α L (it is a well defied estimator if α L is o-radom, e.g., i Example 1). This cosistecy is ot related to the cosistecy of ˆµ (ˆα )asa estimator of µ, i.e., L (ˆα ) p 0. I fact, it may ot be worthwhile to discuss the cosistecy of ˆµ (ˆα ), sice sometimes there is o cosistet estimator of µ (e.g., mi α A L (α) p 0) ad sometimes there are too may cosistet estimators of µ (e.g., max α A L (α) p 0, i which case ˆµ (α) is cosistet for ay α). I some cases a selectio procedure does ot have property (2.1), but ˆα is still close to α L i the followig sese that is weaker tha (2.1): L (ˆα )/L (α L ) p 1, (2.3) where p deotes covergece i probability. A selectio procedure satisfyig (2.3) is said to be asymptotically loss efficiet, i.e., ˆα is asymptotically as efficiet as α L i terms of the loss L (α). Sice the purpose of model selectio is to miimize the loss L (α), (2.3) is a essetial asymptotic requiremet for a selectio procedure. Clearly, cosistecy i the sese of (2.1) implies asymptotic loss efficiecy i the sese of (2.3). I some cases (e.g., Examples 1 ad 2), cosistecy is the same as asymptotic loss efficiecy. The proof of the followig result is give i Sectio 6. Propositio 1. Suppose that p / 0, (2.4) lim if mi (α) > 0 (2.5) α A A c ad A c is oempty for sufficietly large. The (2.1) is equivalet to (2.3) if either p (α L ) p or A c cotais exactly oe model for sufficietly large. The followig regularity coditio will ofte be used i establishig asymptotic results: 1 0, (2.6) [R (α)] l α A A c where l is some fixed positive iteger such that E(y 1 µ 1 ) 4l <. Note that coditio (2.6) is exactly the same as coditio (A.3) i Li (1987) whe A c is

6 226 JUN SHAO empty; but Li s coditio (A.3) may ot hold whe A c is ot empty. If the umber of models i A is bouded (Examples 1 ad 2), the (2.6) with l =1is the same as mi α A A c R (α), (2.7) the coditio (A.3 ) i Li (1987). Whe A = {α h,h =1,...,p } with α h = {1,...,h} (e.g., polyomial approximatio i Example 3), Li (1987) showed that coditio (2.6) with l = 2 is the same as (2.7). Uder a additioal assumptio that e is ormal, we may replace (2.6) by α A A c δr(α) 0 for ay 0 <δ<1, which is Assumptio 2 i Shibata (1981). 3. The GIC λ Method May model selectio procedures are idetical or equivalet to the procedure which miimizes Γ,λ (α) = S (α) + λ ˆσ p 2 (α) (3.1) over α A,whereS (α) = y ˆµ (α) 2,ˆσ 2 is a estimator of σ2,ad{λ } is a sequece of o-radom umbers 2adλ / 0. This procedure will be called the GIC λ method. If ˆσ 2 = S (ᾱ )/( p ), ᾱ = {1,...,p }, the the GIC λ with λ is the GIC method i Rao ad Wu (1989); the GIC λ with λ 2istheC p method i Mallows (1973); ad the GIC λ with λ λ>2is the FPE λ method i Shibata (1984). Sice the GIC λ is a good represetative of the model selectio procedures cited i Sectio 1, we first study its asymptotic behavior. Let the model selected by miimizig Γ,λ (α) beˆα,λ. Cosider first the case of λ 2. Assume that ˆσ 2 is a cosistet estimator of σ 2. It is show i Sectio 6 that e 2 Γ,2 (α) = e H (α)e α A c e 2 + L (α)+o p (L (α)) α A A c, (3.2) + 2ˆσ2 p (α) where the equality for the case of α A A c holds uder coditio (2.6) ad the o p is uiformly i α A A c. It follows directly from (3.2) that ˆα,2 is asymptotically loss efficiet i the sese of (2.3) if there is o correct model i A, i.e., A c is empty. If A c is ot empty but cotais exactly oe model for each, saya c = {αc },theˆα,2 is also asymptotically loss efficiet. This ca be show by usig (3.2) as follows. If p (α c ),the 2ˆσ 2 p (α c ) e H (α c )e = ˆσ2 p (α c ) + o p ( ˆσ 2 p (α c ) ) = L (α c )+o p(α c ),

7 ASYMPTOTICS FOR LINEAR MODEL SELECTION 227 which, together with (3.2), implies that Γ,2 (α) = e 2 + L (α)+o p (L (α)) uiformly i α A ad, therefore, ˆα,2 is asymptotically loss efficiet. If p (α c ) is fixed, the (2.5) holds (Nishii (1984)), which implies that α L = α c ad mi α A A c Γ,2(α) p 0 ad, therefore, P {ˆα,2 = α L } 1, i.e., ˆα,2 is cosistet i the sese of (2.1). As the followig example idicates, however, ˆα,2 may ot be a asymptotically loss efficiet procedure whe A c cotais more tha oe model. Example 4. Suppose that A = A c = {α 1,α 2 }, i.e., A cotais two models ad both models are correct. Assume that α 1 α 2. Let p 1 ad p 2 be the dimesios of the models α 1 ad α 2, respectively. The p 1 <p 2 ad Q = H (α 2 ) H (α 1 ) is a projectio matrix of rak p 2 p 1.SiceS (α i )= e e e H (α i )e,ˆα,2 = α 1 if ad oly if 2ˆσ 2( p 2 p 1 ) > e Q e. Case 1. p 1. If p 2 p 1,thee Q e /( p 2 p 1 ) p σ 2 ad P {ˆα,2 = α 1 } 1, i.e., the ˆα,2 is cosistet. If p 2 p 1 q for a fixed positive iteger q, thep 2 /p 1 1, i which case L (α 2 )/L (α 1 ) p 1, i.e., ay selectio procedure is asymptotically loss efficiet. Case 2. p 1.Ifp 2 p 1, the we still have e Q e /( p 2 p 1 ) p σ 2, which implies that ˆα,2 is cosistet. Assume that p 2 p 1 ad that for ay fixed iteger q ad costat c>2, ( ) lim if if P e Q Q e >cσ 2 q > 0, (3.3) Q,q where Q,q = {all projectio matrices of rak q}. Note that coditio (3.3) holds if e N(0,σ 2 I ). From (3.3) ad the fact that p 1 ad p 2 p 1,theratio L (ˆα,2 ) L (α 1 ) = I(ˆα,2 = α 1 )+L (α 2 ) L (α 1 ) I(ˆα,2 = α 2 )=1+W I(ˆα,2 = α 2 ) does ot ted to 1, where W = e Q e /e H (α 1 )e ad I(C) is the idicator fuctio of the set C. For example, whe e N(0,σ 2 I ), the p 1 W /( p 2 p 1 ) is a F-radom variable with degrees of freedom p 2 p 1 ad p 1. Hece ˆα,2 is ot asymptotically loss efficiet. I Example 4, ˆα,2 is asymptotically loss efficiet if ad oly if A c does ot cotai two models with fixed dimesios. This is actually true i geeral. Let α c be the model i A c with the smallest dimesio.

8 228 JUN SHAO Theorem 1. Suppose that (2.6) holds ad that ˆσ 2 is cosistet for σ 2. (i) If A c cotais at most oe model for all, theˆα,2 is asymptotically loss efficiet i the sese of (2.3). Furthermore, if A c cotais a uique model with fixed dimesio for all, theˆα,2 is cosistet i the sese of (2.1). (ii) Assume that A c cotais more tha oe models for sufficietly large. If α A c 1 [ p (α)] m 0 (3.4) for some positive iteger m such that E(y 1 µ 1 ) 4m <, theˆα,2 is asymptotically loss efficiet. If (3.4) does ot hold but α A c,α αc 1 [ p (α) p (α c )] m 0 (3.5) for some positive iteger m such that E(y 1 µ 1 ) 4m <, theˆα,2 is asymptotically loss efficiet. (iii) Assume that A c cotais more tha oe models for sufficietly large ad that (3.3) holds. The a ecessary coditio for ˆα,2 beig asymptotically loss efficiet is that p (α c ) or mi p (α) p (α c ). (3.6) α A c,α αc (iv) If the umber of models i A c is bouded, or if m =2ad A = {α i,i = 1,...,p } with α i = {1,...,i}, the coditio (3.6) is also sufficiet for the asymptotic loss efficiecy of ˆα,2. Remark 1. Coditio (3.6) meas that A does ot cotai two correct models with fixed dimesios. Remark 2. I Theorem 1, the estimator ˆσ 2 is required to be cosistet for σ 2. A popular choice of ˆσ 2 is S(ᾱ )/( p ), the sum of squared residuals (uder the largest model i A ) over its degree of freedom. This estimator is cosistet if A c is ot empty, but is ot ecessarily cosistet whe Ac is empty, i.e., there is o correct model i A. We shall further discuss this issue i Sectio 4. If there are a few replicates at each x i, the we ca compute the withi-group sample variace for each i ad the average of the withi-group sample variaces is always a cosistet estimator of σ 2. Theorem 1 idicates that asymptotically, the GIC λ method with λ 2 ca be used to fid (1) the best model amog icorrect models; (2) the better

9 ASYMPTOTICS FOR LINEAR MODEL SELECTION 229 model betwee a correct model ad a icorrect model; but it is too crude to be useful i distiguishig correct models with fixed dimesios, i.e., it teds to overfit (select a correct model with a uecessarily large dimesio). From (3.1), Γ,λ (α) is a sum of two compoets: S (α)/, which measures the goodess of fit of model α, adλ ˆσ 2p (α)/, which is a pealty o the use of models with large dimesios. I view of the fact that the use of λ 2 teds to overfit, it is atural to cosider a large λ i (3.1), i.e., to put a heavy pealty o the use of models with large dimesios. The reaso why ˆα,2 may ot be asymptotically loss efficiet is that the miimizer of Γ,2 (α) e 2 = 2ˆσ2 p (α) e H (α)e, which is cosidered as a fuctio of α A c, may ot be the same as the miimizer of L (α) =σ 2 p (α)/. What will occur if we use a λ that? Similar to the expasio (3.2), we have Γ,λ (α)= e 2 + λˆσ2 p (α) e H (α)e α A c e 2 +L (α)+ (λ ˆσ2 2σ2 )p (α) +o p (L (α)) α A A c, where the equality for the case of α A A c holds uder coditio (2.6). If (3.7) e max H (α)e α A c λ ˆσ 2p (α) p 0, (3.8) the, for α A c,γ,λ (α) e 2 / is domiated by the term λ ˆσ 2 p (α)/ which has the same miimizer as L (α) =e H (α)e /. Hece, P {ˆα,λ A c but ˆα,λ α c } 0, (3.9) where ˆα,λ is the model selected usig the GIC λ ad α c is the model i A c with the smallest dimesio. This meas that the GIC λ method picks the best model i A c as log as (3.8) holds, which is implied by a weak coditio lim sup α A c 1 [ p (α)] m < (3.10) for some positive iteger m such that E(y 1 µ 1 ) 4m <. Note that (3.10) holds if the umber of models i A is bouded (Examples 1 ad 2) or if m =2 ad A = {α i,i =1,...,p } with α i = {1,...,i} (polyomial approximatio i Example 3).

10 230 JUN SHAO For the asymptotic correctess of the GIC λ method, the remaiig questio is whether it ca assess the models i A A c. Ufortuately, the GIC λ teds to select a model with a small dimesio ad, therefore, may fail to be asymptotically loss efficiet if models with small dimesios have large values of L (α). More precisely, if there are α 1 ad α 2 i A A c such that L (α lim 1 ) L (α > 1 but lim 1 )+(λ ˆσ 2 2σ 2 )p (α 1 )/ L (α 2 ) L (α 2 )+(λ ˆσ 2 2σ 2 )p (α 2 )/ < 1 (3.11) (which implies lim p (α 1 )/p (α 2 ) < 1), the the GIC λ is ot asymptotically loss efficiet. A ecessary coditio for ˆα,λ to be asymptotically loss efficiet is that (3.11) does ot hold for ay α 1 ad α 2. Of course, (3.11) is almost impossible to check. I the followig theorem we provide some sufficiet coditios for the asymptotic loss efficiecy of the GIC λ. Theorem 2. Suppose that (2.6) ad (3.10) hold ad that ˆσ 2 p 0 ad ˆσ 2 p. (i) A sufficiet coditio for the asymptotic loss efficiecy of ˆα,λ is that (2.5) holds ad λ is chose to satisfy λ ad λ p 0. (3.12) (ii) If A cotais at least oe correct model with fixed dimesio for sufficietly large, λ ad λ / 0, theˆα,λ is cosistet. Remark 3. Ulike the case of λ 2, it is ot required i Theorem 2 that ˆσ 2 be a cosistet estimator of σ 2. We ow apply Theorems 1 ad 2 to Examples 1-3. Example 1. (cotiued) We use the otatio give by (1.1). I this example (2.4), (2.5), (2.6) ad (3.10) hold. Note that A c is ot empty ad cosistecy i the sese of (2.1) is the same as asymptotic loss efficiecy i the sese of (2.3) (Propositio 1). By Theorem 1, ˆα,2 is cosistet if ad oly if ᾱ = {1,...,p} is the oly correct model. By Theorem 2(ii), ˆα,λ is always cosistet if λ ad λ / 0. Example 2. (cotiued) Note that = kr meas that either k or r. Usig Theorems 1 ad 2, we ow show that ˆα,2 is better whe k, whereas ˆα,λ with λ satisfyig (3.12) is better whe r.

11 ASYMPTOTICS FOR LINEAR MODEL SELECTION 231 It is easy to see that (2.6) ad (3.10) hold. Coditio (2.5) holds if k is fixed. If k, the (2.5) is the same as lim if k 1 k k (µ j 1 k µ i ) 2 > 0, k j=1 i=1 which is a reasoable coditio. Cosider first the case where k ad r is fixed. Sice the differece i dimesios of the two models i A is k 1, a applicatio of Theorem 1(i)&(iv) shows that ˆα,2 is always asymptotically loss efficiet. O the other had, it ca be show that if λ,thep {ˆα,λ = α 1 } 1. Hece ˆα,λ is asymptotically loss efficiet if ad oly if the oe-mea model is correct. Next, cosider the case where r ad k is fixed. I this case the dimesios of both models are fixed. By Propositio 1, cosistecy is the same as asymptotic loss efficiecy. By Theorem 2, ˆα,λ with λ ad λ / 0 is cosistet. By Theorem 1, ˆα,2 is cosistet if ad oly if the oe-mea model is icorrect. Fially, cosider the case where k ad r. Sice p / = r 1 0, cosistecy is the same as asymptotic loss efficiecy (Propositio 1). By Theorems 1 ad 2, both ˆα,2 ad ˆα,λ are cosistet, but λ has to be chose so that (3.12) holds, i.e., λ /r 0. For example, if we choose λ =log (GIC λ is the equivalet to the BIC i Schwarz (1978)), the ˆα,λ is icosistet if log /r 0. This is exactly what was described i Sectio 3 of Stoe (1979). Example 3. (cotiued) I this case p as. Coditios (2.6) ad (3.10) are usually satisfied with m = 2. If there exists a correct model i A for some, the there are may correct models i A ad by Theorems 1 ad 2, ˆα,λ is cosistet but ˆα,2 is ot. O the other had, if there is o correct model i A for all, theˆα,2 is asymptotically loss efficiet but ˆα,λ may ot, sice coditio (2.5) may ot hold. I coclusio, the GIC λ method with λ 2 is more useful i the case where there is o fixed-dimesio correct model, whereas the GIC λ method with λ is more useful i the case where there exist fixed-dimesio correct models. To ed this sectio, we discuss briefly the GIC λ with λ λ, acostat larger tha 2. It is apparet that the GIC λ with a fixed λ>2 is a compromise betwee the GIC 2 ad the GIC λ with λ. The asymptotic behavior of the GIC λ, however, is ot as good as the GIC 2 i the case where o fixeddimesio correct model exists, ad ot as good as the GIC λ whe there are

12 232 JUN SHAO fixed-dimesio correct models. This ca be see from the proofs of Theorems 1 ad 2 i Sectio Other Selectio Methods I this sectio we show that some selectio methods cited i Sectio 1 have the same asymptotic behavior (i terms of cosistecy ad asymptotic loss efficiecy) as the GIC λ uder certai coditios. First, cosider the GIC λ with the followig particular choice of ˆσ : 2 ˆσ 2 = S (ᾱ ) p, (4.1) where ᾱ = {1,...,p }. If (4.1) is used, the the GIC 2 is the C p method (Mallows (1973)) ad the GIC λ is the GIC i Rao ad Wu (1989). The estimator i (4.1), however, is ot ecessarily cosistet for σ 2 if ᾱ is a icorrect model. Asymptotic behavior of the C p (λ 2) is give i the followig result. Theorem 1A. (i) If (ᾱ ) 0 ad p / 1, theˆσ 2 i (4.1) is cosistet for σ 2 ad, therefore, the assertios (i)-(iv) i Theorem 1 are valid for the C p. (ii) If (2.4) holds, the the assertios (i)-(iv) i Theorem 1 are valid for the C p. Note that i Theorem 2, we do ot eed ˆσ 2 to be cosistet. Hece we have the followig result for the case where λ. Theorem 2A. Assume that (2.6) ad (3.10) hold. The the assertios (i)-(ii) i Theorem 2 are valid for the GIC λ with ˆσ 2 give by (4.1) ad λ. If we use ˆσ 2 =ˆσ2 (α) = S (α) p (α) (a estimate of σ 2 depeds o the model α) i (3.1), the we select a model by miimizig Γ,λ (α) = S [ (α) 1+ λ ] p (α). p (α) If λ p / 0, this method has the same asymptotic behavior as the method miimizig log S (α) + λ p (α) p (α), sice log(1 + x) x as x 0. Miimizig Γ,λ (α) iskowastheaicif λ 2adtheBICifλ =log. Let α,λ be the model selected by miimizig Γ,λ (α) overα A. We have the followig result similar to Theorems 1 ad 2.

13 ASYMPTOTICS FOR LINEAR MODEL SELECTION 233 Theorem 3. Suppose that (2.6) holds. (i) The assertios (i)-(iv) i Theorem 1 are valid for α,2 (the AIC ) if either (2.4) holds or p max (α) 0 ad 1. (4.2) α A (ii) Assume that (3.10) holds. The the assertios (i)-(ii) i Theorem 2 are valid for α,λ with λ. The delete-1 CV method selects a model by miimizig CV,1 (α) = [I H (α)] 1 [y ˆµ (α)] 2 over α A, where H (α) is a diagoal matrix whose ith diagoal elemet is the ith diagoal elemet of H (α). The GCV method replaces H (α) by [ 1 tr H(α)]I = [ 1 p (α)]i, where tra is the trace of the matrix A, ad hece it selects a model by miimizig From the idetity GCV (α) = S (α) [1 1 p (α)] 2. 1 [1 1 p (α)] 2 =1+ 2p [ ] (α) p (α) + p (α) 2, p (α) we kow that the GCV ad the AIC have the same asymptotic behavior if [ max α A p (α) p (α) which holds if ad oly if (2.4) holds. ] 2 /[ 1+ 2p (α) ] 0, (4.3) p (α) Theorem 4. Suppose that (2.6) holds. (i) The assertios (i)-(iv) i Theorem 1 are valid for the GCV if either (2.4) or (4.2) holds. (ii) Assume that h =max i x i(x X ) 1 x i 0. (4.4) The the assertios (i)-(iv) i Theorem 1 are valid for the delete-1 CV. Coditio (4.4) is stroger tha coditio (2.4). Whe either (2.4) or (4.2) holds, the GCV ad the delete-1 CV may ot be asymptotically loss efficiet. Example 2. (cotiued) We cosider Example 2 i the situatio where k is large but r, the umber of replicatio, is fixed. Sice p / = r 1, (2.4) does ot hold.

14 234 JUN SHAO Assume that lim (α 1 )= > 0, i.e., (4.2) does ot hold. Let y ij be the jth observatio i the ith group, j =1,...,r, i =1,...,k,ȳ i be the ith group mea, ȳ be the overall mea, SS 1 = k i=1 rj=1 (y ij ȳ) 2 ad SS k = k i=1 rj=1 (y ij ȳ i ) 2. The delete-1 CV ad the GCV are idetical i this case ad select the oe-mea model if ad oly if SS 1 /(1 1 ) 2 <SS k /(1 r 1 ) 2.From L (α 1 ) L (α k ) p r σ 2, SS 1 p σ 2 + ad SS k p (r 1)σ 2, r the delete-1 CV (or the GCV) is ot asymptotically loss efficiet if σ 2 /r < σ 2 /(r 1). The delete-d CV is a extesio of the delete-1 CV. Suppose that we split the (1 + p )matrix(y, X ) ito two distict sub-matrices: a d (1 + p )matrix (y,s, X,s ) cotaiig the rows of (y, X ) idexed by the itegers i s, a subset of {1,...,} of size d, ada( d) (1 + p )matrix(y,s c, X,s c) cotaiig the rows of (y, X ) idexed by the itegers i s c, the complemet of s. For ay α A,weestimateβ (α) byˆβ,s c(α), the LSE based o (y,s c, X,s c) uder model α. The model is the assessed by y,s ˆµ,s (α) 2,whereˆµ,s (α) = X,s (α)ˆβ,s c(α) adx,s (α) isad p (α) matrix cotaiig the colums of X,s idexed by the itegers i α. LetS be a class of N subsets s. The delete-d CV method selects a model by miimizig CV,d (α) = 1 y dn,s ˆµ,s (α) 2 s S over α A. The set S ca be obtaied by usig a balaced icomplete block desig (Shao (1993)) or by takig a simple radom sample from the collectio of all possible subsets of {1,...,} of size d (Burma (1989), Shao (1993)). While the delete-1 CV has the same asymptotic behavior as the C p (Theorem 4), the delete-d CV has the same asymptotic behavior as the GIC λ with λ = +1. (4.5) d If d/ 0, the λ 2; if d/ τ (0, 1), the λ 1 1 τ +1, a fixed costat larger tha 2; if d/ 1, the λ. I view of the discussio (i the ed of Sectio 3) for the GIC λ with a fixed λ>2, we cosider oly the case where d is chose so that d/ 1(λ ). Theorem 5. Suppose that (2.5), (2.6) ad (3.10) hold ad that max s S sup X,s c 2 c =1 d X,s cc 2 d 0.

15 ASYMPTOTICS FOR LINEAR MODEL SELECTION 235 The the delete-d CV is asymptotically loss efficiet if d is chose so that d 1 ad p 0. (4.6) d If, i additio, A cotais at least oe correct model with fixed dimesio, the the delete-d CV is cosistet. Remark 4. Coditio (4.6) implies coditio (2.4) ad is similar to coditio (3.12) i Theorem 2. I fact, p /( d) 0 is a very atural requiremet for usig the delete-d CV, sice d is the umber of observatios used to fit a iitial model with as may as p parameters. The PMDL ad PLS methods (Rissae (1986), Wei (1992)) are show to have the same asymptotic behavior as the BIC method (which is a special case of the GIC) uder some situatios (Wei (1992)). However, these two methods are iteded for the case where e is a time series so that the observatios have a atural order. Hece, we do ot discuss these methods here. I coclusio, the methods discussed so far ca be classified ito the followig three classes accordig to their asymptotic behaviors: Class 1. The GIC 2,theC p, the AIC, the delete-1 CV, ad the GCV. Class 2. The GIC λ with λ ad the delete-d CV with d/ 1. Class 3. The GIC λ with a fixed λ>2 ad the delete-d CV with d/ τ (0, 1). The methods i class 1 are useful i the case where there is o fixed-dimesio correct model. With a suitable choice of λ or d, the methods i class 2 are useful i the case where there exist fixed-dimesio correct models. The methods i class 3 are compromises of the methods i class 1 ad the methods i class 2; but their asymptotic performaces are ot as good as those of the methods i class 1 i the case where o fixed-dimesio correct model exists, ad ot as good as those of the methods i class 2 whe there are fixed-dimesio correct models. 5. Empirical Results We study the magitude of P {ˆα = α L } with a fixed by simulatio i two examples. Although some selectio methods are show to have the same asymptotic behavior, their fixed sample performaces (i terms of P {ˆα = α L }) may be differet. The first example is the liear regressio (Example 1) with p =5;thatis, y i = β 1 x i1 + β 2 x i2 + β 3 x i3 + β 4 x i4 + β 5 x i5 + e i, i =1,...,40, where e i are idepedet ad idetically distributed as N(0, 1), x ij is the ith value of the jth explaatory variable x j, x i1 1, ad the values of x ij, j =2, 3, 4, 5,

16 236 JUN SHAO are take from a example i Gust ad Maso (1980) (also, see Table 1 i Shao (1993)). This study is a extesio of that i Shao (1993) which studies the cross-validatio methods oly. Table 1. Selectio probabilities i regressio problem based o 1000 simulatios. True β Model AIC C p GIC CV 1 GCV CV d (2, 0, 0, 4, 0) 1,4** ,2,4* ,3,4* ,4,5* ,2,3,4* ,2,4,5* ,3,4,5* ,2,3,4,5* (2, 0, 0, 4, 8) 1,4,5** ,2,4,5* ,3,4,5* ,2,3,4,5* (2, 9, 0, 4, 8) 1,4, ,2,4,5** ,3,4,5* ,2,3,4,5* (2, 9, 6, 4, 8) 1,2,3, ,2,4, ,3,4, ,2,3,4,5** * A correct model ** The optimal correct model Six selectio procedures, the AIC, the C p,thegic λ with ˆσ 2 give by (4.1), the delete-1 CV (deoted by CV 1 ), the GCV, ad the delete-d CV (deoted by CV d ), are applied to select a model from 2 p 1 = 31 models. The λ i the GIC is chose to be log = log so that this GIC is almost the same as the BIC. The d i the delete-d CV is chose to be 25 so that (4.5) approximately holds ad the delete-d CV is comparable with the GIC. The S i the delete-d CV is obtaied by takig a radom sample of size 2 = 80 from all possible subsets of {1,...,40} of size 25. For these six selectio procedures, the empirical probabilities (based o 1,000 simulatios) of selectig each model are reported i Table 1, where each model is deoted by a subset of {1,...,5} that cotais the idices of the explaatory variables x j i the model. Models correspodig to zero empirical probabilities for all the methods i the simulatio are omitted.

17 ASYMPTOTICS FOR LINEAR MODEL SELECTION 237 The secod example cosidered is the polyomial approximatio to a possibly oliear curve (Example 3); that is, we select a model from the followig class of models: y i = β 0 + β 1 x i + + β h 1 x h 1 i + e i, h =1,...,p. (5.1) I the simulatio, =40adp = 5. Other settigs ad the selectio procedures cosidered are the same as those i the first example. The values of x i are take to be the same as x i2 i the first example. We cosider situatios where oe of the models i (5.1) is correct, as well as the case where the true model is y i =exp(2x i )+e i so that oe of the models i (5.1) is correct. The results are reported i Table 2. Table 2. Selectio probabilities i polyomial approximatio problem based o 1000 simulatios. True E(y i ) Model AIC C p GIC CV 1 GCV CV d 1 h = 1** h = 2* h = 3* h = 4* h = 5* x i h = 2** h = 3* h = 4* h = 5* x i +2x 2 i h = 3** h = 4* h = 5* x i +2x 2 i h = x 3 i /2 h = 4** h = 5* x i +2x 2 i h = x 3 i /2+2x4 i /3 h = 5** exp (2x i ) h = * A correct model ** The optimal correct model The followig is a summary of the results i Tables 1 ad 2. (1) The procedures i class 2 (the GIC ad the CV d ) have much better empirical performaces tha the procedures i class 1 (the AIC, the C p,thecv 1,ad

18 238 JUN SHAO the GCV) whe there are at least two fixed-dimesio correct models. The probability P {ˆα = α L } may be very low for the methods i class 1 whe the dimesio of the optimal model is ot close to p. This cofirms the asymptotic results established i Sectios 3 ad 4. (2) The performaces of two methods i class 2 may be substatially differet. For example, the probability of the GIC selectig the optimal model ca be as low as i the first example, whereas the CV d selects the optimal model with probability higher tha 0.90 i all cases. O the other had, the CV d selects a icorrect model sometimes with a small chace. 6. Proofs Proof of Propositio 1. We oly eed to show that (2.3) does ot hold, assumig that (2.1) does ot hold. If A c cotais exactly oe model, the by (2.4), L (α L ) p 0; but by (2.5), L (ˆα ) p 0. Hece (2.3) does ot hold. Next, assume that A c cotais more tha oe models but p (αl ) p. Sice P {ˆα α L } 0, there exists α 1 A c such that α 1 α L ad P {ˆα = α 1 } 0. The [ ] [ L (ˆα ) L (α L ) 1 L (α 1 ) e ] L (α L ) 1 I(ˆα =α 1 )= H (α 1 )e e H (α L )e 1 I(ˆα =α 1 ) p 0. Proofof(3.2).Notethat Γ,2 (α) = e 2 Hece (3.2) follows from ad + L (α)+ 2(ˆσ2 σ 2 )p (α) + 2[σ2 p (α) e H (α)e ] + 2e [I H (α)]µ. ˆσ 2 max σ2 p (α) p 0, (6.1) α A A c L (α) σ 2 p max (α) e H (α)e p 0, (6.2) α A A c L (α) e max [I H (α)]µ p 0. (6.3) α A A c L (α) Result (6.1) follows from (6.2), e H (α)e L (α), ad the fact that ˆσ 2 σ 2 p 0. Results (6.2) ad (6.3) ca be show usig the same argumet i Li (1987), p.970 uder coditio (2.6).

19 ASYMPTOTICS FOR LINEAR MODEL SELECTION 239 Proof of Theorem 1. The first statemet i (i) is proved i Sectio 3. The secod statemet i (i) is a cosequece of the first statemet ad Propositio 1. For (ii), it suffices to show that Γ,2 (α) = e 2 + L (α)+o p (L (α)) uiformly i α A c, which follows from either max e H (α)e σ 2 p 0 (6.4) p (α) or α A c max α A c,α αc α A c e [H (α) H (α c )]e p (α) p (α c ) σ 2 p 0. (6.5) From Theorem 2 of Whittle (1960), E e H (α)e σ 2 2m c p (α) [ p (α)] m, (6.6) where c is a positive costat. The for ay ɛ>0, { P max e H } (α)e σ 2 >ɛ cɛ 2m 1 p (α) [ p (α)] m. α A c Hece (6.4) is implied by coditio (3.4). A similar argumet shows that (6.5) is implied by coditio (3.5). The result i (iii) ca be proved usig the same argumet i Example 4. For (iv), it suffices to show that p (α c ) is the same as (3.4) ad mi α A c,α α c p (α) p (α c ) is the same as (3.5), which is apparet if the umber of models i A c is bouded. The proof for the case where m =2adA = {α i,i= 1,...,p } with α i = {1,...,i} is the same as that i Li (1987), p.963. ProofofTheorem2. From (6.6) ad coditio (3.10), e H (α)e λ p (α) = O p (λ 1 ) uiformly i α A c. Hece (3.9) holds. Sice L (α) > (α), (3.7) ad coditios (2.5) ad (3.12) imply that Γ,λ (α) = e 2 + L (α) +o p (L (α)) uiformly i α A A c, ad if A c is ot empty, Γ,λ (α c ) = o p (L (α)) uiformly i α A A c. The result i (i) is established. If A cotais at least oe correct model with fixed dimesio, the (2.5) holds ad P {ˆα,λ = α c } 1. The cosistecy of ˆα,λ follows from the fact that α L = α c.

20 240 JUN SHAO Proofs of Theorems 1A, 2A, ad 3. First, cosider Theorem 1A. Note that S (ᾱ ) p = e [I H (ᾱ )]e p + p (ᾱ )+ 2e [I H (ᾱ )]µ p. (6.7) If (ᾱ ) 0adp / 1, the S (ᾱ )/( p ) is cosistet ad the result i (i) follows. Suppose ow that (2.4) holds. For the result i (ii), it suffices to show that (6.1) still holds for ˆσ 2 = S (ᾱ )/( p ). From (6.7), S (ᾱ )/( p )= σ 2 + o p (1) + O p ( (ᾱ )). Hece (6.1) follows from the fact that (ᾱ )p max (α) p α A L (α). The proofs for Theorem 2A ad Theorem 3 are similar. Proof of Theorem 4. (i) If (2.4) holds, the (4.3) holds ad the result follows from Theorem 3(i). Now, assume that (4.2) holds. The p (α L )/ p 0ad p (ˆα )/ p 0, where ˆα is the model selected by the GCV. If A c is empty for all, the GCV (ˆα )= e 2 + L (ˆα )+o p (L (ˆα )), GCV (α L )= e 2 + L (α L )+o p(l (α L )), ad 0 GCV (ˆα ) GCV (α L ) L (ˆα ) = L (ˆα ) L (α L ) L (ˆα ) + o p (1) o p (1). This proves that (2.3) holds. The proof for the case where A c is oempty is similar to the proof of Theorem 1. (ii) Defie T (α) =[y ˆµ (α)] H (α)[y ˆµ (α)]. The CV,1 (α) = S (α) The result follows if T (α) σ 2 p (α) L (α) E T (α) p (α) σ2 + 2T (α) ( ) h T (α) + O p. = o p (1) uiformly i α A A c, (6.8) 2m c [ p (α)] m α Ac (6.9)

21 ASYMPTOTICS FOR LINEAR MODEL SELECTION 241 ad E T (α) T (α c ) p (α) p (α c ) 2m σ2 c [ p (α) p (α c )] m α Ac (6.10) for some c>0 ad positive iteger m such that E(y 1 µ 1 ) 4m <. Let W (α) =[I H (α)] H (α)[i H (α)]. Whe α A A c, T (α) =e W (α)e +2e W (α)µ + µ W (α)µ. (6.11) From Theorem 2 of Whittle (1960), E e W (α)e E[e W 2l (α)e ] c[trw 2 (α)]l ch l [trw (α)] l. (6.12) Note that trw (α) =tr H (α)[i H (α)] = p (α) tr H2 (α) p (α). (6.13) By (2.6), (6.12) ad (6.13), { P max e W (α)e E[e W } (α)e ] R (α) >ɛ cɛ 2l 1 [R (α)] l 0. α A A c α A A c The (6.8) follows from (6.11), µ W (α)µ h (α) h R (α) adthefact that (2.6) implies max L (α) α A A c R (α) 1 p 0. Results (6.9) ad (6.10) follow from Theorem 2 of Whittle (1960), the idetity (6.13), ad the fact that tr H2 (α) h p (α) ad tr[ H2 (α) H2 (α c )] 2h [ p (α) p (α c )] whe α A c. Proof of Theorem 5. It follows from the proof i Shao (1993), Appedix that uder the give coditios, CV,d (α) = S (α) + λ T (α) ( ) λ T (α) + o p uiformly i α A,whereλ is give by (4.5) ad T (α) is give by (6.11). The the result follows from the give coditios ad result (6.9). Ackowledgemets I would like to thak the referees ad a Associate Editor for their costructive commets ad suggestios. The research was supported by Natioal Scieces Foudatio Grat DMS ad Natioal Security Agecy Grat MDA

22 242 JUN SHAO Refereces Akaike, H. (1970). Statistical predictor idetificatio. A. Ist. Statist. Math. 22, Alle, D. M. (1974). The relatioship betwee variable selectio ad data augmetatio ad a method for predictio. Techometrics 16, Burma, P. (1989). A comparative study of ordiary cross-validatio, v-fold cross-validatio ad the repeated learig-testig methods. Biometrika 76, Crave, P. ad Wahba, G. (1979). Smoothig oisy data with splie fuctios: Estimatig the correct degree of smoothig by the method of geeralized cross-validatio. Numer. Math. 31, Geisser, S. (1975). The predictive sample reuse method with applicatios. J. Amer. Statist. Assoc. 70, Gust, R. F. ad Maso, R. L. (1980). Regressio Aalysis ad Its Applicatio. MarcelDekker, New York. Haa, E. J. ad Qui, B. G. (1979). The determiatio of the order of a autoregressio. J. Roy. Statist. Soc. Ser. B 41, Li, K.-C. (1987). Asymptotic optimality for C p,c L, cross-validatio ad geeralized crossvalidatio: Discrete idex set. A. Statist. 15, Mallows, C. L. (1973). Some commets o C p. Techometrics 15, Nishii, R. (1984). Asymptotic properties of criteria for selectio of variables i multiple regressio. A. Statist. 12, Pötscher, B. M. (1989). Model selectio uder ostatioary: autoregressive models ad stochastic liear regressio models. A. Statist. 17, Rao, C. R. ad Wu, Y. (1989). A strogly cosistet procedure for model selectio i a regressio problem. Biometrika 76, Rissae, J. (1986). Stochastic complexity ad modelig. A. Statist. 14, Schwarz, G. (1978). Estimatig the dimesios of a model. A. Statist. 6, Shao, J. (1993). Liear model selectio by cross-validatio. J. Amer. Statist. Assoc. 88, Shibata, R. (1981). A optimal selectio of regressio variables. Biometrika 68, Shibata, R. (1984). Approximate efficiecy of a selectio procedure for the umber of regressio variables. Biometrika 71, Stoe, M. (1974). Cross-validatory choice ad assessmet of statistical predictios. J. Roy. Statist. Soc. Ser. B 36, Stoe, M. (1979). Commets o model selectio criteria of Akaike ad Schwarz. J. Roy. Statist. Soc. Ser. B 41, Wei, C. Z. (1992). O predictive least squares priciples. A. Statist. 20, Whittle, P. (1960). Bouds for the momets of liear ad quadratic forms i idepedet variables. Theory Probab. Appl. 5, Zhag, P. (1993). Model selectio via multifold cross validatio. A. Statist. 21, Departmet of Statistics, Uiversity of Wiscosi, 1210 W. Dayto St., Madiso, WI 53706, U.S.A. (Received March 1994; accepted September 1996)

23 ASYMPTOTICS FOR LINEAR MODEL SELECTION 243 COMMENT Rudolf Bera Uiversity of Califoria, Berkeley Professor Shao s welcome asymptotic aalysis of stadard model selectio procedures divides these ito three categories: those that perform better whe oe or more correct models have fixed dimesio uder the asymptotics; those that do better whe o correct model has fixed dimesio; ad itermediate methods. Adoptig the premise that model selectio is iteded to reduce estimatio risk uder quadratic loss, my discussio will draw attetio to two poits: GIC λ selectio estimators with lim λ = ca have arbitrarily high asymptotic risk whe the sigal-to-oise ratio is large eough. GIC λ selectio estimators with either λ = 2 or lim λ = are ot asymptotically miimax uless the sigal to oise ratio coverges to zero. They are domiated, i maximum risk, by a variety of procedures that taper the compoets of the least squares fit toward zero. I will develop both poits i a sigal recovery settig that is formally a special case of Shao s problem. Suppose that X = {X,t : t T } is a observatio o a discrete sigal ξ = {ξ,t : t T } that is measured with error at the time poits T = {1,...,}. The measuremet errors are idepedet ad are such that the distributio of each compoet X,t is N(ξ,t,σ 2 ). For ay real-valued fuctio f defied o T,letave(f) = 1 t T f(t). The time-averaged quadratic loss of ay estimator ˆξ is the ad the correspodig risk is L (ˆξ,ξ )=ave[(ˆξ ξ ) 2 ] R (ˆξ,ξ,σ 2 )=EL (ˆξ,ξ ). Model selectio ad related estimators typically have smaller risk whe all but a few compoets of ξ are small. With eough prior iformatio, this favorable situatio may be approximated by suitable orthogoal trasformatio of X before estimatio. This trasformatio leaves the Gaussia error distributio uchaged. A model selectio or other estimator costructed i the ew coordiate system may be trasformed back to the origial coordiate system without chagig its quadratic loss. Thus, i sigal recovery problems, the {X,t } might be Fourier, or wavelet, or aalysis of variace, or orthogoal polyomial coefficiets of the observed sigal.

24 244 JUN SHAO Let u [0, 1]. We cosider ested model selectio, i which the cadidate estimators have the form ˆξ (u) ={ˆξ,t (u) :t T },withˆξ,t (u) =X,t wheever t/( +1) u ad ˆξ,t = 0 otherwise. The value of u will be chose by the GIC λ method i Shao s Sectio 3. Let ˆσ 2 be a cosistet estimator of σ 2 that satisfies lim sup ave(ξ 2 )/σ2 r E ˆσ 2 σ 2 =0 (1) for every r [0, ). Such variace estimators may costructed exterally usig replicatio or iterally by methods such as those described i Rice (1984). The GIC λ selectio criterio is ˆΓ (u, λ )=ˆγ (u)+λ ˆσ 2 1 [( +1)u] I, where ˆγ (u) = 1 t/(+1)>u X2,t ad [ ] I is the iteger part fuctio. Let û be the smallest value of u [0, 1] that miimizes ˆΓ (u, λ ). Existece of û is assured because the criterio fuctio assumes oly a fiite umber of values. The model selectio estimator ˆξ (û ) will be deoted by ˆξ,λ. Propositio 1. I the sigal-plus-oise model, with ˆσ 2 satisfyig (1), thefollowig bouds hold for every r [0, ): lim If lim λ =, the sup ave(ξ 2 )/σ2 r lim sup ave(ξ)/σ 2 2 r The least squares estimator X satisfies lim sup ave(ξ 2 )/σ2 r R (ˆξ,2,ξ,σ 2 )=σ 2 mi(r, 1). (2) R (ˆξ,λ,ξ,σ 2 )=σ 2 r. (3) R (X,ξ,σ 2 )=σ 2. (4) This propositio will be proved at the ed of the discussio. Let us cosider some implicatios: (a) If ξ is a voltage sigal, the ave(ξ) 2 is the time-averaged power dissipated by this sigal i passig through a uit resistace. Cosequetly, ave(ξ 2)/σ2 is the time-averaged sigal-to-oise ratio i our sigal recovery problem. The maximum risks i Propositio 1 are computed over subsets of ξ values that are geerated by boudig the sigal-to-oise ratio from above. (b) For r = 0, the limitig maximum risks i Propositio 1 do ot distiguish betwee the performace of ˆξ,2 ad ˆξ,λ with lim λ =. Theorems 1 ad 2 i Shao s paper idicate that the latter estimators may perform better

25 ASYMPTOTICS FOR LINEAR MODEL SELECTION 245 i some (but ot all) circumstaces where the sigal-to-oise ratio coverges to zero. (c) As log as the sigal-to-oise ratio does ot exceed 1, both ˆξ,2 ad ˆξ,λ with lim λ = have the same asymptotic maximum risk. Oce the sigal to oise ratio exceeds 1, the ˆξ,λ has greater asymptotic maximum risk tha ˆξ,2 or eve the least squares estimator X. (d) For all values of r, the asymptotic maximum risk of ˆξ,λ with lim λ = coicides with that of the trivial estimator ˆξ = 0. This does ot mea that ˆξ,λ is trivial. (e) For all values of r, the asymptotic maximum risk of ˆξ,2 equals the smaller of the asymptotic maximum risks of X ad ˆξ,λ with lim λ =. This argumet strogly promotes the use of ˆξ,2 over these two competitors uless oe is cofidet that the special circumstaces of remark b hold. How well do model selectio estimators perform withi the class of all estimators of ξ? A aswer that complemets Propositio 1 is Propositio 2. I the sigal-plus-oise model, with ˆσ 2 satisfyig (1), thefollowig equality holds for every r [0, ): lim if sup R (ˆξ,ξ,σ 2 )=σ 2 r/(r +1). (5) ˆξ ave(ξ)/σ 2 2 r This result follows from Pisker s (1980) geeral lower boud o risk i sigal recovery from Gaussia oise. It may also be derived from ideas i Stei (1956) by cosiderig best orthogoally equivariat estimators i the submodel where ave(ξ 2 )/σ2 = r. To be asymptotically miimax, a estimator ˆξ must satisfy lim sup ave(ξ)/σ 2 2 r R (ˆξ,ξ,σ 2 )=σ 2 r/(r +1). Simplest amog asymptotically miimax estimators is the James-Stei (1961) estimator ˆξ,S =[1 ˆσ 2 /ave(x2 )]+ X, where [ ] + deotes the positive part fuctio ad ˆσ 2 is a estimator of σ 2 that satisfies (1). For every positive, fiite r ad σ 2, σ 2 r/(r+1) <σ 2 mi(r, 1). Hece, for large, the James-Stei estimator domiates, i maximum risk, ay of the three estimators discussed i Propositio 1. Figure 1 reveals the extet of this domiatio.

26 246 JUN SHAO 4 Maximum Risk (i multiples of variace) Sigal-to-Noise Ratio Figure 1. The asymptotic maximum risks of ested model selectio estimators geerated by the GIC 2 criterio (solid lies) ad by the GIC λ criterio whe lim λ = (broke lie). The asymptotic miimax risk, attaied by good tapered estimators, is the dashed curve below. The story oly begis here. We ca costruct asymptotically miimax estimators that domiate the James-Stei estimator over submodels. Let G be a give closed covex subset of [0, 1] T that cotais all costats i [0, 1]. Each fuctio g G defies a cadidate modulatio estimator gx = {g(t)x,t : t T } for ξ. The risk of this cadidate estimator uder quadratic loss is R (gx,ξ,σ 2 )=ave[σ 2 g 2 + ξ 2 (1 g)2 ]. Here squarig is doe compoetwise. A estimator of this risk, suggested by Stei s ubiased estimator for risk or by the C p idea, is ˆR (g) =ave[g 2ˆσ 2 +(1 g)2 (X 2 ˆσ2 )]. The proposal is to use the modulatio estimator ĝ X,whereĝ miimizes ˆR (g) over all g G.WheG cosists of all costat fuctios i [0, 1] T, the modulatio estimator ĝ X is just the James-Stei estimator described above. To improve o James-Stei, let G,mo be the set of all oicreasig fuctios i [0, 1] T. The class of cadidate estimators {gx : g G,mo } ow

27 ASYMPTOTICS FOR LINEAR MODEL SELECTION 247 cotais the ested model selectio estimators discussed earlier. It cotais as well cadidate estimators that selectively taper the coordiates of X towards zero. Choosig ĝ,mo to miimize ˆR (g) overg G,mo geeralizes precisely choosig û to miimize ˆΓ (u, 2) over u [0, 1]. Because the class of cadidate modulators G,mo cotais all costat fuctios o [0, 1] T, it turs out that ĝ,mo X is asymptotically miimax, ulike ˆξ,2 : lim sup ave(ξ)/σ 2 2 r R (ĝ,mo X,ξ,σ 2 )=σ 2 r/(r +1). Thus, o the oe had, ĝ,mo X domiates, for every r>0, the ested model selectio estimators treated i Propositio 1. O the other had, because G,mo is richer tha the class of all costats i [0, 1] T, the maximum risk of the estimator ĝ,mo X asymptotically domiates that of the James-Stei estimator over large classes of submodels withi ave(ξ 2 )/σ2 r. For further details o these poits, o other iterestig choices of G, ad o algorithms for computig ĝ,seeberaaddümbge (1996). I short, whe quadratic risk is the criterio ad the sigal-to-oise ratio is ot asymptotically zero, data-drive taperig of X is superior to model selectio for estimatig ξ. This fidig is ot etirely surprisig, sice the compoets of X could be Fourier or wavelet coefficiets computed from the origial data; ad taperig is kow to reduce the Gibbs pheomeo that is created by trucatig a Fourier series. ProofofPropositio1. Fix r ad suppose throughout that ave(ξ 2 )/σ2 r for every. Result (4) is obvious. Let V,1 (u) = 1/2 V,2 (u) = 1/2 t/(+1)>u t/(+1)>u [(X,t ξ,t ) 2 σ 2 ] ξ,t (X,t ξ,t ) for every 0 u 1. Let deote the supremum orm o [0, 1]. By Kolmogorov s iequality, there exist fiite costats C 1 ad C 2 such that sup ave(ξ 2 )/σ 2 r E V,i C i for i =1, 2 ad every 1. First step. Recall the defiitio of ˆγ (u) adletν (u) = 1 t/(+1)>u ξ2,t. The ˆγ (u) =ν (u)+σ 2 (1 1 [( +1)u] I )+ 1/2 V,1 (u)+2 1/2 V,2 (u). (6) Cosequetly, ˆΓ (, 2) ν ( ) σ 2 (1 + ) = o E (1). (7)

Lecture 24: Variable selection in linear models

Lecture 24: Variable selection in linear models Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors ECONOMETRIC THEORY MODULE XIII Lecture - 34 Asymptotic Theory ad Stochastic Regressors Dr. Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Asymptotic theory The asymptotic

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

A statistical method to determine sample size to estimate characteristic value of soil parameters

A statistical method to determine sample size to estimate characteristic value of soil parameters A statistical method to determie sample size to estimate characteristic value of soil parameters Y. Hojo, B. Setiawa 2 ad M. Suzuki 3 Abstract Sample size is a importat factor to be cosidered i determiig

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator Slide Set 13 Liear Model with Edogeous Regressors ad the GMM estimator Pietro Coretto pcoretto@uisa.it Ecoometrics Master i Ecoomics ad Fiace (MEF) Uiversità degli Studi di Napoli Federico II Versio: Friday

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices Radom Matrices with Blocks of Itermediate Scale Strogly Correlated Bad Matrices Jiayi Tog Advisor: Dr. Todd Kemp May 30, 07 Departmet of Mathematics Uiversity of Califoria, Sa Diego Cotets Itroductio Notatio

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

The Method of Least Squares. To understand least squares fitting of data.

The Method of Least Squares. To understand least squares fitting of data. The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

Self-normalized deviation inequalities with application to t-statistic

Self-normalized deviation inequalities with application to t-statistic Self-ormalized deviatio iequalities with applicatio to t-statistic Xiequa Fa Ceter for Applied Mathematics, Tiaji Uiversity, 30007 Tiaji, Chia Abstract Let ξ i i 1 be a sequece of idepedet ad symmetric

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Rank tests and regression rank scores tests in measurement error models

Rank tests and regression rank scores tests in measurement error models Rak tests ad regressio rak scores tests i measuremet error models J. Jurečková ad A.K.Md.E. Saleh Charles Uiversity i Prague ad Carleto Uiversity i Ottawa Abstract The rak ad regressio rak score tests

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Study the bias (due to the nite dimensional approximation) and variance of the estimators 2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider k-sample ad chage poit problems for idepedet data i a

More information

The Growth of Functions. Theoretical Supplement

The Growth of Functions. Theoretical Supplement The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Axioms of Measure Theory

Axioms of Measure Theory MATH 532 Axioms of Measure Theory Dr. Neal, WKU I. The Space Throughout the course, we shall let X deote a geeric o-empty set. I geeral, we shall ot assume that ay algebraic structure exists o X so that

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

Berry-Esseen bounds for self-normalized martingales

Berry-Esseen bounds for self-normalized martingales Berry-Essee bouds for self-ormalized martigales Xiequa Fa a, Qi-Ma Shao b a Ceter for Applied Mathematics, Tiaji Uiversity, Tiaji 30007, Chia b Departmet of Statistics, The Chiese Uiversity of Hog Kog,

More information

Spectral Partitioning in the Planted Partition Model

Spectral Partitioning in the Planted Partition Model Spectral Graph Theory Lecture 21 Spectral Partitioig i the Plated Partitio Model Daiel A. Spielma November 11, 2009 21.1 Itroductio I this lecture, we will perform a crude aalysis of the performace of

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Regression with an Evaporating Logarithmic Trend

Regression with an Evaporating Logarithmic Trend Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4 MATH 30: Probability ad Statistics 9. Estimatio ad Testig of Parameters Estimatio ad Testig of Parameters We have bee dealig situatios i which we have full kowledge of the distributio of a radom variable.

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece 1, 1, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

IP Reference guide for integer programming formulations.

IP Reference guide for integer programming formulations. IP Referece guide for iteger programmig formulatios. by James B. Orli for 15.053 ad 15.058 This documet is iteded as a compact (or relatively compact) guide to the formulatio of iteger programs. For more

More information

CHAPTER I: Vector Spaces

CHAPTER I: Vector Spaces CHAPTER I: Vector Spaces Sectio 1: Itroductio ad Examples This first chapter is largely a review of topics you probably saw i your liear algebra course. So why cover it? (1) Not everyoe remembers everythig

More information

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002 ECE 330:541, Stochastic Sigals ad Systems Lecture Notes o Limit Theorems from robability Fall 00 I practice, there are two ways we ca costruct a ew sequece of radom variables from a old sequece of radom

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise) Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p +

More information

Chapter 6 Sampling Distributions

Chapter 6 Sampling Distributions Chapter 6 Samplig Distributios 1 I most experimets, we have more tha oe measuremet for ay give variable, each measuremet beig associated with oe radomly selected a member of a populatio. Hece we eed to

More information

Stochastic Matrices in a Finite Field

Stochastic Matrices in a Finite Field Stochastic Matrices i a Fiite Field Abstract: I this project we will explore the properties of stochastic matrices i both the real ad the fiite fields. We first explore what properties 2 2 stochastic matrices

More information

Efficient GMM LECTURE 12 GMM II

Efficient GMM LECTURE 12 GMM II DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet

More information

Notes for Lecture 11

Notes for Lecture 11 U.C. Berkeley CS78: Computatioal Complexity Hadout N Professor Luca Trevisa 3/4/008 Notes for Lecture Eigevalues, Expasio, ad Radom Walks As usual by ow, let G = (V, E) be a udirected d-regular graph with

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

THE ASYMPTOTIC COMPLEXITY OF MATRIX REDUCTION OVER FINITE FIELDS

THE ASYMPTOTIC COMPLEXITY OF MATRIX REDUCTION OVER FINITE FIELDS THE ASYMPTOTIC COMPLEXITY OF MATRIX REDUCTION OVER FINITE FIELDS DEMETRES CHRISTOFIDES Abstract. Cosider a ivertible matrix over some field. The Gauss-Jorda elimiatio reduces this matrix to the idetity

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates Iteratioal Joural of Scieces: Basic ad Applied Research (IJSBAR) ISSN 2307-4531 (Prit & Olie) http://gssrr.org/idex.php?joural=jouralofbasicadapplied ---------------------------------------------------------------------------------------------------------------------------

More information

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample. Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized

More information

Preponderantly increasing/decreasing data in regression analysis

Preponderantly increasing/decreasing data in regression analysis Croatia Operatioal Research Review 269 CRORR 7(2016), 269 276 Prepoderatly icreasig/decreasig data i regressio aalysis Darija Marković 1, 1 Departmet of Mathematics, J. J. Strossmayer Uiversity of Osijek,

More information

Mi-Hwa Ko and Tae-Sung Kim

Mi-Hwa Ko and Tae-Sung Kim J. Korea Math. Soc. 42 2005), No. 5, pp. 949 957 ALMOST SURE CONVERGENCE FOR WEIGHTED SUMS OF NEGATIVELY ORTHANT DEPENDENT RANDOM VARIABLES Mi-Hwa Ko ad Tae-Sug Kim Abstract. For weighted sum of a sequece

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Frequentist Inference

Frequentist Inference Frequetist Iferece The topics of the ext three sectios are useful applicatios of the Cetral Limit Theorem. Without kowig aythig about the uderlyig distributio of a sequece of radom variables {X i }, for

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece,, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet as

More information

ECON 3150/4150, Spring term Lecture 3

ECON 3150/4150, Spring term Lecture 3 Itroductio Fidig the best fit by regressio Residuals ad R-sq Regressio ad causality Summary ad ext step ECON 3150/4150, Sprig term 2014. Lecture 3 Ragar Nymoe Uiversity of Oslo 21 Jauary 2014 1 / 30 Itroductio

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara Poit Estimator Eco 325 Notes o Poit Estimator ad Cofidece Iterval 1 By Hiro Kasahara Parameter, Estimator, ad Estimate The ormal probability desity fuctio is fully characterized by two costats: populatio

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Math 155 (Lecture 3)

Math 155 (Lecture 3) Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet

More information

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = = Review Problems ICME ad MS&E Refresher Course September 9, 0 Warm-up problems. For the followig matrices A = 0 B = C = AB = 0 fid all powers A,A 3,(which is A times A),... ad B,B 3,... ad C,C 3,... Solutio:

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Chapter IV Integration Theory

Chapter IV Integration Theory Chapter IV Itegratio Theory Lectures 32-33 1. Costructio of the itegral I this sectio we costruct the abstract itegral. As a matter of termiology, we defie a measure space as beig a triple (, A, µ), where

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Lecture 11 October 27

Lecture 11 October 27 STATS 300A: Theory of Statistics Fall 205 Lecture October 27 Lecturer: Lester Mackey Scribe: Viswajith Veugopal, Vivek Bagaria, Steve Yadlowsky Warig: These otes may cotai factual ad/or typographic errors..

More information

Lecture 8: Convergence of transformations and law of large numbers

Lecture 8: Convergence of transformations and law of large numbers Lecture 8: Covergece of trasformatios ad law of large umbers Trasformatio ad covergece Trasformatio is a importat tool i statistics. If X coverges to X i some sese, we ofte eed to check whether g(x ) coverges

More information

lim za n n = z lim a n n.

lim za n n = z lim a n n. Lecture 6 Sequeces ad Series Defiitio 1 By a sequece i a set A, we mea a mappig f : N A. It is customary to deote a sequece f by {s } where, s := f(). A sequece {z } of (complex) umbers is said to be coverget

More information

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

A Proof of Birkhoff s Ergodic Theorem

A Proof of Birkhoff s Ergodic Theorem A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Supplemental Material: Proofs

Supplemental Material: Proofs Proof to Theorem Supplemetal Material: Proofs Proof. Let be the miimal umber of traiig items to esure a uique solutio θ. First cosider the case. It happes if ad oly if θ ad Rak(A) d, which is a special

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions We have previously leared: KLMED8004 Medical statistics Part I, autum 00 How kow probability distributios (e.g. biomial distributio, ormal distributio) with kow populatio parameters (mea, variace) ca give

More information

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if LECTURE 14 NOTES 1. Asymptotic power of tests. Defiitio 1.1. A sequece of -level tests {ϕ x)} is cosistet if β θ) := E θ [ ϕ x) ] 1 as, for ay θ Θ 1. Just like cosistecy of a sequece of estimators, Defiitio

More information

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution EEL5: Discrete-Time Sigals ad Systems. Itroductio I this set of otes, we begi our mathematical treatmet of discrete-time s. As show i Figure, a discrete-time operates or trasforms some iput sequece x [

More information

Statistical Inference Based on Extremum Estimators

Statistical Inference Based on Extremum Estimators T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0

More information

Lecture 23: Minimal sufficiency

Lecture 23: Minimal sufficiency Lecture 23: Miimal sufficiecy Maximal reductio without loss of iformatio There are may sufficiet statistics for a give problem. I fact, X (the whole data set) is sufficiet. If T is a sufficiet statistic

More information