Inference After Variable Selection

Size: px

Start display at page:

Download "Inference After Variable Selection"

Jayson Warner
5 years ago
Views:

1 Department of Mathematics, SIU Carbondale Inference After Variable Selection Lasanthi Pelawa Watagoda June 12, 2017

Outline 1 Introduction 2 Inference For Ridge and Lasso 3 Variable Selection 3.1 Criteria for final model selection 3.2 R functions for variable selection 3.

2 Outline 1 Introduction 2 Inference For Ridge and Lasso 3 Variable Selection 3.1 Criteria for final model selection 3.2 R functions for variable selection 3.3 New OLS Sub Model Theorem 4 Prediction Intervals 4.1 Shorth PI 4.2 Two new prediction intervals 5 Bootstrapping Hypothesis Tests 5.1 Bootstrap a Confidence Region 5.2 Prediction Region Method 6 Examples and Simulations 2/53 Lasanthi Pelawa Watagoda SIU Carbondale

3 The Multiple Linear Regression model Suppose that the response variable Y i and at least one predictor variable x i,j are quantitative with x i,1 1. Let x T i = (x i,1,..., x i,p ) = (1 u T i ) and β = (β 1,..., β p ) T where β 1 corresponds to the intercept. The multiple linear regression (MLR) model Y i = β 1 + x i,2 β x i,p β p + e i = x T i β + e i (1) for i = 1,..., n. This model is also called the full model. Here n is the sample size and the random variable e i is the ith error. 3/53 Lasanthi Pelawa Watagoda SIU Carbondale

4 The Multiple Linear Regression model Matrix Notation In matrix notation, these n equations become Y = Xβ + e, (2) Y - n 1 vector of dependent variables X - n p matrix of predictors β - p 1 vector of unknown coefficients e - n 1 vector of unknown errors ˆβ - estimator of β Ŷ i = x T i ˆβ - ith fitted value r i = Y i Ŷi - ith residual Ordinary least squares (OLS) is often used for inference if n/p is large. 4/53 Lasanthi Pelawa Watagoda SIU Carbondale

5 The Multiple Linear Regression model More Notation The centered response Z = Y Y where Y = Y1. W = (W ij ) = n (p 1) matrix of standardized nontrivial predictors For j = 1,..., p 1. let W ij denote the (j + 1)th variable standardized so that n i=1 W ij = 0 and n i=1 W2 ij = n. Hence W ij = x i,j+1 x j+1 σ j+1 where σ 2 j+1 = 1 n n (x i,j+1 x j+1 ) 2. i=1 The sample correlation matrix of the nontrivial predictors u i is Ru = WT W. n 5/53 Lasanthi Pelawa Watagoda SIU Carbondale

6 The Multiple Linear Regression model More Notation Then regression through the origin is used for the model where the vector of fitted values Ŷ = Y + Ẑ. Z = Wη + e (3) 6/53 Lasanthi Pelawa Watagoda SIU Carbondale

7 The Multiple Linear Regression model Alternative Methods For Estimating β Method Forward Selection with OLS Principal Component Regression Partial Least Squares Ridge Regression Lasso Relaxed Lasso Criterion Minimum C p and EBIC 10 fold cross validation (CV) 10 fold cross validation (CV) 10 fold cross validation (CV) 10 fold cross validation (CV) 10 fold cross validation (CV) 7/53 Lasanthi Pelawa Watagoda SIU Carbondale

8 The Multiple Linear Regression model Alternative Methods For Estimating β All of these methods produce M models. The number of models M depends on the method. One of the models is the full model (1) that uses all p 1 nontrivial predictors. The full model is (approximately) fit with (ordinary) least squares (OLS). Some of the methods use ˆβ = 0 and fit the model Y i = β 1 + e i. 8/53 Lasanthi Pelawa Watagoda SIU Carbondale

9 The Multiple Linear Regression model Alternative Methods For Estimating β Lasso and ridge regression have a parameter λ (Tuning Parameter). When λ = 0, the full OLS model is used. These methods also use a maximum value λ M of λ and a grid of M λ values λ 1 = 0 < λ 2 < < λ M 1 < λ M. For Lasso, λ M is the smallest value of λ such that ˆβ λm = 0. Hence ˆβ λi 0 for i < M. For forward selection, PCR, and PLS, M p. 9/53 Lasanthi Pelawa Watagoda SIU Carbondale

10 Minimizing Criterion for Ridge and Lasso Consider choosing ˆη to minimize the criterion Q(η) = 1 a (Z Wη)T (Z Wη) + λ 1,n a p 1 η i j (4) λ 1,n 0, a > 0, and j > 0 are known constants. j = 2 corresponds to ridge regression, j = 1 corresponds to lasso, and a = 1, 2, n, and 2n are common. The residual sum of squares RSS(η) = (Z Wη) T (Z Wη), and λ 1,n = 0 corresponds to the OLS estimator ˆη OLS = (W T W) 1 W T Z. i=1 10/53 Lasanthi Pelawa Watagoda SIU Carbondale

11 Limiting Distribution For Ridge Regression Assume p is fixed. Knight and Fu (2000) prove i) that ˆη is a consistent estimator of η if λ 1,n = o(n) ii) ˆη OLS and ˆη are asymptotically equivalent if λ 1,n too slowly as n, iii) ˆη is a n consistent estimator of η if λ 1,n = O( n) λ 1,n / n τ 0, then n(ˆηrr η) D N p 1 ( τvη, σ 2 V) where Ru = WT W n P V 1 (5) as n If τ = 0, then OLS and ridge regression have the same limiting distribution. 11/53 Lasanthi Pelawa Watagoda SIU Carbondale

12 Limiting Distribution For Ridge Regression Note that V 1 = ρ u if the u i are a random sample from a population with a nonsingular population correlation matrix ρ u. Under (5), if λ 1,n /n 0 then W T W + λ 1,n I p 1 n P V 1, and n(w T W + λ 1,n I p 1 ) 1 P V. 12/53 Lasanthi Pelawa Watagoda SIU Carbondale

13 Limiting Distribution For Ridge Regression The following identity from Gunst and Mason (1980, p. 342) is useful for ridge regression inference: ˆη RR = (W T W+λ 1,n I p 1 ) 1 W T Z = (W T W+λ 1,n I p 1 ) 1 W T W ˆη OLS = A n = [I p 1 λ 1,n (W T W + λ 1,n I p 1 ) 1 ]ˆη OLS = B nˆη OLS since A n B n = 0. If λ 1,n / n τ 0, then n(ˆηrr η) = n(ˆη RR ˆη OLS + ˆη OLS η) = n(ˆηols η) n λ 1,n n n(wt W + λ 1,n I p 1 ) 1ˆη OLS D N p 1 (0, σ 2 V) τvη N p 1 ( τvη, σ 2 V). 13/53 Lasanthi Pelawa Watagoda SIU Carbondale

14 Limiting Distribution For Lasso Theorem Let ˆη L be the lasso estimator. Then ( ) τ n(ˆηl η) N p 1 2 Vs, σ2 V. where Ru = WT W n P V 1 as n. If τ = 0, then OLS and lasso regression have the same limiting distribution. 14/53 Lasanthi Pelawa Watagoda SIU Carbondale

15 Limiting Distribution For Lasso New Proof Proof. The following identity from Efron and Hastie (2016, p. 308), for example, is useful for inference for the lasso estimator ˆη L : W T (Z W ˆη L ) + λ 1,n 2 s n = 0 where s in [ 1, 1] and s in = sign(ˆη i,l ) if ˆη i,l 0. Here sign(η i ) = 1 if η i > 1 and sign(η i ) = 1 if η i < 1. Note that s n = s n, ˆη L depends on ˆη L. Thus ˆη L = (W T W) 1 W T Z n(w T W) 1 λ 1,n 2n s n. 15/53 Lasanthi Pelawa Watagoda SIU Carbondale

16 Limiting Distribution For Lasso New Proof Proof. If λ 1,n / n τ 0 and s n P s = s η, then n(ˆηl η) = n(ˆη L ˆη OLS + ˆη OLS η) = n(ˆηols η) n λ 1,n 2n n(wt W) 1 s n D Np 1 (0, σ 2 V) τ 2 Vs ( ) τ N p 1 2 Vs, σ2 V. 15/53 Lasanthi Pelawa Watagoda SIU Carbondale

17 Variable Selection Definition Following Olive and Hawkins (2005), a model for variable selection can be described by x T β = x T S β S + x T E β E = x T S β S (6) where x = (x T S, xt E )T, x S is a k S 1 vector and x E is a (p k S 1) 1 vector. Given that x S is in the model, β E = 0 and E denotes the subset of terms that can be eliminated given that the subset S is in the model. 16/53 Lasanthi Pelawa Watagoda SIU Carbondale

18 M Submodels M Submodels using Forward Selection Forward selection forms a sequence of submodels I 1,..., I M where I j uses j predictors including the constant. Let I 1 use x 1 = x 1 1: the model has a constant but no nontrivial predictors. To form I 2, consider all models I with two predictors including x 1. Compute Q 2(I) = SSE(I) = RSS(I) = r T (I)r(I) = n i=1 r2 i (I) = n i=1 (Y i Ŷi(I)) 2. Let I 2 minimize Q 2 (I) for the p 1 models I that contain x 1 and one other predictor. 17/53 Lasanthi Pelawa Watagoda SIU Carbondale

19 M Submodels M Submodels using Forward Selection Denote the predictors in I 2 by x 1, x 2. In general, to form I j consider all models I with j predictors including variables x 1,..., x j 1. Compute Q j (I) = r T (I)r(I) = n i=1 r2 i (I) = n i=1 (Y i Ŷi(I)) 2. Let I j minimize Q j (I) for the p j + 1 models I that contain x 1,..., x j 1 and one other predictor not already selected. Denote the predictors in I j by x 1,..., x j. Continue in this manner for j = 2,..., M. Often M = min( n/j, p) for some integer J such as J = 5, 10, or /53 Lasanthi Pelawa Watagoda SIU Carbondale

20 Criteria for final model selection When there is a sequence of M submodels, the final submodel I d needs to be selected. Let x I and ˆβ I be a 1. Hence the candidate model contains a terms, including a constant. Suppose the e i are independent and identically distributed (iid) with variance V(e i ) = σ 2. Then there are many criteria used to select the final submodel I d. A simple method is to take the model that uses d = M = min( n/j, p) variables. If p is fixed, the method will use the full OLS model once n/j p. 19/53 Lasanthi Pelawa Watagoda SIU Carbondale

21 Criteria for final model selection Criterion when there is a good estimator for σ 2 Let criteria C S (I) have the form C S (I) = SSE(I) + ak nˆσ 2. These criteria need a good estimator of σ 2. The criterion C p (I) = AIC S (I) uses K n = 2 while the BIC S (I) criterion uses K n = log(n). Typically ˆσ 2 is the full OLS model MSE = n i=1 r 2 i n p when n/p is large. Then ˆσ 2 = MSE is a n consistent estimator of σ 2 under mild conditions by Su and Cook (2012). 20/53 Lasanthi Pelawa Watagoda SIU Carbondale

22 Criteria for final model selection EBIC The EBIC criterion given in Luo and Chen (2012) may work when n/p is not large. Let 0 γ 1 and I = a min(n, p) if ˆβ I is a 1. We may use a min(n/5, p). Then ( SSE(I) EBIC(I) = n log n = BIC(I) + 2γ log ) + a log(n) + 2γ log [( )] p. a [( )] p a This criterion can give good results if p = p n = O(n k ) and γ > 1 1/(2k). Hence we will use γ = 1. The above criteria can be applied to forward selection and relaxed lasso. The C p criterion can also be applied to lasso. 21/53 Lasanthi Pelawa Watagoda SIU Carbondale

23 R functions for variable selection Method Criterion R function Library Forward Selection Minimum C p regsubsets leaps PCR 10 fold CV pcr pls PLS 10 fold CV plsr pls RR 10 fold CV cv.glmnet glmnet Lasso (Relaxed) 10 fold CV cv.glmnet glmnet Table: Methods used for Variable Selection 22/53 Lasanthi Pelawa Watagoda SIU Carbondale

24 New OLS Sub Model Theorem Theorem Suppose the usual linear model Y = Xβ + e, with EY = Xβ and E(e) = 0. Then Cov(Y) = Cov(e) = σ 2 I. If we break down X and β as follows; X = [ X I X 0 ], β = [ βi β 0 Xβ = X I β I + X 0 β 0 where β I = [ X T I X I] 1 X T I Y = AY, then E( ˆβ I ) = β I + [ X T I X I] 1 X T I X 0 β 0 and Cov( ˆβ I ) = σ 2 ( X T I X I) 1. ], 23/53 Lasanthi Pelawa Watagoda SIU Carbondale

25 New OLS Sub Model Theorem: Proof Proof. Assume this is an arbitorary submodel. When the submodel contains the set of predictors S then ˆβ I works well and estimates β I, but if I does not contain S then E( ˆβ ( [X ] ) T 1 I ) = E I X I X T I Y = E (AY) = AE (Y) = AXβ = A (X I β I + X 0 β 0 ) = [ X T I X I] 1 X T I (X I β I + X 0 β 0 ) = [ X T I X I] 1 X T I X I β I + [ X T I X I] 1 X T I X 0 β 0 = β I + [ X T I X I] 1 X T I X 0 β 0. When β 0 = 0 then E( ˆβ I ) is equal to ˆβ I. The above results shows why OLS does not work well if the submodel does not contains enough predictors. 24/53 Lasanthi Pelawa Watagoda SIU Carbondale

26 New OLS Sub Model Theorem: Proof Proof. Now consider the Cov( ˆβ I ): Cov( ˆβ I ) = Cov(AY) = ACov(Y)A T = Aσ 2 IA T = σ 2 AA T ( [X = σ 2 ] ) ( T 1 I X I X T [X ] ) T 1 T I I X I X T I = σ 2 [ X T I X ] ( 1 I X T [X ] ) T 1 T I X I I X I = σ 2 ( [X T I X I ] T ) 1 = σ 2 ( X T I X I) 1. The above results shows why OLS does not work well if the submodel does not contains enough predictors. 24/53 Lasanthi Pelawa Watagoda SIU Carbondale

27 Prediction Intervals Definition Consider predicting a future test response variable Y f given a p 1 vector of predictors x f and training data (x 1, Y 1 ),..., (x n, Y n ). A large sample 100(1 δ)% prediction interval (PI) has the form [ˆL n, Ûn] where P(ˆL n Y f Ûn) 1 δ as the sample size n. 25/53 Lasanthi Pelawa Watagoda SIU Carbondale

28 Prediction Intervals Shorth PI The shorth(c) estimator is useful for making prediction intervals. Let Z (1),..., Z (n) be the order statistics of Z 1,..., Z n. Then let the shortest closed interval containing at least c of the Z i be shorth(c) = [Z (s), Z (s+c 1) ]. (7) Let k n = n(1 δ) (8) Frey (2013) showed that for large nδ and iid data, the shorth(k n ) PI has maximum undercoverage 1.12 δ/n, and used the shorth(c) estimator as the large sample 100(1 δ)% PI where c = min(n, n[1 δ δ/n ] ). (9) 26/53 Lasanthi Pelawa Watagoda SIU Carbondale

29 Two new prediction intervals The First New Prediction Interval Let q n = min(1 δ , 1 δ + d/n) for δ > 0.1 and q n = min(1 δ/2, 1 δ + 10δd/n), otherwise. (10) If 1 δ < and q n < 1 δ , set q n = 1 δ. Let c = nq n, (11) and let ( b n = ) n + 2d n n d if d 8n/9, and ( b n = ), n otherwise. Compute the shorth(c) of the residuals = [r (s), r (s+c 1) ] = [ ξ δ1, ξ 1 δ2 ]. Then the first new 100 (1 δ)% large sample PI for Y f is: (12) [ ˆm(x f ) + b n ξδ1, ˆm(x f ) + b n ξ1 δ2 ]. (13) 27/53 Lasanthi Pelawa Watagoda SIU Carbondale

30 Two new prediction intervals The Second New Prediction Interval (Validation PI) The second new PI randomly divides the data into two half sets H and V H has n H = n/2 of the cases and V has the remaining n V = n n H cases i 1,..., i nv. The estimator ˆm H (x) is computed using the training data set H. Then the validation residuals v j = Y ij ˆm H (x ij ) are computed for j = 1,..., n V cases in the validation set V. Find the Frey PI [v (s), v (s+c 1) ] of the validation residuals. Then second new 100(1 δ)% large sample PI for Y f is: [ ˆm H (x f ) + v (s), ˆm H (x f ) + v (s+c 1) ]. (14) 28/53 Lasanthi Pelawa Watagoda SIU Carbondale

31 Bootstrapping Hypothesis Tests Bootstrap a Confidence Region To bootstrap a confidence region, Mahalanobis distances and prediction regions will be useful. Consider predicting a future test value z f, given past training data z 1,..., z n where the z i are r 1 random vectors. A large sample 100(1 δ)% prediction region is a set A n such that P(z f A n ) 1 δ as n. Let the r 1 column vector T be a multivariate location estimator, and let the r r symmetric positive definite matrix C be a dispersion estimator. Then the ith squared sample Mahalanobis distance is D 2 i = D 2 i (T, C) = D2 z i (T, C) = (z i T) T C 1 (z i T) (15) for each observation z i. 29/53 Lasanthi Pelawa Watagoda SIU Carbondale

32 Bootstrapping Hypothesis Tests Bootstrap a Confidence Region Notice that the Euclidean distance of z i from the estimate of center T is D i (T, I r ) where I r is the r r identity matrix. The classical Mahalanobis distance D i uses (T, C) = (z, S), the sample mean and sample covariance matrix where z = 1 n n i=1 z i and S = 1 n 1 n (z i z)(z i z) T. (16) i=1 30/53 Lasanthi Pelawa Watagoda SIU Carbondale

33 Bootstrapping Hypothesis Tests large sample 100(1 δ)% nonparametric prediction region Let (T, C) = (z, S), and let D (Un) be the 100q n th sample quantile of the D i. Then the Olive (2013) large sample 100(1 δ)% nonparametric prediction region for a future value z f given iid data z 1,...,, z n is {z : D 2 z(z, S) D 2 (U n) }, (17) while the classical large sample 100(1 δ)% prediction region is {z : D 2 z(z, S) χ 2 r,1 δ }. (18) 31/53 Lasanthi Pelawa Watagoda SIU Carbondale

34 Bootstrapping Hypothesis Tests Prediction Region Method The Olive (2017abce) prediction region method obtains a confidence region for µ by applying the nonparametric prediction region to the bootstrap sample T 1,..., T B. Let T and S T be the sample mean and sample covariance matrix of the bootstrap sample. The prediction region method large sample 100(1 δ)% confidence region for µ is {w : D 2 w(t, S T) D 2 (U B ) } (19) where D 2 (U B ) is computed from D2 i = (T i T ) T [S T] 1 (T i T ) for i = 1,..., B. Note that the corresponding test for H 0 : µ = µ 0 rejects H 0 if (T µ 0 ) T [S T] 1 (T µ 0 ) > D 2 (U B ). 32/53 Lasanthi Pelawa Watagoda SIU Carbondale

35 Bootstrapping Hypothesis Tests Prediction Region Method for Bootstrapping FS Following Olive (2017bce), we describe the prediction region method for bootstrapping forward selection where 0s need to be added for omitted variables. Suppose n > 20p. If ˆβ I is (k 1) 1, form ˆβ I,0 from ˆβ I by adding 0s corresponding to the omitted variables. Then ˆβ I,0 is a nonlinear estimator of β, and the residual bootstrap method can be applied. 33/53 Lasanthi Pelawa Watagoda SIU Carbondale

36 Bootstrapping Hypothesis Tests Prediction Region Method for Bootstrapping FS Let ˆβ = ˆβ Imin,0 be formed from the forward selection model I min that minimizes the C p criterion. Instead of computing the least squares estimator from regressing Y i on X, perform variable selection on Y i and X, fit the model that minimizes the criterion, and add 0s corresponding to the omitted variables, resulting in estimators ˆβ 1,..., ˆβ B. Then test µ = Aβ = c using the prediction region method for r > 1 and the shorth if r = 1. Bootstrapping lasso and ridge regression is similar, but lasso already has the 0s and ridge regression usually does not produce 0 slope estimates. 34/53 Lasanthi Pelawa Watagoda SIU Carbondale

37 Bootstrapping Hypothesis Tests Residual Bootstrap For MLR model with n > 20p, the residual bootstrap makes sense: The residuals from the full model are sampled with replacement resulting in a bootstrap sample r 1,..., r n and Y i = Ŷi + r i for i = 1,..., n are collected into a vector Y j which is regressed on X to get ˆβ j for j = 1,..., n. 35/53 Lasanthi Pelawa Watagoda SIU Carbondale

38 Examples and Simulations Example for PI The Hebbler (1847) data was collected from n = 26 districts in Prussia in We will study the relationship between Y = the number of women married to civilians in the district with the predictors x 1 = constant x 2 = pop = the population of the district in 1843 x 3 = mmen = the number of married civilian men in the district x 4 = mmilmen = number of married men in the military in the district x 5 = milwmn = the number of women married to husbands in the military in the district 36/53 Lasanthi Pelawa Watagoda SIU Carbondale

39 Examples and Simulations Example for PI Sometimes the person conducting the survey would not count a spouse if the spouse was not at home. Hence Y is highly correlated but not equal to x 3. Similarly, x 4 and x 5 are highly correlated but not equal. We expect that Y = x 3 + e is a good model, but n/p = 5.2 is small. 37/53 Lasanthi Pelawa Watagoda SIU Carbondale

40 Examples and Simulations Example for PI Forward selection selected the model with the minimum C p while the other methods used 10-fold CV PLS and PCR used the OLS full model with PI length Forward selection used a constant and mmen with PI length Ridge regression had PI length Lasso and relaxed lasso used a constant, mmen, and pop with lasso PI length relaxed lasso PI length /53 Lasanthi Pelawa Watagoda SIU Carbondale

41 y y Examples and Simulations Example: Marry Data Response Plots y y a) Forward Selection b) Ridge Regression yhat yhat c) Lasso d) Relaxed Lasso yhat yhat 39/53 Lasanthi Pelawa Watagoda SIU Carbondale

42 Examples and Simulations Example for PI Figure 1 shows the response plots for forward selection, ridge regression, lasso, and relaxed lasso. The plots for PLS=PCR=OLS full model were similar to those of forward selection and relaxed lasso. The plots suggest that the MLR model is appropriate since the plotted points scatter about the identity line. The 90% pointwise prediction bands are also shown, and consist of two lines parallel to the identity line. These bands are very narrow in Figure 1 a) and d). 40/53 Lasanthi Pelawa Watagoda SIU Carbondale

43 Examples and Simulations Simulations for New PI For the model Y = Xβ + e, methods such as forward selection, PCR, PLS, ridge regression, relaxed lasso, and lasso each generate M fitted models I 1,..., I M, where M depends on the method, and we considered several methods for selecting the final submodel I d. Only method i) needs n/p large. i) Let I d = I min be the model that minimizes C p for forward selection, relaxed lasso, or lasso. ii) Let I d = I min be the model that minimizes EBIC for forward selection or relaxed lasso. For forward selection, we used M = min( n/5, p). iii) Choose I d using k fold cross validation (CV). We used 10-fold CV. 41/53 Lasanthi Pelawa Watagoda SIU Carbondale

44 Examples and Simulations Simulations for New PI The simulation used p = 20, 40, n, and 2n. Let k be the number of nonzero coefficients, including the constant. k = 1, 19, and p 1. The zero mean errors e i were iid of five types: i) N(0,1) errors, ii) t 3 errors, iii) EXP(1) - 1 errors, iv) uniform( 1, 1) errors, and v) 0.9 N(0,1) N(0,100) errors. The simulation used 5000 runs, so an observed coverage in (0.94, 0.96) gives no reason to doubt that the PI has the nominal coverage of Correlations were set up to be cor(x i, x j ) = (2ψ + (m 2)ψ 2 )/(1 + (m 1)ψ 2 ) for i j where x i and x j are nontrivial predictors. The simulation used ψ = 0, 0.9 and 1/ p. 42/53 Lasanthi Pelawa Watagoda SIU Carbondale

45 Examples and Simulations Simulation Results for New PI Table: Large Sample 95% PI Coverages and Lengths, e i N(0, 1) n? p n p ψ k FS lasso RL RR PLS PCR Cov Len Cov Len Cov Len n < p Cov Len n 10p Cov Len Cov Len Cov Len n = p Cov Len n 10p Cov Len n 10p Cov Len Cov Len n < p Cov Len /53 Lasanthi Pelawa Watagoda SIU Carbondale

46 Examples and Simulations Simulation Results for New PI Table: Large Sample 95% PI Coverages and Lengths, e i N(0, 1) n? p n p ψ k FS lasso RL RR PLS PCR Cov Len Cov Len Cov Len Cov Len Cov Len n = p Cov Len Cov Len n = p Cov Len Cov Len Cov Len n = p Cov Len /53 Lasanthi Pelawa Watagoda SIU Carbondale

47 Examples and Simulations Simulation Results for New PI Tables show some simulation results for the first new PI where forward selection used C p for n 10p and EBIC for n < 10p. The other methods minimized 10-fold CV. For methods other than ridge regression, the maximum number of variables uses was approximately min( n/5, p). For n 10p, coverages tended to be near or higher than the nominal value of The lengths of the asymptotically optimal 95% PIs are i) 3.92 = 2(1.96), ii) 6.365, iii) 2.996, iv) 1.90 = 2(0.95), and v) /53 Lasanthi Pelawa Watagoda SIU Carbondale

48 Examples and Simulations Simulation Results for New PI The average PI length was often near to 1.4 to 1.8 times the asymptotically optimal length for n = 10p and close to the optimal length for n = 100p. An exception was that lasso and ridge regression often had far too long lengths for k 19 and ψ 0.5. For n p, good performance needed stronger regularity conditions. PLS tended to have severe undercoverage with small average length. PCR length was often too long for ψ = 0. 46/53 Lasanthi Pelawa Watagoda SIU Carbondale

49 Examples and Simulations Simulation Results for New PI When there was k = 1 active predictor, forward selection, lasso, and relaxed lasso often performed well. For k = 19, forward selection performed well as did lasso and relaxed lasso for ψ = 0. For dense models with k = p 1 and n/p not large, there was often undercoverage. Let d 1 be the number of active predictors in the selected model. N(0, 1) errors, ψ = 0, and d < k, an asymptotic population 95% PI has length 3.92 k d + 1. EBIC occasionally had undercoverage, especially for k = 19 or p 1, which was usually more severe for ψ = 0.9 or 1/ p. 47/53 Lasanthi Pelawa Watagoda SIU Carbondale

50 Examples and Simulations Simulations for Bootstraping Hypothesis testing A small simulation was done, using the same type of data as for the prediction interval simulation, using B = max(1000, n, 20p) and 5000 runs. The regression model used β = (1, 1, 0, 0) T with n = 100 and p = 4. When ψ = 0, the design matrix X consisted of iid N(0,1) random variables, and the full model least squares confidence intervals for β i should have length near 2t 96,0.975 σ/ n 2(1.96)σ/10 = 0.392σ when the iid zero mean errors have variance σ 2. 48/53 Lasanthi Pelawa Watagoda SIU Carbondale

51 Examples and Simulations Simulations for Bootstraping Hypothesis testing The simulation computed the Frey shorth(c) interval for each β i and used the prediction region method to test H 0 : β 3 = β 4 = 0. The nominal coverage was 0.95 with δ = Observed coverage between 0.94 and 0.96 would suggest coverage is close to the nominal value. 49/53 Lasanthi Pelawa Watagoda SIU Carbondale

52 Examples and Simulations Simulation Results for Bootstraping The regression models used the residual bootstrap on the full model least squares estimator and on the forward selection estimator ˆβ Imin,0. Results are shown for when the iid errors e i N(0, 1). Table: Bootstrapping OLS Regression and Forward Selection model ψ cov/len β 1 β 2 β 3 β 4 test reg 0 cov len vs 0 cov len reg 0.5 cov len vs 0.5 cov len reg 0.9 cov len vs 0.9 cov len /53 Lasanthi Pelawa Watagoda SIU Carbondale

53 Examples and Simulations Simulation Results for Bootstraping The cutoff will often be near χ 2 r,0.95 if the statistic T is asymptotically normal. χ 2 2,0.95 = is close to 2.45 for the full model regression bootstrap test. The coverages were near 0.95 for the regression bootstrap on the full model. Suppose ψ = 0. Then ˆβ S has the same limiting distribution for I min and the full model. Note that the average lengths and coverages were similar for the full model and forward selection I min for β 1 and β 2 and β S = (β 1, β 2 ) T. 51/53 Lasanthi Pelawa Watagoda SIU Carbondale

54 Examples and Simulations Simulation Results for Bootstraping Table: Bootstrap LASSO, ψ = 0 n eps type β 1 β 2 β 3 β 4 test cicov avelen cicov avelen cicov avelen cicov avelen cicov avelen cicov avelen cicov avelen cicov avelen cicov avelen cicov avelen cicov avelen /53 Lasanthi Pelawa Watagoda SIU Carbondale

55 THANK YOU 53/53 Lasanthi Pelawa Watagoda SIU Carbondale

INFERENCE AFTER VARIABLE SELECTION. Lasanthi C. R. Pelawa Watagoda. M.S., Southern Illinois University Carbondale, 2013

INFERENCE AFTER VARIABLE SELECTION by Lasanthi C. R. Pelawa Watagoda M.S., Southern Illinois University Carbondale, 2013 A Dissertation Submitted in Partial Fulfillment of the Requirements for the Doctor