Large Sample Theory For OLS Variable Selection Estimators

Size: px

Start display at page:

Download "Large Sample Theory For OLS Variable Selection Estimators"

Archibald Manning
5 years ago
Views:

1 Large Sample Theory For OLS Variable Selection Estimators Lasanthi C. R. Pelawa Watagoda and David J. Olive Southern Illinois University June 18, 2018 Abstract This paper gives large sample theory for ordinary least squares variable selection estimators such as forward selection and backward elimination. This theory is useful for comparing these estimators with the elastic net, lasso, and ridge regression when the sample size is large compared to the number of predictors. KEY WORDS: Elastic Net; Forward Selection; Lasso; Mixture Distributions; Relaxed Lasso; Ridge Regression. 1 INTRODUCTION In this section we review the large sample theory for some shrinkage estimators and review the variable selection model. The following section will give large sample theory for OLS variable selection estimators. We assume the number of predictors, p, is fixed. Suppose that the response variable Y i and at least one predictor variable x i,j are quantitative with x i,1 1. Let x T i = (x i,1,..., x i,p ) = (1 u T i ) and β = (β 1,..., β p ) T where β 1 corresponds to the intercept. Then the multiple linear regression model is Y i = β 1 + x i,2 β x i,p β p + e i = x T i β + e i (1) for i = 1,..., n. This model is also called the full model. Here n is the sample size, and assume that the random variables e i are independent and identically distributed (iid) with variance V (e i ) = σ 2. In matrix notation, these n equations become Y = Xβ + e (2) where Y is an n 1 vector of response variables, X is an n p matrix of predictors, β is a p 1 vector of unknown coefficients, and e is an n 1 vector of unknown errors. The David J. Olive is Professor, Department of Mathematics, Southern Illinois University, Carbondale, IL 62901, USA. Lasanthi C. R. Pelawa Watagoda is Visiting Assistant Professor, Appalachian State University, Boone, NC , USA. 1

2 ith fitted value Ŷi = x T ˆβ i and the ith residual r i = Y i Ŷi where ˆβ is an estimator of β. Ordinary least squares (OLS) is often used for inference if n/p is large. It is often convenient to use the centered response Z = Y Y where Y = Y 1, and the n (p 1) matrix of standardized nontrivial predictors W = (W ij ). For j = 1,..., p 1, let W ij denote the (j + 1)th variable standardized so that n i=1 W ij = 0 and n i=1 W ij 2 = n. Note that the sample correlation matrix of the nontrivial predictors u i is Ru = W T W /n. Then regression through the origin is used for the model Z = W η + e (3) where the vector of fitted values Ŷ = Y + Ẑ. There are many methods for estimating β, including backward elimination and forward selection with OLS, elastic net due to Zou and Hastie (2005), lasso due to Tibshirani (1996), and ridge regression: see Hoerl and Kennard (1970). We also used the variant of relaxed lasso that applies OLS to a constant and the predictors that had nonzero lasso coefficients, which is the LARS-OLS hybrid estimator of Efron, Hastie, Johnstone, and Tibshirani (2004), also called the relaxed lasso (φ = 0) estimator by Meinshausen (2007). Some large sample theory for these estimators will be summarized below. These methods produce M models and use a criterion to select the final model (e.g., C p or 10-fold cross validation (CV)). The number of models M depends on the method. Lasso and ridge regression have a parameter λ, and if λ = 0, then the OLS full model is used. These two methods also use a maximum value λ M of λ and a grid of M λ values 0 λ 1 < λ 2 < < λ M 1 < λ M. For lasso, λ M is the smallest value of λ such that ˆη λm = 0. Hence ˆη λi 0 for i < M. See James, Witten, Hastie, and Tibshirani (2013, ch. 6). Consider choosing ˆη to minimize the criterion Q(η) = 1 a (Z W η)t (Z Wη) + λ 1,n a p 1 η i j (4) where λ 1,n 0, a > 0, and j > 0 are known constants. Then j = 2 corresponds to ridge regression, j = 1 corresponds to lasso, and a = 1, 2, n, and 2n are common. The residual sum of squares RSS(η) = (Z Wη) T (Z W η), and λ 1,n = 0 corresponds to the OLS estimator ˆη OLS = (W T W ) 1 W T Z. For model (4), Knight and Fu (2000) proved that i) ˆη is a consistent estimator of η if λ 1,n = o(n) so λ 1,n /n 0 as n, ii) ˆη is a n consistent estimator of η if λ 1,n = O( n) (so λ 1,n / n is bounded), and iii) ˆη OLS, lasso, and ridge regression are asymptotically equivalent if λ 1,n / n 0 as n. Assume that the sample correlation matrix Ru = W T W n i=1 P V 1 (5) where V 1 = ρ u, the population correlation matrix of the nontrivial predictors u i, if the u i are a random sample from a population. Under (5), if λ 1,n /n 0 then W T W + λ 1,n I p 1 n P V 1, and n(w T W + λ 1,n I p 1 ) 1 P V. 2

3 Let H = W(W T W) 1 W T, and assume that max i=1,...,n h ii P 0 as n. Then the OLS estimator satisfies n(ˆηols η) D N p 1 (0, σ 2 V ). (6) The following identity from Gunst and Mason (1980, p. 342) is useful for ridge regression inference: ˆη R = (W T W + λ 1,n I p 1 ) 1 W T Z = (W T W + λ 1,n I p 1 ) 1 W T W (W T W) 1 W T Z = (W T W + λ 1,n I p 1 ) 1 W T W ˆη OLS = A nˆη OLS = [I p 1 λ 1,n (W T W + λ 1,n I p 1 ) 1 ]ˆη OLS = B nˆη OLS = ˆη OLS λ 1n n n(w T W + λ 1,n I p 1 ) 1ˆη OLS since A n B n = 0. The following identity from Efron and Hastie (2016, p. 308), for example, is useful for inference for the lasso estimator ˆη L : 1 n W T (Z W ˆη L ) + λ 1,n 2n s n = 0 or W T (Z W ˆη L ) + λ 1,n 2 s n = 0 where s in [ 1, 1] and s in = sign(ˆη i,l ) if ˆη i,l 0. Here sign(η i ) = 1 if η i > 1 and sign(η i ) = 1 if η i < 1. Note that s n = s n, ˆη L depends on ˆη L. Thus ˆη L = (W T W ) 1 W T Z λ 1,n 2n n(w T W ) 1 s n = ˆη OLS λ 1,n 2n n(w T W) 1 s n. Following Hastie, Tibshirani, and Wainwright (2015, p. 57), the elastic net estimator ˆη EN minimizes Q EN (η) = RSS(η) + λ 1 η λ 2 η 1 (7) where λ 1 = (1 α)λ 1,n and λ 2 = 2αλ 1,n with 0 α 1. Following Jia and Yu (2010), by standard Karush-Kuhn-Tucker (KKT) conditions for convex optimality for Equation (7), ˆη EN is optimal if 2W T W ˆη EN 2W T Z + 2λ 1ˆη EN + λ 2 s n = 0, or (W T W + λ 1 I p 1 )ˆη EN = W T Z λ 2 2 s n, or Hence ˆη EN = ˆη R n(w T W + λ 1 I p 1 ) 1 λ 2 2n s n. (8) ˆη EN = ˆη OLS λ 1 n n(w T W + λ 1 I p 1 ) 1 ˆη OLS λ 2 2n n(w T W + λ 1 I p 1 ) 1 s n = ˆη OLS n(w T W + λ 1 I p 1 ) 1 [ λ 1 n ˆη OLS + λ 2 2n s n]. 3

4 Note that if ˆλ 1,n / n P τ and ˆα P ψ, then ˆλ 1 / n P (1 ψ)τ and ˆλ 2 / n P 2ψτ. Under these conditions, n(ˆηen η) = n(ˆη OLS η) n(w T W + ˆλ 1 I p 1 ) 1 [ ˆλ 1 n ˆη OLS + ˆλ 2 2 n s n]. The following theorem shows the elastic net, lasso, and ridge regression are asymptotically equivalent to the OLS full model if ˆλ 1,n / n P 0, and will be useful for comparing these estimators and the OLS variable selection estimators. The theorem follows from results in Knight and Fu (2000) and Slawski, zu Castell, and Tutz (2010). Also see Zou and Zhang (2009). Let ˆη A be ˆη EN, ˆη L, or ˆη R. Note that c) follows from b) if ψ = 0, and d) follows from b) (using 2ˆλ 1,n / n P 2τ) if ψ = 1. Recall that we are assuming that p is fixed. Theorem 1. Assume that the conditions of the OLS theory (6) hold for the model Z = Wη + e. a) If ˆλ 1,n / n P 0, then n(ˆηa η) D N p 1 (0, σ 2 V ). b) If ˆλ 1,n / n P τ 0, ˆα P ψ [0, 1], and s n P s = sη, then n(ˆηen η) D N p 1 ( V [(1 ψ)τη + ψτs], σ 2 V ). c) If ˆλ 1,n / n P τ 0, then n(ˆηr η) D N p 1 ( τv η, σ 2 V ). d) If ˆλ 1,n / n P τ 0 and s n P s = sη, then n(ˆηl η) D N p 1 ( τ 2 V s, σ2 V Next we describe variable selection, and then develop theory in Section 2. Variable selection is the search for a subset of predictor variables that can be deleted with little loss of information if n/p is large. Following Olive and Hawkins (2005), a model for variable selection can be described by ). x T β = x T Sβ S + x T Eβ E = x T Sβ S (9) where x = (x T S, xt E )T, x S is an a S 1 vector, and x E is a (p a S ) 1 vector. Given that x S is in the model, β E = 0 and E denotes the subset of terms that can be eliminated given that the subset S is in the model. Let x I be the vector of a terms from a candidate subset indexed by I, and let x O be the vector of the remaining predictors (out of the candidate submodel). Suppose that S is a subset of I and that model (9) holds. Then x T β = x T Sβ S = x T Sβ S + x T I/Sβ (I/S) + x T O0 = x T I β I (10) 4

5 where x I/S denotes the predictors in I that are not in S. Since this is true regardless of the values of the predictors, β O = 0 if S I. Forward selection forms a sequence of submodels I 1,..., I M where I j uses j predictors including the constant. Let I 1 use x 1 = x 1 1: the model has a constant but no nontrivial predictors. To form I 2, consider all models I with two predictors including x 1. Compute Q 2 (I) = SSE(I) = RSS(I) = r T (I)r(I) = n i=1 r2 i (I) = n i=1 (Y i Ŷi(I)) 2. Let I 2 minimize Q 2 (I) for the p 1 models I that contain x 1 and one other predictor. Denote the predictors in I 2 by x 1, x 2. In general, to form I j consider all models I with j predictors including variables x 1,..., x j 1. Compute Q j(i) = r T (I)r(I) = n i=1 r2 i (I) = n i=1 (Y i Ŷ i (I)) 2. Let I j minimize Q j (I) for the p j+1 models I that contain x 1,..., x j 1 and one other predictor not already selected. Denote the predictors in I j by x 1,..., x j. Continue in this manner for j = 2,..., M = p. When there is a sequence of M submodels, the final submodel I d needs to be selected. Let the candidate model I contain a terms, including a constant. For example, let x I and ˆβ I be a 1 vectors. Then there are many criteria used to select the final submodel I d. For a given data set, the quantities p, n, and ˆσ 2 act as constants, and a criterion below may add a constant or be divided by a positive constant without changing the subset I min that minimizes the criterion. Let criteria C S (I) have the form C S (I) = SSE(I) + ak nˆσ 2. These criteria need a good estimator of σ 2 and n/p large. The criterion C p (I) = AIC S (I) uses K n = 2 while the BIC S (I) criterion uses K n = log(n). See Jones (1946) and Mallows (1973) for C p. Typically ˆσ 2 is the OLS full model MSE = n i=1 r 2 i n p when n/p is large. Then ˆσ 2 = MSE is a n consistent estimator of σ 2 under mild conditions by Su and Cook (2012). The following criteria also need n/p large. AIC is due to Akaike (1973) and BIC to Schwarz (1978). ( ) SSE(I) AIC(I) = nlog + 2a, and n ( ) SSE(I) BIC(I) = nlog + a log(n). n Let p be fixed and let I min be the submodel that minimizes the criterion using variable selection with OLS. Following Nishi (1984), the probability that model I min from C p or AIC underfits goes to zero as n. Hence P(S I min ) 1 as n. This result holds for all subsets regression and variable selection methods, such as forward selection and backward elimination, that produce a sequence of nested models including the full model. The above criteria can be applied to forward selection and relaxed lasso. The C p criterion can also be applied to lasso. See Efron and Hastie (2016, pp. 221, 231). 5

6 Section 2 gives large sample theory for OLS variable selection estimators such as forward selection. 2 Large sample theory for OLS variable selection estimators Large sample theory for the elastic net, lasso, and ridge regression is simple using the KKT conditions since the optimization problem is convex. The optimization problem for variable selection is not convex, so new tools are needed. One technique is to consider variable selection models where the probability that the model selects the true set S goes to one. See Leeb and Pötscher (2005). A problem is that n(ˆβ Imin β s ) is only defined if ˆβ Imin has the same dimension as β S. We will show that large sample theory becomes simple by using zero padding. If ˆβ I is a 1, form the p 1 vector ˆβ I,0 from ˆβ I by adding 0s corresponding to the omitted variables. For example, if p = 4 and ˆβ Imin = (ˆβ 1, ˆβ 3 ) T, then ˆβ Imin,0 = (ˆβ 1, 0, ˆβ 3, 0) T. Since fewer than 2 p regression models I contain the true model S, and each such model gives a n consistent estimator ˆβ I,0 of β, the probability that I min picks one of these models goes to one as n. Hence ˆβ Imin,0 is a n consistent estimator of β under model (9). Olive (2017a: p. 123, 2017b: p. 176) showed that ˆβ Imin,0 is a consistent estimator. This section will use mixture distributions to find the limiting distribution of n(ˆβ Imin,0 β). Mixture distributions are useful for model and variable selection since ˆβ Imin,0 is a mixture distribution of ˆβ Ij,0, and the lasso estimator ˆβ L is a mixture distribution of ˆβ L,λi for i = 1,..., M. A random vector u has a mixture distribution of random vectors u j with probabilities π j if u equals random vector u j with probability π j for j = 1,..., J. Let u and u j be p 1 random vectors. Then the cumulative distribution function (cdf) of u is Fu(t) = π j Fu j (t) (11) where the probabilities π j satisfy 0 π j 1 and J π j = 1, J 2, and Fu j (t) is the cdf of u j. Suppose E(h(u)) and the E(h(u j )) exist. Then E(h(u)) = π j E[h(u j )]. (12) Hence E(u) = π j E[u j ], (13) and Cov(u) = E(uu T ) E(u)E(u T ) = E(uu T ) E(u)[E(u)] T = 6

7 J π je[u j u T j ] E(u)[E(u)] T = π j Cov(u j ) + π j E(u j )[E(u j )] T E(u)[E(u)] T. (14) If E(u j ) = θ for j = 1,..., J, then E(u) = θ and Cov(u) = π j Cov(u j ). Note that E(u)[E(u)] T = k=1 π j π k E(u j )[E(u k )] T. (15) Now suppose that T n is equal to the estimator T jn with probability π jn for j = 1,..., J where j π jn = 1, π jn π j as n, and u jn = n(t jn θ) D u j with E(u j ) = 0 and Cov(u j ) = Σ j. Then T n has a mixture distribution of the T jn with probabilities π jn, and the cdf of T n is F Tn (z) = j π jnf Tjn (z) where F Tjn (z) is the cdf of T jn. Hence n(tn θ) has a mixture distribution of the n(t jn θ), and n(tn θ) D u (16) where the cdf of u is Fu(z) = j π jfu j (z) and Fu j (z) is the cdf of u j. Thus u is a mixture distribution of the u j with probabilities π j, E(u) = 0, and Cov(u) = Σu = j π jσ j. Applying the above results with large sample theory for OLS makes large sample theory for OLS variable selection simple. Assume the maximum leverage max i=1,...,n x T ii (XT I X i) 1 x ii 0 in probability as n for each I with S I. For the full OLS model, n(ˆβ β) D N p (0, σ 2 V ) where (X T X)/n P V 1. See, for example, Olive (2017a, p. 39) and Sen and Singer (1993, p. 280). For OLS variable selection with C p, let ˆβ Ij = (X T I j X Ij ) 1 X T I j Y = D j Y, T n = ˆβ Imin,0 and T jn = ˆβ Ij,0 = D j,0 Y where D j,0 adds rows of zeroes to D j corresponding to the x i not in I j. Let T n = T kn = ˆβ Ik,0 with probabilities π kn where π kn π k as n. Denote the π k with S I k by π j. The other π k = 0 by Nishi (1984). Then n(ˆβ Ij β Ij ) D N aj (0, σ 2 V j ) and u jn = n(ˆβ Ij,0 β) D u j N p (0, σ 2 V j,0 ) where n(x T I j X Ij ) 1 P V j and V j,0 adds columns and rows of zeroes corresponding to the x i not in I j. Hence Σ j = σ 2 V j,0 is singular unless I j corresponds to the full model. Then (16) holds: n(ˆβimin,0 β) D u (17) where the cdf of u is Fu(z) = j π jfu j (z). Thus u is a mixture distribution of the u j with probabilities π j, E(u) = 0, and Cov(u) = Σu = j π jσ 2 V j,0. The values of π j depend on the OLS variable selection method with C p, such as backward elimination, forward selection, all subsets, and if λ 1 = 0, the variant of relaxed lasso that computes 7

8 the OLS submodel for the subset corresponding to λ i for i = 1,..., M. Let A be a g p full rank matrix with 1 g p. Then n(aˆβimin,0 Aβ) D Au where Au has a mixture distribution of the Au j N g (0, σ 2 AV j,0 A T ) with probabilities π j. Two special cases are interesting. First, suppose π d = 1 so u u d N p (0,Σ d ). This special case occurs for C p if a S = p so S is the full model, and for methods like BIC that choose I S with probability going to one. The second special case occurs if for each π j > 0, Au j N g (0, AΣ j A T ) = N g (0, AΣA T ). Then n(aˆβ Imin,0 Aβ) D Au N g (0, AΣA T ). This special case occurs for ˆβ S if the nontrivial predictors are orthogonal or uncorrelated with zero mean so X T X/n diag(d 1,..., d p ) as n where each d i > 0. Then ˆβ S has the same multivariate normal limiting distribution for I min and for the OLS full model. 3 Conclusion Results in Claeskens and Hjort (2008, pp. 101, 102, 232) suggest that the probability that AIC underfits goes to zero for many models. Hence with AIC variable selection, n(ˆβimin,0 β) D u for many time series models, generalized linear models, and survival regression models. Efron and Hastie (2016, p. 4) note that inference is needed to compare and assess methods. OLS variable selection estimators are n consistent under mild conditions. The elastic net, lasso, and ridge regression are consistent estimators of β if λ 1,n = o(n) and n consistent if λ 1,n = O( n). These three estimators are asymptotically equivalent to the OLS full model if λ 1,n / n 0 as n. The OLS variable selection estimators have a limiting distribution that is a mixture distribution of the limiting distributions of the OLS full model and other models I j such that S I j. Hence the OLS variable selection estimators can give more precise estimators of β than the OLS full model if a S < p. Usually ˆλ 1,n is selected using a criterion such as k fold CV, AIC, BIC, or GCV. It is not clear whether ˆλ 1,n = o(n). For n/p large, often the lasso program chooses λ 1 > 0. Adding λ 1 = 0 if n 5p should improve the elastic net, lasso, and ridge regression estimators. Using a λ i near n/log(n) may also be useful. For the elastic net and lasso, λm /n does not go to zero as n since ˆη = 0 is not a consistent estimator. Hence λ M is likely proportional to n, and using λ i = iλ M /M for i = 1,..., M will not produce a consistent estimator. In addition to large sample theory, shrinkage estimators can be compared with asymptotically optimal prediction intervals, even if n/p is not large. See Pelawa Watagoda and Olive (2018a). If n/p is large, Olive (2018, 2017a: pp , 2017b: pp ) suggest a bootstrap confidence region that simulates well for OLS variable selection estimators and, in limited simulations, for lasso. Pelawa Watagoda and Olive (2018b) give some theory for this application. 8

9 Response plots of the fitted values Ŷ versus the response Y are useful for checking linearity of the multiple linear regression model and for detecting outliers. Residual plots should also be made. REFERENCES Akaike, H. (1973), Information Theory as an Extension of the Maximum Likelihood Principle, in Proceedings, 2nd International Symposium on Information Theory, eds. Petrov, B.N., and Csakim, F., Akademiai Kiado, Budapest, Claeskens, G., and Hjort, N.L. (2008), Model Selection and Model Averaging, Cambridge University Press, New York, NY. Efron, B., and Hastie, T. (2016), Computer Age Statistical Inference, Cambridge University Press, New York, NY. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), Least Angle Regression, (with discussion), The Annals of Statistics, 32, Gunst, R.F., and Mason, R.L. (1980), Regression Analysis and Its Application, Marcel Dekker, New York, NY. Hastie, T., Tibshirani, R., and Wainwright, M. (2015), Statistical Learning with Sparsity: the Lasso and Generalizations, CRC Press Taylor & Francis, Boca Raton, FL. Hoerl, A.E., and Kennard, R. (1970), Ridge Regression: Biased Estimation for Nonorthogonal Problems, Technometrics, 12, James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning With Applications in R, Springer, New York, NY. Jia, J., and Yu, B. (2010), On Model Selection Consistency of the Elastic Net When p >> n, Statistica Sinica, 20, Jones, H.L. (1946), Linear Regression Functions with Neglected Variables, Journal of the American Statistical Association, 41, Knight, K., and Fu, W.J. (2000), Asymptotics for Lasso-Type Estimators, The Annals of Statistics, 28, Leeb, H., and Pötscher, B.M. (2005), Model Selection and Inference: Facts and Fiction, Econometric Theory, 21, Mallows, C. (1973), Some Comments on C p, Technometrics, 15, Meinshausen, N. (2007), Relaxed Lasso, Computational Statistics & Data Analysis, 52, Nishi, R. (1984), Asymptotic Properties of Criteria for Selection of Variables in Multiple Regression, The Annals of Statistics, 12, Olive, D.J. (2017a), Linear Regression, Springer, New York, NY. Olive, D.J. (2017b), Robust Multivariate Analysis, Springer, New York, NY. Olive, D.J. (2018), Applications of Hyperellipsoidal Prediction Regions, Statistical Papers, to appear. Olive, D.J., and Hawkins, D.M. (2005), Variable Selection for 1D Regression Models, Technometrics, 47, Pelawa Watagoda, L.C.R., and Olive, D. J. (2018a), Comparing Shrinkage Estimators With Asymptotically Optimal Prediction Intervals. Unpublished manuscript at ( 9

10 Pelawa Watagoda, L. C. R., and Olive, D.J. (2018b), Bootstrapping Multiple Linear Regression After Variable Selection. Unpublished manuscript at ( siu.edu/olive/ppboottest.pdf). Schwarz, G. (1978), Estimating the Dimension of a Model, The Annals of Statistics, 6, Sen, P.K., and Singer, J.M. (1993), Large Sample Methods in Statistics: an Introduction with Applications, Chapman & Hall, New York. Slawski, M., zu Castell, W., and Tutz, G., (2010), Feature Selection Guided by Structural Information, The Annals of Applied Statistics, 4, Su, Z., and Cook, R.D. (2012), Inner Envelopes: Efficient Estimation in Multivariate Linear Regression, Biometrika, 99, Tibshirani, R. (1996), Regression Shrinkage and Selection Via the Lasso, Journal of the Royal Statistical Society, B, 58, Zou, H., and Hastie, T. (2005), Regularization and Variable Selection Via the Elastic Net, Journal of the Royal Statistical Society Series, B, 67, Zou, H., and Zhang, H.H. (2009), On the Adaptive Elastic-Net with a Diverging Number of Parameters, The Annals of Statistics, 37,

Inference After Variable Selection

Department of Mathematics, SIU Carbondale Inference After Variable Selection Lasanthi Pelawa Watagoda lasanthi@siu.edu June 12, 2017 Outline 1 Introduction 2 Inference For Ridge and Lasso 3 Variable Selection