Variable selection in linear regression through adaptive penalty selection

Size: px

Start display at page:

Download "Variable selection in linear regression through adaptive penalty selection"

Silas Snow
5 years ago
Views:

PREPRINT 國立臺灣大學數學系預印本 Department of Mathematics, National Taiwan University www.math.ntu.edu.tw/~mathlib/preprint/2012-04.

1 PREPRINT 國立臺灣大學數學系預印本 Department of Mathematics, National Taiwan University Variable selection in linear regression through adaptive penalty selection Hung Chen and Chuan-Fa Tang September 06, 2012

2 Variable selection in linear regression through adaptive penalty selection Hung Chen 1 Chuan-Fa Tang 2 1 Department of Mathematics, National Taiwan University, Taipei, 11167, Taiwan 2 Department of Statistics, University of South Carolina, Columbia, SC 29208, USA Abstract Model selection procedures often use a fixed penalty, such as Mallows C p, to avoid choosing a model which fits a particular data set extremely well. These procedures are often devised to give an unbiased risk estimate when a particular chosen model is used to predict future responses. As a correction for not including the variability induced in model selection, generalized degrees of freedom is introduced in Ye 1998 as an estimate of model selection uncertainty that arise in using the same data for both model selection and associated parameter estimation. Built upon generalized degrees of freedom, Shen and Ye 2002 proposed a data-adaptive complexity penalty. In this article, we evaluate the validity of such an approach on model selection of linear regression when the set of candidate models satisfies nested structure and includes true model. It is found that the performance of such an approach is even worse than Mallows C p on the probability of correct selection. However, this approach coupled with proper selection of the range of penalty or little bootstrap proposed in Breiman 1992 performs better than C p with increasing probability of correct selection but still cannot match with BIC on achieving model selection consistency. Keywords: Adaptive penalty selection; Mallows C p ; Model selection consistency; little bootstrap; Generalized degrees of freedom

3 1 Introduction In analyzing data from complex scientific experiments or observational studies, one often encounters the problem of choosing one plausible linear regression model from a sequence of nested candidate models. When the exact true model is among the candidate models, BIC Schwarz, 1978 has been shown to be a consistent model selector in Nishii 1984, but C p Mallows, 1973 is shown to be a conservative model selector in Woodrofe 1982 for picking a slightly overfitted model. From BIC to C p AIC, the chosen penalty differs in measuring the dimension of the model to log n where n represents the number of observations. Many efforts were made to understand in what circumstances these criteria allow to identify the right model asymptotically. Liu and Yang 2011 proposed a model selector in terms of data-adaptive choice of penalty between AIC and BIC. When the exact true model is among the candidate models, the newly proposed selector is a consistent model selector. Since both the number of candidate predictors and sample size influence the performance of any model selector, an alternative approach of penalty selection is to do adaptive choice of penalty. Shen and Ye 2002 proposed such an approach in which the restriction on the penalty is lifted and the generalized degrees of freedom GDF is used to compensate possible overfitting due to model selection. To avoid overfitting the available data caused by the collection of candidate models, little bootstrap has been advocated in Breiman 1992 to decrease possible overfitting due to multi-model inference. Often GDF is difficult to compute, Shen and Ye 2002 demonstrated that the use of little bootstrap i.e. data perturbation to replace the calculation of GDF can lead to a better regression model in terms of mean prediction error as compared to the use of C p or BIC. In this paper, we evaluate whether adaptive penalty selection procedure proposed in Shen and Ye 2002 leads to a consistent model selector or just reduce the overfitting of multi-model inference for nested candidate models when the candidate models include correct model. As shown in this paper, we recommend the use of little bootstrap data perturbation instead of calculating GDF with a fully adaptive penalty procedure for nested candidate models. The nested candidate models are of the form Y = X K β + ϵ, 1 where Y = Y 1,..., Y n T is the response vector, EY = µ 0 = µ 1,..., µ n T, ϵ = ϵ 1,..., ϵ n T N n 0, σ 2 I n, x j = x 1j,..., x nj T for j = 1,..., K, X K = x 1,..., x K is an n K matrix of potential explanatory variables to be used in predicting linearly the response vector, and β = β 1..., β K T is a vector of unknown coefficients. Here I n is the n n identity matrix and σ 2 is known. In this paper, we further assume that µ 0 = X K β and some of the coefficients may equal zero. 2

4 Consider adaptive choice of λ by finding the best λ which minimizes ˆMλ over λ Λ. Here Λ denotes the collection of penalty to choose. It is a two-step procedure. At the first step, for each λ Λ, choose the optimal model ˆMλ by minimizing Y ˆµ T Y ˆµ + λ M σ 2, where M denotes the number of unknown parameters used to prescribe model M. Then calculate g 0 λ which is the twice of generalized degrees of freedom with penalty λ. At the second step, choose the optimal λ Λ based on data as follows: ˆλ = argmin λ Λ { RSS ˆMλ + g 0 λ σ 2} where RSS ˆMλ = Y ˆµ ˆMλ 2, is usual L 2 norm, ˆµ ˆMλ = K ˆµ M k 1 { ˆMλ=Mk }, 1 { } is the usual indicator function and ˆµ Mk is the estimator of µ in model M k. Here Λ = [0, or 0, as recommended in Shen and Ye 2002 and Shen and Huang When g 0 λ is difficult to compute, data perturbation is recommended. Despite GDF already take into the account of the data-driven selection of the sequence of estimators i.e., candidate models, it is found in this paper that the model selected by C p is better than ˆMˆλ in terms of the probability of selecting correct model unless λ falling between AIC and BIC. However, the model selector by replacing g 0 λ with the estimate given by little bootstrap can beat the performance of C p in terms of the probability of selecting correct model. The rest of the paper is organized as the following. In Section 2, we introduce the unbiased risk estimation and generalized degrees of freedom in the setting of model selection with nested linear regression models. It also addresses the performance of the adaptive penalty selection procedure in Shen and Ye 2002 with approximated GDF in terms of model selection consistency. Section 3 gives the evaluation of the performance of adaptive penalty selection with GDF. Section 4 concludes with a discussion on the merit of data perturbation method. The Appendix contains technical proofs. 2 Adaptive penalty selection with nested candidate models In this section, we start with the derivation of generalized degrees of freedom defined in Ye 1998 through unbiased risk estimation, describe its potential benefit on increasing the probability of selecting correct model, and pitfalls without a proper constraint on the choice of penalty. 2.1 Unbiased risk estimation under nested linear regression model selection The available data contains n observations and each of them has K explanatory predictors, x 1,..., x K, and response y. Denote the matrix formed by the first k explanatory variables as X k = x 1,..., x k and β k = β 1,..., β k T where k = 1,..., K. We then have K candidate models and the kth candidate 3

5 model M k is Y = X k β k + ϵ. The set of candidate models is denoted as M = {M k k = 1,..., K} while the true model M k0 M. Namely, µ 0 = X k0 β k0 with 1 k 0 K. For the ease of presentation, denote {M k k 0 j K} by M k0 which includes the true model and overfitted models. For kth candidate model M k with least squares fit, we have ˆµ Mk = P k Y where P k stands for the orthogonal projection operator onto the space spanned by X k. We evaluate the model ˆµ Mk by its prediction accuracy. Denote the future response vector at X K by Y 0 and assume that EY 0 = EY and V ary 0 = σ 2 I n. The unobservable prediction error is defined as P EM k = Y 0 ˆµ Mk 2 and residual sum of squares as RSSM k = Y ˆµ Mk 2. It is well known that the residul sum of squares RSSM k often smaller than P EM k. Namely, E[P EM k ] > E[RSSM k ]. Mallows 1973 proposed C p by inflating RSSM k with 2trP k σ 2 where trp k is often called degrees of freedom. Since both P EM k and RSSM k are of quadratic form, it follows easily that E[P EM k ] = E[RSSM k ] + 2kσ 2. We now give generalized degrees of freedom through unbiased risk estimate over M while the true model is M k0. For simplicity, we assume that σ 2 is known from now on. The model selected through Mallows C p criterion among M is denoted as ˆM2. We can then write the post-model-selection estimator of µ 0 through C p by ˆµ ˆM2 = K ˆµ M k 1 { ˆM2=Mk}. The prediction risk is defined as follows: RISKC p = E [ Y 0 ˆµ ˆM2 T Y 0 ˆµ ˆM2 ]. By standard argument, we have [ ] RISKC p = E µ 0 ˆµ ˆM2 T µ 0 ˆµ ˆM2 + nσ 2 = E[RSS ˆM2] + 2E[ϵ T ˆµ ˆM2 µ 0 ] = E[RSS ˆM2] + g 0 2 σ 2 where g 0 2 = 2E[ϵ T ˆµ ˆM2 µ 0 ]/σ 2 and g 0 2 is twice GDF defined in Ye In general, the calculation of GDF associated with C p can be complicate but its analytic form can be obtained under the settings of this paper. When K k 0 = 20, and n = 404, g 0 2 is found to be around 2k as given in Lemma 4. Or, the unbiased risk estimator of using C p over M should be RSS ˆM2 + g 0 2 σ 2 instead of RSS ˆM2 + 2 ˆM2 σ 2 where ˆM2 is the total number of unknown parameters of the chosen model by C p. It will be shown in Lemma 4a that g 0 2 2k 0 under some regularity conditions. Since BIC achieves model selection consistency as shown in Schwarz 1978, its associated unbiased 4

6 risk estimator will be RSS ˆMlog n + 2k 0 σ 2. This leads to the proposal in Shen and Ye 2002 on employing adaptive choice of penalty incorporating GDF. We now investigate whether the adaptive penalty selection proposal incorporate GDF in Shen and Ye 2002 is a consistent model selection procedure. 2.2 Adaptive choice of λ over Λ = {0, 2, log n} As an illustration, we consider how adaptive choice of penalty, λ Λ = {0, 2, log n}, with GDF improves on the fixed choice of penalty. Consider, k 0 = 1 and K k 0 = 20 in which finding the underfitted model in M 0 is not possible. When λ = 0, model M K will always be chosen since it gives the smallest residual of sum squares. Then g 0 0 = 2K which is equal to the degree of freedom assigned by Mallows C p. When λ = log n, we have ˆMλ = M k0 with probability close to 1 and then g 0 log n 2k 0. When λ = 2, Woodroofe 1982 shows that the probability of selecting correct model is slightly larger than 0.7 while choose no more than one superfluous explanatory variables in average. Then g 0 2 2k where 5.02 depends on k 0 and K k 0. It will be shown that adaptive penalty selection procedure with GDF improves C p selection in increasing the probability of selecting correct model M k0 when K k This improvement mainly come from the reduction of the probability of selecting M k0 +1 by C p. Technical details will be provided later on. We now give a lemma which will be used repeatedly to give an upper bound on ˆMλ k 0. It can also be used to give an upper bound on the probability of choosing M k0 when the adaptive penalty selection procedure is used. Lemma 1. Suppose that log n Λ, λ log n for all λ Λ, and ˆMlog n = M k0. a For all λ Λ, P ˆMˆλ = Mk0 P λ ˆMλ M k0 g 0 λ g 0 log n. b ˆMˆλ M k0 if there exists λ such that λ ˆMλ M k0 > g 0 λ g 0 log n. Proof. Recall that ˆMλ = argmin Mk M{RSSM k + λ M k σ 2 } for any given λ Λ or ˆMλ is the chosen model for given λ. We conclude that RSS ˆMλ + λ ˆMλ σ 2 RSSM k0 + λ M k0 σ 2. Hence, for all λ, [ λ ˆMλ ] M k0 σ 2 RSSM k0 RSS ˆMλ. 2 When the event {ˆλ = log n} occurs, the following must hold. RSS ˆMλ + g 0 λ RSS ˆMlog n + g 0 log n for all λ. 5

7 Hence, P ˆλ = log n P RSS ˆMλ + g 0 λ RSS ˆMlog n + g 0 log n. We conclude that P ˆλ = log n P RSS ˆMlog n RSS ˆMλ g 0 λ g 0 log n P λ ˆMλ Mk0 σ 2 RSS ˆMlog n RSS ˆMλ g 0 λ g 0 log n P λ ˆMλ Mk0 σ 2 g 0 λ g 0 log n. The second inequality holds by 2. We conclude the proof of this lemma. Next lemma states that ˆMλ is a decreasing step function of λ. Lemma 2. When the set of candidate models M is a sequence of nested linear regression models, ˆMλ is a non-increasing step function over Λ. When ˆMλ l = k l for some 1 k l K, ˆMλ k l for all λ λ l. Proof. Recall that ˆMλ denotes the selected model for given penalty λ M k σ 2. It follows from Proposition 5.1 of Breiman 1992 that RSS ˆMλ must be at an extreme point of the lower convex envelop of the graph {k, RSSk}, k = 1,..., K. Hence, this lemma holds. Adaptive choice of λ over Λ = {0, 2, log n} depends on the comparison of RSS ˆM0 + g 0 0 σ 2, RSS ˆMlog n + g 0 log nσ 2, and RSS ˆM2 + g 0 2σ 2. We start with the comparison of RSS ˆM0 + g 0 0σ 2 and RSS ˆMlog n + g 0 log nσ 2. Note that, for K k 0 = 20, P RSS ˆM0 + g 0 0σ 2 > RSS ˆMlog n + g 0 log nσ 2 = P RSS ˆM0 + g 0 0σ 2 > RSS ˆMlog n + g 0 log nσ 2, ˆMlog n = M k0 +P RSS ˆM0 + g 0 0σ 2 > RSS ˆMlog n + g 0 log nσ 2, ˆMlog n M k0 P Y T P K P k0 Y < 40σ 2, ˆMlog n = M k0. Since Y T P K P k0 Y/σ 2 is a chi-square random variable with degrees of freedom 20, it follows from the consistency of BIC on model selection, P RSS ˆM0 + g 0 0σ 2 > RSS ˆMlog n + g 0 log nσ 2 is close to 1. Next, we evaluate the difference between RSS ˆM2 + g 0 2σ 2 and RSS ˆMlog n + g 0 log nσ 2. When ˆMlog n = M k0 and ˆM2 = M k, it follows from Lemma 1 that 6 ˆM2 M k0 must hold if k k 0 3. Since g 0 2 2k , the adaptive choice of λ over {0, 2, log n} might increase the probability of selecting M k0. This increase must come from the occurrence of the event {1 ˆM2 k 0 3}. Or, it does improve C p on increasing the probability of correct selection. Refer to Lemma 5 for further details. 6

8 2.3 Adaptive choice of λ over Λ = [2, log n] The choice of Λ can be difficult in practice if the users don t know the behavior of fix penalty model selection criterion. The natural choice of Λ can be on [0, as suggested in Shen and Ye In this subsection, Λ is chosen to be [2, log n]. The case Λ = [0, log n] is deferred to next subsection. Note that g 0 λ = 2 K k 0 P χ 2 k+2 > kλ + 2k 0 under some regularity conditions by Lemma 4a in Section 3. When Λ = [2, log n], adaptive choice of λ, ˆλ is defined as follows: ˆλ = argmin λ [2,log n] {RSS ˆMλ + g 0 λσ 2}. When the event { ˆM2 = M k0 } occurs, it follows from Lemma 2 that ˆλ = log n by the fact that g 0 λ is decreasing. Hence, adaptive penalty selection procedure will select M k0 when Mallows C p select M k0. This conclude that adaptive penalty selection with λ [2, log n] does improve C p on the probability of correct selection. Table 1 presents the finding of a simulation experiment and leave general discussion in Section 3. This simulation result shows that the adaptive choice of λ over [2, log n] performs better than C p but cannot match with BIC. Compare to C p, adaptive choice of λ over [2, log n] improves over Mallows C p by increasing the probability of correct selection from 71.29% to 74.88% and this improvement is being observed consistently over simulations. In Section 3, it quantifies the increase of probability of selecting M k0 based on adaptive choice of λ over [2, log n] over fixed penalty 2 is given which matches well with the above simulation experiment. 2.4 Adaptive choice of λ over Λ = [0, log n] Woodrofe 1982 showed that Mallows C p selection procedure can be viewed as finding the global minimum of a random walk with negative drift 1 formed by [ n ˆβ k β k / V ar ˆβ k ] 2 2 over {k 0,..., K}. When λ = 0, it leads to the selection of M K, the largest candidate model. Section 2.2 shows that g 0 0 = 2K which is twice GDF defined in Shen As in last subsection, we again use a simulation study to evaluate adaptive penalty selection procedure when Λ = [0, log n]. The finding is summarized in Table 2. Clearly, adaptive penalty selection of λ over [0, log n] not only cannot improve over C p but decreases the probability of selecting M k0 to about 52.6% as shown in Table 2. This finding is consistent with the simulation results in Atkinson 1980 and Zhang Both papers suggested that the penalty factor for Atkinson generalized Mallows C p statistics should be chosen between 1.5 and 6. When λ = 1, the size of selected model is the one that reaches the global minimum of a random walk without drift over {k 0,..., K}. It leads to an overfitted model with lots of superfluous explanatory variables. We now employ Lemma 1 to explain why adaptive penalty selection performs worse than 7

9 ˆMˆλ λ [2, log n] λ = 2 ˆMˆλ λ [2, log n] λ = 2 k k k k k k k k k k k k k k k k k k k k k Table 1: The empirical distribution of chosen model through adaptive model selection when λ [2, log n] based on 100, 000 simulated runs. ˆMˆλ λ [0, log n] λ = 2 ˆMˆλ λ [0, log n] λ = 2 k k k k k k k k k k k k k k k k k k k k k Table 2: The empirical distribution of chosen model through adaptive model selection when λ [0, log n] based on 100, 000 simulated runs. 8

10 C p when Λ = [0, log n]. Recall that g 0 λ = 2E[ϵ T ˆµ ˆMλ µ 0 ]/σ 2 where ˆµ ˆMλ = K ˆµ M k 1 { ˆMλ=Mk}. Note that P k does not depend on ϵ and hence [ K ] E[ϵ T ˆµ ˆMλ µ 0 ] = E ϵ T P k ϵ 1 { ˆMλ=Mk} = σ 2 g 0 λ/2 > 0. When ˆMλ k 0, g 0 λ/σ 2 must be greater than 2k 0. Since ϵ T ˆµ ˆMλ µ 0 is a mixture of χ 2 random variables and cannot be observed, it is being estimated by its expected value as suggested in Mallows Note that g 0 λ 2k 0 are equal to 25.35, 21.98, 18.78, 15.98, 13.36, and as λ takes values 1.0, 1.1, 1.2, 1.3, 1.4, and 1.5. Since the probability distribution of χ 2 1 random variable is positively skew, the replacement of such a random variable by its mean can be problematic. Moreover, the following must hold when ˆMˆλ = M k0 for all λ Λ. ˆMλ g0 λ 2k 0 k 0, λ where is the Gauss integer. When ˆMˆλ = M k0, it follows from Lemma 1 that ˆM1.1 k 0 19, ˆM1.2 k 0 15, ˆM1.3 k 0 12, ˆM1.4 k 0 9, and ˆM1.5 k 0 7. This reduces the probability of choosing M k0 which provides an explanation of why adaptive penalty selection is so problematic when λ [1, 1.5]. 3 Performance of Adaptive Penalty Selection with GDF In this section, we start with assumptions which ensure that BIC is a consistent model selector and C p never miss on choosing nonzero coefficient explanatory variables. Under the setting of this paper, the analytic form of g 0 λ is available. Its numerical values can be obtained easily by any statistical software. To evaluate whether the method of adaptive penalty selection can improve fixed penalty selection, we combine simulation study and analytic approximation together on those K, k 0 with K k When K k 0 = 20, numerical values of g 0 λ is obtained by R software. Then combine numerical values of g 0 λ for K k 0 = 20 and the exponential inequality in Teicher 1984 to give a bound on g 0 λ for K k 0 > Key assumptions We now state three assumptions on the class of competing models M. The following assumptions ensure that employing BIC leads to consistent model selection and we never get an underfitted model when λ [0, log n]. They mainly prescribe the relation between n, the total number of observation, K, the total number of explanatory variables, and k 0, the total number of nonzero regression coefficients. As a remark, the setting of simulation studies in Section 2 satisfies these three assumptions. 9

11 Assumption B. X K is full rank and there exists a constant c > 0 such that µ T 0 I n P j µ 0 /σ 2 cn for all j < k 0, where µ 0 = X k0 β k0 is the mean vector of the true model. Assumption N1. cn > 2k 0 log n. Assumption N2. log n > 2 logk k 0. Assumptions B and N1 are assumed to quantify the magnitude of the bias when the fitted model leaves out nonzero coefficient explanatory variable while Assumption N2 is specified to guard the inclusion of those zero coefficient explanatory variables, max k0 j K ˆβ j. When those assumptions hold, it ensures that BIC selects the correct model with probability 1 as n goes to infinity. Moreover, for all λ [0, log n], the probability of the event { ˆMλ k 0 1} goes to zero as shown in the following Lemma. Its proof is deferred to the Appendix. Lemma 3. When k 0 > 1, Assumptions B and N1 hold, P ˆMλ k 0, for all λ [0, log n] 1 as n goes to infinity. When Assumption N2 also holds, P ˆMlog n = Mk0 1 as n goes to infinity. Under the above assumptions, we only need to address the adaptive penalty selection over Λ = [0, log n] for M k0 on the inclusion of superfluous explanatory variables. It remains to study and ˆλ which is defined as follows ˆλ = argmin λ Λ { RSS ˆMλ + g 0 λσ 2} to quantify ˆMλ k 0. ˆMλ = argmin Mk M k0 { RSSMk + λkσ 2} We now employ the argument in Woodroofe 1982 to calculate g 0 λ defined by 2E[ϵ T ˆµ ˆMλ µ 0 ]/σ 2 over M k0 and details deferred to the Appendix. Lemma 4. When Assumption B, N1 and N2 hold, the followings hold. a For λ [0, log n], g 0 λ = 2 K k 0 j=1 P χ 2 j+2 > jλ +2k 0 and g 0 λ is a strictly decreasing function of λ, where χ 2 l is chi-square distributed with degrees of freedom l. b For λ [2, log n], 2 K k 0 j=21 P χ 2 j+2 > jλ when K k Figure 1 gives the plot of λ, g 0 λ when K k 0 = 20 in which g 0 λ is approximated by R software by the accuracy of fourth decimal points. Lemma 4b gives us a general idea on the plot of λ, g 0 λ when K k 0 >

12 g 0 λ λ Figure 1: g 0 λ when K k 0 = Performance of adaptive penalty selection procedure To evaluate adaptive penalty selection procedure, we start with the case K k 0, n = 20, 404 and Λ = [2, log n]. Since log 404 > 2 log20, Assumption N2 holds. Assume that Assumptions B and N1 hold. It follows from Lemma 3 that the adaptive penalty selection procedure in Shen and Ye 2002 will not miss any nonzero coefficient explanatory variable. Lemma 4a gives g 0 λ for all 0 λ log n which can be obtained by numerical approximation using R program. By the simulation result presented in Table 1 in ˆλ is often smaller than log n and ˆMˆλ cannot be M k0 whenever ˆM2 > 2 + k 0. An estimate of the lower bound of the probability of the event { ˆMˆλ = M k0 } will be given. When the event { ˆM2 = M k0 } occurs, it follows from Lemma 2 that ˆλ = log n by the fact that g 0 λ is decreasing and ˆMˆλ = M k0. Hence, adaptive penalty selection will select M k0 when Mallows C p selects M k0. We now demonstrate how the adaptive penalty selection improves over C p when C p overfits an explanatory variable, the event { ˆM2 = M k0 +1, ˆMˆλ = M k0 } occurs. For i = k 0 + 1,..., K, define V i = [RSSM k0 +i RSSM k0 +i 1]/σ 2. Note that V i are i.i.d. chi-square random variables with degree of freedom 1. When the event { ˆM2 = M k0 +1} occurs, it follows from 11

13 Lemma 2 that ˆMλ = M k0 when λ [V k0 +1, log n] and ˆMλ = M k0 +1 when λ [2, V k0 +1. The following lemma gives a lower bound of the probability of the event { ˆM2 = M k0 +1, ˆλ = log n} when Λ = [2, log n]. Define p λ,k0,k k 0 k to be the probability of the event { ˆMλ = M k } occurs for given penalty λ when k = 1,..., K. Lemma 5. Suppose Assumptions B, N1, and N2 hold and Λ = [2, log n]. A lower bound on the probability of { ˆM2 = M k0 +1, ˆMˆλ = M k0 } is p 2,k0 +1,K k 0 k P 2 < V k0 +1 sup {g 0 λ 2k 0 > λ} λ>0. Proof. Define E 1 = { ˆMλ = M k0 +1 for all λ [2, V k0 +1, ˆMˆλ = M k0 } and E 2 = { ˆMλ = M k0 and for all λ [V k0, log n], ˆMˆλ = M k0 }. Then E 1 = {RSSM k0 RSSM k0 +1 > λσ 2, RSSM k0 +1 RSSM k0 +1+i iλσ 2, for all 0 < i K k 0 1, RSSM k0 + 2k 0 σ 2 RSS ˆMλ + g 0 λσ 2, for all λ [2, V k0 +1} = {V k V k0 +2+i i + 1V k0 +1, for all 0 i K k 0 2, 2 < V k0 +1 g 0 V k0 +1 2k 0 } E 2 = {RSSM k0 RSSM k0 +i iλ σ 2, for all 0 < i K k 0, Observe that RSSM k0 + 2k 0 σ 2 RSS ˆMλ + g 0 λ σ 2, for all λ [V k0 +1, log n} = {V k V k0 +1+i i + 1V k0 +1, for all 0 i K k 0 1}. { ˆM2 = M k0 +1, ˆMˆλ = M k0 } = E 1 E 2 = {V k V k0 +2+i i + 1V k0 +1, for all 0 i K k 0 2, 2 < V k0 +1 g 0 V k0 +1 2k 0, V k V k0 +1+j j + 1V k0 +1, for all 0 j K k 0 1} {V k V k0 +2+i 2i + 1, for all 0 i K k 0 2, 2 < V k0 +1 g 0 V k0 +1 2k 0, V k V k0 +1+j jv k0 +1, for all 1 j K k 0 1} {V k V k0 +2+i 2i + 1, for all 0 i K k 0 2, 2 < V k0 +1 g 0 V k0 +1 2k 0, V k V k0 +1+j 2j, for all 1 j K k 0 1} {V k V k0 +2+i 2i + 1, for all 0 i K k 0 2, 2 < V k0 +1 g 0 V k0 +1 2k 0 }. 12

14 Then we have P ˆM2 = M k0 +1, ˆMˆλ = M k0 P 2 < V k0 +1 g 0 V k0 +1 2k 0 P V k V k0 +2+i 2i + 1, for all 0 i K k Note that the first term in the right hand side of 3 is equal to p 2,k0 +1,K k 0 1k We conclude the proof. Remark 1. When K k 0 = 20, p 2,k0 +1,19k is around by simulation. Since g 0 λ is strictly decreasing, a lower bound of the second term of 3 is obtained as follows. P 2 < V k0 +1 g 0 V k0 +1 2k 0 P 2 < V k0 +1 sup {g 0 λ 2k 0 > λ} λ>0 P 2 V k , 4 where 2.56 is obtained by simulation to approximate sup λ>0 {g 0 λ 2k 0 > λ} for K k 0 = 20 and is a numerical approximation of P 2 V k by R program. It follows from 3 and 4 that, for Λ = [2, log n], P ˆM2 = M k0 +1, ˆλ = log n = In fact, { ˆMˆλ = M k0 } = k0 k K{ ˆM2 = M k, ˆMˆλ = M k0 }. The adaptive penalty selection procedure picks M k0 P ˆMˆλ = M k0 with probability k 0 k K k 0 k k 0 +1 P ˆM2 = M k, ˆMˆλ = M k0 P ˆM2 = M k, ˆMˆλ = M k0 = p 1 2, 1, 20 + p 1 2, 1, 19 P 2 χ 2 1 sup {g 0 λ 2k 0 > λ} λ> = Since P ˆMˆλ = Mk0 p 2,k0,K k 0 k 0, adaptive penalty selector with λ [2, log n] improves C p on correct selection probability. Moreover, a lower bound on the increase of correct selection probability is around for K k 0 = 20 and k 0 = 1. We conclude that when λ [2, log n], adaptive penalty selector improves C p at least at the amount of 3.4% on correct selection probability. As suggested by the simulation study, the improvement mainly comes from the event { ˆM2 = M k0 +1, ˆλ = log n}. Furthermore, we provide a crude upper bound which is obtained by Lemma 1. When K k 0 = 20, we have for all 2 λ log n and g 0 λ = 2 20 j=1 P χ 2 j+2 > jλ + 2k 0 g 0 2 2k 0 = 5.02 when K k 0 =

15 As suggested by 5, the upper bound of the probability of selecting correct model M k0 by adaptive penalty selection is 0.88 which is obtained by adding up P ˆM2 = k for k = k 0,..., k in Table 2. As revealed in Table 1 of Section 2, adaptive penalty selection procedure improves fixed penalty C p by increasing the probability of correct selection around 3.5%. This reduction must come from including no more than 2 superfluous zero-coefficient explanatory variables. According to the above discussion and the result presented in Table 1, we have the following lemma. Lemma 6. When K k 0 = 20, all three assumptions hold if n 404. The adaptive penalty selection procedure proposed in Shen and Ye 2002 improves over Mallows C p in terms of model selection consistency where λ [2, log n]. consistency. However, it cannot match with BIC in terms of model selection We now compare the performance of adaptive penalty selection under K k 0 > 20 where n > K k 0 2 versus K k 0 = 20 where n = 404. When K k 0 > 20, it follows from Lemma 4 that g 0 λ = 2 20 j=1 K k 0 P χ 2 j+2 > jλ + P χ 2 j+2 > jλ + 2k 0 j=21 for all 2 λ log n and K k 0 j=21 P χ2 j+2 > jλ We then have g 0 2 2k 0 < = 7.63 when K k 0 > As suggested by 6, the upper bound of the probability of selecting correct model M k0 by adaptive penalty selection is 0.91 which is obtained by adding up P ˆM2 = k for k = k 0,..., k in Table 2. As revealed in Table 1 of Section 2, adaptive penalty selection improves fixed penalty C p by increasing the probability of correct selection around 3.5%. The reduction must come from including no more than 3 superfluous zero-coefficient explanatory variables. We conclude the following Theorem. Theorem 1. When Assumption B, Assumption N1, and Assumption N2 hold, the adaptive penalty selection procedure with generalized degrees of freedom over λ [2, log n] is still a conservative model selection procedure. In terms of the probability of selecting overparametrized models, it improves fixed penalty selection procedure C p but cannot match with the fixed penalty selection BIC. As a remark, a lower bound on the probability of selecting M k0 method used in the proof of Lemma 5. can also obtained by the same It only needs to replace p 2,1,20 1 with p 2,1,K k0 1 for K k 0 > 20 and p 2,1,19 1 by p 2,1,K k0 11 for K k 0 > 20. Since g 0 λ increases with n, sup λ>0 {g 0 λ 2k 0 > λ} increases accordingly. Hence, there is no need to adjust the probability 14

16 P 2 χ 2 1 sup λ>0 {g 0 λ 2k 0 > λ} for lower bound. Similarly, a lower bound for K k 0 > 20 is given as p 2,1,K k0 1 + p 2,1,K k Then we have the approximated lower bound P ˆλ = log n = p 1 2, 1, K k 0 + p 1 2, 1, K k 0 1 P 2 χ 2 1 sup {g 0 λ 2k 0 > λ} λ> = This approximated lower bound only decrease slightly when K k 0 > 20. Note that the difference of p 2,1,K k0 1 and p 2,1,20 1 is small. That is because for j > 0 and λ > 2 p λ,1,k k0 1 p λ,1,k k0 +j1 = P V k0 +1 λ,..., V k V K λk k 0 P V k0 +1 λ,..., V k V K λk k 0, V k V K λk k 0 + j = P V k0 +1 λ,..., V k V K λk k 0, V k V K > λk k 0 + j P V k V K > λk k 0 + j The last probability is close to 0 by the law of large numbers. For example, when λ 2 and K k 0 = 15, p λ,1,k k0 1 p λ,1,k k0 +j1 P V k V K > by R. Use the same argument, we can show that the difference between p λ,1,k k0 l and p λ,1,k k0 +jl should be small for l = 1, 2. 4 Discussion and Conclusion Our main purpose, in this paper, is to show that the practice of a fully adaptive choice of penalty should be cautious. In particular, we have demonstrated that a fully adaptive penalty selection procedure through the unbiased risk estimator can be a bad practice for nested candidate models. But the adaptive penalty selection procedure does improve the performance of C p when the range of penalty is chosen properly. However, such a procedure still cannot achieve model selection consistency as fixed penalty BIC approach does. To achieve proper selection of the range of penalty, it can be done by using little bootstrap or data perturbation suggested in Shen and Ye When the data perturbation method with suggested perturbation variance 0.25σ 2 is employed to estimate GDF over Λ = [0, log n], it outperforms the procedure by adding GDF as reported in Table 3. 15

17 ˆMˆλ ĝ 0τ λ g 0 λ ˆMˆλ ĝ0τ λ g 0 λ k k k k k k k k k k k k k k k k k k k k k Table 3: The empirical distribution of chosen model through adaptive model selection with ĝ 0τ λ, τ/σ = 0.5 when λ [0, log n] based on 10, 000 simulated runs. 5 Appendix 5.1 Proof of Lemma 3 It follows easily from Lemma 2 that P ˆMlog n M k0 P ˆMλ M k0, for all λ [0, log n]. Or P ˆMlog n < M k0 > P ˆMλ < M k0, for some λ [0, log n]. Observe that P ˆMλ < M k0 for some λ [0, log n] = P ˆMlog k 0 1 n < M k0 P k 0 1 k 0 1 k 0 1 ˆMlog n = k P RSSM k RSSM k0 k 0 kσ 2 log n P µ T 0 I P k µ 0 + 2ϵ T I P k µ 0 + ϵ T I P k ϵ k 0 kσ 2 log n P µ T 0 I P k µ 0 + 2ϵ T I P k µ 0 k 0 σ 2 log n. 16

18 Since µ T 0 I P kµ 0 + 2ϵ T 0 I P kµ 0 Nµ T 0 I P kµ 0, 4µ T 0 I P kµ 0 σ 2, it follows from Assumption B2 that k 0 1 = = P µ T 0 I P k µ 0 + 2ϵ T I P k µ 0 k 0 σ 2 log n k 0 1 k 0 1 P Z k 0σ 2 log n µ T 0 I P kµ 0 2σ µ T 0 I P kµ 0 µ T 0 I P k µ 0 P Z 1 4 σ 2 k 0 1 P Z µt 0 I P kµ 0 4σ µ T 0 I P kµ 0 where Z denotes a standard normal random variable. By Assumption B1, k P ˆMλ 0 1 < M k0 P Z 1 µ T 0 I P k µ 0 4 σ 2 k 0 Φ where Φ is the cdf of the standard normal. We conclude that P ˆMlog n = M k0 is close to 1 by Assumption N. cn 4, 5.2 Proof of Lemma 4 Proof of a. Define V i = RSSM i 1 RSSM i /σ 2, S k0 λ = 0, and S j λ = j i=k 0 +1 V i λ for j = k 0 + 1,..., K. It can be obtained easily from the result of Spitzer 1956 and Zhang 1992 that and E [max k0 j KS j λ] = = K j=k 0 +1 [ E ˆMλ ] M k0 = K j=k [ ] E S j λ 1 j k {Sj λ>0} 0 1 [ ] E V j j k 0 λ 1 j k {Vj >j k 0 λ} = 0 K j=k 0 +1 Then [ ] E ϵ T P ˆMλ ϵ /σ 2 = E [ϵ T P ˆMλ P Mk0 = E [max k0 j KS j λ] + λe [ ] E 1 {Sj λ>0} = ϵ λ K j=k 0 +1 ˆMλ M k0 ] + E [ ˆMλ M k0 K k 0 l=1 P S j λ > 0 = σ 2 + λ [ ϵ T P Mk0 ϵ [ P χ 2 l+2 > lλ λp χ 2 l > lλ] K k 0 j=1 ˆMλ M k0 ] /σ 2 P χ 2 j > jλ. ] σ 2 + ϵ T P Mk0 ϵ /σ 2 = = K k 0 j=1 K k 0 j=1 [ P χ 2 j+2 > jλ λp χ 2 j > jλ ] K k 0 + λ P χ 2 j+2 > jλ + k j=1 P χ 2 j > jλ + k 0 7

19 We conclude the proof. For λ 2, K k 0 21, Lemma 4 gives an estimate on how fast P χ 2 j+2 > jλ goes to zero and an upper bound on 2 K k 0 j=21 P χ 2 j+2. > jλ The derivation relies on Theorem 1 of Teicher 1984 which is listed below for reference. Theorem 1. Teicher, 1984 Let W i be independent random variables with E[W i ] = 0, E[Wi 2] = σ2 i and E W i k k!c k 2 2 σi 2/2, for all k 3 and some c 2 > 0. Let B n = n i=1 a niw i and a ni be arbitrary constants. If vn 2 = n i=1 a2 ni σ2 i and c n = c 2 max 1 i n a ni. Then P B n > xv n exp { x c } 1 nx, x > 0. v n Proof of b. For K k 0 > 20 and λ = 2, we employ Theorem 1 of Teicher 1984 by setting W i to be i.i.d. χ for all i = 1,..., K k Then E[W i ] = 0 and E[W i ] 2 = σ 2 i = 2. By Lemma 5 in Teicher 1984, we have E W i r = E χ r r!2 r 2, for all r 3. Then we can set c 2 = 2. Choose a ni = 1 for all n Then vk+2 2 = 2k + 2, v k+2 = 2k + 2 and c k+2 = 2. Write B k+2 = k+2 i=1 W i and set x = kλ 1 2. Then, for all natural number k, 2k+2 P χ 2 k+2 > kλ = P χ 2 k+2 k + 2 > kλ 1 2 = P B k+2 > kλ 1 2 = P B k+2 > kλ 1 2 2k + 2 = P B k+2 > 2k + 2x 2k x 2 2x exp k + 2 = exp { exp 1 2 k2 λ 1 2 4jλ k + 2 = exp 1 λk + 2 exp { k2 λ 1 2 4k + 2λ exp 1 } λ { x 2 { k2 λ 1 2 } kλ 1 + 4k + 2λ k + 2λ } exp + kλ 1 k + 2λ For k 21 and λ 2, we have P χ 2 k+2 > kλ 21λ 12 exp { λ k + λ 1 } λ 2 } kλ 1 k + 2 { k2 λ 1 2 4k + 2λ + λ 1 λ { exp and for λ = 2, we have K k0 2 P χ 2 k+2 > 2k 2 P χ 2 k+2 > 2k 2 exp1/2 k=21 k=21 } k exp }. 1 exp = Acknowledment. This research was supported in part by grants M and M from the National Science Council of Republic of China. 18

20 References [1] Atkinson, A.C., A note on the generalized information criterion for a choice of a model. Biometrika 67, [2] Breiman, L., The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. Journal of the American Statistical Association 87, [3] Liu, W. and Yang, Y., Parametric or nonparametric? A parametricness index for model selection. The Annals of Statistics 39, [4] Mallows, C.L., Some comments on C p. Technometrics 15, [5] Nishii, R., Asymptotic properties of criteria for selection of variables in multiple regression. The Annals of Statistics 12, [6] Schwarz, G., Estimating the dimension of a model. The Annals of Statistics 6, [7] Shen, X., Huang, H.C., Optimal model assessment, selection, and combination. Journal of the American Statistical Association 101, [8] Shen, X., Ye, J., Adaptive model selection. Journal of the American Statistical Association 97, [9] Spitzer, F., A combinatorial lemma and its application to probability theory. Transactions of the American Mathematical Society 82, [10] Teicher, H., Exponential bounds for large deviations of sums of unbounded random variables. Sankhya 46, [11] Woodroofe, M., On model selection and the arcsin laws. The Annals of Statistics 10, [12] Ye, J., On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association 93, [13] Zhang, P., On the distributional properties of model selection criteria. Journal of the American Statistical Association 87,

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)