Variable selection in linear regression through adaptive penalty selection

Size: px
Start display at page:

Download "Variable selection in linear regression through adaptive penalty selection"

Transcription

1 PREPRINT 國立臺灣大學數學系預印本 Department of Mathematics, National Taiwan University Variable selection in linear regression through adaptive penalty selection Hung Chen and Chuan-Fa Tang September 06, 2012

2 Variable selection in linear regression through adaptive penalty selection Hung Chen 1 Chuan-Fa Tang 2 1 Department of Mathematics, National Taiwan University, Taipei, 11167, Taiwan 2 Department of Statistics, University of South Carolina, Columbia, SC 29208, USA Abstract Model selection procedures often use a fixed penalty, such as Mallows C p, to avoid choosing a model which fits a particular data set extremely well. These procedures are often devised to give an unbiased risk estimate when a particular chosen model is used to predict future responses. As a correction for not including the variability induced in model selection, generalized degrees of freedom is introduced in Ye 1998 as an estimate of model selection uncertainty that arise in using the same data for both model selection and associated parameter estimation. Built upon generalized degrees of freedom, Shen and Ye 2002 proposed a data-adaptive complexity penalty. In this article, we evaluate the validity of such an approach on model selection of linear regression when the set of candidate models satisfies nested structure and includes true model. It is found that the performance of such an approach is even worse than Mallows C p on the probability of correct selection. However, this approach coupled with proper selection of the range of penalty or little bootstrap proposed in Breiman 1992 performs better than C p with increasing probability of correct selection but still cannot match with BIC on achieving model selection consistency. Keywords: Adaptive penalty selection; Mallows C p ; Model selection consistency; little bootstrap; Generalized degrees of freedom

3 1 Introduction In analyzing data from complex scientific experiments or observational studies, one often encounters the problem of choosing one plausible linear regression model from a sequence of nested candidate models. When the exact true model is among the candidate models, BIC Schwarz, 1978 has been shown to be a consistent model selector in Nishii 1984, but C p Mallows, 1973 is shown to be a conservative model selector in Woodrofe 1982 for picking a slightly overfitted model. From BIC to C p AIC, the chosen penalty differs in measuring the dimension of the model to log n where n represents the number of observations. Many efforts were made to understand in what circumstances these criteria allow to identify the right model asymptotically. Liu and Yang 2011 proposed a model selector in terms of data-adaptive choice of penalty between AIC and BIC. When the exact true model is among the candidate models, the newly proposed selector is a consistent model selector. Since both the number of candidate predictors and sample size influence the performance of any model selector, an alternative approach of penalty selection is to do adaptive choice of penalty. Shen and Ye 2002 proposed such an approach in which the restriction on the penalty is lifted and the generalized degrees of freedom GDF is used to compensate possible overfitting due to model selection. To avoid overfitting the available data caused by the collection of candidate models, little bootstrap has been advocated in Breiman 1992 to decrease possible overfitting due to multi-model inference. Often GDF is difficult to compute, Shen and Ye 2002 demonstrated that the use of little bootstrap i.e. data perturbation to replace the calculation of GDF can lead to a better regression model in terms of mean prediction error as compared to the use of C p or BIC. In this paper, we evaluate whether adaptive penalty selection procedure proposed in Shen and Ye 2002 leads to a consistent model selector or just reduce the overfitting of multi-model inference for nested candidate models when the candidate models include correct model. As shown in this paper, we recommend the use of little bootstrap data perturbation instead of calculating GDF with a fully adaptive penalty procedure for nested candidate models. The nested candidate models are of the form Y = X K β + ϵ, 1 where Y = Y 1,..., Y n T is the response vector, EY = µ 0 = µ 1,..., µ n T, ϵ = ϵ 1,..., ϵ n T N n 0, σ 2 I n, x j = x 1j,..., x nj T for j = 1,..., K, X K = x 1,..., x K is an n K matrix of potential explanatory variables to be used in predicting linearly the response vector, and β = β 1..., β K T is a vector of unknown coefficients. Here I n is the n n identity matrix and σ 2 is known. In this paper, we further assume that µ 0 = X K β and some of the coefficients may equal zero. 2

4 Consider adaptive choice of λ by finding the best λ which minimizes ˆMλ over λ Λ. Here Λ denotes the collection of penalty to choose. It is a two-step procedure. At the first step, for each λ Λ, choose the optimal model ˆMλ by minimizing Y ˆµ T Y ˆµ + λ M σ 2, where M denotes the number of unknown parameters used to prescribe model M. Then calculate g 0 λ which is the twice of generalized degrees of freedom with penalty λ. At the second step, choose the optimal λ Λ based on data as follows: ˆλ = argmin λ Λ { RSS ˆMλ + g 0 λ σ 2} where RSS ˆMλ = Y ˆµ ˆMλ 2, is usual L 2 norm, ˆµ ˆMλ = K ˆµ M k 1 { ˆMλ=Mk }, 1 { } is the usual indicator function and ˆµ Mk is the estimator of µ in model M k. Here Λ = [0, or 0, as recommended in Shen and Ye 2002 and Shen and Huang When g 0 λ is difficult to compute, data perturbation is recommended. Despite GDF already take into the account of the data-driven selection of the sequence of estimators i.e., candidate models, it is found in this paper that the model selected by C p is better than ˆMˆλ in terms of the probability of selecting correct model unless λ falling between AIC and BIC. However, the model selector by replacing g 0 λ with the estimate given by little bootstrap can beat the performance of C p in terms of the probability of selecting correct model. The rest of the paper is organized as the following. In Section 2, we introduce the unbiased risk estimation and generalized degrees of freedom in the setting of model selection with nested linear regression models. It also addresses the performance of the adaptive penalty selection procedure in Shen and Ye 2002 with approximated GDF in terms of model selection consistency. Section 3 gives the evaluation of the performance of adaptive penalty selection with GDF. Section 4 concludes with a discussion on the merit of data perturbation method. The Appendix contains technical proofs. 2 Adaptive penalty selection with nested candidate models In this section, we start with the derivation of generalized degrees of freedom defined in Ye 1998 through unbiased risk estimation, describe its potential benefit on increasing the probability of selecting correct model, and pitfalls without a proper constraint on the choice of penalty. 2.1 Unbiased risk estimation under nested linear regression model selection The available data contains n observations and each of them has K explanatory predictors, x 1,..., x K, and response y. Denote the matrix formed by the first k explanatory variables as X k = x 1,..., x k and β k = β 1,..., β k T where k = 1,..., K. We then have K candidate models and the kth candidate 3

5 model M k is Y = X k β k + ϵ. The set of candidate models is denoted as M = {M k k = 1,..., K} while the true model M k0 M. Namely, µ 0 = X k0 β k0 with 1 k 0 K. For the ease of presentation, denote {M k k 0 j K} by M k0 which includes the true model and overfitted models. For kth candidate model M k with least squares fit, we have ˆµ Mk = P k Y where P k stands for the orthogonal projection operator onto the space spanned by X k. We evaluate the model ˆµ Mk by its prediction accuracy. Denote the future response vector at X K by Y 0 and assume that EY 0 = EY and V ary 0 = σ 2 I n. The unobservable prediction error is defined as P EM k = Y 0 ˆµ Mk 2 and residual sum of squares as RSSM k = Y ˆµ Mk 2. It is well known that the residul sum of squares RSSM k often smaller than P EM k. Namely, E[P EM k ] > E[RSSM k ]. Mallows 1973 proposed C p by inflating RSSM k with 2trP k σ 2 where trp k is often called degrees of freedom. Since both P EM k and RSSM k are of quadratic form, it follows easily that E[P EM k ] = E[RSSM k ] + 2kσ 2. We now give generalized degrees of freedom through unbiased risk estimate over M while the true model is M k0. For simplicity, we assume that σ 2 is known from now on. The model selected through Mallows C p criterion among M is denoted as ˆM2. We can then write the post-model-selection estimator of µ 0 through C p by ˆµ ˆM2 = K ˆµ M k 1 { ˆM2=Mk}. The prediction risk is defined as follows: RISKC p = E [ Y 0 ˆµ ˆM2 T Y 0 ˆµ ˆM2 ]. By standard argument, we have [ ] RISKC p = E µ 0 ˆµ ˆM2 T µ 0 ˆµ ˆM2 + nσ 2 = E[RSS ˆM2] + 2E[ϵ T ˆµ ˆM2 µ 0 ] = E[RSS ˆM2] + g 0 2 σ 2 where g 0 2 = 2E[ϵ T ˆµ ˆM2 µ 0 ]/σ 2 and g 0 2 is twice GDF defined in Ye In general, the calculation of GDF associated with C p can be complicate but its analytic form can be obtained under the settings of this paper. When K k 0 = 20, and n = 404, g 0 2 is found to be around 2k as given in Lemma 4. Or, the unbiased risk estimator of using C p over M should be RSS ˆM2 + g 0 2 σ 2 instead of RSS ˆM2 + 2 ˆM2 σ 2 where ˆM2 is the total number of unknown parameters of the chosen model by C p. It will be shown in Lemma 4a that g 0 2 2k 0 under some regularity conditions. Since BIC achieves model selection consistency as shown in Schwarz 1978, its associated unbiased 4

6 risk estimator will be RSS ˆMlog n + 2k 0 σ 2. This leads to the proposal in Shen and Ye 2002 on employing adaptive choice of penalty incorporating GDF. We now investigate whether the adaptive penalty selection proposal incorporate GDF in Shen and Ye 2002 is a consistent model selection procedure. 2.2 Adaptive choice of λ over Λ = {0, 2, log n} As an illustration, we consider how adaptive choice of penalty, λ Λ = {0, 2, log n}, with GDF improves on the fixed choice of penalty. Consider, k 0 = 1 and K k 0 = 20 in which finding the underfitted model in M 0 is not possible. When λ = 0, model M K will always be chosen since it gives the smallest residual of sum squares. Then g 0 0 = 2K which is equal to the degree of freedom assigned by Mallows C p. When λ = log n, we have ˆMλ = M k0 with probability close to 1 and then g 0 log n 2k 0. When λ = 2, Woodroofe 1982 shows that the probability of selecting correct model is slightly larger than 0.7 while choose no more than one superfluous explanatory variables in average. Then g 0 2 2k where 5.02 depends on k 0 and K k 0. It will be shown that adaptive penalty selection procedure with GDF improves C p selection in increasing the probability of selecting correct model M k0 when K k This improvement mainly come from the reduction of the probability of selecting M k0 +1 by C p. Technical details will be provided later on. We now give a lemma which will be used repeatedly to give an upper bound on ˆMλ k 0. It can also be used to give an upper bound on the probability of choosing M k0 when the adaptive penalty selection procedure is used. Lemma 1. Suppose that log n Λ, λ log n for all λ Λ, and ˆMlog n = M k0. a For all λ Λ, P ˆMˆλ = Mk0 P λ ˆMλ M k0 g 0 λ g 0 log n. b ˆMˆλ M k0 if there exists λ such that λ ˆMλ M k0 > g 0 λ g 0 log n. Proof. Recall that ˆMλ = argmin Mk M{RSSM k + λ M k σ 2 } for any given λ Λ or ˆMλ is the chosen model for given λ. We conclude that RSS ˆMλ + λ ˆMλ σ 2 RSSM k0 + λ M k0 σ 2. Hence, for all λ, [ λ ˆMλ ] M k0 σ 2 RSSM k0 RSS ˆMλ. 2 When the event {ˆλ = log n} occurs, the following must hold. RSS ˆMλ + g 0 λ RSS ˆMlog n + g 0 log n for all λ. 5

7 Hence, P ˆλ = log n P RSS ˆMλ + g 0 λ RSS ˆMlog n + g 0 log n. We conclude that P ˆλ = log n P RSS ˆMlog n RSS ˆMλ g 0 λ g 0 log n P λ ˆMλ Mk0 σ 2 RSS ˆMlog n RSS ˆMλ g 0 λ g 0 log n P λ ˆMλ Mk0 σ 2 g 0 λ g 0 log n. The second inequality holds by 2. We conclude the proof of this lemma. Next lemma states that ˆMλ is a decreasing step function of λ. Lemma 2. When the set of candidate models M is a sequence of nested linear regression models, ˆMλ is a non-increasing step function over Λ. When ˆMλ l = k l for some 1 k l K, ˆMλ k l for all λ λ l. Proof. Recall that ˆMλ denotes the selected model for given penalty λ M k σ 2. It follows from Proposition 5.1 of Breiman 1992 that RSS ˆMλ must be at an extreme point of the lower convex envelop of the graph {k, RSSk}, k = 1,..., K. Hence, this lemma holds. Adaptive choice of λ over Λ = {0, 2, log n} depends on the comparison of RSS ˆM0 + g 0 0 σ 2, RSS ˆMlog n + g 0 log nσ 2, and RSS ˆM2 + g 0 2σ 2. We start with the comparison of RSS ˆM0 + g 0 0σ 2 and RSS ˆMlog n + g 0 log nσ 2. Note that, for K k 0 = 20, P RSS ˆM0 + g 0 0σ 2 > RSS ˆMlog n + g 0 log nσ 2 = P RSS ˆM0 + g 0 0σ 2 > RSS ˆMlog n + g 0 log nσ 2, ˆMlog n = M k0 +P RSS ˆM0 + g 0 0σ 2 > RSS ˆMlog n + g 0 log nσ 2, ˆMlog n M k0 P Y T P K P k0 Y < 40σ 2, ˆMlog n = M k0. Since Y T P K P k0 Y/σ 2 is a chi-square random variable with degrees of freedom 20, it follows from the consistency of BIC on model selection, P RSS ˆM0 + g 0 0σ 2 > RSS ˆMlog n + g 0 log nσ 2 is close to 1. Next, we evaluate the difference between RSS ˆM2 + g 0 2σ 2 and RSS ˆMlog n + g 0 log nσ 2. When ˆMlog n = M k0 and ˆM2 = M k, it follows from Lemma 1 that 6 ˆM2 M k0 must hold if k k 0 3. Since g 0 2 2k , the adaptive choice of λ over {0, 2, log n} might increase the probability of selecting M k0. This increase must come from the occurrence of the event {1 ˆM2 k 0 3}. Or, it does improve C p on increasing the probability of correct selection. Refer to Lemma 5 for further details. 6

8 2.3 Adaptive choice of λ over Λ = [2, log n] The choice of Λ can be difficult in practice if the users don t know the behavior of fix penalty model selection criterion. The natural choice of Λ can be on [0, as suggested in Shen and Ye In this subsection, Λ is chosen to be [2, log n]. The case Λ = [0, log n] is deferred to next subsection. Note that g 0 λ = 2 K k 0 P χ 2 k+2 > kλ + 2k 0 under some regularity conditions by Lemma 4a in Section 3. When Λ = [2, log n], adaptive choice of λ, ˆλ is defined as follows: ˆλ = argmin λ [2,log n] {RSS ˆMλ + g 0 λσ 2}. When the event { ˆM2 = M k0 } occurs, it follows from Lemma 2 that ˆλ = log n by the fact that g 0 λ is decreasing. Hence, adaptive penalty selection procedure will select M k0 when Mallows C p select M k0. This conclude that adaptive penalty selection with λ [2, log n] does improve C p on the probability of correct selection. Table 1 presents the finding of a simulation experiment and leave general discussion in Section 3. This simulation result shows that the adaptive choice of λ over [2, log n] performs better than C p but cannot match with BIC. Compare to C p, adaptive choice of λ over [2, log n] improves over Mallows C p by increasing the probability of correct selection from 71.29% to 74.88% and this improvement is being observed consistently over simulations. In Section 3, it quantifies the increase of probability of selecting M k0 based on adaptive choice of λ over [2, log n] over fixed penalty 2 is given which matches well with the above simulation experiment. 2.4 Adaptive choice of λ over Λ = [0, log n] Woodrofe 1982 showed that Mallows C p selection procedure can be viewed as finding the global minimum of a random walk with negative drift 1 formed by [ n ˆβ k β k / V ar ˆβ k ] 2 2 over {k 0,..., K}. When λ = 0, it leads to the selection of M K, the largest candidate model. Section 2.2 shows that g 0 0 = 2K which is twice GDF defined in Shen As in last subsection, we again use a simulation study to evaluate adaptive penalty selection procedure when Λ = [0, log n]. The finding is summarized in Table 2. Clearly, adaptive penalty selection of λ over [0, log n] not only cannot improve over C p but decreases the probability of selecting M k0 to about 52.6% as shown in Table 2. This finding is consistent with the simulation results in Atkinson 1980 and Zhang Both papers suggested that the penalty factor for Atkinson generalized Mallows C p statistics should be chosen between 1.5 and 6. When λ = 1, the size of selected model is the one that reaches the global minimum of a random walk without drift over {k 0,..., K}. It leads to an overfitted model with lots of superfluous explanatory variables. We now employ Lemma 1 to explain why adaptive penalty selection performs worse than 7

9 ˆMˆλ λ [2, log n] λ = 2 ˆMˆλ λ [2, log n] λ = 2 k k k k k k k k k k k k k k k k k k k k k Table 1: The empirical distribution of chosen model through adaptive model selection when λ [2, log n] based on 100, 000 simulated runs. ˆMˆλ λ [0, log n] λ = 2 ˆMˆλ λ [0, log n] λ = 2 k k k k k k k k k k k k k k k k k k k k k Table 2: The empirical distribution of chosen model through adaptive model selection when λ [0, log n] based on 100, 000 simulated runs. 8

10 C p when Λ = [0, log n]. Recall that g 0 λ = 2E[ϵ T ˆµ ˆMλ µ 0 ]/σ 2 where ˆµ ˆMλ = K ˆµ M k 1 { ˆMλ=Mk}. Note that P k does not depend on ϵ and hence [ K ] E[ϵ T ˆµ ˆMλ µ 0 ] = E ϵ T P k ϵ 1 { ˆMλ=Mk} = σ 2 g 0 λ/2 > 0. When ˆMλ k 0, g 0 λ/σ 2 must be greater than 2k 0. Since ϵ T ˆµ ˆMλ µ 0 is a mixture of χ 2 random variables and cannot be observed, it is being estimated by its expected value as suggested in Mallows Note that g 0 λ 2k 0 are equal to 25.35, 21.98, 18.78, 15.98, 13.36, and as λ takes values 1.0, 1.1, 1.2, 1.3, 1.4, and 1.5. Since the probability distribution of χ 2 1 random variable is positively skew, the replacement of such a random variable by its mean can be problematic. Moreover, the following must hold when ˆMˆλ = M k0 for all λ Λ. ˆMλ g0 λ 2k 0 k 0, λ where is the Gauss integer. When ˆMˆλ = M k0, it follows from Lemma 1 that ˆM1.1 k 0 19, ˆM1.2 k 0 15, ˆM1.3 k 0 12, ˆM1.4 k 0 9, and ˆM1.5 k 0 7. This reduces the probability of choosing M k0 which provides an explanation of why adaptive penalty selection is so problematic when λ [1, 1.5]. 3 Performance of Adaptive Penalty Selection with GDF In this section, we start with assumptions which ensure that BIC is a consistent model selector and C p never miss on choosing nonzero coefficient explanatory variables. Under the setting of this paper, the analytic form of g 0 λ is available. Its numerical values can be obtained easily by any statistical software. To evaluate whether the method of adaptive penalty selection can improve fixed penalty selection, we combine simulation study and analytic approximation together on those K, k 0 with K k When K k 0 = 20, numerical values of g 0 λ is obtained by R software. Then combine numerical values of g 0 λ for K k 0 = 20 and the exponential inequality in Teicher 1984 to give a bound on g 0 λ for K k 0 > Key assumptions We now state three assumptions on the class of competing models M. The following assumptions ensure that employing BIC leads to consistent model selection and we never get an underfitted model when λ [0, log n]. They mainly prescribe the relation between n, the total number of observation, K, the total number of explanatory variables, and k 0, the total number of nonzero regression coefficients. As a remark, the setting of simulation studies in Section 2 satisfies these three assumptions. 9

11 Assumption B. X K is full rank and there exists a constant c > 0 such that µ T 0 I n P j µ 0 /σ 2 cn for all j < k 0, where µ 0 = X k0 β k0 is the mean vector of the true model. Assumption N1. cn > 2k 0 log n. Assumption N2. log n > 2 logk k 0. Assumptions B and N1 are assumed to quantify the magnitude of the bias when the fitted model leaves out nonzero coefficient explanatory variable while Assumption N2 is specified to guard the inclusion of those zero coefficient explanatory variables, max k0 j K ˆβ j. When those assumptions hold, it ensures that BIC selects the correct model with probability 1 as n goes to infinity. Moreover, for all λ [0, log n], the probability of the event { ˆMλ k 0 1} goes to zero as shown in the following Lemma. Its proof is deferred to the Appendix. Lemma 3. When k 0 > 1, Assumptions B and N1 hold, P ˆMλ k 0, for all λ [0, log n] 1 as n goes to infinity. When Assumption N2 also holds, P ˆMlog n = Mk0 1 as n goes to infinity. Under the above assumptions, we only need to address the adaptive penalty selection over Λ = [0, log n] for M k0 on the inclusion of superfluous explanatory variables. It remains to study and ˆλ which is defined as follows ˆλ = argmin λ Λ { RSS ˆMλ + g 0 λσ 2} to quantify ˆMλ k 0. ˆMλ = argmin Mk M k0 { RSSMk + λkσ 2} We now employ the argument in Woodroofe 1982 to calculate g 0 λ defined by 2E[ϵ T ˆµ ˆMλ µ 0 ]/σ 2 over M k0 and details deferred to the Appendix. Lemma 4. When Assumption B, N1 and N2 hold, the followings hold. a For λ [0, log n], g 0 λ = 2 K k 0 j=1 P χ 2 j+2 > jλ +2k 0 and g 0 λ is a strictly decreasing function of λ, where χ 2 l is chi-square distributed with degrees of freedom l. b For λ [2, log n], 2 K k 0 j=21 P χ 2 j+2 > jλ when K k Figure 1 gives the plot of λ, g 0 λ when K k 0 = 20 in which g 0 λ is approximated by R software by the accuracy of fourth decimal points. Lemma 4b gives us a general idea on the plot of λ, g 0 λ when K k 0 >

12 g 0 λ λ Figure 1: g 0 λ when K k 0 = Performance of adaptive penalty selection procedure To evaluate adaptive penalty selection procedure, we start with the case K k 0, n = 20, 404 and Λ = [2, log n]. Since log 404 > 2 log20, Assumption N2 holds. Assume that Assumptions B and N1 hold. It follows from Lemma 3 that the adaptive penalty selection procedure in Shen and Ye 2002 will not miss any nonzero coefficient explanatory variable. Lemma 4a gives g 0 λ for all 0 λ log n which can be obtained by numerical approximation using R program. By the simulation result presented in Table 1 in ˆλ is often smaller than log n and ˆMˆλ cannot be M k0 whenever ˆM2 > 2 + k 0. An estimate of the lower bound of the probability of the event { ˆMˆλ = M k0 } will be given. When the event { ˆM2 = M k0 } occurs, it follows from Lemma 2 that ˆλ = log n by the fact that g 0 λ is decreasing and ˆMˆλ = M k0. Hence, adaptive penalty selection will select M k0 when Mallows C p selects M k0. We now demonstrate how the adaptive penalty selection improves over C p when C p overfits an explanatory variable, the event { ˆM2 = M k0 +1, ˆMˆλ = M k0 } occurs. For i = k 0 + 1,..., K, define V i = [RSSM k0 +i RSSM k0 +i 1]/σ 2. Note that V i are i.i.d. chi-square random variables with degree of freedom 1. When the event { ˆM2 = M k0 +1} occurs, it follows from 11

13 Lemma 2 that ˆMλ = M k0 when λ [V k0 +1, log n] and ˆMλ = M k0 +1 when λ [2, V k0 +1. The following lemma gives a lower bound of the probability of the event { ˆM2 = M k0 +1, ˆλ = log n} when Λ = [2, log n]. Define p λ,k0,k k 0 k to be the probability of the event { ˆMλ = M k } occurs for given penalty λ when k = 1,..., K. Lemma 5. Suppose Assumptions B, N1, and N2 hold and Λ = [2, log n]. A lower bound on the probability of { ˆM2 = M k0 +1, ˆMˆλ = M k0 } is p 2,k0 +1,K k 0 k P 2 < V k0 +1 sup {g 0 λ 2k 0 > λ} λ>0. Proof. Define E 1 = { ˆMλ = M k0 +1 for all λ [2, V k0 +1, ˆMˆλ = M k0 } and E 2 = { ˆMλ = M k0 and for all λ [V k0, log n], ˆMˆλ = M k0 }. Then E 1 = {RSSM k0 RSSM k0 +1 > λσ 2, RSSM k0 +1 RSSM k0 +1+i iλσ 2, for all 0 < i K k 0 1, RSSM k0 + 2k 0 σ 2 RSS ˆMλ + g 0 λσ 2, for all λ [2, V k0 +1} = {V k V k0 +2+i i + 1V k0 +1, for all 0 i K k 0 2, 2 < V k0 +1 g 0 V k0 +1 2k 0 } E 2 = {RSSM k0 RSSM k0 +i iλ σ 2, for all 0 < i K k 0, Observe that RSSM k0 + 2k 0 σ 2 RSS ˆMλ + g 0 λ σ 2, for all λ [V k0 +1, log n} = {V k V k0 +1+i i + 1V k0 +1, for all 0 i K k 0 1}. { ˆM2 = M k0 +1, ˆMˆλ = M k0 } = E 1 E 2 = {V k V k0 +2+i i + 1V k0 +1, for all 0 i K k 0 2, 2 < V k0 +1 g 0 V k0 +1 2k 0, V k V k0 +1+j j + 1V k0 +1, for all 0 j K k 0 1} {V k V k0 +2+i 2i + 1, for all 0 i K k 0 2, 2 < V k0 +1 g 0 V k0 +1 2k 0, V k V k0 +1+j jv k0 +1, for all 1 j K k 0 1} {V k V k0 +2+i 2i + 1, for all 0 i K k 0 2, 2 < V k0 +1 g 0 V k0 +1 2k 0, V k V k0 +1+j 2j, for all 1 j K k 0 1} {V k V k0 +2+i 2i + 1, for all 0 i K k 0 2, 2 < V k0 +1 g 0 V k0 +1 2k 0 }. 12

14 Then we have P ˆM2 = M k0 +1, ˆMˆλ = M k0 P 2 < V k0 +1 g 0 V k0 +1 2k 0 P V k V k0 +2+i 2i + 1, for all 0 i K k Note that the first term in the right hand side of 3 is equal to p 2,k0 +1,K k 0 1k We conclude the proof. Remark 1. When K k 0 = 20, p 2,k0 +1,19k is around by simulation. Since g 0 λ is strictly decreasing, a lower bound of the second term of 3 is obtained as follows. P 2 < V k0 +1 g 0 V k0 +1 2k 0 P 2 < V k0 +1 sup {g 0 λ 2k 0 > λ} λ>0 P 2 V k , 4 where 2.56 is obtained by simulation to approximate sup λ>0 {g 0 λ 2k 0 > λ} for K k 0 = 20 and is a numerical approximation of P 2 V k by R program. It follows from 3 and 4 that, for Λ = [2, log n], P ˆM2 = M k0 +1, ˆλ = log n = In fact, { ˆMˆλ = M k0 } = k0 k K{ ˆM2 = M k, ˆMˆλ = M k0 }. The adaptive penalty selection procedure picks M k0 P ˆMˆλ = M k0 with probability k 0 k K k 0 k k 0 +1 P ˆM2 = M k, ˆMˆλ = M k0 P ˆM2 = M k, ˆMˆλ = M k0 = p 1 2, 1, 20 + p 1 2, 1, 19 P 2 χ 2 1 sup {g 0 λ 2k 0 > λ} λ> = Since P ˆMˆλ = Mk0 p 2,k0,K k 0 k 0, adaptive penalty selector with λ [2, log n] improves C p on correct selection probability. Moreover, a lower bound on the increase of correct selection probability is around for K k 0 = 20 and k 0 = 1. We conclude that when λ [2, log n], adaptive penalty selector improves C p at least at the amount of 3.4% on correct selection probability. As suggested by the simulation study, the improvement mainly comes from the event { ˆM2 = M k0 +1, ˆλ = log n}. Furthermore, we provide a crude upper bound which is obtained by Lemma 1. When K k 0 = 20, we have for all 2 λ log n and g 0 λ = 2 20 j=1 P χ 2 j+2 > jλ + 2k 0 g 0 2 2k 0 = 5.02 when K k 0 =

15 As suggested by 5, the upper bound of the probability of selecting correct model M k0 by adaptive penalty selection is 0.88 which is obtained by adding up P ˆM2 = k for k = k 0,..., k in Table 2. As revealed in Table 1 of Section 2, adaptive penalty selection procedure improves fixed penalty C p by increasing the probability of correct selection around 3.5%. This reduction must come from including no more than 2 superfluous zero-coefficient explanatory variables. According to the above discussion and the result presented in Table 1, we have the following lemma. Lemma 6. When K k 0 = 20, all three assumptions hold if n 404. The adaptive penalty selection procedure proposed in Shen and Ye 2002 improves over Mallows C p in terms of model selection consistency where λ [2, log n]. consistency. However, it cannot match with BIC in terms of model selection We now compare the performance of adaptive penalty selection under K k 0 > 20 where n > K k 0 2 versus K k 0 = 20 where n = 404. When K k 0 > 20, it follows from Lemma 4 that g 0 λ = 2 20 j=1 K k 0 P χ 2 j+2 > jλ + P χ 2 j+2 > jλ + 2k 0 j=21 for all 2 λ log n and K k 0 j=21 P χ2 j+2 > jλ We then have g 0 2 2k 0 < = 7.63 when K k 0 > As suggested by 6, the upper bound of the probability of selecting correct model M k0 by adaptive penalty selection is 0.91 which is obtained by adding up P ˆM2 = k for k = k 0,..., k in Table 2. As revealed in Table 1 of Section 2, adaptive penalty selection improves fixed penalty C p by increasing the probability of correct selection around 3.5%. The reduction must come from including no more than 3 superfluous zero-coefficient explanatory variables. We conclude the following Theorem. Theorem 1. When Assumption B, Assumption N1, and Assumption N2 hold, the adaptive penalty selection procedure with generalized degrees of freedom over λ [2, log n] is still a conservative model selection procedure. In terms of the probability of selecting overparametrized models, it improves fixed penalty selection procedure C p but cannot match with the fixed penalty selection BIC. As a remark, a lower bound on the probability of selecting M k0 method used in the proof of Lemma 5. can also obtained by the same It only needs to replace p 2,1,20 1 with p 2,1,K k0 1 for K k 0 > 20 and p 2,1,19 1 by p 2,1,K k0 11 for K k 0 > 20. Since g 0 λ increases with n, sup λ>0 {g 0 λ 2k 0 > λ} increases accordingly. Hence, there is no need to adjust the probability 14

16 P 2 χ 2 1 sup λ>0 {g 0 λ 2k 0 > λ} for lower bound. Similarly, a lower bound for K k 0 > 20 is given as p 2,1,K k0 1 + p 2,1,K k Then we have the approximated lower bound P ˆλ = log n = p 1 2, 1, K k 0 + p 1 2, 1, K k 0 1 P 2 χ 2 1 sup {g 0 λ 2k 0 > λ} λ> = This approximated lower bound only decrease slightly when K k 0 > 20. Note that the difference of p 2,1,K k0 1 and p 2,1,20 1 is small. That is because for j > 0 and λ > 2 p λ,1,k k0 1 p λ,1,k k0 +j1 = P V k0 +1 λ,..., V k V K λk k 0 P V k0 +1 λ,..., V k V K λk k 0, V k V K λk k 0 + j = P V k0 +1 λ,..., V k V K λk k 0, V k V K > λk k 0 + j P V k V K > λk k 0 + j The last probability is close to 0 by the law of large numbers. For example, when λ 2 and K k 0 = 15, p λ,1,k k0 1 p λ,1,k k0 +j1 P V k V K > by R. Use the same argument, we can show that the difference between p λ,1,k k0 l and p λ,1,k k0 +jl should be small for l = 1, 2. 4 Discussion and Conclusion Our main purpose, in this paper, is to show that the practice of a fully adaptive choice of penalty should be cautious. In particular, we have demonstrated that a fully adaptive penalty selection procedure through the unbiased risk estimator can be a bad practice for nested candidate models. But the adaptive penalty selection procedure does improve the performance of C p when the range of penalty is chosen properly. However, such a procedure still cannot achieve model selection consistency as fixed penalty BIC approach does. To achieve proper selection of the range of penalty, it can be done by using little bootstrap or data perturbation suggested in Shen and Ye When the data perturbation method with suggested perturbation variance 0.25σ 2 is employed to estimate GDF over Λ = [0, log n], it outperforms the procedure by adding GDF as reported in Table 3. 15

17 ˆMˆλ ĝ 0τ λ g 0 λ ˆMˆλ ĝ0τ λ g 0 λ k k k k k k k k k k k k k k k k k k k k k Table 3: The empirical distribution of chosen model through adaptive model selection with ĝ 0τ λ, τ/σ = 0.5 when λ [0, log n] based on 10, 000 simulated runs. 5 Appendix 5.1 Proof of Lemma 3 It follows easily from Lemma 2 that P ˆMlog n M k0 P ˆMλ M k0, for all λ [0, log n]. Or P ˆMlog n < M k0 > P ˆMλ < M k0, for some λ [0, log n]. Observe that P ˆMλ < M k0 for some λ [0, log n] = P ˆMlog k 0 1 n < M k0 P k 0 1 k 0 1 k 0 1 ˆMlog n = k P RSSM k RSSM k0 k 0 kσ 2 log n P µ T 0 I P k µ 0 + 2ϵ T I P k µ 0 + ϵ T I P k ϵ k 0 kσ 2 log n P µ T 0 I P k µ 0 + 2ϵ T I P k µ 0 k 0 σ 2 log n. 16

18 Since µ T 0 I P kµ 0 + 2ϵ T 0 I P kµ 0 Nµ T 0 I P kµ 0, 4µ T 0 I P kµ 0 σ 2, it follows from Assumption B2 that k 0 1 = = P µ T 0 I P k µ 0 + 2ϵ T I P k µ 0 k 0 σ 2 log n k 0 1 k 0 1 P Z k 0σ 2 log n µ T 0 I P kµ 0 2σ µ T 0 I P kµ 0 µ T 0 I P k µ 0 P Z 1 4 σ 2 k 0 1 P Z µt 0 I P kµ 0 4σ µ T 0 I P kµ 0 where Z denotes a standard normal random variable. By Assumption B1, k P ˆMλ 0 1 < M k0 P Z 1 µ T 0 I P k µ 0 4 σ 2 k 0 Φ where Φ is the cdf of the standard normal. We conclude that P ˆMlog n = M k0 is close to 1 by Assumption N. cn 4, 5.2 Proof of Lemma 4 Proof of a. Define V i = RSSM i 1 RSSM i /σ 2, S k0 λ = 0, and S j λ = j i=k 0 +1 V i λ for j = k 0 + 1,..., K. It can be obtained easily from the result of Spitzer 1956 and Zhang 1992 that and E [max k0 j KS j λ] = = K j=k 0 +1 [ E ˆMλ ] M k0 = K j=k [ ] E S j λ 1 j k {Sj λ>0} 0 1 [ ] E V j j k 0 λ 1 j k {Vj >j k 0 λ} = 0 K j=k 0 +1 Then [ ] E ϵ T P ˆMλ ϵ /σ 2 = E [ϵ T P ˆMλ P Mk0 = E [max k0 j KS j λ] + λe [ ] E 1 {Sj λ>0} = ϵ λ K j=k 0 +1 ˆMλ M k0 ] + E [ ˆMλ M k0 K k 0 l=1 P S j λ > 0 = σ 2 + λ [ ϵ T P Mk0 ϵ [ P χ 2 l+2 > lλ λp χ 2 l > lλ] K k 0 j=1 ˆMλ M k0 ] /σ 2 P χ 2 j > jλ. ] σ 2 + ϵ T P Mk0 ϵ /σ 2 = = K k 0 j=1 K k 0 j=1 [ P χ 2 j+2 > jλ λp χ 2 j > jλ ] K k 0 + λ P χ 2 j+2 > jλ + k j=1 P χ 2 j > jλ + k 0 7

19 We conclude the proof. For λ 2, K k 0 21, Lemma 4 gives an estimate on how fast P χ 2 j+2 > jλ goes to zero and an upper bound on 2 K k 0 j=21 P χ 2 j+2. > jλ The derivation relies on Theorem 1 of Teicher 1984 which is listed below for reference. Theorem 1. Teicher, 1984 Let W i be independent random variables with E[W i ] = 0, E[Wi 2] = σ2 i and E W i k k!c k 2 2 σi 2/2, for all k 3 and some c 2 > 0. Let B n = n i=1 a niw i and a ni be arbitrary constants. If vn 2 = n i=1 a2 ni σ2 i and c n = c 2 max 1 i n a ni. Then P B n > xv n exp { x c } 1 nx, x > 0. v n Proof of b. For K k 0 > 20 and λ = 2, we employ Theorem 1 of Teicher 1984 by setting W i to be i.i.d. χ for all i = 1,..., K k Then E[W i ] = 0 and E[W i ] 2 = σ 2 i = 2. By Lemma 5 in Teicher 1984, we have E W i r = E χ r r!2 r 2, for all r 3. Then we can set c 2 = 2. Choose a ni = 1 for all n Then vk+2 2 = 2k + 2, v k+2 = 2k + 2 and c k+2 = 2. Write B k+2 = k+2 i=1 W i and set x = kλ 1 2. Then, for all natural number k, 2k+2 P χ 2 k+2 > kλ = P χ 2 k+2 k + 2 > kλ 1 2 = P B k+2 > kλ 1 2 = P B k+2 > kλ 1 2 2k + 2 = P B k+2 > 2k + 2x 2k x 2 2x exp k + 2 = exp { exp 1 2 k2 λ 1 2 4jλ k + 2 = exp 1 λk + 2 exp { k2 λ 1 2 4k + 2λ exp 1 } λ { x 2 { k2 λ 1 2 } kλ 1 + 4k + 2λ k + 2λ } exp + kλ 1 k + 2λ For k 21 and λ 2, we have P χ 2 k+2 > kλ 21λ 12 exp { λ k + λ 1 } λ 2 } kλ 1 k + 2 { k2 λ 1 2 4k + 2λ + λ 1 λ { exp and for λ = 2, we have K k0 2 P χ 2 k+2 > 2k 2 P χ 2 k+2 > 2k 2 exp1/2 k=21 k=21 } k exp }. 1 exp = Acknowledment. This research was supported in part by grants M and M from the National Science Council of Republic of China. 18

20 References [1] Atkinson, A.C., A note on the generalized information criterion for a choice of a model. Biometrika 67, [2] Breiman, L., The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. Journal of the American Statistical Association 87, [3] Liu, W. and Yang, Y., Parametric or nonparametric? A parametricness index for model selection. The Annals of Statistics 39, [4] Mallows, C.L., Some comments on C p. Technometrics 15, [5] Nishii, R., Asymptotic properties of criteria for selection of variables in multiple regression. The Annals of Statistics 12, [6] Schwarz, G., Estimating the dimension of a model. The Annals of Statistics 6, [7] Shen, X., Huang, H.C., Optimal model assessment, selection, and combination. Journal of the American Statistical Association 101, [8] Shen, X., Ye, J., Adaptive model selection. Journal of the American Statistical Association 97, [9] Spitzer, F., A combinatorial lemma and its application to probability theory. Transactions of the American Mathematical Society 82, [10] Teicher, H., Exponential bounds for large deviations of sums of unbounded random variables. Sankhya 46, [11] Woodroofe, M., On model selection and the arcsin laws. The Annals of Statistics 10, [12] Ye, J., On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association 93, [13] Zhang, P., On the distributional properties of model selection criteria. Journal of the American Statistical Association 87,

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Comparison with RSS-Based Model Selection Criteria for Selecting Growth Functions

Comparison with RSS-Based Model Selection Criteria for Selecting Growth Functions TR-No. 14-07, Hiroshima Statistical Research Group, 1 12 Comparison with RSS-Based Model Selection Criteria for Selecting Growth Functions Keisuke Fukui 2, Mariko Yamamura 1 and Hirokazu Yanagihara 2 1

More information

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006 Least Squares Model Averaging Bruce E. Hansen University of Wisconsin January 2006 Revised: August 2006 Introduction This paper developes a model averaging estimator for linear regression. Model averaging

More information

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models

More information

Comparison with Residual-Sum-of-Squares-Based Model Selection Criteria for Selecting Growth Functions

Comparison with Residual-Sum-of-Squares-Based Model Selection Criteria for Selecting Growth Functions c 215 FORMATH Research Group FORMATH Vol. 14 (215): 27 39, DOI:1.15684/formath.14.4 Comparison with Residual-Sum-of-Squares-Based Model Selection Criteria for Selecting Growth Functions Keisuke Fukui 1,

More information

Averaging Estimators for Regressions with a Possible Structural Break

Averaging Estimators for Regressions with a Possible Structural Break Averaging Estimators for Regressions with a Possible Structural Break Bruce E. Hansen University of Wisconsin y www.ssc.wisc.edu/~bhansen September 2007 Preliminary Abstract This paper investigates selection

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Akaike Information Criterion

Akaike Information Criterion Akaike Information Criterion Shuhua Hu Center for Research in Scientific Computation North Carolina State University Raleigh, NC February 7, 2012-1- background Background Model statistical model: Y j =

More information

The Performance of the Turek-Fletcher Model Averaged Confidence Interval

The Performance of the Turek-Fletcher Model Averaged Confidence Interval *Manuscript Click here to view linked References The Performance of the Turek-Fletcher Model Averaged Confidence Interval Paul Kabaila, A.H. Welsh 2 &RheannaMainzer December 7, 205 Abstract We consider

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Grouping Pursuit in regression. Xiaotong Shen

Grouping Pursuit in regression. Xiaotong Shen Grouping Pursuit in regression Xiaotong Shen School of Statistics University of Minnesota Email xshen@stat.umn.edu Joint with Hsin-Cheng Huang (Sinica, Taiwan) Workshop in honor of John Hartigan Innovation

More information

A Significance Test for the Lasso

A Significance Test for the Lasso A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen May 14, 2013 1 Last time Problem: Many clinical covariates which are important to a certain medical

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science 1 Likelihood Ratio Tests that Certain Variance Components Are Zero Ciprian M. Crainiceanu Department of Statistical Science www.people.cornell.edu/pages/cmc59 Work done jointly with David Ruppert, School

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Median Cross-Validation

Median Cross-Validation Median Cross-Validation Chi-Wai Yu 1, and Bertrand Clarke 2 1 Department of Mathematics Hong Kong University of Science and Technology 2 Department of Medicine University of Miami IISA 2011 Outline Motivational

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models Junfeng Shang Bowling Green State University, USA Abstract In the mixed modeling framework, Monte Carlo simulation

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

How the mean changes depends on the other variable. Plots can show what s happening...

How the mean changes depends on the other variable. Plots can show what s happening... Chapter 8 (continued) Section 8.2: Interaction models An interaction model includes one or several cross-product terms. Example: two predictors Y i = β 0 + β 1 x i1 + β 2 x i2 + β 12 x i1 x i2 + ɛ i. How

More information

Alfredo A. Romero * College of William and Mary

Alfredo A. Romero * College of William and Mary A Note on the Use of in Model Selection Alfredo A. Romero * College of William and Mary College of William and Mary Department of Economics Working Paper Number 6 October 007 * Alfredo A. Romero is a Visiting

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics A short review of the principles of mathematical statistics (or, what you should have learned in EC 151).

More information

Advanced Statistics I : Gaussian Linear Model (and beyond)

Advanced Statistics I : Gaussian Linear Model (and beyond) Advanced Statistics I : Gaussian Linear Model (and beyond) Aurélien Garivier CNRS / Telecom ParisTech Centrale Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection

More information

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Department of Mathematics Carl von Ossietzky University Oldenburg Sonja Greven Department of

More information

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17 Model selection I February 17 Remedial measures Suppose one of your diagnostic plots indicates a problem with the model s fit or assumptions; what options are available to you? Generally speaking, you

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Model Combining in Factorial Data Analysis

Model Combining in Factorial Data Analysis Model Combining in Factorial Data Analysis Lihua Chen Department of Mathematics, The University of Toledo Panayotis Giannakouros Department of Economics, University of Missouri Kansas City Yuhong Yang

More information

arxiv: v2 [math.st] 9 Feb 2017

arxiv: v2 [math.st] 9 Feb 2017 Submitted to the Annals of Statistics PREDICTION ERROR AFTER MODEL SEARCH By Xiaoying Tian Harris, Department of Statistics, Stanford University arxiv:1610.06107v math.st 9 Feb 017 Estimation of the prediction

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really

More information

Quantile Regression for Panel Data Models with Fixed Effects and Small T : Identification and Estimation

Quantile Regression for Panel Data Models with Fixed Effects and Small T : Identification and Estimation Quantile Regression for Panel Data Models with Fixed Effects and Small T : Identification and Estimation Maria Ponomareva University of Western Ontario May 8, 2011 Abstract This paper proposes a moments-based

More information

Review and continuation from last week Properties of MLEs

Review and continuation from last week Properties of MLEs Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that

More information

Quick Review on Linear Multiple Regression

Quick Review on Linear Multiple Regression Quick Review on Linear Multiple Regression Mei-Yuan Chen Department of Finance National Chung Hsing University March 6, 2007 Introduction for Conditional Mean Modeling Suppose random variables Y, X 1,

More information

A NOTE ON THE SELECTION OF TIME SERIES MODELS. June 2001

A NOTE ON THE SELECTION OF TIME SERIES MODELS. June 2001 A NOTE ON THE SELECTION OF TIME SERIES MODELS Serena Ng Pierre Perron June 2001 Abstract We consider issues related to the order of an autoregression selected using information criteria. We study the sensitivity

More information

1 Data Arrays and Decompositions

1 Data Arrays and Decompositions 1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is

More information

Bayesian Estimation of Prediction Error and Variable Selection in Linear Regression

Bayesian Estimation of Prediction Error and Variable Selection in Linear Regression Bayesian Estimation of Prediction Error and Variable Selection in Linear Regression Andrew A. Neath Department of Mathematics and Statistics; Southern Illinois University Edwardsville; Edwardsville, IL,

More information

3 Multiple Linear Regression

3 Multiple Linear Regression 3 Multiple Linear Regression 3.1 The Model Essentially, all models are wrong, but some are useful. Quote by George E.P. Box. Models are supposed to be exact descriptions of the population, but that is

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)

More information

Modfiied Conditional AIC in Linear Mixed Models

Modfiied Conditional AIC in Linear Mixed Models CIRJE-F-895 Modfiied Conditional AIC in Linear Mixed Models Yuki Kawakubo Graduate School of Economics, University of Tokyo Tatsuya Kubokawa University of Tokyo July 2013 CIRJE Discussion Papers can be

More information

Advanced Econometrics I

Advanced Econometrics I Lecture Notes Autumn 2010 Dr. Getinet Haile, University of Mannheim 1. Introduction Introduction & CLRM, Autumn Term 2010 1 What is econometrics? Econometrics = economic statistics economic theory mathematics

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 05 Full points may be obtained for correct answers to eight questions Each numbered question (which may have several parts) is worth

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

4 Multiple Linear Regression

4 Multiple Linear Regression 4 Multiple Linear Regression 4. The Model Definition 4.. random variable Y fits a Multiple Linear Regression Model, iff there exist β, β,..., β k R so that for all (x, x 2,..., x k ) R k where ε N (, σ

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

A Bayesian Criterion for Clustering Stability

A Bayesian Criterion for Clustering Stability A Bayesian Criterion for Clustering Stability B. Clarke 1 1 Dept of Medicine, CCS, DEPH University of Miami Joint with H. Koepke, Stat. Dept., U Washington 26 June 2012 ISBA Kyoto Outline 1 Assessing Stability

More information

1/34 3/ Omission of a relevant variable(s) Y i = α 1 + α 2 X 1i + α 3 X 2i + u 2i

1/34 3/ Omission of a relevant variable(s) Y i = α 1 + α 2 X 1i + α 3 X 2i + u 2i 1/34 Outline Basic Econometrics in Transportation Model Specification How does one go about finding the correct model? What are the consequences of specification errors? How does one detect specification

More information

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Tentative solutions TMA4255 Applied Statistics 16 May, 2015 Norwegian University of Science and Technology Department of Mathematical Sciences Page of 9 Tentative solutions TMA455 Applied Statistics 6 May, 05 Problem Manufacturer of fertilizers a) Are these independent

More information

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Department of Mathematics Carl von Ossietzky University Oldenburg Sonja Greven Department of

More information

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T. Exam 3 Review Suppose that X i = x =(x 1,, x k ) T is observed and that Y i X i = x i independent Binomial(n i,π(x i )) for i =1,, N where ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T x) This is called the

More information

DS-GA 1002 Lecture notes 12 Fall Linear regression

DS-GA 1002 Lecture notes 12 Fall Linear regression DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable,

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

1 Random walks and data

1 Random walks and data Inference, Models and Simulation for Complex Systems CSCI 7-1 Lecture 7 15 September 11 Prof. Aaron Clauset 1 Random walks and data Supposeyou have some time-series data x 1,x,x 3,...,x T and you want

More information

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club 36-825 1 Introduction Jisu Kim and Veeranjaneyulu Sadhanala In this report

More information

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Ma 3/103: Lecture 24 Linear Regression I: Estimation Ma 3/103: Lecture 24 Linear Regression I: Estimation March 3, 2017 KC Border Linear Regression I March 3, 2017 1 / 32 Regression analysis Regression analysis Estimate and test E(Y X) = f (X). f is the

More information

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints Noname manuscript No. (will be inserted by the editor) A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints Mai Zhou Yifan Yang Received: date / Accepted: date Abstract In this note

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

Diagnostics can identify two possible areas of failure of assumptions when fitting linear models.

Diagnostics can identify two possible areas of failure of assumptions when fitting linear models. 1 Transformations 1.1 Introduction Diagnostics can identify two possible areas of failure of assumptions when fitting linear models. (i) lack of Normality (ii) heterogeneity of variances It is important

More information

ISQS 5349 Spring 2013 Final Exam

ISQS 5349 Spring 2013 Final Exam ISQS 5349 Spring 2013 Final Exam Name: General Instructions: Closed books, notes, no electronic devices. Points (out of 200) are in parentheses. Put written answers on separate paper; multiple choices

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

An Unbiased C p Criterion for Multivariate Ridge Regression

An Unbiased C p Criterion for Multivariate Ridge Regression An Unbiased C p Criterion for Multivariate Ridge Regression (Last Modified: March 7, 2008) Hirokazu Yanagihara 1 and Kenichi Satoh 2 1 Department of Mathematics, Graduate School of Science, Hiroshima University

More information

Chapter 3: Regression Methods for Trends

Chapter 3: Regression Methods for Trends Chapter 3: Regression Methods for Trends Time series exhibiting trends over time have a mean function that is some simple function (not necessarily constant) of time. The example random walk graph from

More information

Week 5 Quantitative Analysis of Financial Markets Modeling and Forecasting Trend

Week 5 Quantitative Analysis of Financial Markets Modeling and Forecasting Trend Week 5 Quantitative Analysis of Financial Markets Modeling and Forecasting Trend Christopher Ting http://www.mysmu.edu/faculty/christophert/ Christopher Ting : christopherting@smu.edu.sg : 6828 0364 :

More information

Regression Analysis. y t = β 1 x t1 + β 2 x t2 + β k x tk + ϵ t, t = 1,..., T,

Regression Analysis. y t = β 1 x t1 + β 2 x t2 + β k x tk + ϵ t, t = 1,..., T, Regression Analysis The multiple linear regression model with k explanatory variables assumes that the tth observation of the dependent or endogenous variable y t is described by the linear relationship

More information

12 Statistical Justifications; the Bias-Variance Decomposition

12 Statistical Justifications; the Bias-Variance Decomposition Statistical Justifications; the Bias-Variance Decomposition 65 12 Statistical Justifications; the Bias-Variance Decomposition STATISTICAL JUSTIFICATIONS FOR REGRESSION [So far, I ve talked about regression

More information

Forecast comparison of principal component regression and principal covariate regression

Forecast comparison of principal component regression and principal covariate regression Forecast comparison of principal component regression and principal covariate regression Christiaan Heij, Patrick J.F. Groenen, Dick J. van Dijk Econometric Institute, Erasmus University Rotterdam Econometric

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

ECO Class 6 Nonparametric Econometrics

ECO Class 6 Nonparametric Econometrics ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

STAT 100C: Linear models

STAT 100C: Linear models STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Monitoring Random Start Forward Searches for Multivariate Data

Monitoring Random Start Forward Searches for Multivariate Data Monitoring Random Start Forward Searches for Multivariate Data Anthony C. Atkinson 1, Marco Riani 2, and Andrea Cerioli 2 1 Department of Statistics, London School of Economics London WC2A 2AE, UK, a.c.atkinson@lse.ac.uk

More information

Terence Tai-Leung Chong. Abstract

Terence Tai-Leung Chong. Abstract Estimation of the Autoregressive Order in the Presence of Measurement Errors Terence Tai-Leung Chong The Chinese University of Hong Kong Yuanxiu Zhang University of British Columbia Venus Liew Universiti

More information

Forward Regression for Ultra-High Dimensional Variable Screening

Forward Regression for Ultra-High Dimensional Variable Screening Forward Regression for Ultra-High Dimensional Variable Screening Hansheng Wang Guanghua School of Management, Peking University This version: April 9, 2009 Abstract Motivated by the seminal theory of Sure

More information

Forecasting Levels of log Variables in Vector Autoregressions

Forecasting Levels of log Variables in Vector Autoregressions September 24, 200 Forecasting Levels of log Variables in Vector Autoregressions Gunnar Bårdsen Department of Economics, Dragvoll, NTNU, N-749 Trondheim, NORWAY email: gunnar.bardsen@svt.ntnu.no Helmut

More information

CONSISTENCY OF CROSS VALIDATION FOR COMPARING REGRESSION PROCEDURES. By Yuhong Yang University of Minnesota

CONSISTENCY OF CROSS VALIDATION FOR COMPARING REGRESSION PROCEDURES. By Yuhong Yang University of Minnesota Submitted to the Annals of Statistics CONSISTENCY OF CROSS VALIDATION FOR COMPARING REGRESSION PROCEDURES By Yuhong Yang University of Minnesota Theoretical developments on cross validation CV) have mainly

More information

First Year Examination Department of Statistics, University of Florida

First Year Examination Department of Statistics, University of Florida First Year Examination Department of Statistics, University of Florida August 20, 2009, 8:00 am - 2:00 noon Instructions:. You have four hours to answer questions in this examination. 2. You must show

More information

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator by Emmanuel Flachaire Eurequa, University Paris I Panthéon-Sorbonne December 2001 Abstract Recent results of Cribari-Neto and Zarkos

More information

Introductory Econometrics

Introductory Econometrics Based on the textbook by Wooldridge: : A Modern Approach Robert M. Kunst robert.kunst@univie.ac.at University of Vienna and Institute for Advanced Studies Vienna December 11, 2012 Outline Heteroskedasticity

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error? Simple linear regression Linear Regression Nicole Beckage y " = β % + β ' x " + ε so y* " = β+ % + β+ ' x " Method to assess and evaluate the correlation between two (continuous) variables. The slope of

More information