STAT 100C: Linear models - PDF Free Download

STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 21

Model selection Choosing the best model among a collection of models {M 1, M 2..., M N }. What is a good model? 1. fits the data well (model fit): measured e.g. by least-squares criterion. 2. is simple (model simplicity or parsimony): measured by # of parameters. These two are in conflict; there is a trade-off between them. We are looking for a criteria that allows us to balance the two. 2 / 21

Model selection in regression What covariates to include in the model? A model corresponds to a subset of covariates that we include (in it). Suppose we have a pool {x 1,..., x q } of covariates. Then, there are 2 q possible models: All possible subsets. For example if we have {x 1, x 2, x 3 }, then there are 2 3 = 8 possible models: p = 0 y = β 0 + ε M 0 y = β 0 + β 1 x 1 + ε M 1 p = 1 y = β 0 + β 2 x 2 + ε M 2 y = β 0 + β 3 x 3 + ε M 3 y = β 0 + β 1 x 1 + β 2 x 2 + ε M 12 p = 2 y = β 0 + β 1 x 1 + β 3 x 3 + ε M 13 y = β 0 + β 2 x 2 + β 3 x 3 + ε M 23 p = 3 y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + ε M 123 3 / 21

We can fit all possible models, and try to pick the best one based on a criterion. R 2 is in general not a good model selection criterion, since it increases (at least does not decrease) when we increase the number of parameters: R 2 (M 0 ) R 2 (M 1 ) R 2 (M 12 ) R 2 (M 123 ) 4 / 21

Model selection criteria Adjusted R 2 (Radj 2 ) or mean-squared error. Let Rp 2 = 1 SSE p / SST, assuming that we have p covariates in the model. SST = i (y i ȳ) 2 does not depend on p. The adjusted R 2 is R 2 adj,p = 1 SSE p /(n p 1) SST /(n 1) Note also that = 1 s2 p sy 2, sp 2 = SSE p n p 1 = MSE p. R 2 adj,p = 1 n 1 n p 1 (1 R2 p). Increasing p does not necessarily increase R 2 adj,p. R 2 adj,p takes the # of parameters into account. Maximizing R 2 adj,p is equivalent to minimizing s2 p. 5 / 21

Akaike s information criterion (AIC). An example of penalized criteria. In general if l(θ) is the log-likelihood of the model M, and ˆθ M is the maximum likelihood estimator, the AIC of the model is AIC(M ) = 2l(ˆθ M ) + 2p(M ), p(m ) = # of parametres of the model (penalizes complex models). Pick the model with smallest AIC. In regression, θ = (β, σ 2 ), and l(β, σ 2 ) = n 2 log(2π) 1 2σ 2 S(β) n log σ2 2 where S(β) = n i=1 ( yi (X β) i ) 2. 6 / 21

MLE of σ 2 is σ 2 = S( β)/n = SSE /n. Thus, l( β, σ 2 ) = n 2 log(2π) 1 2 σ 2 S( β) n 2 = n 2 log(2π) n }{{ 2 } const. Ignoring the constant (as in your book): AIC p = n log n 2 log ( SSE n ( SSEp ) + 2(p + 1). n A related criteria is Bayesian information criterion (BIC). log σ2 ). 7 / 21

Mallow s C p statistic True model y = X β + ε where X R n q. For true β some of the coefficients are zero. Let S [q] be index of nonzero coefficients of β. Then y = X S β S + ε X S is the reduction of X to the columns in S. For example, q = 5, and S = {1, 2, 4}, β 1 X β = ( ) β 2 x 1 x 2 x 3 x 4 x 5 0 β 4 = ( ) β 1 x 1 x 2 x 4 β 2 }{{} β 4 X S 0 }{{} β S S is called the support of β. S = the true set of covariates in the model. 8 / 21

9 / 21

10 / 21

Mallow s C p statistic Pick some S [q], with S = p + 1, and fit model y = X S α + ε, α R p+1. (1) X S R n (p+1) is X restricted to columns in S. Get mean vector estimate µ S = X S α after fitting (1). A good measure of performance is d(µ, µ S ) := 1 σ 2 E µ µ S 2. (A form of rescaled MSE for parameter µ) µ is the true mean: µ = X β = X S β S. We should choose S that minimizes this. 11 / 21

Let HS := I H S One can show that (try it!) d(µ, µ S ) := 1 σ 2 E µ µ S 2 = 1 σ 2 µt HS µ + (p + 1) If the model is adequate then µ Im(X S ) = µ T HS µ = 0 = d(µ, µ S ) = p + 1 Otherwise d(µ, µ S ) > p + 1. Can do model selection by comparing d(µ, µ S ) to p + 1. However, it depends on the unknown µ. (Note: µ Im(X S ) is equivalent to S S). 12 / 21

C p tries to approximate d(µ, µ S ). It is given by (if we know σ 2 ) With e S = y µ S, C p = SSE p σ 2 n + 2(p + 1) E[SSE p ] = E e S 2 = µ T H S µ + (n p 1)σ 2 This implies that C p is an unbiased estimator of d(µ, µ S ): In practice, do not know σ 2, E[ C p ] = d(µ, µ S ) replace it with s 2 based on the full model (all covariates): C p = SSE p s 2 n + 2(p + 1). Choose the smallest model for which C p p + 1. 13 / 21

14 / 21

PRESS statistic Recall e (i) = y i ŷ (i) and ŷ (i) = x T i PRESS p = β (i). Define: n e(i) 2 = i=1 n i=1 ( ei 1 h ii ) 2 1. Leave one data-point out, 2. fit the model, 3. try to predict the deleted data-point. PRESS p an empirical measure of the prediction error of the model (called generalization error in machine learning). Can use PRESS p as a model selection criteria. 15 / 21

16 / 21

General principle of model selection, called cross-validation: 1. Hold some part of the data and try to predict it by the fitted model. 2. Choose the model that has the smallest prediction error. Using PRESS p is equivalent to leave-one-out cross-validation. (Also known as jack-knifing in this case.) 17 / 21

Automatic methods Instead of looking at all possible regressions, we can use a greedy approach, usually called stepwise regression: Pick a preset significance value α (alpha-to-enter): Forward selection: 1. Start with no variable in the model, 2. From the pool of available covariates pick the most significant to keep in the model, say x j, assuming its significance is > α. Remove x j from the pool. 3. Repeat until no variable is significant. Backward selection: 1. Start with full model, 2. drop the covariates that are least significant recursively, 3. until none is insignificant. 18 / 21

To decide whether to keep or drop a variable at each stage, it is common to use t or F test at level α = 0.05 or 0.10. (alpha-to-enter) In forward selection once a variable enters the model, it cannot leave the model at a later stage. Similarly, in backward selection once a variable leaves the model, it cannot enter the model at a later stage. Bidirectional (stepwise) selection: Combination of forward and backward selections. There are criticisms of these approaches. (Effectively generating hypotheses from the data, and testing them on the same data, which is generally not advisable.) 19 / 21

20 / 21

21 / 21