Variable Selection and Model Building

LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 38 Variable Selection and Model Building Dr. Shalabh Deartment of Mathematics and Statistics Indian Institute of Technology Kanur

Evaluation of subset regression model A question arises after the selection of subsets of candidate variables for the model, how to judge which subset yields better regression model. Various criteria have been roosed in the literature to evaluate and comare the subset regression models.. Coefficient of determination The coefficient of determination is the square of multile correlation coefficient between the study variable y and set of exlanatory variables X, X,..., X denotes as R. Note that X for all which simly indicates the need of = i =,,..., n intercet term in the model without which the coefficient of determination can not be used. So essentially, there will be a subset of ( ) exlanatory variables and one intercet term in the notation. i R The coefficient of determination based on such variables is SS ( ) reg R = SS T SSres ( ) = SS T where SS ( ) and SS ( ) are the sum of squares due to regression and residuals, resectively in a subset model based reg res on ( ) exlanatory variables. Since there are k exlanatory variables available and we select only ( ) k out of them, so there are ossible choices of subsets. Each such choice will roduce one subset model. Moreover, the coefficient of determination has a tendency to increase with the increase in.

So roceed as follows: Choose any aroriate value of, fit the model and obtain Add one variable, fit the model and again obtain. Obviously R R If R R is small, then sto and choose the value of for subset regression. If R R > +. + + R R + R. is high, then kee on adding variables uto a oint where an additional variable does not roduces a large change in the value of or the increment in becomes small. R 3 To know such value of, create a lot of R versus. For examle, the curve will look like as in the following figure Choose the value of corresonding to a value of where the knee of the curve is clearly seen. Such choice of may not be unique among different analyst. Some exerience and judgment of analyst will be helful is finding the aroriate and satisfactory value of. R

4 To choose a satisfactory value analytically, a solution is a test which can identify the model with R which does not significantly differ from the R based on all the exlanatory variables. Let R = ( R )( + d α ) 0 k +, nk, where d α, nk, = kfα ( n, n k ) n k and is the value of R R + based on all ( k + ) exlanatory variables. A subset with R > R 0 is called k an R -adequate(α) subset..

5. Adjusted coefficient of determination The usted coefficient of determination has certain advantages over the usual coefficient of determination. The usted coefficient of determination based on -term model is n R ( ) = ( R ). n An advantage of R ( ) is that it does not necessarily increases as increases. If there are r more exlanatory variables which are added to a - term model then if and only if the artial F - statistic for testing the significance of r additional exlanatory variables exceeds. So the subset selection based on R ( ) can be made on the same lines are in In general, the value of corresonding to maximum value of R ( + r) > R ( ) R ( ) is chosen for the subset model. R. 3. Residual mean square A model is said to have a better fit if residuals are small. This is reflected in the sum of squares due to residuals SS res. A model with smaller SS res is referable. Based on this, the residual mean square based on a variable subset regression model is defined as MS res SSres ( ) ( ) =. n

6 So MS res () can be used as a criterion for model selection like SS res. The SS res () decreases with an increase in. So similarly as increases, MS res () initially decreases, then stabilizes and finally may increase if the model is not sufficient to comensate the loss of one degree of freedom in the factor (n - ). When MS res () is lotted versus, the curve look like as in the following figure. So lot MS res () versus. Choose corresonding to minimum value of MS res (). Choose corresonding to which MS res () is aroximately equal to MS res based on full model. Choose near the oint where the smallest value of MS res () turns uward.

7 Such minimum value of MS res () will roduce a R ( ) n R( ) = ( R) n n SSres( ) =. n SS n SSres( ) =. SS n T MSres( ) =. SS / ( n ) T T with maximum value. So Thus the two criterion, viz, minimum MS res () and maximum R ( ) are equivalent.

4. Mallow s C statistics 8 Mallow s C criterion is based on the mean squared error of a fitted value. Consider the model y = X + ε with artitioned X ( X, X) where is matrix and is matrix, so that y = X + X + ε E ε = V ε = I, ( ) 0, ( ) = X n X n q where = (, )'. ' ' Consider the reduced model y = X + δ, E( δ) = 0, V( δ) = I and redict y based on subset model as The rediction of y can also be seen as the estimation of of ŷ yˆ = X ˆ, where ˆ = ( XX) Xy. is given by ' ' ( ˆ ) ( ˆ ) Γ = E X X ' X X. E( y) = X, so the exected outweighed squared error loss So the subset model can be considered as an aroriate model if Γ is small. Since where H = X( XX) X, ' ' so Γ = ( ' ) ' ' + ' '. E yhy X HX X X [ ε ε ] EyHy ( ' ) = E( X + )' H( X + ) [ ' ' ' ' ε ε' ε' ε] = E X HX + X H + HX + H = ' X ' H X + 0+ 0+ tr H = + ' X' HX.

Thus Γ = + ' X' HX ' X' HX + ' X' X = + ' X' X ' X' HX = + ' X '( I H) X = + ' X' HX where H = I X( XX) X. ' ' 9 Since Thus [ ε ε ] EyHy ( ' ) = E( X + )' H( X + ) = + trh ' X ' HX = + ( n ) ' X' HX = ' X' HX E( yhy ' ) ( n ). Γ = ( ) + ( ' ). n EyHy Γ Note that deends on and which are unknown. So can not be used in ractice. A solution to this roblem is to relace and by their resective estimators which gives ˆ Γ ˆ = ( n) + SSres ( ) Γ where SS ( ) = y ' H y res is the residuals sum of squares based on the subset model.

0 A rescaled vision of Γˆ is SSres ( ) C = ( n) + ˆ which is the Mallow s C statistic for the model b ( X ' X) X ' y = = n q ˆ ˆ ˆ ( y X )'( y X ) y = X + δ, are used to estimate and resectively which are based on full model. the subset model. Usually When different subset models are considered, then the models with smallest C are considered to be better than those models with higher C. So lower C is referable. If the subset model has negligible bias, (in case of b, then bias is zero), then [ ( )] = ( ) E SS n res and ( n ) E C Bias = 0 = n =.

The lot of C versus for each regression equation will be a straight line assing through origin and look like as follows: Those oints which have smaller bias will be near to line and those oints with significant bias will lie above the line. For examle, oint A has little bias, so it is closer to line whereas oints B and C have substantial bias, so they are above the line. Moreover, oint C is above oint A and it reresents a model with lower total error. It may be referred to accet some bias in the regression equation to reduce the average rediction error. Note that an unbiased estimator of is used in C = which is based on the assumtion that the full model has negligible bias. In case, the full model contains non-significant exlanatory variables with zero regression coefficients, then the same unbiased estimator of will overestimate and then C will have smaller values. So working of C deends on the good choice of estimator of.