Variable Selection and Model Building

Size: px

Start display at page:

Download "Variable Selection and Model Building"

Lee Barnaby Perkins
6 years ago
Views:

1 LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 37 Variable Selection and Model Building Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

2 The complete regression analysis depends on the explanatory variables present in the model. It is understood in the regression analysis that only correct and important explanatory variables appear in the model. In practice, after ensuring the correct functional form of the model, the analyst usually has a pool of explanatory variables which possibly influence the process or experiment. Generally, all such candidate variables are not used in the regression modeling but a subset of explanatory variables is chosen from this pool. How to determine such an appropriate subset of explanatory variables to be used in regression is called the problem of variable selection. While choosing a subset of explanatory variables, there are two possible options: 1. In order to make the model as realistic as possible, the analyst may include as many as possible explanatory variables.. In order to make the model as simple as possible, one may include only fewer number of explanatory variables. Both the approaches have their own consequences. In fact, model building and subset selection have contradicting objectives. When large number of variables are included in the model, then these factors can influence the prediction of study variable y. On the other hand, when small number of variables are included then the predictive variance of ŷ decreases. Also, when the observations on more number are to be collected, then it involves more cost, time, labour etc. A compromise between these consequences is striked to select the best regression equation.

3 3 The problem of variable selection is addressed assuming that the functional form of the explanatory variable, e.g., x,,log x x etc., is known and no outliers or influential observations are present in the data. Various statistical tools like residual analysis, identification of influential or high leverage observations, model adequacy etc. are linked to variable selection. In fact, all these processes should be solved simultaneously. Usually, these steps are iteratively employed. In the first step, a strategy for variable selection is opted and model is fitted with selected variables. The fitted model is then checked for the functional form, outliers, influential observations etc. Based on the outcome, the model is re-examined and selection of variable is reviewed again. Several iterations may be required before the final adequate model is decided. 1 There can be two types of incorrect model specifications. 1. Omission/exclusion of relevant variables.. Inclusion of irrelevant variables. Now we discuss the statistical consequences arising from both the situations.

4 4 1. Exclusion of relevant variables In order to keep the model simple, the analyst may delete some of the explanatory variables which may be of importance from the point of view of theoretical considerations. There can be several reasons behind such decision, e.g., it may be hard to quantity the variables like taste, intelligence etc. Sometimes it may be difficult to take correct observations on the variables like income etc. Let there be k candidate explanatory variables out of which suppose r variables are included and (k r) variables are to be deleted from the model. So partition the X and β as X = X1 X and β β1 β. n k = n r n ( k r) r 1 ( k r) 1) The model y = Xβ + ε, E( ε) = 0, V( ε) = σ I y = X β + X β + ε 1 1 which is called as full model or true model. can be expressed as After dropping the r explanatory variable in the model, the new model is y = X1β1+ δ which is called as misspecified model or false model.

5 5 Applying OLS to the false model, the OLSE of The estimation error is obtained as follows: where b = ( XX) Xy b = ( XX) X( Xβ + Xβ + ε) β 1 = β + ( XX) XXβ + ( XX) Xε b β = θ + ( XX) Xε θ = ( XX) XXβ is Thus Eb ( β) = θ + ( XX) E( ε) = θ β which is a linear function of, i.e., the coefficients of excluded variables. So is biased, in general. The bias vanishes if XX 1 = 0, i.e., X 1 and X are orthogonal or uncorrelated. The mean squared error matrix of b 1 is b 1 MSEb ( ) = Eb ( β)( b β) = E θθ + θε X( XX) + ( XX) Xεθ + ( XX) Xεε X( XX) = θθ σ ( XX) XIX( XX) = θθ + σ ( XX) So efficiency generally declines. Note that the second term is the conventional form of MSE.

6 6 The residual sum of squares is where Thus SSres ee ˆ σ = = n r n r e= y Xb = Hy, H = I X( XX) X Hy= H( Xβ + Xβ + ε) = 0 + H ( X β + ε) 1 = H ( X β + ε) 1 yh y= ( X β + X β + ε) H ( X β + ε) = ( β XHHXβ + β XHε + β XHXβ + β XHε + ε HXβ + ε Hε) Es ( ) = E( βxhx 1 β) E( ε Hε) n r 1 = βxhx 1 β) + ( n r) σ n r 1 = σ + βxhx 1 β. n r Thus s is a biased estimator of and s provides an over estimate of. Note that even if XX = then also s gives an overestimate of be invalid in this case. σ σ σ 1 0,. So the statistical inferences based on this will be faulty. The t-test and confidence region will

7 7 If the response is to be predicted at x = ( x, x ), ˆ = = ( ) y xb x X X X y with E( yˆ ) = x β Var yˆ = σ + x X X x ( ) 1 ( ). 1 then using the full model, the predicted value is When subset model is used then the predictor is and then yˆ = xb E( yˆ ) = x ( X X ) X E( y) = x( XX) XE( Xβ + Xβ + ε) = x( XX) X( Xβ + Xβ ) = xβ + x( XX) XXβ = x β + xθ. 1 1 i Thus ŷ is a biased predictor of y. It is unbiased when XX = The MSE of predictor is Also 1 ( ). MSE( yˆ1) = σ 1 + x1( X1X1) x 1 + x1θ xβ provided ˆ MSE yˆ1 Var( y) ( ) V ( ˆ β) ββ is positive semidefinite.

8 8. Inclusion of irrelevant variables Sometimes due to enthusiasm and to make the model more realistic, the analyst may include some explanatory variables that are not very relevant to the model. Such variables may contribute very little to the explanatory power of the model. This may tend to reduce the degrees of freedom (n - k) and consequently the validity of inference drawn may be questionable. or example, the value of coefficient of determination will increase indicating that the model is getting better which may not really be true. Let the true model be y = Xβ + ε, E( ε) = 0, V( ε) = σ I which comprise k explanatory variable. Suppose now r additional explanatory variables are added to the model and resulting model becomes y = Xβ + γ + δ n r γ r 1 where is a matrix of n observations on each of the r explanatory variables and is vector of regression coefficient associated with and Applying OLS to false model, we get where b and C are the OLSEs of and respectively. δ is disturbance term. This model is termed as false model. b X X X X y = c X y X X X b X y = X c y X Xb + X C = X y (1) Xb + C = y () β γ

9 Premultiply equation () by 1 X ( ), we get 9 X Xb X C X y ( ) + ( ) = ( ). (3) Subtracting equation (1) from (3), we get X X X ( ) X b = X y X ( ) y X I ( ) Xb = X I ( ) y b = X H X X H y ( ) where H = I ( ). The estimation error of b is β ( ) b = X H X X H y β ( X HX) X H( X ) = β + ε β = ( X HX) X Hε. Thus E b X H X X H E ( β) = ( ) ( ε) = 0, so b is unbiased even when some irrelevant variables are added to the model. The covariance matrix is ( β)( β) Vb ( ) = Eb b ( ) 1 = E X HX X Hεε HX( X HX) = σ = σ ( X HX) X HIHX ( X HX) ( X H X). 1

10 10 If OLS is applied to true model, then with bt = ( X X) X y Eb ( T ) Vb = β ( T ) = σ ( X X). To compare b and b T we use the following result. Result: If A and B are two positive definite matrices then A B is atleast positive semi definite if atleast positive semi definite. B A is also Let A= ( X H X) B = ( X X) B A = X X X H X ( ) = X X X X + X X = ( ) X X which is atleast positive semi definite matrix. This implies that the efficiency declines unless X = 0. If X = 0, i.e., X and are orthogonal, then both are equally efficient. The residual sum of squares under false model is SS = e e res

11 11 where e = y Xb c b = X H X X H y ( ) c ( ) y ( ) Xb = y Xb ( ) ( ) = ( ) I X( X HX) X H z y ( ) ( ) x ( ) X H y X X H = I H = I X X H X X H H = = = H : idempotent. So = ( ) ( ) X e y X X H X X H y H y = I X( X H X) X H ( ) H X y = HX ( I H ) HX y = H H X y = H y where H = H H * * X X X.

12 1 Thus SS = e e res E SS = y H H H H y = yhh = yh = X X * X * ( res ) σ ( X ) σ ( n k r) = SSres E = σ. n k r y X y tr H SS res So is an unbiased estimator of. n k r σ A comparison of exclusion and inclusion of variables is as follows: Exclusion type Inclusion type Estimation of coefficients Biased Unbiased Efficiency Generally declines Declines Estimation of disturbance Over-estimate Unbiased term Conventional test of Invalid and faulty inferences Valid though erroneous hypothesis and confidence region

Chapter 11 Specification Error Analysis

Chapter 11 Specification Error Analysis Chapter Specification Error Analsis The specification of a linear regression model consists of a formulation of the regression relationships and of statements or assumptions concerning the explanator variables