Variable Selection and Model Building

Similar documents
Chapter 11 Specification Error Analysis

Chapter 13 Variable Selection and Model Building

Multiple Linear Regression Analysis

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Simple Linear Regression Analysis

Variable Selection and Model Building

ECONOMETRIC THEORY. MODULE XVII Lecture - 43 Simultaneous Equations Models

Lecture 9 SLR in Matrix Form

Variable Selection and Model Building

ECONOMETRIC THEORY. MODULE VI Lecture 19 Regression Analysis Under Linear Restrictions

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Lecture 10 Multiple Linear Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Regression Models - Introduction

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Math 423/533: The Main Theoretical Topics

Statistical Modelling in Stata 5: Linear Models

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

ECON The Simple Regression Model

Ch 2: Simple Linear Regression

In the bivariate regression model, the original parameterization is. Y i = β 1 + β 2 X2 + β 2 X2. + β 2 (X 2i X 2 ) + ε i (2)

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Reference: Davidson and MacKinnon Ch 2. In particular page

LECTURE 5 HYPOTHESIS TESTING

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Economics 620, Lecture 4: The K-Variable Linear Model I. y 1 = + x 1 + " 1 y 2 = + x 2 + " 2 :::::::: :::::::: y N = + x N + " N

Introduction to Estimation Methods for Time Series models. Lecture 1

Heteroskedasticity. y i = β 0 + β 1 x 1i + β 2 x 2i β k x ki + e i. where E(e i. ) σ 2, non-constant variance.

Regression Analysis for Data Containing Outliers and High Leverage Points

ECO220Y Simple Regression: Testing the Slope

Inference for the Regression Coefficient

The Linear Regression Model

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Chapter 5 Matrix Approach to Simple Linear Regression

Unit 10: Simple Linear Regression and Correlation

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Econometrics Summary Algebraic and Statistical Preliminaries

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

14 Multiple Linear Regression

Formal Statement of Simple Linear Regression Model

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Mathematics for Economics MA course

Simple Linear Regression

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Economics 620, Lecture 4: The K-Varable Linear Model I

Multiple Linear Regression CIVL 7012/8012

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Chapter 14 Stein-Rule Estimation

Ma 3/103: Lecture 24 Linear Regression I: Estimation

STAT 540: Data Analysis and Regression

Lecture 3: Inference in SLR

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1

Inferences for Regression

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

coefficients n 2 are the residuals obtained when we estimate the regression on y equals the (simple regression) estimated effect of the part of x 1

Econometric Methods. Prediction / Violation of A-Assumptions. Burcu Erdogan. Universität Trier WS 2011/2012

1 Least Squares Estimation - multiple regression.

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Quantitative Methods I: Regression diagnostics

Lecture 14 Simple Linear Regression

The regression model with one stochastic regressor.

Inference for Regression Inference about the Regression Model and Using the Regression Line

Advanced Econometrics I

MS&E 226: Small Data

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

11 Hypothesis Testing

28. SIMPLE LINEAR REGRESSION III

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Chapter 13. Multiple Regression and Model Building

Multiple Linear Regression

Multiple Linear Regression

Specification errors in linear regression models

Brief Suggested Solutions

Econ 582 Fixed Effects Estimation of Panel Data

Multivariate Regression (Chapter 10)

General Linear Model (Chapter 4)

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

REGRESSION ANALYSIS AND INDICATOR VARIABLES

Econ 510 B. Brown Spring 2014 Final Exam Answers

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Steps in Regression Analysis

Simple and Multiple Linear Regression

OPTIMIZATION OF FIRST ORDER MODELS

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Simple Linear Regression

Econ 620. Matrix Differentiation. Let a and x are (k 1) vectors and A is an (k k) matrix. ) x. (a x) = a. x = a (x Ax) =(A + A (x Ax) x x =(A + A )

Ordinary Least Squares Regression

Transcription:

LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 37 Variable Selection and Model Building Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

The complete regression analysis depends on the explanatory variables present in the model. It is understood in the regression analysis that only correct and important explanatory variables appear in the model. In practice, after ensuring the correct functional form of the model, the analyst usually has a pool of explanatory variables which possibly influence the process or experiment. Generally, all such candidate variables are not used in the regression modeling but a subset of explanatory variables is chosen from this pool. How to determine such an appropriate subset of explanatory variables to be used in regression is called the problem of variable selection. While choosing a subset of explanatory variables, there are two possible options: 1. In order to make the model as realistic as possible, the analyst may include as many as possible explanatory variables.. In order to make the model as simple as possible, one may include only fewer number of explanatory variables. Both the approaches have their own consequences. In fact, model building and subset selection have contradicting objectives. When large number of variables are included in the model, then these factors can influence the prediction of study variable y. On the other hand, when small number of variables are included then the predictive variance of ŷ decreases. Also, when the observations on more number are to be collected, then it involves more cost, time, labour etc. A compromise between these consequences is striked to select the best regression equation.

3 The problem of variable selection is addressed assuming that the functional form of the explanatory variable, e.g., x,,log x x etc., is known and no outliers or influential observations are present in the data. Various statistical tools like residual analysis, identification of influential or high leverage observations, model adequacy etc. are linked to variable selection. In fact, all these processes should be solved simultaneously. Usually, these steps are iteratively employed. In the first step, a strategy for variable selection is opted and model is fitted with selected variables. The fitted model is then checked for the functional form, outliers, influential observations etc. Based on the outcome, the model is re-examined and selection of variable is reviewed again. Several iterations may be required before the final adequate model is decided. 1 There can be two types of incorrect model specifications. 1. Omission/exclusion of relevant variables.. Inclusion of irrelevant variables. Now we discuss the statistical consequences arising from both the situations.

4 1. Exclusion of relevant variables In order to keep the model simple, the analyst may delete some of the explanatory variables which may be of importance from the point of view of theoretical considerations. There can be several reasons behind such decision, e.g., it may be hard to quantity the variables like taste, intelligence etc. Sometimes it may be difficult to take correct observations on the variables like income etc. Let there be k candidate explanatory variables out of which suppose r variables are included and (k r) variables are to be deleted from the model. So partition the X and β as X = X1 X and β β1 β. n k = n r n ( k r) r 1 ( k r) 1) The model y = Xβ + ε, E( ε) = 0, V( ε) = σ I y = X β + X β + ε 1 1 which is called as full model or true model. can be expressed as After dropping the r explanatory variable in the model, the new model is y = X1β1+ δ which is called as misspecified model or false model.

5 Applying OLS to the false model, the OLSE of The estimation error is obtained as follows: where b = ( XX) Xy. 1 1 1 1 b = ( XX) X( Xβ + Xβ + ε) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 β 1 = β + ( XX) XXβ + ( XX) Xε b β = θ + ( XX) Xε θ = ( XX) XXβ. 1 1 1 is Thus Eb ( β) = θ + ( XX) E( ε) 1 1 1 = θ β which is a linear function of, i.e., the coefficients of excluded variables. So is biased, in general. The bias vanishes if XX 1 = 0, i.e., X 1 and X are orthogonal or uncorrelated. The mean squared error matrix of b 1 is b 1 MSEb ( ) = Eb ( β)( b β) 1 1 1 = E θθ + θε X( XX) + ( XX) Xεθ + ( XX) Xεε X( XX) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 = θθ + 0+ 0 + σ ( XX) XIX( XX) = θθ + σ ( XX). 1 1 1 1 1 1 So efficiency generally declines. Note that the second term is the conventional form of MSE.

6 The residual sum of squares is where Thus SSres ee ˆ σ = = n r n r e= y Xb = Hy, 1 1 1 H = I X( XX) X. 1 1 1 1 1 Hy= H( Xβ + Xβ + ε) 1 1 1 1 = 0 + H ( X β + ε) 1 = H ( X β + ε) 1 yh y= ( X β + X β + ε) H ( X β + ε) 1 1 1 1 = ( β XHHXβ + β XHε + β XHXβ + β XHε + ε HXβ + ε Hε). 1 1 1 1 1 1 1 1 1 1 Es ( ) = E( βxhx 1 β) + 0+ 0 + E( ε Hε) n r 1 = βxhx 1 β) + ( n r) σ n r 1 = σ + βxhx 1 β. n r Thus s is a biased estimator of and s provides an over estimate of. Note that even if XX = then also s gives an overestimate of be invalid in this case. σ σ σ 1 0,. So the statistical inferences based on this will be faulty. The t-test and confidence region will

7 If the response is to be predicted at x = ( x, x ), ˆ = = ( ) y xb x X X X y with E( yˆ ) = x β Var yˆ = σ + x X X x ( ) 1 ( ). 1 then using the full model, the predicted value is When subset model is used then the predictor is and then yˆ = xb 1 1 1 E( yˆ ) = x ( X X ) X E( y) 1 1 1 1 1 = x( XX) XE( Xβ + Xβ + ε) 1 1 1 1 1 1 = x( XX) X( Xβ + Xβ ) 1 1 1 1 1 1 = xβ + x( XX) XXβ 1 1 1 1 1 1 = x β + xθ. 1 1 i Thus ŷ is a biased predictor of y. It is unbiased when XX = The MSE of predictor is Also 1 ( ). MSE( yˆ1) = σ 1 + x1( X1X1) x 1 + x1θ xβ. 1 0. provided ˆ MSE yˆ1 Var( y) ( ) V ( ˆ β) ββ is positive semidefinite.

8. Inclusion of irrelevant variables Sometimes due to enthusiasm and to make the model more realistic, the analyst may include some explanatory variables that are not very relevant to the model. Such variables may contribute very little to the explanatory power of the model. This may tend to reduce the degrees of freedom (n - k) and consequently the validity of inference drawn may be questionable. or example, the value of coefficient of determination will increase indicating that the model is getting better which may not really be true. Let the true model be y = Xβ + ε, E( ε) = 0, V( ε) = σ I which comprise k explanatory variable. Suppose now r additional explanatory variables are added to the model and resulting model becomes y = Xβ + γ + δ n r γ r 1 where is a matrix of n observations on each of the r explanatory variables and is vector of regression coefficient associated with and Applying OLS to false model, we get where b and C are the OLSEs of and respectively. δ is disturbance term. This model is termed as false model. b X X X X y = c X y X X X b X y = X c y X Xb + X C = X y (1) Xb + C = y () β γ

Premultiply equation () by 1 X ( ), we get 9 X Xb X C X y ( ) + ( ) = ( ). (3) Subtracting equation (1) from (3), we get X X X ( ) X b = X y X ( ) y X I ( ) Xb = X I ( ) y b = X H X X H y ( ) where H = I ( ). The estimation error of b is β ( ) b = X H X X H y β ( X HX) X H( X ) = β + ε β = ( X HX) X Hε. Thus E b X H X X H E ( β) = ( ) ( ε) = 0, so b is unbiased even when some irrelevant variables are added to the model. The covariance matrix is ( β)( β) Vb ( ) = Eb b ( ) 1 = E X HX X Hεε HX( X HX) = σ = σ ( X HX) X HIHX ( X HX) ( X H X). 1

10 If OLS is applied to true model, then with bt = ( X X) X y Eb ( T ) Vb = β ( T ) = σ ( X X). To compare b and b T we use the following result. Result: If A and B are two positive definite matrices then A B is atleast positive semi definite if atleast positive semi definite. B A is also Let A= ( X H X) B = ( X X) B A = X X X H X ( ) = X X X X + X X = ( ) X X which is atleast positive semi definite matrix. This implies that the efficiency declines unless X = 0. If X = 0, i.e., X and are orthogonal, then both are equally efficient. The residual sum of squares under false model is SS = e e res

11 where e = y Xb c b = X H X X H y ( ) c ( ) y ( ) Xb = y Xb ( ) ( ) = ( ) I X( X HX) X H z y ( ) ( ) x ( ) X H y X X H = I H = I X X H X X H H = = = H : idempotent. So = ( ) ( ) X e y X X H X X H y H y = I X( X H X) X H ( ) H X y = HX ( I H ) HX y = H H X y = H y where H = H H * * X X X.

1 Thus SS = e e res E SS = y H H H H y = yhh = yh = X X * X * ( res ) σ ( X ) σ ( n k r) = SSres E = σ. n k r y X y tr H SS res So is an unbiased estimator of. n k r σ A comparison of exclusion and inclusion of variables is as follows: Exclusion type Inclusion type Estimation of coefficients Biased Unbiased Efficiency Generally declines Declines Estimation of disturbance Over-estimate Unbiased term Conventional test of Invalid and faulty inferences Valid though erroneous hypothesis and confidence region