Motivation for multiple regression

Similar documents
Homework Set 2, ECO 311, Fall 2014

Homework Set 2, ECO 311, Spring 2014

Essential of Simple regression

Empirical Application of Simple Regression (Chapter 2)

ECON The Simple Regression Model

Chapter 2: simple regression model

Review of Econometrics

Econometrics I Lecture 3: The Simple Linear Regression Model

Lecture: Simultaneous Equation Model (Wooldridge s Book Chapter 16)

Simple Linear Regression: The Model

2. Linear regression with multiple regressors

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Econometrics Multiple Regression Analysis: Heteroskedasticity

Two-Variable Regression Model: The Problem of Estimation

Homoskedasticity. Var (u X) = σ 2. (23)

Econometrics Summary Algebraic and Statistical Preliminaries

Multiple Regression Analysis: Heteroskedasticity

Lecture 9: Panel Data Model (Chapter 14, Wooldridge Textbook)

Multiple Linear Regression CIVL 7012/8012

Multiple Regression Analysis. Part III. Multiple Regression Analysis

The Simple Regression Model. Part II. The Simple Regression Model

Multiple Regression Analysis

1. You have data on years of work experience, EXPER, its square, EXPER2, years of education, EDUC, and the log of hourly wages, LWAGE

Applied Statistics and Econometrics

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Lecture 2 Multiple Regression and Tests

Lecture 4: Regression Analysis

Introductory Econometrics

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Least Squares Estimation-Finite-Sample Properties

ECON Introductory Econometrics. Lecture 6: OLS with Multiple Regressors

ECNS 561 Multiple Regression Analysis

ECON3150/4150 Spring 2015

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Multiple Regression Analysis

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Simple Linear Regression

Homework Set 3, ECO 311, Spring 2014

ECON3150/4150 Spring 2016

1 A Non-technical Introduction to Regression

Econ 2120: Section 2

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

Econometrics Review questions for exam

1 Motivation for Instrumental Variable (IV) Regression

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Multivariate Regression Analysis

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

EC4051 Project and Introductory Econometrics

The Simple Linear Regression Model

Linear Regression with Multiple Regressors

ECON Introductory Econometrics. Lecture 16: Instrumental variables

Introduction to Econometrics. Multiple Regression (2016/2017)

ECON Introductory Econometrics. Lecture 7: OLS with Multiple Regressors Hypotheses tests

Statistical Inference with Regression Analysis

Repeated observations on the same cross-section of individual units. Important advantages relative to pure cross-section data

Econometrics I KS. Module 1: Bivariate Linear Regression. Alexander Ahammer. This version: March 12, 2018

Introductory Econometrics

Econometrics of Panel Data

8. Instrumental variables regression

Simple Linear Regression Model & Introduction to. OLS Estimation

Introductory Econometrics

Economics 113. Simple Regression Assumptions. Simple Regression Derivation. Changing Units of Measurement. Nonlinear effects

The Multiple Regression Model Estimation

The multiple regression model; Indicator variables as regressors

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

INTRODUCTION TO BASIC LINEAR REGRESSION MODEL

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

L2: Two-variable regression model

WISE International Masters

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Econometrics - 30C00200

Ch 2: Simple Linear Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Econometric Methods. Prediction / Violation of A-Assumptions. Burcu Erdogan. Universität Trier WS 2011/2012

Empirical Application of Panel Data Regression

Introduction to Econometrics. Multiple Regression

Contest Quiz 3. Question Sheet. In this quiz we will review concepts of linear regression covered in lecture 2.

Advanced Econometrics I

1/34 3/ Omission of a relevant variable(s) Y i = α 1 + α 2 X 1i + α 3 X 2i + u 2i

Linear Regression with Multiple Regressors

Problem Set - Instrumental Variables

Dealing With Endogeneity

Lectures 5 & 6: Hypothesis Testing

Regression Analysis: Basic Concepts

Applied Statistics and Econometrics

Applied Statistics and Econometrics

Measuring the fit of the model - SSR

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Advanced Econometrics

Topic 7: Heteroskedasticity

WISE International Masters

Simultaneous Equation Models Learning Objectives Introduction Introduction (2) Introduction (3) Solving the Model structural equations

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Intermediate Econometrics

Econometrics -- Final Exam (Sample)

Handout 12. Endogeneity & Simultaneous Equation Models

Making sense of Econometrics: Basics

Handout 11: Measurement Error

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Transcription:

Motivation for multiple regression 1. Simple regression puts all factors other than X in u, and treats them as unobserved. Effectively the simple regression does not account for other factors. 2. The slope coefficient β 1 in simple regression has causal interpretation only if the independent variable X is exogenous, i.e., cov(x, u) = 0. This is an assumption that we need to check. 3. The assumption cov(x, u) = 0 can be too strong to hold in reality. If any variable in u is correlated with X, the independent variable becomes endogenous, and the slope coefficient no longer has causal interpretation (in this case β 1 just measures association). 4. We can show that ˆβ 1 = (xi x)(y i ȳ) = (xi x) 2 (xi x)y i (1) (xi x) 2 = (xi x)(β 0 + β 1 x i + u i ) (xi x) 2 = β 1 + (xi x)u i (xi x) 2 (2) β 1 + 5. The last result is important, and it implies that cov(x, u), (as n ) (3) σx 2 ˆβ 1 { β1, if cov(x, u) = 0 β 1 + bias, if cov(x, u) 0, bias = cov(x,u) σ 2 x (4) In words, the estimated slope coefficient converges to the true value (so is unbiased) if X is exogenous. When X becomes endogenous, ˆβ 1 is a biased estimate. The bias is given by cov(x,u). σx 2 6. One extreme case is β 1 = 0, so X has no causal effect on Y at all. But because X is correlated with some variables in the error term, the regression can produce a statistically significant ˆβ 1. In other words, the regression indicates spurious causality that does not exist. 7. For example, we may regress salary on education, and put ability in the error term. Because ability and education are correlated, the result of the simple regression may 1

be biased. What is captured by the simple regression may be the effect of ability on salary, not the effect of education on salary. 8. A multiple regression can explicitly account for many, if not all, other factors, by taking them out of the error term. Consider the simplest multiple regression Y = β 0 + β 1 X 1 + β 2 X 2 + e, (5) where e is the error term, for which we assume E(e X 1, X 2 ) = 0. (6) 9. This multiple regression becomes the simple regression if we define the error term in the simple regression as u β 2 X 2 + e. If we run the simple regression of Y on X 1 we can show ˆβ 1 β 1 + bias, bias = cov(x 1, u) σ 2 x1 = β 2 cov(x 1, X 2 ) σ 2 x1 (7) If β 2 > 0, cov(x 1, X 2 ) > 0, then bias > 0, ˆβ 1 > β 1, so the simple regression overestimates the effect of X 1 on Y. See Table 3.2 of the textbook for other possibilities. 10. We call X 2 the omitted variable if (i) it has causal effect on Y (so β 2 0), (ii) it is correlated with the key regressor (so cov(x 1, X 2 ) 0), and (iii) it is excluded from the regression (being put in the error term). The bias caused by omitted variable is called omitted variable bias. 11. If we use non-experimental data and the goal is to prove causality, omitted variable bias is the top issue we need to address. We have to ask if there is any variable in the error term which is the omitted variable. 12. Consider another example of omitted variable bias. Before we tried the simple regression of house price on the number of bathroom X 1. In that simple regression the error term contains the size of house X 2, which is the omitted variable. X 2 affects the house price, and is correlated with X 1. Therefore the slope of the simple regression is a biased estimate for the true causal effect of X 1 on Y. 2

Estimating multiple regression 1. Consider a multiple regression given by Y = β 0 + β 1 X 1 + β 2 X 2 +... + β k X k + u (1) Note there are k + 1 regressors, and one of them is constant (the intercept term). 2. The unknown parameters are β i, i = 0, 1,...k and σ 2 = var(u). 3. The key assumption to induce causality in multiple regression is E(u X 1, X 2,..., X k ) = E(u) = 0 (2) Assumption (2) is more likely to hold in reality than the simple regression because u now contains fewer variables. That means the multiple regression is more suitable than the simple regression for proving causality. 4. Notice that Assumption (2) does NOT require cov(x i X j ) 0. It is OK for the regressors in the multiple regression to be correlated with each other. Actually that is the whole point for multiple regression, which explicitly controls for X 2,..., X k and any one of them can be correlated with the key regressor X 1. Assumption (2) only requires those regressors be uncorrelated with the error term. 5. The OLS estimators for the coefficients, denoted by ˆβ k, (k = 0, 1,..., k), are obtained by solving k + 1 equations, which are the first order conditions (FOC) for minimizing residual sum squares i ûi = 0, (FOC 1) i x 1iû i = 0, (FOC 2)...,... i x kû i = 0, (FOC k+1) The formula for ˆβ k is complicated. Matrix algebra (not required) is needed in order to get simpler formula. 6. However, there is a simple formula for ˆβ 1 if we follow a two-step procedure (3) Theorem 1 Let ˆr be the residual of auxiliary regression of X 1 on X 2,..., X k. Then 1

the OLS estimator for β 1 in multiple regression (1) is ˆβ 1 = i ˆr iy i (Frisch-Waugh Theorem) (4) 7. Frisch-Waugh Theorem theorem indicates that we can obtain ˆβ 1 in two steps (a) Step 1: regress X 1 on X 2,..., X k and keep the residual ˆr (b) Step 2: regress Y on ˆr without intercept term 8. Residual ˆr measures the part of X 1 that cannot be explained by X 2,..., X k. Put differently, ˆr captures the part of X 1 after the effect of other factors has been netted out. This is why multiple regression is better than simple regression for proving causality. 9. Proof of Frisch-Waugh Theorem: Because ˆr is the residual, so it satisfies the FOC. That is and ˆr i = 0, i i ˆr i x 1i = i i x 2iˆr i = 0,..., i where û i is the residual for (1). The above equations imply that i ˆr iy i = ˆr 2 i, i ˆr i( ˆβ 0 + ˆβ 1 x 1i +... + ˆβ k x ki + û i ) x kiˆr i = 0 (5) ˆr i û i = 0 (6) i = ˆβ 1 (7) 10. The OLS estimate and true coefficient are related via ˆβ 1 = β 1 + i ˆr iu i (8) from which we can prove the statistical property of ˆβ 1 : (a) E( ˆβ 1 X) = β 1 so ˆβ 1 is unbiased (b) The (conditional) variance for ˆβ 1 (assuming homoskedasticity) is var( ˆβ 1 X) = σ2 = σ 2 SST X1 (1 R 2 X1 ) (9) 2

where SST X1 = i (x 1i x 1 ) 2 measures the total variation in X 1, and RX1 2 denotes the R squared for the auxiliary regression of X 1 on X 2,..., X k. Everything else equal, the variance is big (and OLS estimate is imprecise) if X 1 is highly correlated with other independent variable (RX1 2 is big). The phenomenon of high correlation among regressors is called multicollinearity. The consequence of multicollinearity is insignificant estimate Intuitively, when regressors are highly correlated, the regression can not tell them apart, so the estimate is imprecise. 11. Now we face a trade off. The chance of multicollinearity is zero when we run simple regression. But simple regression has high chance of suffering omitted variable bias. Multiple regression has higher chance of multicollinearity, but lower chance of omitted variable bias. Econometrics puts more weight on omitted variable bias than multicollinearity 12. After obtaining the coefficient estimates, we can compute the residual û i = Y i Ŷi = Y i ˆβ 0 ˆβ 1 X 1i... ˆβ k X ki (10) Then the variance of the error term is estimated as ˆσ 2 = û2 i n k 1 (11) The square root of ˆσ 2 is called the standard error of regression (SER). 13. Because multiple regression has multiple independent variables, we can test hypothesis that involves several coefficients. The test is called F test, and is computed as F = (RSS r RSS u )/q RSS u /(n k u 1) (12) where RSS r is the RSS for the restricted regression that imposes the null hypothesis. RSS u is the RSS for the unrestricted regression, q is the number of restrictions, and k u is the number of regressors in the unrestricted regression. 3

(a) The F test follows F distribution with degree of freedom of (q, n k u 1) under the null hypothesis. The t test is special case of F test. The null hypothesis is rejected if the p-value is less than 0.05. (b) The intuition is, the null hypothesis is false (so can be rejected) if imposing the null hypothesis significantly changes RSS. (c) For example, consider an unrestricted multiple regression Y = β 0 + β 1 X 1 + β 2 X 2 + u The null hypothesis is H 0 : β 1 = β 2 By imposing the restriction in the null hypothesis, we get the restricted regression Y = β 0 + β 1 (X 1 + X 2 ) + u so the restricted regression uses X 1 + X 2 as regressor. For this example q = 1 F test can be used when the null hypothesis involes several coefficients 4

Example: Multiple Regression 1. We still use the house data. 2. First we run simple regression of rprice on baths. This simple regression puts variable area (which measures the size of house) into the error term. Because the number of bathrooms and house size must be correlated, baths is endogenous in the simple regression. As a result, the estimated coefficient of baths 29582.67 has NO causal interpretation (or is a biased estimate for the true causal effect). This number just measures the linear association or correlation between baths and rprice. We can only conclude that having one more bathroom is associated with a raise of 29582.67 in real house price. The OLS fitted line 14510.8 + 29582.67 baths, however, is the best linear predictor for y (if we only use baths as predictor), no matter baths is exogenous or not. 3. Next we run multiple regression of rprice on baths and area. The stata command is reg rprice baths area. Now area is out of the error term, but other factors may still be there. That means it is still unlikely we can get causal effect by running this multiple regression with just two regressors. The estimated coefficient of baths now becomes 18602.52, smaller than 29582.67. We can conclude that having one more bathroom, while holding house size fixed, is associated with a raise of 18602.52 in real house price. Put differently, if we have two houses with same size, but one house has one more bathroom than the other, then the rprice of former is higher than the latter by 18602.52. Another way to interpret 18602.52 is, it measures the association between baths and rprice, after the effect of area has been netted out. 4. So one benefit of multiple regression is that multiple regression can explicitly control for other factors, therefore lower the chance of omitted variable bias. Comparing the simple and multiple regressions, it is safe to say the simple regression (in this example) overestimates the effect of baths on rprice. The simple regression estimated coefficient 29582.67 may capture not just the effect of baths, but also area. In short, there is omitted variable bias for simple regression. The omitted variable is area. The omitted variable bias is positive since baths and area are positively correlated and we expect area has positive effect on rprice, see Table 3.2 in the textbook for detail. 5. Another benefit of multiple regression is bigger R squared. We see 0.5570 > 0.4737. The multiple regression fits data better just because it uses more regressors (more 1

information) 6. Multiple regression has cost. Note that in multiple regression the standard error for baths coefficient is 2142.533, higher than 1745.859 in simple regression. This finding implies that the estimate in multiple regression is less precise than simple regression. It is the correlation between baths and area that causes the variance to rise, see formula below var( ˆβ 1 X) = σ2 = σ 2 SST X1 (1 R 2 X1 ). For this problem we are lucky because baths is still significant after area is included. In practice, it is not uncommon that the key regressor becomes insignificant after other regressors are added. 7. Next we apply the Frisch-Waugh Theorem to show how to obtain 18602.52. The stata commands are * step 1: auxiliary regression qui reg baths area predict rhat, re * step 2: regress y onto rhat reg rprice rhat So first we (quietly) regress baths X 1 onto area X 2, and save the residual rhat ˆr using command predict with option re. In step 2 we regress rprice Y onto rhat. We get the same estimate as that reported by command reg rprice baths area, so Frisch- Waugh Theorem is verified. 8. We also report the F test for the hypothesis that baths and size have same effect on rprice. The null hypothesis is H 0 : β 1 = β 2. You can use the command test baths = area. Or you can construct the F test manually by running unrestricted and restricted regressions. See the do file for details. 2

3

Do File * Do file for multiple regression (chapter 3 and 4) clear capture log close ************************************* cd "I:\311" log using 311log.txt, text replace use 311_house.dta, clear * simple regression reg rprice baths * multiple regression reg rprice baths area * save rss sca rssu = e(rss) * example of f test, H0: beta1 = beta2 test baths = area * restricted regression gen x = baths + area qui reg rprice x sca rssr = e(rss) * F test and p value sca f = ((rssr-rssu)/1)/(rssu/321-3) sca pvalue = Ftail(1, 318, f) dis "f test is " f dis "pvalue is " pvalue * verify Frisch-Waugh Theorem * step 1: auxiliary regression qui reg baths area predict rhat, re * step 2: regress y onto rhat reg rprice rhat ***************************************** log close 4