An explanation of Two Stage Least Squares

Similar documents
Lecture 8: Instrumental Variables Estimation

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

Instrumental Variables, Simultaneous and Systems of Equations

Exercise Sheet 4 Instrumental Variables and Two Stage Least Squares Estimation

Simultaneous Equations with Error Components. Mike Bronner Marko Ledic Anja Breitwieser

Problem Set #3-Key. wage Coef. Std. Err. t P> t [95% Conf. Interval]

Econometrics. 8) Instrumental variables

Course Econometrics I

4 Instrumental Variables Single endogenous variable One continuous instrument. 2

5.2. a. Unobserved factors that tend to make an individual healthier also tend

4 Instrumental Variables Single endogenous variable One continuous instrument. 2

Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like.

Ecmt 675: Econometrics I

Handout 11: Measurement Error

ECON Introductory Econometrics. Lecture 17: Experiments

Problem Set #5-Key Sonoma State University Dr. Cuellar Economics 317- Introduction to Econometrics

ECON Introductory Econometrics. Lecture 16: Instrumental variables

Practice exam questions

Lecture#12. Instrumental variables regression Causal parameters III

Econometrics Midterm Examination Answers

Handout 12. Endogeneity & Simultaneous Equation Models

ECON3150/4150 Spring 2016

Problem Set 10: Panel Data

Lab 07 Introduction to Econometrics

Dynamic Panel Data Models

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Problem Set # 1. Master in Business and Quantitative Methods

ECON3150/4150 Spring 2015

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

ECO220Y Simple Regression: Testing the Slope

Instrumental Variable Regression

Quantitative Methods Final Exam (2017/1)

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Fixed and Random Effects Models: Vartanian, SW 683

Dealing With and Understanding Endogeneity

Mediation Analysis: OLS vs. SUR vs. 3SLS Note by Hubert Gatignon July 7, 2013, updated November 15, 2013

Lecture: Simultaneous Equation Model (Wooldridge s Book Chapter 16)

Economics 345: Applied Econometrics Section A01 University of Victoria Midterm Examination #2 Version 2 Fall 2016 Instructor: Martin Farnham

Practice 2SLS with Artificial Data Part 1

Exercise sheet 6 Models with endogenous explanatory variables

Multiple Regression Analysis: Estimation. Simple linear regression model: an intercept and one explanatory variable (regressor)

Basic econometrics. Tutorial 3. Dipl.Kfm. Johannes Metzler

Exercices for Applied Econometrics A

Spatial Regression Models: Identification strategy using STATA TATIANE MENEZES PIMES/UFPE

University of California at Berkeley Fall Introductory Applied Econometrics Final examination. Scores add up to 125 points

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

Specification Error: Omitted and Extraneous Variables

Applied Statistics and Econometrics. Giuseppe Ragusa Lecture 15: Instrumental Variables

Graduate Econometrics Lecture 4: Heteroskedasticity

Brief Suggested Solutions

ECO375 Tutorial 8 Instrumental Variables

Dealing With Endogeneity

Econometrics Homework 1

2. (3.5) (iii) Simply drop one of the independent variables, say leisure: GP A = β 0 + β 1 study + β 2 sleep + β 3 work + u.

Problem set - Selection and Diff-in-Diff

14.32 Final : Spring 2001

ECON3150/4150 Spring 2016

ECONOMICS AND ECONOMIC METHODS PRELIM EXAM Statistics and Econometrics August 2013

Case of single exogenous (iv) variable (with single or multiple mediators) iv à med à dv. = β 0. iv i. med i + α 1

Statistical Inference with Regression Analysis

Lecture 4: Multivariate Regression, Part 2

Эконометрика, , 4 модуль Семинар Для Группы Э_Б2015_Э_3 Семинарист О.А.Демидова

Problem Set 1 ANSWERS

Economics 345: Applied Econometrics Section A01 University of Victoria Midterm Examination #2 Version 1 SOLUTIONS Fall 2016 Instructor: Martin Farnham

SIMULTANEOUS EQUATION MODEL

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Nonrecursive Models Highlights Richard Williams, University of Notre Dame, Last revised April 6, 2015

Econometrics II Censoring & Truncation. May 5, 2011

Multiple Regression Analysis

ECON Introductory Econometrics. Lecture 13: Internal and external validity

Chapter 14. Simultaneous Equations Models Introduction

Applied Health Economics (for B.Sc.)

Ability Bias, Errors in Variables and Sibling Methods. James J. Heckman University of Chicago Econ 312 This draft, May 26, 2006

Multiple Linear Regression CIVL 7012/8012

1. You have data on years of work experience, EXPER, its square, EXPER2, years of education, EDUC, and the log of hourly wages, LWAGE

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

ECON Introductory Econometrics. Lecture 6: OLS with Multiple Regressors

Introduction to Econometrics. Heteroskedasticity

Topics in Applied Econometrics and Development - Spring 2014

Econometrics II. Lecture 4: Instrumental Variables Part I

Essential of Simple regression

Exam ECON3150/4150: Introductory Econometrics. 18 May 2016; 09:00h-12.00h.

Regression #8: Loose Ends

Econometrics. 9) Heteroscedasticity and autocorrelation

Control Function and Related Methods: Nonlinear Models

THE MULTIVARIATE LINEAR REGRESSION MODEL

Question 1 [17 points]: (ch 11)

GMM Estimation in Stata

Greene, Econometric Analysis (7th ed, 2012)

Rockefeller College University at Albany

UNIVERSITY OF WARWICK. Summer Examinations 2015/16. Econometrics 1

Econ 836 Final Exam. 2 w N 2 u N 2. 2 v N

Autocorrelation. Think of autocorrelation as signifying a systematic relationship between the residuals measured at different points in time

Simultaneous Equation Models Learning Objectives Introduction Introduction (2) Introduction (3) Solving the Model structural equations

Lecture 4: Multivariate Regression, Part 2

ECON2228 Notes 10. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 48

ECON 594: Lecture #6

1 Motivation for Instrumental Variable (IV) Regression

sociology 362 regression

Empirical Application of Simple Regression (Chapter 2)

Transcription:

Introduction Introduction to Econometrics An explanation of Two Stage Least Squares When we get an endogenous variable we know that OLS estimator will be inconsistent. In addition OLS regressors will also be biased. Two Stage Least Squares is a general solution to the problem of inconsistent estimators. When we have one endogenous variable and one excluded exogenous variable the model is exactly identified which allows us to use it as a direct instrument. If there is more than one instrument available we use the best linear combination of all the exogenous variables including the instruments to construct a variable which we can use in place of the endogenous variable. The following demonstrates the Two Stage Least Squares method. The following two section both explain the methodology of 2SLS but I have used two separate ways, and slightly different notation, to present 2SLS as it differs across the spectrum of textbooks and online resources. The Two Stage Least Squares (General Notation) Whenever we have more than one instrument we say that the model is over identified. Let z i1,, z im where E[z im u i ] = 0. This means that each instrument is exogenous. There are many combinations of the instruments to use but the 2SLS is the most efficient IV estimator. The Vector of all exogenous variables is z i = [x i1,.., x 1, z i1, z im ] The linear combination of z i most correlated with x is given by the linear projection of x on z i. This is known as the reduced form model. x i1 = π 0 + π 1 x i1 + + π k 1 x ik 1 + π K z i1 + + π K+M 1 z im + v i (i) Where v i is uncorrelated with each of the right hand side variables and has zero mean. By definition any linear combination of z i is uncorrelated with ε i we have, E(v i ) = 0, Cov(v i, x i1 ) = 0,, Cov(v i, z im ) = 0 therefore v i drops out of the equation. x = π 0 + π 1 x i1 + + π k 1 x ik 1 + π K z i1 + + π K+M 1 z im x is often interpreted as the part of x that is uncorrelated with ε i.as long as we make the assumption that there are no exact linear dependencies amount the exogenous variables, we can consistently estimate the parameters in equation (i). We have to estimate x as it is not observed and therefore not feasible. Stage 1: Obtain fitted values of x from the reduced form regression x = π 0 + π 1 x i1 + + π k 1 x ik 1 + π K z i1 + + π K+M 1 z im 1

Stage 2: Run the OLS regression y i = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + β k x + ε i Where x i = [x i1,.., x 1, x ] we can define the 2SLS estimator as β IV(2SLS) = x i x 1 i x i y i = (X X) 1 X y The 2SLS estimator turns out to be an OLS estimator. Note that X = Z(Z Z) 1 Z X = P Z X where tis projection matrix is idempotent and symmetric which means that P Z P Z = P Z. Therefore we have X X = X P Z X=(P Z X) P Z X. From this it is clear to see that X X = X X. Now we can show that the 2SLS estimator uses linear combination of instruments, x can be written as, β IV(2SLS) = x i x 1 i x i y i = (X X ) 1 X y This is the standard formula where we regress y on X. It is important to note that β IV(2SLS) = β IV are identical when there is only one instrument for the endogenous variable x. Two Stage Least Squares (Alternative Notation) The Structural Model y i = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + β k x + ε i (1) This is our original linear model where the endogenous variable is denoted as x. Reduced Form Model x = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + δ 1 z i1 + + δ m z im + v i (2) In this model the z i s represent the exogenous instrumental variables. 1 The best IV for x is the linear combination of the exogenous variables which we call x. By definition any linear combination of z i is uncorrelated with ε i we have, E(v i ) = 0, Cov(v i, x i1 ) = 0,, Cov(v i, z im ) = 0 therefore v i drops out of the equation x = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + δ 1 z i1 + + δ m z im Aside: Although correct, it is not enough to say that z i1 is correlated with x. It is more precise to say that z i1 is partially correlated with x since there are other variables in equation (2). 1 For a basic understanding of the role of instrumental variables you should consult the PDF Instrumental Variables 2

First Stage: This stage is essentially just running the reduced form regression as shown in (3). What we are actually doing is regressing the endogenous problem variable on all the exogenous variables. This includes all the x s from the original structural equation which are exogenous plus all the instruments z's which are obviously exogenous because they are from outside the model. 1st Stage regression (Reduced Form Regression) x = β 0 + β 1x i1 + + β (k 1) x i(k 1) + δ 1z i1 + + δ mz im (3) Second Stage: Now we have an estimate for the endogenous variable x which is the best linear combination of exogenous variables x. This is an exogenous variable that we can use as a replacement for x which is the endogenous variable from the structural model (1). This is why Model (4) below uses x i in the equation. The 2SLS fist purges x of its correlation with ε i before doing the OLS regression we can show the as x = x + v i the composite error will be ε i + β k v i y i = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + β k x + ε i + β k v i Where ε i + β k v i has a zero mean and is uncorrelated with all the right hand side variables. 2SLS Regression y i = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + β k x + ε i (4) Therefore we have β IV(2SLS) = ( x i x i ) 1 x i y i In this final equation we only have exogenous variables. Therefore two stage least squares has prevented inconsistency coming from the fact that we have an endogenous explanatory variable. It is important to notice that although two stage least squares will be consistent it will never be unbiased. Example 1 Suppose we are interested in estimating wages and are using the following model (5). The explanatory variables are what we think effects a person wages. However we know from the OLS assumptions that these variables cannot be correlated with the error term. The error term contains everything that is unobserved and contains variables le ability. (5) is the model we wish to estimate where error represents the population error whist in (6) u i is the residual error from the sample we observe. log (wage) i = β 0 + β 1 exper i + β 2 exper i 2 + β 3 age i + β 4 region i + β 5 educ i + error i (5) log (wage) i = β 0 + β 1 exper i + β 2 exper i 2 + β 3 age i + β 4 region i + β 5 educ i + u i (6) By inspecting the structure of the error terms we can notice that ability is an unobserved variable that is related to the education variable. For instance those with more ability are 3

expected to have on average more years of education than those with less ability. We will assume that ability is not correlated with the experience variable although there are examples where this could be the case. u i = ability i + error (educ i u i ) 0 More specifically we could write the last equation in the following two equations to explicitly demonstrate the origin of this endogeneity. (educ i ability i ) 0, and (educ i error i ) 0 If education is correlated to the residual u i through the ability variable it is true that we will have a biased and inconsistent estimator for not only β 5 but for all the variables. Therefore we need to use an instrument or a set of instruments to solve for the endogeneity problem. At this stage we need to find at least one instrument for the one endogenous variable. When we have more than one excluded exogenous variable we use two stage least squares to create the best linear combination of instruments to rule out the endogeneity of education. We could have the variables mother s education and father s education. Consequently if the two instruments satisfy the requirements of being uncorrelated with the error but correlated with education then we can use them. It is worth noting that there are methods for testing for endogeneity and testing over identifying restrictions however we shall not look at them here. Stage 1 educ i = β 0 + β 1exper i + β 2exper 2 i + β 3age i + β 4region i + δ 1motheduc i + δ 2fatheduc i + e i Now we have an estimate of education that is constructed from the exogenous variables in the structural equation plus the outside exogenous variables. The β s are acting as instruments for themselves. Stage 2 log (wage) i = β 0 + β 1 exper i + β 2 exper 2 i + β 3 age i + β 4 region i + β 5 educ i + u i Example 2 The following regression attempts to model income based on cigarettes consumed, education and the age of a person. Log(income) = β 0 + β 1 cigs + β 2 educ + β 3 age + β 4 age 2 + ε 1 (i) How do you interpret the coefficient β 1? 4

. regress lincome cigs educ age agesq F( 4, 802) = 39.61 Model 67.5412888 4 16.8853222 Prob > F = 0.0000 Residual 341.854549 802.426252555 R-squared = 0.1650 Adj R-squared = 0.1608 Total 409.395838 806.507935283 Root MSE =.65288 lincome Coef. Std. Err. t P> t [95% Conf. Interval] cigs.0017306.0017137 1.01 0.313 -.0016333.0050945 educ.0603606.0078983 7.64 0.000.0448567.0758645 age.0576908.0076436 7.55 0.000.042687.0726946 agesq -.0006306.0000834-7.56 0.000 -.0007943 -.0004669 _cons 7.795444.1704271 45.74 0.000 7.460908 8.129979 1 more cigarette smoked per day results in a 0.00173% increase in income. cigs is the endogenous variable because it is based on income amongst other factors. We would probably expect consumption of more cigarettes to decrease income because of the health reasons but that is not the case here. We expect this result is caused because of the endogeneity To reflect the fact that cigarette consumption might be jointly determined with income, a demand for cigarettes equation is cigs = γ 0 + γ 1 log(income) + γ 2 edcu + γ 3 age + γ 4 age 2 + γ 5 log(cigpric) + γ 6 resaurn + ε 2 where cigpric is the price of a pack of cigarettes (in cents), and restaurn is a binary variable equal to one if the person lives in a state with restaurant smoking restrictions. (ii) Assuming these are exogenous to the individual, what signs would you expect for 5 and 6? We would expect γ 5 to have a negative sign as the higher the price of cigarettes the less you will consume. However as smoking is addictive the demand would most lely be inelastic so we don t expect a large change. We would expect γ 6 to be negatively related to cigarette consumption. If the restaurants do not allow smoking we would expect less consumption. Although I doubt the magnitude would be too large. The regression below confirms expectations.. regress cigs lincome educ age agesq lcigpric restaurn F( 6, 800) = 7.42 Model 8003.02506 6 1333.83751 Prob > F = 0.0000 Residual 143750.658 800 179.688322 R-squared = 0.0527 Adj R-squared = 0.0456 Total 151753.683 806 188.280003 Root MSE = 13.405 cigs Coef. Std. Err. t P> t [95% Conf. Interval] lincome.8802682.7277832 1.21 0.227 -.548322 2.308858 educ -.5014982.1670772-3.00 0.003 -.8294597 -.1735368 age.7706936.1601223 4.81 0.000.456384 1.085003 agesq -.0090228.001743-5.18 0.000 -.0124443 -.0056013 lcigpric -.7508586 5.773343-0.13 0.897-12.08355 10.58183 restaurn -2.825085 1.111794-2.54 0.011-5.007462 -.6427078 _cons -3.639841 24.07866-0.15 0.880-50.90466 43.62497 5

(iii) Under what assumption is the income equation identified? For identification there needs to be an equal number of equations and endogenous variables. There needs to be at least one excluded exogenous variable for every endogenous variable. In the cigarette consumption equation we have the log(cigpric) and resaurn which we assume are both exogenous. We can use these to identify the income equation. We have 2 possible IV s for the endogenous variable cig so we are over identified and will use the OLS combination of the two as long as they are strong enough instruments with respect to correlation. This means we should always look at the reduced form first stage regression. SEM can also suffer from weak instruments thus vigilance is warranted. (iv) Estimate the income equation by OLS and discuss the estimate of β 1.. regress lincome cigs educ age agesq F( 4, 802) = 39.61 Model 67.5412888 4 16.8853222 Prob > F = 0.0000 Residual 341.854549 802.426252555 R-squared = 0.1650 Adj R-squared = 0.1608 Total 409.395838 806.507935283 Root MSE =.65288 lincome Coef. Std. Err. t P> t [95% Conf. Interval] cigs.0017306.0017137 1.01 0.313 -.0016333.0050945 educ.0603606.0078983 7.64 0.000.0448567.0758645 age.0576908.0076436 7.55 0.000.042687.0726946 agesq -.0006306.0000834-7.56 0.000 -.0007943 -.0004669 _cons 7.795444.1704271 45.74 0.000 7.460908 8.129979 β 1 is the coefficient on cigarettes and shows that given a 1 more cigarette smoked per day results in a 0.00173% increase in income. We would expect to have a negative sign as the number of cigarettes increase we would expect to see a decline in health and income due to less hours worked. This is not the case here because of the endogeneity problem where cigs is based on other variables that are not in the income equation and cause bias and inconsistent estimators. (v) Estimate the reduced form for cigs. Are log(cigpric) and restaurn significant?. regress cigs lincome educ age agesq lcigpric restaurn F( 6, 800) = 7.42 Model 8003.02506 6 1333.83751 Prob > F = 0.0000 Residual 143750.658 800 179.688322 R-squared = 0.0527 Adj R-squared = 0.0456 Total 151753.683 806 188.280003 Root MSE = 13.405 cigs Coef. Std. Err. t P> t [95% Conf. Interval] lincome.8802682.7277832 1.21 0.227 -.548322 2.308858 educ -.5014982.1670772-3.00 0.003 -.8294597 -.1735368 age.7706936.1601223 4.81 0.000.456384 1.085003 agesq -.0090228.001743-5.18 0.000 -.0124443 -.0056013 lcigpric -.7508586 5.773343-0.13 0.897-12.08355 10.58183 restaurn -2.825085 1.111794-2.54 0.011-5.007462 -.6427078 _cons -3.639841 24.07866-0.15 0.880-50.90466 43.62497 6

log(cigpric) is not significant at the with a p-value of 0.897 which suggests it is not a strong instrument for cigs. This intuitively makes sense because cigarettes are highly addictive we don t expect to see a large effect on consumption. resaurn is significant and has a large t-stat of -2.54 so we can assume that this is an good instrument to use for cigs in the income equation. I would not expect the banning of cigarettes in public restaurants to have such a large effect on cigarette consumption however the results suggest otherwise. (vi) Now, estimate the income equation by 2SLS. Discuss how the estimate of β 1 compares with the OLS estimate.. ivregress 2sls lincome educ age agesq (cigs = lcigpric restaurn) Instrumental variables (2SLS) regression Number of obs = 807 Wald chi2(4) = 89.80 Prob > chi2 = 0.0000 R-squared =. Root MSE =.87723 lincome Coef. Std. Err. z P> z [95% Conf. Interval] cigs -.0421257.0261371-1.61 0.107 -.0933535.009102 educ.0396746.0162305 2.44 0.015.0078633.0714859 age.0938182.0237794 3.95 0.000.0472115.1404249 agesq -.0010508.0002735-3.84 0.000 -.0015868 -.0005148 _cons 7.780893.2291541 33.95 0.000 7.33176 8.230027 Instrumented: cigs Instruments: educ age agesq lcigpric restaurn β 1(2SLS) = 0.0421 which is now negative and has a higher P-Value than the original β 1 = 0.0017. This is the expected relationship between cigarette consumption and income. It shows that the OLS estimate is biased upwards. This is the original equation we want to estimate with the education replaced by its estimate which uses the best linear combination of exogenous variables. This regression now has no endogeneity problems because educ i is now an exogenous term and unrelated to the error. 7