An explanation of Two Stage Least Squares

Introduction Introduction to Econometrics An explanation of Two Stage Least Squares When we get an endogenous variable we know that OLS estimator will be inconsistent. In addition OLS regressors will also be biased. Two Stage Least Squares is a general solution to the problem of inconsistent estimators. When we have one endogenous variable and one excluded exogenous variable the model is exactly identified which allows us to use it as a direct instrument. If there is more than one instrument available we use the best linear combination of all the exogenous variables including the instruments to construct a variable which we can use in place of the endogenous variable. The following demonstrates the Two Stage Least Squares method. The following two section both explain the methodology of 2SLS but I have used two separate ways, and slightly different notation, to present 2SLS as it differs across the spectrum of textbooks and online resources. The Two Stage Least Squares (General Notation) Whenever we have more than one instrument we say that the model is over identified. Let z i1,, z im where E[z im u i ] = 0. This means that each instrument is exogenous. There are many combinations of the instruments to use but the 2SLS is the most efficient IV estimator. The Vector of all exogenous variables is z i = [x i1,.., x 1, z i1, z im ] The linear combination of z i most correlated with x is given by the linear projection of x on z i. This is known as the reduced form model. x i1 = π 0 + π 1 x i1 + + π k 1 x ik 1 + π K z i1 + + π K+M 1 z im + v i (i) Where v i is uncorrelated with each of the right hand side variables and has zero mean. By definition any linear combination of z i is uncorrelated with ε i we have, E(v i ) = 0, Cov(v i, x i1 ) = 0,, Cov(v i, z im ) = 0 therefore v i drops out of the equation. x = π 0 + π 1 x i1 + + π k 1 x ik 1 + π K z i1 + + π K+M 1 z im x is often interpreted as the part of x that is uncorrelated with ε i.as long as we make the assumption that there are no exact linear dependencies amount the exogenous variables, we can consistently estimate the parameters in equation (i). We have to estimate x as it is not observed and therefore not feasible. Stage 1: Obtain fitted values of x from the reduced form regression x = π 0 + π 1 x i1 + + π k 1 x ik 1 + π K z i1 + + π K+M 1 z im 1

Stage 2: Run the OLS regression y i = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + β k x + ε i Where x i = [x i1,.., x 1, x ] we can define the 2SLS estimator as β IV(2SLS) = x i x 1 i x i y i = (X X) 1 X y The 2SLS estimator turns out to be an OLS estimator. Note that X = Z(Z Z) 1 Z X = P Z X where tis projection matrix is idempotent and symmetric which means that P Z P Z = P Z. Therefore we have X X = X P Z X=(P Z X) P Z X. From this it is clear to see that X X = X X. Now we can show that the 2SLS estimator uses linear combination of instruments, x can be written as, β IV(2SLS) = x i x 1 i x i y i = (X X ) 1 X y This is the standard formula where we regress y on X. It is important to note that β IV(2SLS) = β IV are identical when there is only one instrument for the endogenous variable x. Two Stage Least Squares (Alternative Notation) The Structural Model y i = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + β k x + ε i (1) This is our original linear model where the endogenous variable is denoted as x. Reduced Form Model x = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + δ 1 z i1 + + δ m z im + v i (2) In this model the z i s represent the exogenous instrumental variables. 1 The best IV for x is the linear combination of the exogenous variables which we call x. By definition any linear combination of z i is uncorrelated with ε i we have, E(v i ) = 0, Cov(v i, x i1 ) = 0,, Cov(v i, z im ) = 0 therefore v i drops out of the equation x = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + δ 1 z i1 + + δ m z im Aside: Although correct, it is not enough to say that z i1 is correlated with x. It is more precise to say that z i1 is partially correlated with x since there are other variables in equation (2). 1 For a basic understanding of the role of instrumental variables you should consult the PDF Instrumental Variables 2

First Stage: This stage is essentially just running the reduced form regression as shown in (3). What we are actually doing is regressing the endogenous problem variable on all the exogenous variables. This includes all the x s from the original structural equation which are exogenous plus all the instruments z's which are obviously exogenous because they are from outside the model. 1st Stage regression (Reduced Form Regression) x = β 0 + β 1x i1 + + β (k 1) x i(k 1) + δ 1z i1 + + δ mz im (3) Second Stage: Now we have an estimate for the endogenous variable x which is the best linear combination of exogenous variables x. This is an exogenous variable that we can use as a replacement for x which is the endogenous variable from the structural model (1). This is why Model (4) below uses x i in the equation. The 2SLS fist purges x of its correlation with ε i before doing the OLS regression we can show the as x = x + v i the composite error will be ε i + β k v i y i = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + β k x + ε i + β k v i Where ε i + β k v i has a zero mean and is uncorrelated with all the right hand side variables. 2SLS Regression y i = β 0 + β 1 x i1 + + β (k 1) x i(k 1) + β k x + ε i (4) Therefore we have β IV(2SLS) = ( x i x i ) 1 x i y i In this final equation we only have exogenous variables. Therefore two stage least squares has prevented inconsistency coming from the fact that we have an endogenous explanatory variable. It is important to notice that although two stage least squares will be consistent it will never be unbiased. Example 1 Suppose we are interested in estimating wages and are using the following model (5). The explanatory variables are what we think effects a person wages. However we know from the OLS assumptions that these variables cannot be correlated with the error term. The error term contains everything that is unobserved and contains variables le ability. (5) is the model we wish to estimate where error represents the population error whist in (6) u i is the residual error from the sample we observe. log (wage) i = β 0 + β 1 exper i + β 2 exper i 2 + β 3 age i + β 4 region i + β 5 educ i + error i (5) log (wage) i = β 0 + β 1 exper i + β 2 exper i 2 + β 3 age i + β 4 region i + β 5 educ i + u i (6) By inspecting the structure of the error terms we can notice that ability is an unobserved variable that is related to the education variable. For instance those with more ability are 3

expected to have on average more years of education than those with less ability. We will assume that ability is not correlated with the experience variable although there are examples where this could be the case. u i = ability i + error (educ i u i ) 0 More specifically we could write the last equation in the following two equations to explicitly demonstrate the origin of this endogeneity. (educ i ability i ) 0, and (educ i error i ) 0 If education is correlated to the residual u i through the ability variable it is true that we will have a biased and inconsistent estimator for not only β 5 but for all the variables. Therefore we need to use an instrument or a set of instruments to solve for the endogeneity problem. At this stage we need to find at least one instrument for the one endogenous variable. When we have more than one excluded exogenous variable we use two stage least squares to create the best linear combination of instruments to rule out the endogeneity of education. We could have the variables mother s education and father s education. Consequently if the two instruments satisfy the requirements of being uncorrelated with the error but correlated with education then we can use them. It is worth noting that there are methods for testing for endogeneity and testing over identifying restrictions however we shall not look at them here. Stage 1 educ i = β 0 + β 1exper i + β 2exper 2 i + β 3age i + β 4region i + δ 1motheduc i + δ 2fatheduc i + e i Now we have an estimate of education that is constructed from the exogenous variables in the structural equation plus the outside exogenous variables. The β s are acting as instruments for themselves. Stage 2 log (wage) i = β 0 + β 1 exper i + β 2 exper 2 i + β 3 age i + β 4 region i + β 5 educ i + u i Example 2 The following regression attempts to model income based on cigarettes consumed, education and the age of a person. Log(income) = β 0 + β 1 cigs + β 2 educ + β 3 age + β 4 age 2 + ε 1 (i) How do you interpret the coefficient β 1? 4

. regress lincome cigs educ age agesq F( 4, 802) = 39.61 Model 67.5412888 4 16.8853222 Prob > F = 0.0000 Residual 341.854549 802.426252555 R-squared = 0.1650 Adj R-squared = 0.1608 Total 409.395838 806.507935283 Root MSE =.65288 lincome Coef. Std. Err. t P> t [95% Conf. Interval] cigs.0017306.0017137 1.01 0.313 -.0016333.0050945 educ.0603606.0078983 7.64 0.000.0448567.0758645 age.0576908.0076436 7.55 0.000.042687.0726946 agesq -.0006306.0000834-7.56 0.000 -.0007943 -.0004669 _cons 7.795444.1704271 45.74 0.000 7.460908 8.129979 1 more cigarette smoked per day results in a 0.00173% increase in income. cigs is the endogenous variable because it is based on income amongst other factors. We would probably expect consumption of more cigarettes to decrease income because of the health reasons but that is not the case here. We expect this result is caused because of the endogeneity To reflect the fact that cigarette consumption might be jointly determined with income, a demand for cigarettes equation is cigs = γ 0 + γ 1 log(income) + γ 2 edcu + γ 3 age + γ 4 age 2 + γ 5 log(cigpric) + γ 6 resaurn + ε 2 where cigpric is the price of a pack of cigarettes (in cents), and restaurn is a binary variable equal to one if the person lives in a state with restaurant smoking restrictions. (ii) Assuming these are exogenous to the individual, what signs would you expect for 5 and 6? We would expect γ 5 to have a negative sign as the higher the price of cigarettes the less you will consume. However as smoking is addictive the demand would most lely be inelastic so we don t expect a large change. We would expect γ 6 to be negatively related to cigarette consumption. If the restaurants do not allow smoking we would expect less consumption. Although I doubt the magnitude would be too large. The regression below confirms expectations.. regress cigs lincome educ age agesq lcigpric restaurn F( 6, 800) = 7.42 Model 8003.02506 6 1333.83751 Prob > F = 0.0000 Residual 143750.658 800 179.688322 R-squared = 0.0527 Adj R-squared = 0.0456 Total 151753.683 806 188.280003 Root MSE = 13.405 cigs Coef. Std. Err. t P> t [95% Conf. Interval] lincome.8802682.7277832 1.21 0.227 -.548322 2.308858 educ -.5014982.1670772-3.00 0.003 -.8294597 -.1735368 age.7706936.1601223 4.81 0.000.456384 1.085003 agesq -.0090228.001743-5.18 0.000 -.0124443 -.0056013 lcigpric -.7508586 5.773343-0.13 0.897-12.08355 10.58183 restaurn -2.825085 1.111794-2.54 0.011-5.007462 -.6427078 _cons -3.639841 24.07866-0.15 0.880-50.90466 43.62497 5

(iii) Under what assumption is the income equation identified? For identification there needs to be an equal number of equations and endogenous variables. There needs to be at least one excluded exogenous variable for every endogenous variable. In the cigarette consumption equation we have the log(cigpric) and resaurn which we assume are both exogenous. We can use these to identify the income equation. We have 2 possible IV s for the endogenous variable cig so we are over identified and will use the OLS combination of the two as long as they are strong enough instruments with respect to correlation. This means we should always look at the reduced form first stage regression. SEM can also suffer from weak instruments thus vigilance is warranted. (iv) Estimate the income equation by OLS and discuss the estimate of β 1.. regress lincome cigs educ age agesq F( 4, 802) = 39.61 Model 67.5412888 4 16.8853222 Prob > F = 0.0000 Residual 341.854549 802.426252555 R-squared = 0.1650 Adj R-squared = 0.1608 Total 409.395838 806.507935283 Root MSE =.65288 lincome Coef. Std. Err. t P> t [95% Conf. Interval] cigs.0017306.0017137 1.01 0.313 -.0016333.0050945 educ.0603606.0078983 7.64 0.000.0448567.0758645 age.0576908.0076436 7.55 0.000.042687.0726946 agesq -.0006306.0000834-7.56 0.000 -.0007943 -.0004669 _cons 7.795444.1704271 45.74 0.000 7.460908 8.129979 β 1 is the coefficient on cigarettes and shows that given a 1 more cigarette smoked per day results in a 0.00173% increase in income. We would expect to have a negative sign as the number of cigarettes increase we would expect to see a decline in health and income due to less hours worked. This is not the case here because of the endogeneity problem where cigs is based on other variables that are not in the income equation and cause bias and inconsistent estimators. (v) Estimate the reduced form for cigs. Are log(cigpric) and restaurn significant?. regress cigs lincome educ age agesq lcigpric restaurn F( 6, 800) = 7.42 Model 8003.02506 6 1333.83751 Prob > F = 0.0000 Residual 143750.658 800 179.688322 R-squared = 0.0527 Adj R-squared = 0.0456 Total 151753.683 806 188.280003 Root MSE = 13.405 cigs Coef. Std. Err. t P> t [95% Conf. Interval] lincome.8802682.7277832 1.21 0.227 -.548322 2.308858 educ -.5014982.1670772-3.00 0.003 -.8294597 -.1735368 age.7706936.1601223 4.81 0.000.456384 1.085003 agesq -.0090228.001743-5.18 0.000 -.0124443 -.0056013 lcigpric -.7508586 5.773343-0.13 0.897-12.08355 10.58183 restaurn -2.825085 1.111794-2.54 0.011-5.007462 -.6427078 _cons -3.639841 24.07866-0.15 0.880-50.90466 43.62497 6

log(cigpric) is not significant at the with a p-value of 0.897 which suggests it is not a strong instrument for cigs. This intuitively makes sense because cigarettes are highly addictive we don t expect to see a large effect on consumption. resaurn is significant and has a large t-stat of -2.54 so we can assume that this is an good instrument to use for cigs in the income equation. I would not expect the banning of cigarettes in public restaurants to have such a large effect on cigarette consumption however the results suggest otherwise. (vi) Now, estimate the income equation by 2SLS. Discuss how the estimate of β 1 compares with the OLS estimate.. ivregress 2sls lincome educ age agesq (cigs = lcigpric restaurn) Instrumental variables (2SLS) regression Number of obs = 807 Wald chi2(4) = 89.80 Prob > chi2 = 0.0000 R-squared =. Root MSE =.87723 lincome Coef. Std. Err. z P> z [95% Conf. Interval] cigs -.0421257.0261371-1.61 0.107 -.0933535.009102 educ.0396746.0162305 2.44 0.015.0078633.0714859 age.0938182.0237794 3.95 0.000.0472115.1404249 agesq -.0010508.0002735-3.84 0.000 -.0015868 -.0005148 _cons 7.780893.2291541 33.95 0.000 7.33176 8.230027 Instrumented: cigs Instruments: educ age agesq lcigpric restaurn β 1(2SLS) = 0.0421 which is now negative and has a higher P-Value than the original β 1 = 0.0017. This is the expected relationship between cigarette consumption and income. It shows that the OLS estimate is biased upwards. This is the original equation we want to estimate with the education replaced by its estimate which uses the best linear combination of exogenous variables. This regression now has no endogeneity problems because educ i is now an exogenous term and unrelated to the error. 7