30C00200 Econometrics 8) Instrumental variables Timo Kuosmanen Professor, Ph.D. http://nomepre.net/index.php/timokuosmanen
Today s topics Thery of IV regression Overidentification Two-stage least squates (2SLS) Testing for endogeneity: Weak instruments Hausman test
Examples of instrumental variables In the case of measurement errors, instrument could be another measurement (or proxy) for unobserved x Example: twins study of returns to education x is the self-reported years of schooling by respondent z is the years of schooling reported by respondent s twin brother / sister In time series and panel data models, past values of x observed in previous periods are frequently used as instruments.
IV estimator Assume the regression model is: y = β 1 + β 2 x + ε However, the exogeneity assumption Cov(ε,x) = 0 is violated. Examples: measurement error in x, omitted variable in ε Assume we have an instrument z that is Highly correlared with endogenous x : Cov(x,z) >> 0 Uncorrelated with disturbance ε : Cov(ε,z) = 0
IV estimator Recall the OLS estimator for slope β 2 b Est. Cov( x, y) n ( x x)( y y) ( x x)( y y) i i i i OLS i1 i1 2 n n Est. Var( x) 2 i i i i1 i1 n ( x x) ( x x)( x x) The instrumental variable (IV) estimator: b IV 2 Est. Cov( z, y) Est. Cov( z, x) n i1 n i1 ( z z )( y y) i ( z z )( x x) i i i
IV estimator The instrumental variable (IV) estimator can be rewritten as: b IV Est. Cov( z, y) Est Cov z 1 2x 2 The IV estimator is unbiased and consistent. (, ) Est. Cov( z, x) Est. Cov( z, x) 2Est. Cov( z, x) Est. Cov( z, ) Est. Cov( z, x) Est. Cov( z, ) 2 Est. Cov( z, x) Since we assumed Cov(z,ε) = 0, the expected value is Eb ( ) IV 2 2
Variance of IV estimator Variance of the IV estimator Var( b ) Var( ) IV 2 2 ( n 1) Var( x) rzx Precision of the IV estimator improves if Variance of disturbance ε decreases Sample size n increases Variance of regressor x increases Correlation (r zx ) of regressor x and instrument z increases
OLS and IV as GMM estimators The OLS residuals have the property n i1 xe i OLS i 0 Thus, Est.Cov(x,e) = 0. This is the sample counterpart to the assumed population orthogonality condition Cov(x,ε) = 0 Note: we can derive the OLS estimator directly from the sample orthogonality condition. Assume centered data where sample averages of x and y are equal to zero, and assume the constant term is zero. Then n n n n OLS OLS OLS 2 xiei xi ( yi b xi ) xi yi b xi 0 i1 i1 i1 i1 OLS b Est. Cov( x, y) / Est. Var( x)
OLS and IV as GMM estimators Analogously, the IV estimator is based on the population orthogonality condition Cov(z,ε) = 0. We can derive the IV estimator using the sample orthogonality condition n i1 ze i IV i 0 n n n IV IV zi yi b xi zi yi b zixi i1 i1 i1 ( ) 0 IV b Est. Cov( z, y) / Est. Cov( z, x) Both OLS and IV can be seen as special cases of the generalized method of moment (GMM)
IV regression in Stata Two-stage least squares can be implemented in Stata using the command ivreg instead of the usual reg Syntax.ivreg y x2 x3 x4 (x2 = z1 z2 x3 x4) In matrix form: OLS: IV: -1 b = (XX) Xy -1 b = (ZX) Zy
Over-identification Thus far, we assumed that there exist a single instrumental variable z that is highly correlated with x but uncorrelated with ε Examples of instrumental variables Alternative proxy variables Past values x t-1 If a useful instrument is available, then there are potentially more than just one instrument If past value x t-1 is a good instrument for x t, then also x t-2, x t-3,, are likely useful instruments. Choosing just one of the many instruments would be inefficient use of information available Solution: two-stage least squares (2SLS) method
Two-stage least squares (2SLS) Assume we have one endogenous regressor x in the model y = β 1 + β 2 x + ε Assume we have (L-1) instruments z 2, z 3,, z L for x 2-stage estimation procedure: 1) Regress by using OLS: x = κ 1 + κ 2 z 2 + κ 3 z 3 + + κ L z L + ε Save the fitted values: x* = k 1 + k 2 z 2 + k 3 z 3 + + k L z L 2) Use the fitted values x* to estimate the original regression equation: y = β 1 + β 2 x* + ε
Two-stage least squares (2SLS) Practical notes: If we have more than one endogenous problem variable x, then stage 1 can be done separately for each variable Different endogenous regressors can be instrumented with different z variables All exogenous regressors x are usually included as instruments z If OLS is used in the stepwise estimation, the standard errors of the 2-stage regression need to be adjusted Stata does this automatically when ivreg is used
Example: production function of electricity distribution networks Assume Cobb-Douglas production function ln y = β 0 + β 1 L i + β 2 K i + ε i Output y: ln Energy (GWh) Inputs x: L = ln OPEX, K = ln Krepl Instrument for K: ln Knuse OPEX = operational expenditure (incl. wages) Krepl = Capital stock (replacement value) Knuse = Capital stock (net use value) Sample of 160 observations in years 2011 and 2012.
CD function, direct OLS estimation. regress lnenergy lnopex lnkrepl Source SS df MS Number of obs = 160 F( 2, 157) = 911.83 Model 261.157887 2 130.578943 Prob > F = 0.0000 Residual 22.483226 157.143205261 R-squared = 0.9207 Adj R-squared = 0.9197 Total 283.641113 159 1.78390637 Root MSE =.37842 lnenergy Coef. Std. Err. t P> t [95% Conf. Interval] lnopex.4460534.1190976 3.75 0.000.210813.6812938 lnkrepl.591481.1152412 5.13 0.000.3638579.8191042 _cons -4.696363.4388221-10.70 0.000-5.563119-3.829606
Two-stage least squares (2SLS) Capital stock is hard to measure. Suppose our proxy for capital stock K contains measurement error. If that is the case, the OLS estimator of the output elasticity of K is biased towards zero. Two alternative proxy measures of K: Krepl and Knuse. Two-stage least squares: Stage 1: Regress ln Krepl on ln Knuse and ln OPEX. Record the predicted ln Krepl (ln PrKrepl). Stage 2: Regress ln Energy on ln OPEX and ln PrKrepl to estimate the production function of interest.
2SLS regression. reg3 (lnkrepl = lnopex lnknuse) (lnenergy = lnopex lnkrepl), exog(lnopex) 2sls Two-stage least-squares regression Equation Obs Parms RMSE "R-sq" F-Stat P lnkrepl 160 2.1486906 0.9862 5624.20 0.0000 lnenergy 160 2.3923997 0.9148 858.94 0.0000 lnkrepl lnopex.2911237.0407567 7.14 0.000.2109328.3713145 lnknuse.6873963.0377983 18.19 0.000.6130263.7617664 _cons 1.755602.1187836 14.78 0.000 1.52189 1.989314 lnenergy lnopex.0456137.1489348 0.31 0.760 -.2474226.33865 lnkrepl.9875145.1451144 6.81 0.000.7019949 1.273034 _cons -6.052248.5352627-11.31 0.000-7.105403-4.999093 Endogenous variables: lnkrepl lnenergy Exogenous variables: lnopex lnknuse Coef. Std. Err. t P> t [95% Conf. Interval]
IV (2SLS) regression. ivregress 2sls lnenergy lnopex (lnkrepl = lnknuse lnopex) Instrumental variables (2SLS) regression Number of obs = 160 Wald chi2(2) = 1750.71 Prob > chi2 = 0.0000 R-squared = 0.9148 Root MSE =.3887 lnenergy Coef. Std. Err. z P> z [95% Conf. Interval] lnkrepl.9875145.1437475 6.87 0.000.7057744 1.269254 lnopex.0456137.1475319 0.31 0.757 -.2435436.334771 _cons -6.052248.5302208-11.41 0.000-7.091462-5.013034 Instrumented: lnkrepl Instruments: lnopex lnknuse
IV (GMM) regression. ivregress gmm lnenergy lnopex (lnkrepl = lnknuse) Instrumental variables (GMM) regression Number of obs = 160 Wald chi2(2) = 1320.63 Prob > chi2 = 0.0000 R-squared = 0.9148 GMM weight matrix: Robust Root MSE =.3887 Robust lnenergy Coef. Std. Err. z P> z [95% Conf. Interval] lnkrepl.9875145.1540702 6.41 0.000.6855425 1.289486 lnopex.0456137.1595625 0.29 0.775 -.2671231.3583505 _cons -6.052248.5687178-10.64 0.000-7.166915-4.937582 Instrumented: lnkrepl Instruments: lnopex lnknuse
Testing for weak instruments F-test of joint significance in the 1-stage regression serves as a useful diagnostic test of weak instruments To avoid the problems with weak instruments (imprecise coefficients), the coefficients of stage 1 regression should be jointly significant: F-stat > F crit
Hausman test also referred to as Durbin-Wu-Hausman test Rationale: it is not always clear if endogeneity is a problem or not If exogeneity assumption Cov(x, ε) = 0 holds, then OLS estimator is unbiased and efficient IV estimator is also unbiased, but less efficient (OLS preferred) However, if exogeneity assumption Cov(x, ε) = 0 fails, then OLS estimator is biased and inconsistent IV estimator remains unbiased (IV preferred)
H 0 : Cov(x, ε) = 0; OLS preferred H 1 : Cov(x, ε) 0; IV preferred Hausman test Procedure: Estimate both OLS and IV regressions Compare the estimated coefficients b OLS and b IV and their standard errors If H 0 is true, then difference b IV - b OLS should be small (due to inefficiency of the IV estimator) If H 0 is true, the Hausman statistic follows chi-squared distribution with the degrees of freedom equal to the number of endogenous regressors instrumented in the IV model
Hausman test in Stata Stata computes the Hausman test automatically Run the IV and OLS regressions Save the results by command estimates store name Example: estimates store CostIV and estimates store CostOLS Hausman test is conducted by command hausman Example: hausman CostIV CostOLS constant
Hausman test in Stata. hausman IV OLS Coefficients (b) (B) (b-b) sqrt(diag(v_b-v_b)) IV OLS Difference S.E. lnkrepl.9875145.591481.3960335.0859234 lnopex.0456137.4460534 -.4004397.0870714 b = consistent under Ho and Ha; obtained from ivregress B = inconsistent under Ha, efficient under Ho; obtained from regress Test: Ho: difference in coefficients not systematic chi2(2) = (b-b)'[(v_b-v_b)^(-1)](b-b) = 21.24 Prob>chi2 = 0.0000
Next time Mon 5 Oct Topic: Autocorrelation