30C00200 Econometrics 7) Endogeneity Timo Kuosmanen Professor, Ph.D. http://nomepre.net/index.php/timokuosmanen
Today s topics Common types of endogeneity Simultaneity Omitted variables Measurement errors Introduction to instrumental variables
Assumptions <-> properties Finite sample properties Required assumptions Unbiasedness Exogeneity Efficiency Exogeneity, No autocorrelation, Homoscedasticity Asymptotic properties Consistency Exogeneity, No autocorrelation Asymptotic normality Exogeneity, No autocorrelation, Homoscedasticity
OLS estimator of slope β 1 b n ( x x) i i i 1 1 1 n 1 2 ( xi x) i 1 Est. Cov( x, ) Est. Var( x) Interpretation: OLS estimator b 1 is equal to the true parameter β 1 + an error term The greater the sample covariance of x and ε, the greater the error in b 1 The greater the sample variance of x, the smaller the error in b 1
Classic examples of endogeneity Exogeneity: Cov(x, ε) = 0 MLR.4 Zero Conditional Mean: E[ε x] = 0 Examples of violations (endogeneity): Simultaneity bias Example 1: supply and demand functions are not identifiable from the observed data of prices and quantities. Example 2: firm takes its productivity (ε) into account in its use of inputs x. Omitted variable bias Omitted explanatory factor attributed to ε is correlated with x Measurement errors in regressors x Both proxy variable and the disturbance contain the measurement error, and are hence correlated
Simultaneity in demand equation Suppose we want to estimate demand dunction D(p), given time series of observed market prices p and demanded quantities q. Problem: in market equilibrium, price is determined endogenously such that supply equals demand. We cannot identify changes in demand function from changes in supply function. p D1 D2 S1 S2 Regression line q
Simultaneity in production function Consider the production model by Cobb and Douglas (1928) lny i = β 0 + β 1 lnx 1i + β 2 lnx 2i + ε i Where y i is the output of firm and x 1i, x 2i are the inputs (e.g., labor, capital) of firm i. In this model, exp(ε i ) can be interpreted as productivity of firm i. Marschak and Andrews (1944) argue that rational firm managers can learn over time the produductivity level of their firm, and adjust their inputs x accordingly. If input choices depend on ε i, then clearly the exogeneity assumption is violated. This is a common problem in observational data where rational decision makers choose x-variables endogenously.
Omitted variable bias Suppose the underlying data generating process is governed by equation y i = β 0 + β 1 x 1i + β 2 x 2i + ε i However, suppose we ignore x 2i and estimate equation y i = β 0 + β 1 x 1i + ε i where ε i = β 2 x 2i + ε i. Applying the rules of covariance, we have b Est. Cov( x, ) Est. Cov(, ). (, ) x x Est Covx. ( ). ( ) 1 1 2 1 1 1 1 2 Est Var x1 Est. Var( x1 ) Est Var x1
b 1 1 Omitted variable bias Est. Cov( x, x ) 2 Est. Var( x ) 1 2 1 1 Est. Cov( x, ) Est. Var( x We see that omitting x 2 does not cause bias if x 1 and x 2 are uncorrelated (or if β 2 = 0). 1 ) The bias is equal to [. (, )] (, ) ( ) E Est Cov x x Cov x x Eb 1 2 1 2 1 1 2 2 E[ Est. Var( x1)] Var( x1) Note that the direction of bias depends on the sign of β 2 and that of the correlation of x 1 and x 2.
Measurement errors - examples Self-reported data (survey responses) are subjective, and may be biased Example: condition of apartment (weak, satisfactory, good) Human errors in data processing Example: typing errors In many situations, we need to use imperfect proxy variables to capture the effects that are difficult or impossible to measure Example: IQ test score as proxy for ability
Model of measurement errors True but unobserved variable: y i Measurement error represented by random variable: v i E[v] = 0 Var[v] = σ v 2 Observed proxy: q i = y i + v i
Effect of measurement errors in y Assume the correct model is: y i = β 1 + β 2 x i + ε i Suppose we use proxy q as dependent variable insead of y Noting that y i = q i - v i we have y i = q i - v i = β 1 + β 2 x i + ε i Using proxy q i as dependent variable, the model becomes q i = β 1 + β 2 x i + (ε i + v i )
Effect of measurement errors in y Using proxy q as dependent variable, the model is q i = β 1 + β 2 x i + (ε i + v i ) Interpretation: measurement error in the dependent variable is just an additional source of error attributed to the disturbance ε Recall: this was one reason for introducing the disturbance term ε in the first place Assuming x and v are statistically independent (uncorrelated), then measurement errors in y do not affect the statistical properties of the OLS estimator OLS remains unbiased, efficient, and consistent Note: errors v increase the variance of the disturbance term. The larger Var(v), the larger the standard errors of OLS coefficients.
Effect of measurement errors in x Assume next the dependent variable y is free of error. However, the explanatory variable x is subject to error: True regressor: Random measurement error: Imperfect proxy: S i v i x i = S i + v i
Effect of measurement errors in x Assume the true model is: y i = β 0 + β 1 S i + ε i Suppose we use a proxy x as explanatory variable instead of S? Assume x i = S i + v i, where v is random measurement error. Noting that S i = x i - v i we have y i = β 0 + β 1 (x i - v i ) + ε i = β 0 + β 1 x i + (ε i - β 1 v i )
Effect of measurement errors in x OLS estimator for slope β 1 b 1 Est. Cov( x, y) Est. Var( x) Inserting x = S + v and y = β 0 + β 1 S + ε, we find that b 1 Est. Cov( S v, y) 1 Est. Var( S) Est. Var( S v) Est. Var( S) Est. Var( v) 1 1 Est. Var( v) Est. Var( S) Est. Var( v)
Effect of measurement errors in x Expected value Var() v Eb ( 1) 1 1 Var( S) Var( v) Since Var is non-negative, the measurement error v makes the OLS estimator biased towards zero For positive β 1, the OLS estimator is downward biased For negative β 1, the OLS estimator is upward biased Bias does not disapper as n increases: the OLS estimator is statistically inconsistent.
Measurement errors as a source of endogeneity problem Measurement error in the regressor is another example of the violation of the exogeneity assumption: Cov(x, ε) = 0 y i = β 0 + β 1 x i + (ε i β 1 v i ) = β 0 + β 1 (S i +v i )+ (ε i β 1 v i ) Measurement error v is included in both the regressor x and the disturbance term, so the exogeneity assumption fails OLS estimator is biased and inconsistent
Conclusions: Effects of measurement errors Measurement errors in the dependent variable y can be harmlessly attributed to the disturbance term Still, precise data preferred, as the errors in y increase the variance and the standard errors of OLS estimator Measurement errors in the explanatory variables x are more problematic: OLS estimator is biased toward zero OLS estimator is inconsistent
Standard econometric solution to endogeneity: instrumental variables Suppose regressor x correlates with the disturbance term ε. We can try to find an instrumental variable z i that is Highly correlared with x i : Cov(x i, z i ) >> 0 Uncorrelated with disturbance ε i : Cov(ε i, z i ) = 0
Example: hedonic model of housing market Good condition of the apartment may influence the price, but it is difficult to measure the condition objectively. The data provides a self-reported evaluation (poor, satisfactory, good), which may be biased.
Regression Statistics Multiple R 0,869856 R Square 0,75665 Adjusted R Square 0,752007 Standard Error 53672,36 Observations 642 ANOVA df SS MS F Significance F Regression 12 5,63E+12 4,69E+11 162,9793 6,3E-184 Residual 629 1,81E+12 2,88E+09 Total 641 7,45E+12 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 57230,46 16729,81 3,420867 0,000665 24377,41 90083,5 Size m2 4246,643 227,7986 18,64209 4,42E-62 3799,305 4693,981 Bedrooms -35271,4 5328,539-6,61934 7,74E-11-45735,3-24807,5 Age -2732,05 146,6195-18,6336 4,9E-62-3019,97-2444,12 Elevartor 5714,083 5077,285 1,125421 0,26084-4256,4 15684,56 1st floor -8235,33 5811,223-1,41714 0,156936-19647,1 3176,413 Top floor 9726,951 5449,142 1,785043 0,074736-973,762 20427,66 Loc. Espoonlahti -15838,1 11255,1-1,40719 0,159864-37940,2 6263,998 Loc. Kivenlahti -18496,3 8562,531-2,16014 0,031139-35310,9-1681,65 Loc. Leppävaara -4865,67 6511,45-0,74725 0,455193-17652,5 7921,141 Loc. Olari 354,1232 8057,83 0,043948 0,96496-15469,4 16177,63 Loc. Tapiola 129275,4 7124,512 18,14515 1,71E-59 115284,7 143266,1 Condition 1,2,3 16726,68 4626,302 3,615561 0,000324 7641,81 25811,54
Instrumental variable approach Step 1: Find additional instruments (e.g., energy class, sauna) to explain the stated condition. Condition i = β 1 + β 2 EnergyClass i + β 3 Sauna i + β 4 x i + + ε i Apply the estimated model to form a prediction for condition: PredictedCondition i = b 1 + b 2 EnergyClass i + b 3 Sauna i + b 4 x i +
Regression Statistics Multiple R 0,443 R Square 0,196 Adjusted R Square 0,179 Standard Error 0,462 Observations 642,000 ANOVA df SS MS F Significance F Regression 13,000 32,705 2,516 11,768 0,000 Residual 628,000 134,257 0,214 Total 641,000 166,961 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 2,923 0,089 32,850 0,000 2,748 3,098 Sauna 0,031 0,053 0,572 0,567-0,074 0,135 Energy class A-C -0,096 0,086-1,109 0,268-0,265 0,074 Size m2-0,001 0,002-0,724 0,469-0,005 0,002 Bedrooms 0,037 0,046 0,792 0,429-0,054 0,128 Age -0,011 0,001-7,656 0,000-0,014-0,008 Elevartor 0,091 0,044 2,080 0,038 0,005 0,176 1st floor -0,022 0,050-0,433 0,665-0,120 0,077 Top floor 0,017 0,047 0,362 0,717-0,075 0,109 Loc. Espoonlahti 0,047 0,097 0,483 0,629-0,144 0,238 Loc. Kivenlahti -0,009 0,074-0,123 0,903-0,155 0,136 Loc. Leppävaara -0,020 0,057-0,349 0,727-0,132 0,092 Loc. Olari 0,084 0,069 1,210 0,227-0,052 0,220 Loc. Tapiola 0,047 0,062 0,766 0,444-0,074 0,168
RESIDUAL OUTPUT Observation Predicted Condition 1,2,3 Residuals 1 2,4790-0,4790 2 2,4316 0,5684 3 2,4806-0,4806 4 2,4431-0,4431 5 2,4439 0,5561 6 2,4949-1,4949 7 2,5245 0,4755 8 2,5736-0,5736 9 2,3231-0,3231 10 2,4944 0,5056 11 2,5331-0,5331 12 2,5037-0,5037 13 2,3949 0,6051 14 2,6470-0,6470 15 2,4500-0,4500
Instrumental variable approach Step 2: Use the predicted condition in the original regression model instead of the condition Price i = β 1 + β 2 x 2i + + β K PredictedCondition i + ε i
Regression Statistics Multiple R 0,867774 R Square 0,753032 Adjusted R Square 0,748321 Standard Error 54069,82 Observations 642 ANOVA df SS MS F Significance F Regression 12 5,61E+12 4,67E+11 159,8243 6,3E-182 Residual 629 1,84E+12 2,92E+09 Total 641 7,45E+12 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 626423,5 271822,8 2,304529 0,021518 92633,54 1160213 Size m2 3926,654 275,5462 14,25044 3,9E-40 3385,552 4467,756 Bedrooms -26620,9 6768,741-3,93292 9,33E-05-39913 -13328,9 Age -4960,2 1072,248-4,62598 4,53E-06-7065,82-2854,58 Elevartor 23590,85 9938,087 2,373782 0,017906 4075,007 43106,69 1st floor -12667,3 6223,733-2,03533 0,042236-24889,2-445,532 Top floor 12876,1 5691,009 2,262534 0,024005 1700,427 24051,78 Loc. Espoonlahti -7596,31 11999,68-0,63304 0,526936-31160,6 15967,98 Loc. Kivenlahti -21160 8718,879-2,42692 0,015507-38281,7-4038,39 Loc. Leppävaara -7501,13 6678,863-1,12312 0,261817-20616,7 5614,435 Loc. Olari 16238,13 11100,13 1,462877 0,144001-5559,68 38035,93 Loc. Tapiola 137428,2 8161,734 16,83811 8,59E-53 121400,7 153455,7 Predicted Condition 1,2,3-177624 92752,38-1,91504 0,055941-359766 4517,712
Next time Wed 30 Sept Topic: Instrumental variables