Lecture 14. More on using dummy variables (deal with seasonality)

Size: px

Start display at page:

Download "Lecture 14. More on using dummy variables (deal with seasonality)"

Roger Brown
6 years ago
Views:

1 Lecture 14. More on using dummy variables (deal with seasonality) More things to worry about: measurement error in variables (can lead to bias in OLS (endogeneity) )

2 Have seen that dummy variables are useful when interested in measuring average differences between discrete groups Or for policy evaluation (use of interaction terms lead to the difference-in-difference estimator) Now see how dummy variables can be used to deal with seasonality in data

4 Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data

5 Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data The idea is to include a set of dummy variables for each quarter (or month or day) which will then net out the average change in a variable resulting from any seasonal fluctuations

6 Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data The idea is to include a set of dummy variables for each quarter (or month or day) which will then net out the average change in a variable resulting from any seasonal fluctuations Y t = b 0 + b 1 Q1 + b 2 Q2+ b 3 Q3 + b 4 X + u t

7 Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data The idea is to include a set of dummy variables for each quarter (or month or day) which will then net out the average change in a variable resulting from any seasonal fluctuations Y t = b 0 + b 1 Q1 + b 2 Q2+ b 3 Q3 + b 4 X + u t Hence the coefficient on the quarterly dummy Q1 =1 if data belong to the 1 st quarter of the year (Jan-Mar) = 0 otherwise

8 Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data The idea is to include a set of dummy variables for each quarter (or month or day) which will then net out the average change in a variable resulting from any seasonal fluctuations Y t = b 0 + b 1 Q1 + b 2 Q2+ b 3 Q3 + b 4 X + u t Hence the coefficient on the quarterly dummy Q1 =1 if data belong to the 1 st quarter of the year (Jan-Mar) = 0 otherwise gives the level of Y in the 1 st quarter of the year relative to the constant (Q4 level of Y) averaged over all Q1 observations in the data set

9 Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data The idea is to include a set of dummy variables for each quarter (or month or day) which will then net out the average change in a variable resulting from any seasonal fluctuations Y t = b 0 + b 1 Q1 + b 2 Q2+ b 3 Q3 + b 4 X + u t Hence the coefficient on the quarterly dummy Q1 =1 if data belong to the 1 st quarter of the year (Jan-Mar) = 0 otherwise gives the level of Y in the 1 st quarter of the year relative to the constant (Q4 level of Y) averaged over all Q1 observations in the data set Series net of seasonal effects are said to be seasonally adjusted

10 It may also be useful to model an economic series as a combination of seasonal and a trend component

11 It may also be useful to model an economic series as a combination of seasonal and a trend component Y t = b 0 + b 1 Q1 + b 2 Q2+ b 3 Q3 + b 4 Trend + u t

12 It may also be useful to model an economic series as a combination of seasonal and a trend component Y t = b 0 + b 1 Q1 + b 2 Q2+ b 3 Q3 + b 4 Trend + u t where Trend = 1 in year 1

13 It may also be useful to model an economic series as a combination of seasonal and a trend component Y t = b 0 + b 1 Q1 + b 2 Q2+ b 3 Q3 + b 4 Trend + u t where Trend = 1 in year 1 = 2 in year 2

14 It may also be useful to model an economic series as a combination of seasonal and a trend component Y t = b 0 + b 1 Q1 + b 2 Q2+ b 3 Q3 + b 4 Trend + u t where Trend =1 in year 1 = 2 in year 2 = T in year T

15 It may also be useful to model an economic series as a combination of seasonal and a trend component Y t = b 0 + b 1 Q1 + b 2 Q2+ b 3 Q3 + b 4 Trend + u t where Trend =1 in year 1 = 2 in year 2 = T in year T since dy t /dtrend = b 4 given that the coefficient measures the unit change in y for a unit change in the trend variable and the units of measurement in this case are years

16 It may also be useful to model an economic series as a combination of seasonal and a trend component Y t = b 0 + b 1 Q1 + b 2 Q2+ b 3 Q3 + b 4 Trend + u t where Trend =1 in year 1 = 2 in year 2 = T in year T since dy t /dtrend = b 4 given that the coefficient measures the unit change in y for a unit change in the trend variable and the units of measurement in this case are years then in the model above the trend term measures the annual change in the Y variable net of any seasonal influences

17 The

18 In 2000 the UK department of Transport announced that By 2010 we want to achieve, compared with the average for : a 40% reduction in the number of people killed or seriously injured in road accidents; a 50% reduction in the number of children killed or seriously injured; and a 10% reduction in the slight casualty rate, expressed as the number of people slightly injured per 100 million vehicle kilometres. Did they reach this target?

19 The data set accidents.dta (on the course web site) contains quarterly information on the number of road accidents in the UK from 1983 to 2006 twoway (line acc time, xline(2000) ) total road accidents : DoT quarterly data time The graph shows that road accidents vary more within than between years Can see seasonal influence from a regression of number of accidents on 3 dummy variables (1 for each quarter minus the default category which is the 4 th quarter)

20 . list acc year quart time Q1 Q2 Q3 Q4, clean acc year quart time Q1 Q2 Q3 Q Q Q Q Q Q Q A regression of road accident numbers on quarterly dummies (q4=winter is default given by constant term at accidents, on average in the 4 th quarter) shows accidents are significantly less likely to happen outside the fourth quarter (October-December). On average there are 14,539 fewer accidents in the first quarter of the year than in the last reg acc Q1 Q2 Q3 Source SS df MS Number of obs = F( 3, 104) = Model e Prob > F = Residual e R-squared = Adj R-squared = Total e Root MSE = acc Coef. Std. Err. t P> t [95% Conf. Interval] Q Q Q _cons Saving residual values after netting out the influence of the seasons is the basis for the production of seasonally adjusted data (better guide to underlying trend), used in many official government statistics. Can get a sense of how this works with the following command after a regression. predict rhat, resid /* saves the residuals in a new variable with the name rhat */ twoway (line rhat time, xline(2000) )

21 Residuals time Can see the seasonality is reduced and the trend is much clearer Graph of the residuals is much smoother than the original series it should be since much of the seasonality has been taken out by the dummy variables. The graph also shows that once seasonality accounted for, there is little evidence in a change in the number of road accidents over time until the year 2000 To model both seasonal and trend components of an economic series, simply include both seasonal dummies and a time trend in the regression model. reg logacc Q1 Q2 Q3 year Y t = b 0 + b 1 Q 1 + b 2 Q 2 + b 3 Q 3 + b 4 TREND + u t Source SS df MS Number of obs = F( 4, 103) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = logacc Coef. Std. Err. t P> t [95% Conf. Interval] Q

22 Q Q year _cons Can see that there is a downward trend in road accidents (of around 400 a year over the whole sample period) net of any seasonality. Could also use dummy variable interactions to test whether this trend is stronger after How? Can also use seasonal dummy variables to check whether an apparent association between variables is in fact caused by seasonality in the data. reg acc du Source SS df MS Number of obs = F( 1, 69) = 6.19 Model Prob > F = Residual e R-squared = Adj R-squared = Total e Root MSE = acc Coef. Std. Err. t P> t [95% Conf. Interval] du _cons The regression suggests a negative association between the change in the unemployment rate and the level of accidents (a 1 percentage point rise in the unemployment rate leads to a fall in the number of accidents by 4104 if this regression is to be believed) Might this be in part because seasonal movements in both data series are influencing the results (the unemployment rate also varies seasonally, typically higher in q1 of each year). reg acc du q2-q4 Source SS df MS Number of obs = F( 4, 66) = Model e Prob > F = Residual R-squared = Adj R-squared = Total e Root MSE =

23 acc Coef. Std. Err. t P> t [95% Conf. Interval] du q q q _cons Can see if add quarterly seasonal dummy variables then apparent effect of unemployment disappears.

24 Measurement Error Often a data set will contain imperfect measures of the data we would ideally like.

25 Measurement Error Often a data set will contain imperfect measures of the data we would ideally like. Aggregate Data: (GDP, Consumption, Investment) are only best guesses of theoretical counterparts and frequently revised by government statisticians (so earlier estimates must have been subject to error)

27 Measurement Error Often a data set will contain imperfect measures of the data we would ideally like. Aggregate Data: (GDP, Consumption, Investment) are only best guesses of theoretical counterparts and frequently revised by government statisticians (so earlier estimates must have been subject to error) Survey Data: (income, health, age) Individuals often lie, forget or round to nearest large number ( 102 a week or 100?) payperiod.dta: hist grossam if pyperiod==1 & grossam<1000 & grossam>0 & grsp==1, bin(100) xline( )

28 Measurement Error Often a data set will contain imperfect measures of the data we would ideally like. Aggregate Data: (GDP, Consumption, Investment) are only best guesses of theoretical counterparts and frequently revised by government statisticians (so earlier estimates must have been subject to error) Survey Data: (income, health, age) Individuals often lie, forget or round to nearest large number ( 102 a week or 100?) Proxy Data: (Ability, Intelligence, Permanent Income) Difficult to agree on definition, let alone measure

29 Measurement Error in Dependent Variable True: y = b0 + b1x + u (1)

30 Measurement Error in Dependent Variable True: y = b0 + b1x + u (1) Observe: y = y + e (2) ie dependent variable measured with error e

31 Measurement Error in Dependent Variable True: y = b0 + b1x + u (1) Observe: y = y + e (2) ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0

32 Measurement Error in Dependent Variable True: y = b0 + b1x + u (1) Observe: y = y + e (2) ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0 Sub. (2) into (1)

33 Measurement Error in Dependent Variable True: y = b0 + b1x + u (1) Observe: y = y + e (2) ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0 Sub. (2) into (1) y - e = b0 + b1x + u

34 Measurement Error in Dependent Variable True: y = b0 + b1x + u (1) Observe: y = y + e (2) ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0 Sub. (2) into (1) y - e = b0 + b1x + u take the error term on the left to the other side

35 Measurement Error in Dependent Variable True: y = b0 + b1x + u (1) Observe: y = y + e (2) ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0 Sub. (2) into (1) y - e = b0 + b1x + u take the error term on the left to the other side y = b0 + b1x + u + e

36 Measurement Error in Dependent Variable True: y = b0 + b1x + u (1) Observe: y = y + e (2) ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0 Sub. (2) into (1) y - e = b0 + b1x + u take the error term on the left to the other side y = b0 + b1x + u + e y = b0 + b1x + v where v = u + e (3)

37 Ok to estimate y = b0 + b1x + v where v = u + e (3) by OLS, since

38 Ok to estimate y = b0 + b1x + v where v = u + e (3) by OLS, since E(u) = E(e) = 0

39 Ok to estimate y = b0 + b1x + v where v = u + e (3) by OLS, since E(u) = E(e) = 0 (just random residuals so mean is zero)

40 Ok to estimate y = b0 + b1x + v where v = u + e (3) by OLS, since E(u) = E(e) = 0 (just random residuals so mean is zero) Cov(X,u) = 0 (no correlation between original X variable and original error term)

41 Ok to estimate y = b0 + b1x + v where v = u + e (3) by OLS, since E(u) = E(e) = 0 (just random residuals so mean is zero) Cov(X,u) = 0 (no correlation between original X variable and original error term) and also Cov(X,e) = 0 (nothing to suggest X variable correlated with meas. error in dependent variable)

42 Ok to estimate y = b0 + b1x + v where v = u + e (3) by OLS, since E(u) = E(e) = 0 (just random residuals so mean is zero) Cov(X,u) = 0 (no correlation between original X variable and original error term) and also Cov(X,e) = 0 (nothing to suggest X variable correlated with meas. error in dependent variable) So OLS estimates are unbiased in this case

43 Ok to estimate y = b0 + b1x + v where v = u + e (3) by OLS, since E(u) = E(e) = 0 (just random residuals so mean is zero) Cov(X,u) = 0 (no correlation between original X variable and original error term) and also Cov(X,e) = 0 (nothing to suggest X variable correlated with meas. error in dependent variable) So OLS estimates are unbiased in this case but standard errors are larger than would be in absence of meas. error with the associated problems of inference (type II error)

44 True: ^ σ 2 Var( β ) = u from y = b0 + b1x + u (A) NVar( X )

45 True: ^ σ 2 Var( β ) = u from y = b0 + b1x + u (A) NVar( X ) Estimate: ~ σ 2 Var( β ) = v from y = b0 + b1x + v NVar( X )

46 True: ^ σ 2 Var( β ) = u from y = b0 + b1x + u (A) NVar( X ) Estimate: ~ σ 2 Var( β ) = v from y = b0 + b1x + v NVar( X ) But v=u+e

47 True: ^ σ 2 Var( β ) = u from y = b0 + b1x + u (A) NVar( X ) Estimate: ~ σ 2 Var( β ) = v from y = b0 + b1x + v NVar( X ) But v=u+e and so covariances) σ v = σu + σe (using rules on

48 True: ^ σ 2 Var( β ) = u from y = b0 + b1x + u (A) NVar( X ) Estimate: ~ σ 2 Var( β ) = v from y = b0 + b1x + v NVar( X ) But v=u+e and so σ v = σu + σe (using rules on covariances) Hence ~ σ 2 2 ( ) u + σ Var β = e (B) NVar( X )

49 True: ^ σ 2 Var( β ) = u from y = b0 + b1x + u (A) NVar( X ) Estimate: ~ σ 2 Var( β ) = v from y = b0 + b1x + v NVar( X ) But v=u+e and so σ v = σu + σe (using rules on covariances) Hence ~ Var( β ) σ 2 2 u + σ = e (B) NVar( X ) [Since var(v) = var(u+e) = var(e)+var(u)+2cov(e,u)

50 and if assume things that cause measurement error in y are unrelated to residual u, so cov(e,u)=0. ]

51 and if assume things that cause measurement error in y are unrelated to residual u, so cov(e,u)=0. ] so var(v) = var(e)+var(u)

52 and if assume things that cause measurement error in y are unrelated to residual u, so cov(e,u)=0. ] so var(v) = var(e)+var(u) Hence ~ Var( β ) σ 2 2 u + σ = e > NVar( X ) ^ Var( β ) = σ 2 u NVar( X )

53 and if assume things that cause measurement error in y are unrelated to residual u, so cov(e,u)=0. ] so var(v) = var(e)+var(u) Hence ~ Var( β ) σ 2 2 u + σ = e > NVar( X ) ^ Var( β ) = σ 2 u NVar( X ) So the residual variance in presence of measurement error in dependent variable now also contains an additional contribution from error in y variable, σ 2 e so standard errors are larger in models where there is measurement error in the Y variable and the bigger the measurement error the larger the standard errors, the lower the t (and F) values and the greater the risk of Type II error (failing to reject a false null)

54 Measurement Error in Explanatory Variable True: y = b0 + b1x + u (1)

55 Measurement Error in Explanatory Variable True: y = b0 + b1x + u (1) Observe: X = X + w (2)

56 Measurement Error in Explanatory Variable True: y = b0 + b1x + u (1) Observe: X = X + w (2) ie right hand side variable measured with error (w)

57 Measurement Error in Explanatory Variable True: y = b0 + b1x + u (1) Observe: X = X + w (2) ie right hand side variable measured with error (w) sub. (2) into (1)

58 Measurement Error in Explanatory Variable True: y = b0 + b1x + u (1) Observe: X = X + w (2) ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that X = X - w

59 Measurement Error in Explanatory Variable True: y = b0 + b1x + u (1) Observe: X = X + w (2) ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that X = X - w y = b0 + b1(x-w) + u

60 Measurement Error in Explanatory Variable True: y = b0 + b1x + u (1) Observe: X = X + w (2) ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that X = X - w y = b0 + b1(x-w) + u y = b0 + b1x - b1w + u

61 Measurement Error in Explanatory Variable True: y = b0 + b1x + u (1) Observe: X = X + w (2) ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that X = X - w y = b0 + b1(x-w) + u y = b0 + b1x - b1w + u y = b0 + b1x + v (3)

62 Measurement Error in Explanatory Variable True: y = b0 + b1x + u (1) Observe: X = X + w (2) ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that X = X - w y = b0 + b1(x-w) + u y = b0 + b1x - b1w + u y = b0 + b1x + v (3) where now v = - b1w + u (so residual term again consists of 2 components)

63 Measurement Error in Explanatory Variable True: y = b0 + b1x + u (1) Observe: X = X + w (2) ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that X = X - w y = b0 + b1(x-w) + u y = b0 + b1x - b1w + u y = b0 + b1x + v (3) where now v = - b1w + u (so residual term again consists of 2 components) Hence (3) is the basis for OLS estimation. Does this matter?

64 In (2 variable) model, we know that OLS implies that ^ Cov( X, y ) Cov( X ( 1 ) ( bo + b X + u Cov X, u) b 1 = = = b1 + Var( X ) Var( X ) Var( X ) (just sub. in for y and cancel terms) (4)

65 In (2 variable) model, we know that OLS implies that ^ Cov( X, y ) Cov( X ( 1 ) ( bo + b X + u Cov X, u) b 1 = = = b1 + Var( X ) Var( X ) Var( X ) (4) (just sub. in for y and cancel terms) Since by assumption Cov(X,u) = 0

66 In (2 variable) model, we know that OLS implies that ^ Cov( X, y ) Cov( X ( 1 ) ( bo + b X + u Cov X, u) b 1 = = = b1 + Var( X ) Var( X ) Var( X ) (just sub. in for y and cancel terms) (4) Since by assumption Cov(X,u) = 0 ^ ^ then in (4) E ( b1) = b1 and OLS is unbiased

67 In (2 variable) model, we know that OLS implies that ^ Cov( X, y ) Cov( X ( 1 ) ( bo + b X + u Cov X, u) b 1 = = = b1 + Var( X ) Var( X ) Var( X ) (just sub. in for y and cancel terms) (4) Since by assumption Cov(X,u) = 0 ^ ^ then in (4) E ( b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate

68 In (2 variable) model, we know that OLS implies that ^ Cov( X, y ) Cov( X ( 1 ) ( bo + b X + u Cov X, u) b 1 = = = b1 + Var( X ) Var( X ) Var( X ) (just sub. in for y and cancel terms) (4) Since by assumption Cov(X,u) = 0 ^ ^ then in (4) E ( b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate y = b0 + b1x + v

69 In (2 variable) model, we know that OLS implies that ^ Cov( X, y ) Cov( X ( 1 ) ( bo + b X + u Cov X, u) b 1 = = = b1 + Var( X ) Var( X ) Var( X ) (just sub. in for y and cancel terms) (4) Since by assumption Cov(X,u) = 0 ^ ^ then in (4) E ( b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate y = b0 + b1x + v not y = b0 + b1x + u

70 In (2 variable) model, we know that OLS implies that ^ Cov( X, y ) Cov( X ( 1 ) ( bo + b X + u Cov X, u) b 1 = = = b1 + Var( X ) Var( X ) Var( X ) (just sub. in for y and cancel terms) (4) Since by assumption Cov(X,u) = 0 ^ ^ then in (4) E ( b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate y = b0 + b1x + v not y = b0 + b1x + u and now rewrite Cov(X,v)

71 In (2 variable) model, we know that OLS implies that ^ Cov( X, y ) Cov( X ( 1 ) ( bo + b X + u Cov X, u) b 1 = = = b1 + Var( X ) Var( X ) Var( X ) (just sub. in for y and cancel terms) (4) Since by assumption Cov(X,u) = 0 ^ ^ then in (4) E ( b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate y = b0 + b1x + v not y = b0 + b1x + u and now rewrite Cov(X,v) = Cov(X +w,

72 In (2 variable) model, we know that OLS implies that ^ Cov( X, y ) Cov( X ( 1 ) ( bo + b X + u Cov X, u) b 1 = = = b1 + Var( X ) Var( X ) Var( X ) (just sub. in for y and cancel terms) (4) Since by assumption Cov(X,u) = 0 ^ ^ then in (4) E ( b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate y = b0 + b1x + v not y = b0 + b1x + u and now rewrite Cov(X,v) = Cov(X +w,- b1w + u) (sub. in for X and v using (2) & (3))

73 In (2 variable) model, we know that OLS implies that ^ Cov( X, y ) Cov( X ( 1 ) ( bo + b X + u Cov X, u) b 1 = = = b1 + Var( X ) Var( X ) Var( X ) (just sub. in for y and cancel terms) (4) Since by assumption Cov(X,u) = 0 ^ ^ then in (4) E ( b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate y = b0 + b1x + v not y = b0 + b1x + u and now rewrite Cov(X,v) = Cov(X +w,- b1w + u) (sub. in for X and v using (2) & (3))

74 Expanding terms using rules on covariances Cov(X +w,- b1w + u)

75 Cov(X +w,- b1w + u) = Cov(X,u)+ Cov(X,-b1w) + Cov(w,u) + Cov(w,-b1w)

76 Cov(X +w,- b1w + u) = Cov(X,u)+ Cov(X,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the value of X

77 Cov(X +w,- b1w + u) = Cov(X,u)+ Cov(X,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the value of X (this means any error in X should not depend on the level of X),

78 Cov(X +w,- b1w + u) = Cov(X,u)+ Cov(X,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the value of X (this means any error in X should not depend on the level of X), so Cov(w,u)

79 Cov(X +w,- b1w + u) = Cov(X,u)+ Cov(X,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(X,u)

80 Cov(X +w,- b1w + u) = Cov(X,u)+ Cov(X,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(X,u)= Cov(X,-b1w) = 0

81 Cov(X +w,- b1w + u) = Cov(X,u)+ Cov(X,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(X,u)= Cov(X,-b1w) = 0 This leaves Cov( X, v) = Cov( w, b1 w)

82 Cov(X +w,- b1w + u) = Cov(X,u)+ Cov(X,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(X,u)= Cov(X,-b1w) = 0 This leaves Cov( X, v) = Cov( w, b1 w) = b1 Cov( w, w)

83 Cov(X +w,- b1w + u) = Cov(X,u)+ Cov(X,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(X,u)= Cov(X,-b1w) = 0 This leaves Cov( X, v) = Cov( w, b1 w) = b1 Cov( w, w) = b1 Var( w)

84 Cov(X +w,- b1w + u) = Cov(X,u)+ Cov(X,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(X,u)= Cov(X,-b1w) = 0 This leaves Cov( X, v) = Cov( w, b1 w) = b1 Cov( w, w) = b1 Var( w) Hence Cov ( X, v) 0

85 In other words there is now a correlation between the X variable and the error term in (3). So if we estimate (3) by OLS, this violates the main assumption need to get an unbiased estimate ie we need covariance between explanatory variable and residual to be zero, (see earlier notes) and this is not the case when explanatory variable measured with error.

86 Given y = b0 + b1x + v if we estimate this by OLS, this violates the main assumption need to get an unbiased estimate ie we need covariance between explanatory variable and residual to be zero, (see earlier notes) and this is not the case when explanatory variable measured with error. (4) becomes ^ Cov( X, y ) Cov( X ( bo + b1 X + u) Cov( X, v) b1 Var( w) b1 = = = b1 + = b1 + b1 Var( X ) Var( X ) Var( X ) Var( X )

87 Given y = b0 + b1x + v if we estimate (3) by OLS, this violates the main assumption need to get an unbiased estimate ie we need covariance between explanatory variable and residual to be zero, (see earlier notes) and this is not the case when explanatory variable measured with error. (4) becomes ^ Cov( X, y ) Cov( X ( bo + b1 X + u) Cov( X, v) b1 Var( w) b1 = = = b1 + = b1 + b1 Var( X ) Var( X ) Var( X ) Var( X ) ^ ^ so E( b1) b1 and OLS gives biased estimates in presence of measurement error in the explanatory variable

88 Not only that, can show that OLS estimates are always biased toward zero (Attenuation Bias)

89 Not only that, can show that OLS estimates are always biased toward zero (Attenuation Bias) ^ if b1>0 then b ols 1 < b1

90 Not only that, can show that OLS estimates are always biased toward zero (Attenuation Bias) ^ if b1>0 then b ols 1 < b1 ^ if b1<0 then b ols 1 > b1 ie closer to zero in both cases

91 Problem is that Cov(X,v) 0 ie residual and right hand side variable are correlated In such situations the X variable is said to be endogenous

92 Example of Consequences Of Measurement Error In Data This example uses artificially generated measurement error to illustrate the basic issues.. list /* list observations in data set */ y_ y_observ x_ x_observ The data set measerr.dta gives (unobserved) values of y and x and their observed counterparts, (the 1 st 5 observations on x underestimate by 20 and the last 5 overestimate the value by 20) First look at regression of y on x. reg y_ x_ Source SS df MS Number of obs = F( 1, 8) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = y_ Coef. Std. Err. t P> t [95% Conf. Interval] x_ _cons Now look at consequence of measurement error in dependent variable. reg y_obs x_ Source SS df MS Number of obs = F( 1, 8) = 77.65

93 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = y_observ Coef. Std. Err. t P> t [95% Conf. Interval] x_ _cons Consequence: Coefficients virtually identical, (unbiased) but standard errors larger and hence t values smaller and confidence intervals wider. Measurement error in explanatory variable:. reg y_ x_obs Source SS df MS Number of obs = F( 1, 8) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = y_ Coef. Std. Err. t P> t [95% Conf. Interval] x_observ _cons Consequence: both coefficients biased. and slope coefficient is biased toward zero (0.45 compared with 0.60 ie underestimate effect by 25%) Intercept is biased upward (compare 50.1 with 25.0) Problem is that Cov(X,u) 0 ie residual and right hand side variable are correlated In such situations the X variable is said to be endogenous

94 Solution? - Get better data

95 Solution? - Get better data If that is not possible do something to get round the problem. - replace the variable causing the correlation with the residual with one that is not but that at the same time is still related to the original variable

96 Solution? - Get better data If that is not possible do something to get round the problem. - replace the variable causing the correlation with the residual with one that is not but that at the same time is still related to the original variable Any variable that has these 2 properties is called an Instrumental Variable

97 Solution? - Get better data If that is not possible do something to get round the problem. - replace the variable causing the correlation with the residual with one that is not but that at the same time is still related to the original variable Any variable that has these 2 properties is called an Instrumental Variable

98 Phillip Wright

99 More formally, an instrument Z for the variable of concern X satisfies 1) Cov(X,Z) 0

100 More formally, an instrument Z for the variable of concern X satisfies 1) Cov(X,Z) 0 correlated with the problem variable

101 More formally, an instrument Z for the variable of concern X satisfies 1) Cov(X,Z) 0 correlated with the problem variable 2) Cov(Z,u) = 0

102 More formally, an instrument Z for the variable of concern X satisfies 1) Cov(X,Z) 0 correlated with the problem variable 2) Cov(Z,u) = 0 but uncorrelated with the residual (so does not suffer from measurement error and also is not correlated with any unobservable factors influencing the dependent variable)

103 Instrumental variable (IV) estimation proceeds as follows:

104 Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1)

105 Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply (1) by the instrument Z

106 Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu

107 Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu Follows that Cov(Z,y) = Cov[Zb0 + b1zx + Zu]

108 Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu Follows that Cov(Z,y) = Cov[Zb0 + b1zx + Zu] = Cov(Zb0) + Cov(b1Z,X) + Cov(Z,u)

109 Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu Follows that Cov(Z,y) = Cov[Zb0 + b1zx + Zu] = Cov(Zb0) + Cov(b1Z,X) + Cov(Z,u) since Cov(Zb0) = 0 (using rules on covariance of a constant)

110 Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu Follows that Cov(Z,y) = Cov[Zb0 + b1zx + Zu] = Cov(Zb0) + Cov(b1Z,X) + Cov(Z,u) since Cov(Zb0) = 0 (using rules on covariance of a constant) and Cov(Z,u) = 0 (if assumption above about the properties of instruments is correct)

111 Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu Follows that Cov(Z,y) = Cov[Zb0 + b1zx + Zu] = Cov(Zb0) + Cov(b1Z,X) + Cov(Z,u) since Cov(Zb0) = 0 (using rules on covariance of a constant) and Cov(Z,u) = 0 (if assumption above about the properties of instruments is correct) then Cov(Z,y) = 0 + b1cov(z,x) + 0

112 Solving Cov(Z,y) = 0 + b1cov(z,x) + 0 for b1 gives the formula to calculate the instrumental variable estimator

113 Solving Cov(Z,y) = 0 + b1cov(z,x) + 0 for b1 gives the formula to calculate the instrumental variable estimator So b1 IV = Cov( Z, y) Cov( Z, X )

114 Solving Cov(Z,y) = 0 + b1cov(z,x) + 0 for b1 gives the formula to calculate the instrumental variable estimator So b1 IV = Cov( Z, y) Cov( Z, X ) (compare with b1 OLS = Cov( X, y) Var( X ) )

115 Solving Cov(Z,y) = 0 + b1cov(z,x) + 0 for b1 gives the formula to calculate the instrumental variable estimator So b1 IV = Cov( Z, y) Cov( Z, X ) (compare with b1 OLS = Cov( X, y) Var( X ) ) In the presence of measurement error (or endogeneity in general) the IV estimate is unbiased in large samples (but may be biased in small samples) - technically the IV estimator is said to be consistent

116 Solving Cov(Z,y) = 0 + b1cov(z,x) + 0 for b1 gives the formula to calculate the instrumental variable estimator So b1 IV = Cov( Z, y) Cov( Z, X ) (compare with b1 OLS = Cov( X, y) Var( X ) ) In the presence of measurement error (or endogeneity in general) the IV estimate is unbiased in large samples (but may be biased in small samples) - technically the IV estimator is said to be consistent while the OLS estimator is inconsistent IN THE PRESENCE OF ENDOGENEITY which makes IV a useful estimation technique to employ

117 However can show that (in the 2 variable case) the variance of the IV estimator is given by ^ Var ( β IV 1 ) = N s 2 1 * * Var( X ) r 2 X Z where rxz 2 is the square of the correlation coefficient between endogenous variable and instrument

118 However can show that (in the 2 variable case) the variance of the IV estimator is given by ^ Var ( β IV 1 ) = N s 2 1 * * Var( X ) r 2 X Z where rxz 2 is the square of the correlation coefficient between endogenous variable and instrument ^ s 2 (compared with OLS Var ( β OLS 1 ) = ) N * Var( X )

119 However can show that (in the 2 variable case) the variance of the IV estimator is given by ^ Var ( β IV 1 ) = N s 2 1 * * Var( X ) r 2 X Z where rxz 2 is the square of the correlation coefficient between endogenous variable and instrument ^ s 2 (compared with OLS Var ( β OLS 1 ) = ) N * Var( X ) Since r 2 >0 So IV estimation is less precise (efficient) than OLS estimation May sometimes want to trade off bias against efficiency

120 Where to find an instrument?

121 Lecture 15.Potential solution to endogeneity instrumental variable estimation. Tests for endogeneity. Other sources of endogeneity. Problems with weak instruments

122 Problem with measurement error in X variables is that it makes Cov(X,u) 0 ie residual and right hand side variable are correlated In such situations the X variable is said to be endogenous and OLS will be biased toward zero (inconsistent) in this case

123 Solution? - Get better data If that is not possible do something to get round the problem. - replace the variable causing the correlation with the residual with one that is not but that at the same time is still related to the original variable Any variable that has these 2 properties is called an Instrumental Variable More formally, an instrument Z for the variable of concern X satisfies 1) Cov(X,Z) 0 correlated with the problem variable 2) Cov(Z,u) = 0 but uncorrelated with the residual (so does not suffer from measurement error and also is not correlated with any unobservable factors influencing the dependent variable)

124 So b 1 IV = Cov( Z, y) Cov( Z, X ) (compare with b 1 OLS = Cov( X, y) Var( X ) ) In the presence of measurement error (or endogeneity in general) the IV estimate is unbiased in large samples (but may be biased in small samples) - technically the IV estimator is consistent while the OLS estimator is inconsistent which makes IV a useful estimation technique to employ

125 However can show that (in the 2 variable case) the variance of the IV estimator is given by ^ Var ( β IV 1 ) = N s 2 1 * * Var( X ) r 2 X Z where r 2 xz is the square of the correlation coefficient between endogenous variable and instrument ^ s 2 (compared with OLS Var ( β OLS 1 ) = ) N * Var( X ) So IV estimation is less precise (efficient) than OLS estimation if r XZ 2 >0 (which it must be to satisfy the other requirement of an instrument) but the greater the correlation between X and Z, r 2 XZ, the smaller is Var( (and hence lower standard errors and higher t values) ^ β IV )

126 So why not ensure that the correlation between X and the instrument Z is as high as possible?

127 So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X the initial problem is not solved.

128 So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems

129 So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems Since can always write the IV estimator as b IV Cov( Z, y) 1 = Cov( Z, X )

130 So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems Since can always write the IV estimator as b IV Cov( Z, y) 1 = Cov( Z, X ) sub. in for y = b 0 + b 1 X + u

131 So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems Since can always write the IV estimator as b IV Cov( Z, y) 1 = Cov( Z, X ) sub. in for y = b 0 + b 1 X + u Cov ( Z, b0 + b1 X + u) Cov( Z, X )

132 So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems Since can always write the IV estimator as b IV Cov( Z, y) 1 = Cov( Z, X ) sub. in for y = b 0 + b 1 X + u Cov ( Z, b0 + b1 X + u) Cov( Z, X ) Cov ( Z, b (, ) (, ) = 0 ) + b1 Cov Z X + Cov Z u Cov( Z, X )

133 b IV 0 + b (, ) (, ) 1 = 1Cov Z X + Cov Z u Cov( Z, X ) So b 1 IV = b 1 + Cov( Z, u) Cov( Z, X )

134 b IV 0 + b (, ) (, ) 1 = 1Cov Z X + Cov Z u Cov( Z, X ) So b 1 IV = b 1 + Cov( Z, u) Cov( Z, X ) So if Cov(X,Z) is small then the IV estimate can be a long way from the value b 1

135 b IV 0 + b (, ) (, ) 1 = 1Cov Z X + Cov Z u Cov( Z, X ) So b 1 IV = b 1 + Cov( Z, u) Cov( Z, X ) So if Cov(X,Z) is small then the IV estimate can be a long way from the value b 1 So: always check extent of correlation between X and Z before any IV estimation (see later)

136 So b 1 IV = b 1 + Cov( Z, u) Cov( Z, X ) So if Cov(X,Z) is small then the IV estimate can be a long way from the value b 1 So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments).

137 So b 1 IV = b 1 + Cov( Z, u) Cov( Z, X ) So if Cov(X,Z) is small then the IV estimate can be a long way from the value b 1 So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments). Where to find good instruments?

138 So b 1 IV = b 1 + Cov( Z, u) Cov( Z, X ) So if Cov(X,Z) is small then the IV estimate can be a long way from the value b 1 So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments). Where to find good instruments? - difficult

139 So b 1 IV = b 1 + Cov( Z, u) Cov( Z, X ) So if Cov(X,Z) is small then the IV estimate can be a long way from the value b 1 So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments). Where to find good instruments? - difficult - The appropriate instrument will vary depending on the issue under study.

140 So b 1 IV = b 1 + Cov( Z, u) Cov( Z, X ) So if Cov(X,Z) is small then the IV estimate can be a long way from the value b 1 So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments). Where to find good instruments? - difficult. The appropriate instrument will vary depending on the issue under study.

141 In the case of measurement error, could use the rank of X as an instrument (ie order the variable X by size and use the number of the order rather than the actual vale.

142 In the case of measurement error, could use the rank of X as an instrument (ie order the variable X by size and use the number of the order rather than the actual vale. Clearly correlated with the original value but because it is a rank should not be affected with measurement error

143 In the case of measurement error, could use the rank of X as an instrument (ie order the variable X by size and use the number of the order rather than the actual vale. Clearly correlated with the original value but because it is a rank should not be affected with measurement error - Though this assumes that the measurement error is not so large as to affect the () ordering of the X variable

144 egen rankx=rank(x_obs) /* stata command to create the ranking of x_observ */. list x_obs rankx x_observ rankx ranks from smallest observed x to largest Now do instrumental variable estimates using rankx as the instrument for x_obs ivreg y_t (x_ob=rankx) Instrumental variables (2SLS) regression Source SS df MS Number of obs = F( 1, 8) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = y_ Coef. Std. Err. t P> t [95% Conf. Interval] x_observ _cons Instrumented: x_observ Instruments: rankx Can see both estimated coefficients are a little closer to their values than estimates from regression with measurement error (but not much)in this case the rank of X is not a very good instrumentnote that standard error in instrumented regression is larger than standard error in regression of y_ on x_observed as expected with IV estimation

145 Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss.

146 Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss. Using the idea that IV estimation will always be (asymptotically) unbiased whereas OLS will only be unbiased if Cov(X,u) = 0 then can do the following: Wu-Hausman Test for Endogeneity

147 Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss. Using the idea that IV estimation will always be (asymptotically) unbiased whereas OLS will only be unbiased if Cov(X,u) = 0 then can do the following: Wu-Hausman Test for Endogeneity 1. Given y = b 0 + b 1 X + u (A) Regress the endogenous variable X on the instrument(s) Z X = d 0 + d 1 Z + v (B)

148 Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss. Using the idea that IV estimation will always be (asymptotically) unbiased whereas OLS will only be unbiased if Cov(X,u) = 0 then can do the following: Wu-Hausman Test for Endogeneity 1. Given y = b 0 + b 1 X + u (A) Regress the endogenous variable X on the instrument(s) Z X = d 0 + d 1 Z + v (B) Save the residuals ^ v

149 2. Include this residual as an extra term in the original model

150 Include this residual as an extra term in the original model ie given y = b 0 + b 1 X + u

151 Include this residual as an extra term in the original model ie given y = b 0 + b 1 X + u estimate y = b 0 + b 1 X + b 2^ v + e and test whether b 2 = 0 (using a t test)

152 Include this residual as an extra term in the original model ie given y = b 0 + b 1 X + u estimate y = b 0 + b 1 X + b 2^ v + e and test whether b 2 = 0 (using a t test) If b 2 = 0 conclude there is no correlation between X and u

Handout 11: Measurement Error

Handout 11: Measurement Error In which you learn to recognise the consequences for OLS estimation whenever some of the variables you use are not measured as accurately as you might expect. A (potential)