Econometrics Problem Set 7 WISE, Xiamen University Spring 2016-17 Conceptual Questions 1. (SW 8.2) Suppose that a researcher collects data on houses that have sold in a particular neighborhood over the past year and obtains regression results in Table 1. Dependent variable: ln(p rice) Regressor (1) (2) (3) (4) (5) Size 0.00042 (0.000038) ln(size) 0.69 0.68 0.57 0.69 (0.054) (0.087) (2.03) (0.055) ln(size) 2 0.0078 (0.14) Bedrooms 0.0036 (0.037) P ool 0.082 0.071 0.071 0.071 0.071 (0.032) (0.034) (0.034) (0.035) (0.035) V iew 0.037 0.027 0.026 0.027 0.027 (0.029) (0.028) (0.026) (0.029) (0.030) P ool V iew 0.0022 (0.10) Condition 0.13 0.12 0.12 0.12 0.12 (0.045) (0.035) (0.035) (0.036) (0.035) Intercept 10.97 6.60 6.63 7.02 6.60 (0.069) (0.39) (0.53) (7.50) (0.40) Summary Statistics SER 0.102 0.098 0.099 0.099 0.099 R 2 0.72 0.74 0.73 0.73 0.73 Table 1: Variable definitions: Price sale price; Size house size (in square feet); Bedrooms number of bedrooms; Pool binary variable (1 if house has a swimming pool, 0 otherwise); View binary variable (1 if house has a nice view, 0 otherwise); Condition 1 if realtor reports house is in excellent condition, 0 otherwise). (a) Using the results in column (1), what is the expected change in price of building a 500- square-foot addition to a house? Construct a 95% confidence interval for the percentage change in price.
Price is expected to change by 21%, (100% 500 0.00042). The 95% confidence interval for the percentage change in price is 100% 500 [0.00042 1.96 0.000038, 0.00042 + 1.96 0.000038], that is, [17.276%, 24.724%]. (b) Comparing columns (1) and (2), is it better to use Size or ln(size) to explain house prices? Because the regressions in columns (1) and (2) have the same dependent variable, R 2 can be used to compare the fit of these two regressions. It is better to use ln(size) to explain house prices since its R 2 is higher. (c) Using column (2), what is the estimated effect of pool on price? (Make sure you get the units right.) Construct a 95% confidence interval for this effect. The estimated effect of adding a pool to a house on price is 0.071. That means that adding a pool is associated with a 7.1% increase in price. The 95% confidence interval is 100%[0.071 1.96 0.034, 0.071 + 1.96 0.034], that is, [0.436%,13.764%]. (d) The regression in column (3) adds the number of bedrooms to the regression. How large is the estimated effect of an additional bedroom? Is the effect statistically significant? Why do you think the estimated effect is so small? (Hint: What other variables are being held constant?) The estimated effect of an additional bedroom is 0.36%. The test statistic associated with significance of number of bedrooms is 0.0036/0.037 = 0.097297, so the effect isn t statistically significant at the 5% level. Note that this coefficient measures the effect of an additional bedroom holding the size of the house constant. (e) Is the quadratic term ln(size) 2 important? The test statistic associated with significance of the quadratic term ln(size) 2 in column (4) is 0.0078/0.14 = 0.055714 < 1.96, so the the quadratic term ln(size) 2 is not important. (f) Use the regression in column (5) to compute the expected percentage change in price when a pool is added to a house without a view. Repeat the exercise for a house with a view. Is there a large difference? Is the difference statistically significant? Without a view % P = 100% 0.071 1 = 7.1%. With a view % P = 100% (0.071 1 + 0.00221 = 7.32%. Page 2
The difference in the expected percentage change in price is 0.22%. The difference is not statistically significant at a 5% significance level: The test statistic associated with significance of the interaction term P ool V iew in column (5) is 0.0022/0.10 = 0.022 < 1.96, so the interaction term P ool V iew is not statistically significant. 2. Consider the following graph of income as a function of age. This functional form assumes that income is a continuous function of age and that older individuals earn more that younger individuals, but that the slope might change at some distinct milestones, for example, at age 18, when the typical individual graduates from high school, and at age 22, when he or she graduates from college. Income. 18 22 Age (a) Construct a regression model that could be used to estimate this function. At a general level, this can be described as three different regression functions, one for individuals younger than 18, one for individuals aged between 18 and 22, and one for individuals older than 22. A basic regression model with these features is Income i = β 0 + β 1 Age i + β 2 D i + β 3 D i Age i + β 4 E i + β 5 E i Age i + u i, where D i is 1 if 18 < Age i 22 and E i is 1 if 18 < Age i. However, since the function must also be continuous at the thresholds, this implies two requirements: These imply that β 0 + β 1 18 = β 0 + β 1 18 + β 2 + β 3 18 β 0 + β 1 22 + β 2 + β 3 22 = β 0 + β 1 22 + β 4 + β 5 22 β 2 = 18β 3 β 4 = 4β 3 22β 5 Imposing these constraints on our regression model implies that Income i = β 0 + β 1 Age i 18β 3 D i + β 3 D i Age i + (4β 3 22β 5 )E i + β 5 E i Age i + u i, = β 0 + β 1 Age i + β 3 (D i (Age i 18) + 4E i ) + β 5 E i (Age i 22) + u i, = β 0 + β 1 Age i + β 3 W i + β 5 X i + u i, where W i = D i (Age i 18) + 4E i and X i = E i (Age i 22). Page 3
(b) Based on your regression model from (a) construct a hypothesis test that tests if the slope of the function is constant. Based on the model, Income i = β 0 + β 1 Age i + β 3 W i + β 5 X i + u i, where W i = Age i 18 and X i = Age i 22, we should test H 0 : β 3 = β 5 = 0. 3. (SW 8.6) After reading the textbook s analysis of test scores (refer to Table 8.3 of the textbook), a researcher considers the following hypotheses: (a) She suspects that the effect of the percentage of students eligible for a subsidized lunch (ESL) has a non-linear effect on test scores. In particular, she conjectures that increases in this variable from 10% to 20% have little effect on test scores, but that changes from 50% to 60% have a much larger effect. i. Describe a nonlinear specification that can be used to model this form of nonlinearity. One possible approach is to construct a dummy variable LunchMed i, which is equal to one in districts where 20 < ESL 50 and zero otherwise, another dummy variable LunchHigh i, which is equal to one in districts where ESL > 50 and zero otherwise. Add these dummy variables to the base model. ii. How would you test whether the researcher s conjecture was better than a linear specification of the relationship between T estscore and ESL? Based on the answer to (i), we would test the joint hypothesis that β LunchMed = β LunchHigh = 0 against the alternative that at least one is not equal to zero. (b) She suspects that the effect of income on test scores is different in districts with small classes than in districts with large classes. i. Describe a nonlinear specification that can be used to model this form of nonlinearity. One possible approach is to construct an interaction variable SI i = ST R i ln(income) i and add this variable to the base model. ii. How would you test whether the researcher s conjecture was better than a linear specification of the relationship between income and test scores? Based on the answer to (i), we would test the hypothesis that β SI = 0 against the alternative that it is not equal to zero. 4. (SW 8.8) X is a continuous variable that takes on values between 5 and 100. Sketch the Page 4
following regression functions (with values of X between 5 and 100 on the horizontal axis and values of Ŷ on the vertical axis): (a) Ŷ = 2 + 3 ln(x). Y1 8 10 12 14 16 20 40 60 80 100 X (b) Ŷ = 2 3 ln(x). Y2 12 10 8 6 4 20 40 60 80 100 X (c) Ŷ = 1 + 125X 0.01X2. Page 5
Y3 2000 4000 6000 8000 10000 12000 20 40 60 80 100 X 5. (SW 8.9) Consider the following regression model, Y i = β 0 + β 1 X i + β 2 X 2 i + u i. Explain how to calculate the confidence interval of Ŷ from a change in X from 5 to 6. Ŷ = ˆβ 1 (6 5) + ˆβ 2 (6 2 5 2 ) = ˆβ 1 + 11 ˆβ 2. Using Approach #2 of Section 7.3: transform the regression function: Y i = β 0 + (β 1 + 11β 2 )X 11β 2 X + β 2 X 2 + u i = γ 0 + γ 1 X + γ 2 (X 2 11X) + u i. The confidence interval is ˆγ 1 ± z α/2 SE( ˆγ 1 ). Alternatively, you can directly test the joint hypothesis H 0 : β 1 + 11β 2 = 0 against the alternative H 1 : β 1 +11β 2 0. The F -statistic from this test is equal to ( Ŷ /S.E.( Ŷ ))2 The confidence interval is Ŷ ± z α/2 Ŷ / F -stat. 6. (SW 8.10) Consider the regression model Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 (X 1i X 2i ) + u i. Use Key Concept 8.1 in the textbook to show: (a) Y X 1 = β 1 + β 3 X 2 (effect of change in X 1 holding X 2 constant). Since Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 (X 1i X 2i ) + u i, we can get Y + Y = β 0 + β 1 (X 1 + X 1 ) + β 2 X 2 + β 3 (X 1 + X 1 ) X 2 + u, Thus, Y X 1 = β 1 + β 3 X 2. Y = β 1 X 1 + β 3 X 2 X 1. Page 6
(b) Y X 2 = β 2 + β 3 X 1 (effect of change in X 2 holding X 1 constant). Since Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 (X 1i X 2i ) + u i, we can have Y + Y = β 0 + β 1 X 1 + β 2 (X 2 + X 2 ) + β 3 X 1 (X 2 + X 2 ) + u, Thus, Y X 2 = β 2 + β 3 X 1. Y = β 2 X 2 + β 3 X 1 X 2. (c) If X 1 changes by X 1 and X 2 changes by X 2 then Y = (β 1 + β 3 X 2 ) X 1 + (β 2 + β 3 X 1 ) X 2 + β 3 X 1 X 2. Since Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 (X 1i X 2i ) + u i, we can obtain Y + Y = β 0 + β 1 (X 1 + X 1 ) + β 2 (X 2 + X 2 ) + β 3 (X 1 + X 1 ) (X 2 + X 2 ) + u, Y = β 1 X 1 + β 2 X 2 + β 3 (X 2 X 1 + X 1 X 2 + X 1 X 2 ). Thus, Y = (β 1 + β 3 X 2 ) X 1 + (β 2 + β 3 X 1 ) X 2 + β 3 X 1 X 2. Empirical Questions For these empirical exercises, the required datasets and a detailed description of them can be found at www.wise.xmu.edu.cn/course/gecon/written.html. 7. (SW E8.3) The data set used in this empirical exercise (CollegeDistance) contains data from a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986. In this exercise you will use these data to investigate the relationship between the number of completed years of education for young adults and the distance from each student s high school to the nearest four year college. (Proximity to college lowers the cost of education, so that students who live closer to a four-year college should, on average, complete more years of higher education.) The R code required for each question is listed within its respective solution. The code listed here initialises the software. # read data and attach data CD< read. csv ( D: /R/ C o l l e g e D i s t a n c e. csv ) # add d i s t ˆ2 to the dataset CD$dist2< CD$dist ˆ2 # a t t a c h i n g a l l o w s you to d i r e c t l y a c c e s s v a r i a b l e names attach (CD) # load package AER l i b r a r y ( AER ) Page 7
(a) (b) (c) (f) (h) Regressor ED ln(ed) ED ED ED Dist 0.037 0.003 0.081 0.081 0.110 (0.012) (0.001) (0.025) (0.025) (0.028) Dist 2 0.005 0.005 0.006 (0.002) (0.002) (0.002) F emale 0.143 0.010 0.143 0.141 0.141 (0.050) (0.004) (0.050) (0.050) (0.050) Bytest 0.093 0.007 0.093 0.093 0.093 (0.003) (0.000) (0.003) (0.003) (0.003) T uition 0.191 0.014 0.193 0.194 0.210 (0.099) (0.007) (0.099) (0.099) (0.099) Black 0.351 0.026 0.334 0.331 0.333 (0.067) (0.005) (0.068) (0.068) (0.068) Hispanic 0.362 0.026 0.333 0.330 0.323 (0.076) (0.005) (0.078) (0.078) (0.078) Incomehi 0.372 0.027 0.369 0.362 0.217 (0.062) (0.004) (0.062) (0.062) (0.090) Incomehi 0.124 Dist (0.062) Incomehi 0.009 Dist 2 (0.006) Intercept 8.921 2.266 9.012 9.002 9.042 (0.243) (0.017) (0.250) (0.250) (0.251) Ownhome 0.139 0.010 0.143 0.141 0.144 (0.065) (0.005) (0.065) (0.065) (0.065) DadColl 0.571 0.041 0.561 0.654 0.663 (0.076) (0.005) (0.077) (0.087) (0.087) MomColl 0.378 0.027 0.378 0.569 0.567 (0.083) (0.006) (0.084) (0.122) (0.122) DadColl 0.366 0.356 M omcoll (0.164) (0.164) Cue80 0.029 0.002 0.026 0.026 0.026 (0.010) (0.001) (0.010) (0.010) (0.010) Stwmfg80 0.043 0.003 0.043 0.042 0.042 (0.020) (0.001) (0.020) (0.020) (0.020) R 2 0.281 0.283 0.282 0.283 0.283 SER 1.538 0.109 1.537 1.536 1.536 Regression results for question E8.3. Robust standard errors in parentheses. N = 3796. significant at p <.10; p <.05; p <.01; p <.001. (a) Run a regression of ED on Dist, F emale, Bytest, T uition, Black, Hispanic, Incomehi, Ownhome, DadColl, MomColl, Cue80, and Stwmfg80. If Dist increases from 2 to 3 Page 8
(that is, from 20 miles to 30 miles), how are years of education expected to change? If Dist increases from 6 to 7 (that is, from 60 to 70 miles), how are years of education expected to change? # run the regression ma<-lm(yrsed~dist+female+bytest+tuition+black+hispanic +incomehi+ownhome+dadcoll+momcoll+cue80+stwmfg80) # show the result coeftest(ma,vcov=vcovhc(ma,"hc1")) The regression result for this question is shown in column (a) of the above table. If Dist increases from 2 to 3 (that is, from 20 miles to 30 miles) or from 6 to 7 (that is, from 60 to 70 miles), years of education are expected to decrease by 0.037 year. These values are the same because the regression is a linear function relating ED to Dist. (b) Run a regression of ln(ed) on Dist, F emale, Bytest, T uition, Black, Hispanic, Incomehi, Ownhome, DadColl, M omcoll, Cue80, and Stwmf g80. If Dist increases from 2 to 3, how are years of education expected to change? If Dist increases from 6 to 7, how are years of education expected to change? # run the regression mb<-lm(log(yrsed)~dist+female+bytest+tuition+black+hispanic +incomehi+ownhome+dadcoll+momcoll+cue80+stwmfg80) # show the result coeftest (mb, vcov=vcovhc(mb,"hc")) The regression result for this question is shown in column (b) of the above table. If Dist increases from 2 to 3 or from 6 to 7, ED is expected to decrease by 0.26%. These values are the same because the regression is a linear function relating ln(ed) to Dist. (c) Run a regression of ED on Dist, Dist 2, F emale, Bytest, T uition, Black, Hispanic, Incomehi, Ownhome, DadColl, M omcoll, Cue80, and Stwmf g80. If Dist increases from 2 to 3, how are years of education expected to change? If Dist increases from 6 to 7, how are years of education expected to change? # run the regression mc<-lm(yrsed~dist+dist2+female+bytest+tuition+black+hispanic +incomehi+ownhome+dadcoll+momcoll+cue80+stwmfg80) # show the result coeftest (mc, vcov=vcovhc(mc,"hc")) Page 9
# dist changes from 2 to 3 mc$coef[2]*3+mc$coef[3]*3^2-(mc$coef[2]*2+mc$coef[3]*2^2) # dist changes from 6 to 7 mc$coef[2]*7+mc$coef[3]*7^2-(mc$coef[2]*6+mc$coef[3]*6^2) The regression result is shown in column (c) of the above table. If Dist increases from 2 to 3, ED is expected to decrease by 0.058 year; if Dist increases from 6 to 7, ED is expected to decrease by 0.021 year. (d) Do you prefer regression (c) to (a)? Explain. From a theoretical viewpoint, one might expect that the effect of Dist on ED has a diminishing effect the marginal negative effect of an additional mile in distance should be smaller when the distance is very large, because students who live far away might commute by car or school bus, which makes the reduction of ED less responsive to distance. From an empirical viewpoint, the regression in (c) adds the variable Dist 2 to regression (a). The coefficient on Dist 2 is statistically significant (t = 2.26) and this suggests that the addition of Dist 2 is important. Thus, (c) is preferred to (a). (e) Consider a Hispanic female with T uition = $950, Bytest = 58, Incomehi = Ownhome = 0, DadColl = MomColl = 1, Cue80 = 7.1, and Stwmfg = $10.06. i. Plot the regression relation between Dist and ED from (a) and (c) for Dist in the range from 0 to 10 (from 0 to 100 miles). Describe the similarities and differences between the estimated regression functions. Would your answer change if you plotted the regression for a white male with the same characteristics? # create a sequence of dist from 0 to 10 dseq<-seq(0,10,.5) # create a data frame consistent with CD nd<-data.frame(dist=dseq,dist2=dseq^2,female=1,hispanic=1, black=0,bytest=58,dadcoll=1,momcoll=1,ownhome=0,cue80=7.1, stwmfg80=10.06,tuition=0.95,incomehi=0) # plot the estimated regression function plot(nd$dist,predict(ma,nd),type= l,col= blue, xlab= Distance (Tens of Miles),ylab= Years of Education ) # add additional regression functions lines(nd$dist,predict(mc,nd),col= red ) # add a legend legend("topright",c("model a","model c"), col=c("blue","red"),lwd=1,inset =.01) Page 10
Model a Model c Years of Education 15.0 15.1 15.2 15.3 0 2 4 6 8 10 Distance (Tens of Miles) The quadratic regression in (c) is steeper for small values of Dist than for larger values. The quadratic function is essentially flat when Dist = 10. The only change in the regression functions for a white male is that the intercept would shift. The functions would have the same slopes. ii. How does the regression function (c) behave for Dist > 10? How many observations are there with Dist > 10? # number of observations with dist>10 sum(dist>10) The regression function becomes positively sloped for Dist > 10. There are only 44 of the 3796 observations with Dist > 10. This is approximately 1% of the sample. Thus, this part of the regression function is very imprecisely estimated. (f) Add the interaction term DadColl MomColl to the regression in (c). What does the coefficient on the interaction term measure? Page 11
# run the regression mf<-lm(yrsed~dist+dist2+female+bytest+tuition+black+hispanic +incomehi+ownhome+dadcoll+momcoll+dadcoll*momcoll+cue80+stwmfg80) # show the result coeftest (mf, vcov=vcovhc(mf,"hc")) The regression result is shown in column (f) of the above table. The estimated coefficient is 0.366. This is the extra effect of education above and beyond the separated M omcoll and DadColl effects, when both mother and father attended college. This effect is significant at the 5% significance level but not at the 1% significance level. (g) Mary, Jane, Alexis and Bonnie have the same values of Dist, F emale, Bytest, T uition, Black, Hispanic, Incomehi, Ownhome, Cue80, and Stwmf g80. Neither of Mary s parents attended college. Jane s father attended college, but her mother did not. Alexis s mother attended college, but her father did not. Both of Bonnie s parents attended college. Using the regression from (f): i. What does the regression predict for the difference between Jane s and Mary s years of education? According to the regression, Jane s years of education is 0.654 years longer than Mary s. ii. What does the regression predict for the difference between Alexis s and Mary s years of education? According to the regression, Alexis s years of education is 0.569 years longer than Mary s. iii. What does the regression predict for the difference between Bonnie s and Mary s years of education? According to the regression, Bonnie s years of education is 0.654 + 0.5690.366 = 0.856 years longer than Mary s. (h) Is there any evidence that the effect of Dist on ED depends on the family s income? %# generate the interaction terms %distinc<-dist*incomehi %dist2inc<-dist2*incomehi # run the regression mh<-lm(yrsed~dist+dist2+female+bytest+tuition+black+hispanic Page 12
+incomehi+dist*incomehi+dist2*incomehi+ownhome+dadcoll+momcoll +dadcoll*momcoll+cue80+stwmfg80) # show the result coeftest (mh, vcov=vcovhc(mh,"hc")) # create two data frames consistent with CD ndlow<-data.frame(dist=dseq,dist2=dseq^2,female=1,hispanic=1,black=0, bytest=58,dadcoll=1,momcoll=1,ownhome=0,cue80=7.1,stwmfg80=10.06, tuition=0.95,incomehi=0) ndhi<-data.frame(dist=dseq,dist2=dseq^2,female=1,hispanic=1,black=0, bytest=58,dadcoll=1,momcoll=1,ownhome=0,cue80=7.1,stwmfg80=10.06, tuition=0.95,incomehi=1) # plot the estimated regression function plot(ndhi$dist,predict(mh,ndhi),type= l,col= blue,ylim=c(14.8,15.7), xlab= Distance (Tens of Miles),ylab= Years of Education ) # add additional regression function lines(ndlow$dist,predict(mh,ndlow),type= l,col= red ) # add a legend legend("topright",c("incomehi=1","incomehi=0"), col=c("blue","red"),lwd=1,inset =.01) # joint test on interaction linearhypothesis(mh,c("dist:incomehi=0","dist2:incomehi=0"), vcov=vcovhc(mh,type="hc1")) The regression result is shown in column (h) of the above table. Regression (h) adds the interaction of Incomehi and the distance regressors, Dist and Dist 2. The implied coefficients on Dist and Dist 2 are: Students who are not high income (Incomehi = 0) ED = 0.110Dist + 0.0065Dist 2 + other factors. High Income Students (Incomehi = 1) ED = (0.110 + 0.124)Dist + (0.00650.0087)Dist 2 + other factors = 0.013Dist0.0012Dist 2 + other factors. The two estimated regression functions are plotted below for someone with characteristics given in (f), but with Incomehi = 1 and with Incomehi = 0. When Incomehi = 1, the regression function is essentially flat, suggesting very little effect of Dist on ED. The F -statistic testing that the coefficients on the interaction terms Incomehi Dist and Incomehi Dist 2 are both equal to zero has a p-value of 0.092. Thus, the interaction effects are significant at the 10% but not the 5% significance level. Page 13
Years of Education 14.8 15.0 15.2 15.4 15.6 Incomehi=1 Incomehi=0 0 2 4 6 8 10 Distance (Tens of Miles) (i) After running all of these regressions (and any others that you want to run), summarize the effect of Dist on years of education. The regression functions shown in (f) and (h) show the nonlinear effect of distance on years of education. The effect is statistically significant. In (f) the effect of changing Dist from 20 miles to 30 miles, reduces years of completed education by 0.081(32) + 0.0047 (3222) = 0.0575 years, on average. The regression in (h) shows a slightly negative effect for non-high income student, but essentially no effect for high income students. Page 14