Econometrics Problem Set 6 WISE, Xiamen University Spring 2016-17 Conceptual Questions 1. This question refers to the estimated regressions shown in Table 1 computed using data for 1988 from the CPS. The data set consists of information on 4000 full-time full-year workers. The highest educational achievement for each worker was either a high school diploma or a bachelor s degree. The worker s ages ranged from 25 to 34 years. The data set also contained information on the region of the country where the person lived, marital status, and number of children. For the purposes of these exercises let AHE = average hourly earnings (in 1998 dollars) College = binary variable (1 if college, 0 if high school) F emale = binary variable (1 if female, 0 if male) Age = age (in years) N ortheast = binary variable (1 if Region = Northeast, 0 otherwise) M idwest = binary variable (1 if Region = Midwest, 0 otherwise) South = binary variable (1 if Region = South, 0 otherwise) W est = binary variable (1 if Region = West, 0 otherwise) (a) (SW 7.1) Add * (5%) and ** (1%) to Table 1 to indicate statistical significance of the coefficients. Solution: All coefficients (including the intercept) should have ** except for the coefficient associated with X 6 in column (3) which should have nothing. (b) (SW 7.2) Using the regression results in column (1): i. Is the college-high school earnings difference estimated from this regression statistically significant at the 5% level? Construct a 95% confidence interval of the difference. Solution: The t-statistic is 5.46/0.21 = 26.0 > 1.96, so the coefficient is statistically significant at the 5% level. The 95% confidence interval of the college-high school earnings difference is [5.05, 5.87]. ii. Is the male-female earnings difference estimated from this regression statistically significant at the 5% level? Construct a 95% confidence interval for the difference.
Solution: The t-statistic is 2.64/0.20 = 13.2 < 1.96, so the coefficient is statistically significant at the 5% level. The 95% confidence interval of malefemale earnings difference is [ 3.03, 2.25]. (c) (SW 7.3) Using the regression results in column (2): i. Is age an important determinant of earnings? Use an appropriate statistical test and/or confidence interval to explain your answer. Solution: From column (2), age is statistically significant at the 5% level. Using a t-test, the t-statistic is 0.29/0.04 = 7.25, with a p-value of 4.2 10 13, implying that the coefficient on age is statistically significant at the 1% level. ii. Sally is a 29-year-old female college graduate. Betsy is a 34-year-old female college graduate. Construct a 95% confidence interval for the expected difference between their earnings. Solution: The 95% confidence interval for the expected difference between their earnings is: Age [0.29 ± 1.96 0.04] = [$1.06, $1.84]. (d) (SW 7.4) Using the regression results in column (3): i. Do there appear to be important regional differences? Use an appropriate hypothesis to explain your answer. Solution: The F -statistic testing the coefficients on the regional regressors are zero is 6.10. The 1% critical value (from the F 3, distribution) is 3.78. Because 6.10 > 3.78, the regional effects are significant at the 1% level. ii. Juanita is a 28-year-old female college graduate from the South. Molly is a 28-yearold female college graduate from the West. Jennifer is a 28-year-old female college graduate from the Midwest. α) Construct a 95% confidence interval for the difference in expected earnings between Juanita and Molly. Solution: The 95% confidence interval for the difference in the expected earnings between Juanita and Molly is: (X 6,Juanita X 6,Molly ) [ ˆβ 6 ±z 0.025 SE( ˆβ 6 ) = 0.27 ± 1.96 0.26 = [ 0.78, 0.24]. β) Explain how you would construct a 95% confidence interval for the difference in expected earnings between Juanita and Jennifer. Solution: The expected difference between Juanita and Jennifer is (X 5,Juan X 5,Jenn ) β 5 + (X 6,Juan X 6,Jenn ) β 6 = β 5 + β 6. A 95% confidence interval could be constructed using the general methods discussed in Section 7.3. In this case, an easy way to do this is to omit Page 2
Midwest from the regression and replace it with X 5 = W est. In this new regression the coefficient on South measures the difference in wages between the South and the Midwest, and a 95% confidence interval can be computed directly. Dependent variable: average hourly earnings (AHE). Regressor (1) (2) (3) College(X 1 ) 5.46 5.48 5.44 (0.21) (0.21) (0.21) F emale(x 2 ) -2.64-2.62-2.62 (0.20) (0.20) (0.20) Age(X 3 ) 0.29 0.29 (0.04) (0.04) Northeast(X 4 ) 0.69 (0.30) Midwest(X 5 ) 0.60 (0.28) South(X 6 ) -0.27 (0.26) Intercept 12.69 4.40 3.75 (0.14) (1.05) (1.06) Summary Statistics F -statistic for regional effects = 0 6.10 SER 6.27 6.22 6.21 R 2 0.176 0.190 0.194 R 2 n 4000 4000 4000 Table 1: Results of Regressions of Average Hourly Earnings on Gender and Education Binary Variables and Other Characteristics Using 1988 Data from the Current Populations Survey (e) (SW 7.5) The regression shown in column (2) was estimated again, this time using data from 1992 (4000 observations selected at random from the March 1993 CPS, converted into 1998 dollars using the consumer price index). The results are ÂHE =0.77 + 5.29 College 2.59 F emale + 0.40 Age, SER = 5.85, R 2 = 0.21. (0.98) (0.20) (0.18) (0.03) Comparing this regression to the regression for 1998 shown in column (2) was there a statistically significant change in the coefficient on College? Solution: The t-statistic for the difference in the college coefficients is t = ( ˆβ College,1998 ˆβ College,1992 )/SE( ˆβ College,1998 ˆβ College,1992 ). Page 3
Because ˆβ College,1998 and ˆβ College,1992 are computed from independent samples, they are independent, which means that cov( ˆβ College,1998, ˆβ College,1992 ) = 0. Thus, var( ˆβ College,1998 ˆβ College,1992 ) = var( ˆβ College,1998 ) + var( ˆβ College,1992 ). This implies that SE( ˆβ College,1998 ˆβ College,1992 ) = (0.21 2 + 0.20 2 ) 1/2. Thus, t act = 5.48 5.29 = 0.6552. (0.21 2 + 0.20 2 ) 1/2 There is no significant change since the calculated t-statistic is less than 1.96, the 5% critical value. 2. (SW 7.7) Data were collected from a random sample of 220 home sales from a community in 2003. Let P denote the selling price (in $1000), BDR denote the number of bedrooms, Bath denote the number of bathrooms, Hsize denote the size of the house (in square feet), Lsize denote the lot size (in square feet), Age denote the age of the house (in years), and P r denote a binary variable that is equal to 1 if the condition of the house is reported as poor. An estimated regression yields ˆP =119.2 + 0.485BDR + 23.4Bath + 0.156Hsize + 0.002Lsize + 0.090Age - 48.8P r (23.9) (2.61) (8.94) (0.011) (0.00048) (0.311) (10.5) SER = 41.5, R2 = 0.72. (a) Is the coefficient on BDR statistically significantly different from zero? Solution: For BDR, the t statistic is 0.485 = 0.1858 < 1.96. Therefore, the coefficient on BDR is not statistically significant different from 2.61 zero. (b) Typically five-bedroom houses sell for much more than two-bedroom houses. Is this consistent with your answer to (a) and with the regression more generally? Solution: The coefficient on BDR measures only the partial effect of the number of bedrooms holding Hsize (house size) constant. Yet a typical five-bedroom house is much larger than a typical two-bedroom house. Therefore the results in (a) says little about the conventional wisdom. (c) A homeowner purchases 2000 square feet from an adjacent lot. Construct a 99% confidence interval for the change in the value of her house. Solution: The 99% confidence interval for the effect of lot size on price is 2000 [0.002 ± z 0.005 0.00048], that is, [1.536.47] (with z 0.005 2.5758). (d) Lot size is measured in square feet. Do you think that another scale might be more appropriate? Why or why not? Page 4
Solution: Choosing the scale of the variables should be done to make the regression results easy to read and to interpret. If the lot size were measured in thousands of square feet, the estimates coefficient would be 2 instead of 0.002. (e) The F -statistic for omitting BDR and Age from the regression is F = 0.08. Are the coefficients on BDR and Age jointly statistically different from zero at the 10% level? Solution: The 10% critical value from the F 2, distribution is 2.30. Because 0.08 < 2.30, the coefficients are not jointly significant at the 10% level. 3. (SW 7.9) Consider the regression model Y i = β 0 + β 1 X 1i + β 2 X 2i + u i. Use the transform the regression approach discussed in class to transform the regression so that you can use a t-statistic to test (a) β 1 = β 2 ; Solution: Adding and subtracting β 2 X 1i gives to the right-hand side of the equation Thus you can estimate Y i = β 0 + β 1 X 1i β 2 X 1i + β 2 X 1i + β 2 X 2i + u i = β 0 + (β 1 β 2 )X 1i + β 2 (X 1i + X 2i ) + u i. and test whether γ = 0. (b) β 1 + aβ 2 = 0, where a is a constant; Y i = β 0 + γx 1i + β 2 (X 1i + X 2i ) + u i Solution: Adding and subtracting aβ 2 X 1i to the right-hand side of the equation gives Thus you can estimate Y i = β 0 + β 1 X 1i + aβ 2 X 1i aβ 2 X 1i + β 2 X 2i + u i = β 0 + (β 1 + aβ 2 )X 1i + β 2 (X 2i ax 1i ) + u i. and test whether γ = 0. Y i = β 0 + γx 1i + β 2 (X 2i ax 1i ) + u i (c) β 1 + β 2 = 1; (Hint: You can redefine the dependent variable in the regression.) Page 5
Solution: Adding and subtracting β 2 X 1i to the right-hand side of the equation, and subtracting X 1i from both sides of the equation gives Thus you can estimate Y i X 1i = β 0 + β 1 X 1i + β 2 X 1i X 1i β 2 X 1i + β 2 X 2i + u i = β 0 + (β 1 + β 2 1)X 1i + β 2 (X 2i X 1i ) + u i. Y i X 1i = β 0 + γx 1i + β 2 (X 2i X 1i ) + u i and test whether γ = 0. Alternatively, you can ignore the hint and estimate and test whether γ = 1. (d) β 1 + β 2 = a, where a is a constant. Y i = β 0 + γx 1i + β 2 (X 2i X 1i ) + u i Solution: Adding and subtracting β 2 X 1i to the right-hand side of the equation, and subtracting ax 1i from both sides of the equation gives Y i ax 1i = β 0 + β 1 X 1i + β 2 X 1i ax 1i β 2 X 1i + β 2 X 2i + u i = β 0 + (β 1 + β 2 a)x 1i + β 2 (X 2i X 1i ) + u i. Thus you can estimate and test whether γ = 0. Alternatively, you can estimate Y i ax 1i = β 0 + γx 1i + β 2 (X 2i X 1i ) + u i and test whether γ = a. Y i = β 0 + γx 1i + β 2 (X 2i X 1i ) + u i 4. (SW 7.10) Show that the following two formulas for the homoskedasticity-only F -statistic are equivalent. F = (SSR restricted SSR unrestricted )/q SSR unrestricted /(n k unrestricted 1) and F = (R 2 unrestricted R2 restricted )/q (1 R 2 unrestricted )/(n k unrestricted 1). 5. (SW 7.11) A school district undertakes an experiment to estimate the effect of class size on test scores in second-grade classes. The district assigns 50% of its previous years firstgraders to small second-grade classes (18 students per classroom) and 50% to regular-size Page 6
classes (21 students per classroom). Students new to the district are handled differently: 20% are randomly assigned to small classes and 80% to regular-class sizes. At the end of the second-grade school year, each student is given a standardized exam. Let Y i denote the exam score for the i th student, X 1i denote a binary variable that equals 1 if the student is assigned to a small class, and X 2i denote a binary variable that equals 1 if the student is newly enrolled. Let β 1 denote the causal effect on test scores of reducing class size from regular to small. (a) Consider the regression Y i = β 0 + β 1 X 1i + u i. Do you think that E(u i X 1i ) = 0? Is the OLS estimator of β 1 unbiased and consistent? Explain. Solution: Treatment (assignment to small classes) was not randomly assigned in the population (the continuing and newly-enrolled students) because of the difference in the proportion of treated continuing and newly-enrolled students. Thus, the treatment indicator X 1 is correlated with X 2. If newly-enrolled students perform systematically differently on standardized tests than continuing students (perhaps because of adjustment to a new school), then this becomes part of the error term u in the regression. This leads to correlation between X 1 and u, so that E(u X 1 ) 0. Because E(u X 1 ) 0, ˆβ 1 is biased and inconsistent. Statistically, if the true model is Y i = β 0 + β 1 X 1i + β 2 X 2i + u i, that is, if β 2 0, then we have the necessary conditions for omitted variable bias. The variables X 1i and X 2i are correlated and X 2i partly explains Y i. (b) Consider the regression Y i = β 0 +β 1 X 1i +β 2 X 2i +u i. Do you think that E(u i X 1i, X 2i ) = 0 depends on X 1? Is the OLS estimator of β 1 unbiased and consistent? Explain. Do you think that E(u i X 1i, X 2i ) = 0 depends on X 2? Is the OLS estimator of β 2 unbiased and consistent? Explain. Solution: Because treatment was randomly assigned conditional on enrollment status (continuing or newly-enrolled), E(u X 1, X 2 ) will not depend on X 1. This means that the assumption of conditional mean independence is satisfied, and ˆβ 1 is unbiased and consistent. However, because X 2 was not randomly assigned (newlyenrolled students may, on average, have attributes other than being newly enrolled that affect test scores), E(u X 1, X 2 ) may depend on X 2, so that ˆβ 2 may be biased and inconsistent. Statistically, assume that E(u i X 1i, X 2i ) = E(u i X 2i ) which may not be equal to zero. Intuitively, once you control for student type (continuing or newly-enrolled), class size is randomly assigned. For convenience, assume further that E(u i X 2i ) = γ 0 + γ 2 X 2i. Page 7
Define v i = u i E(u i X 1i, X 2i ). (Note that E(v i X 1i, X 2i ) = 0) Thus, Y i = β 0 + β 1 X 1i + β 2 X 2i + u i = β 0 + β 1 X 1i + β 2 X 2i + E(u i X 1i, X 2i ) + v i = β 0 + β 1 X 1i + β 2 X 2i + E(u i X 2i ) + v i = β 0 + β 1 X 1i + β 2 X 2i + γ 0 + γ 2 X 2i + v i = β 0 + γ 0 + β 1 X 1i + (β 2 + γ 2 )X 2i + v i. Thus, the conditional mean of the error term in this model is zero, that is, least squares assumption 1 holds. The ordinary-least-squares regression estimate ˆβ 1 is an unbiased estimate of β 1. The ordinary-least-squares regression estimate ˆβ 2, however, is an unbiased estimate of β 2 + γ 2, not β 2. 6. The Bonferroni test of the joint hypothesis β 1 = β 1,0 and β 2 = β 2,0 based on the critical value c > 0 uses the following rule: Do not reject if t 1 c and if t 2 c; otherwise, reject, where t 1 and t 2 are the t-statistics that test the restriction on β 1 and β 2 respectively. For a significance level of 5% and two restrictions the Bonferroni critical value c equals 2.241. For the following questions use a large sample approximation for your test statistics. (a) Using the above critical value, what is the probability of rejecting the null when the null is true i. when ρ ˆβ1, ˆβ 2 = 0. Solution: The probability that the Bonferroni test does not reject the null hypothesis when the hypothesis is true is Pr( t 1 < 2.2414, t 2 < 2.2414). Asymptotically, the estimates ˆβ 1 and ˆβ 2 are jointly normally distributed. Hence, for such random variables, zero correlation implies independence. Hence, Pr( t 1 < 2.2414, t 2 < 2.2414) = Pr( t 1 < 2.2414) Pr( t 2 < 2.2414). For a normally distributed random variable, z, Pr( z < 2.2414) = 0.975. Hence, the probability of rejecting the null when the null is true equals 1 0.975 2 = 0.0494. ii. when ρ ˆβ1, ˆβ 2 =.5. (Hint: If t 1 and t 2 are two jointly normally distributed random variables with correlation equal to 0.5 then Pr( t 1 < 2.2414, t 2 < 2.2414) = 0.9535.) Solution: Using the hint directly, the probability of rejecting the null when the null is true is 1 0.9535 = 0.0465. iii. when ρ ˆβ1, ˆβ 2 = 1. Page 8
Solution: When the correlation between ˆβ 1 and ˆβ 2 equals one then Pr( t 1 < 2.241, t 2 < 2.241) = Pr( t 1 < 2.241) = 0.975. Hence, the probability of rejecting the null when the null is true is 1-0.975=0.025. (b) Comment on the size and power of the Bonferroni test as the correlation between β 1 and β 2 increases. Solution: As the correlation between β 1 and β 2 increases the size of the test decreases, the probability of rejecting the null when the null is true decreases. As a result, the power of the test falls, the probability of rejecting the null when the null is false falls. Empirical Questions For these empirical exercises, the required datasets and a detailed description of them can be found at www.wise.xmu.edu.cn/course/gecon/written.html. 7. (SW E7.3) The data set used in this empirical exercise (CollegeDistance) contains data from a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986. In this exercise you will use these data to investigate the relationship between the number of completed years of education for young adults and the distance from each student s high school to the nearest four year college. (Proximity to college lowers the cost of education, so that students who live closer to a four-year college should, on average, complete more years of higher education.) Solution: The R code required for each question is listed within its respective solution. The code listed here initialises the software. # read data and attach data CD< read. csv ( D: /R/ C o l l e g e D i s t a n c e. csv ) # a t t a c h i n g a l l o w s you to d i r e c t l y a c c e s s v a r i a b l e names attach (CD) # add AER l i b r a r y f o r r e q u i r e d f u n c t i o n s l i b r a r y ( AER ) ## The table summarises the regressions used to answer the questions. Page 9
Dependent variable: years of completed education (ED). Model Regressor (1) (2) (3) Dist -0.0734-0.0308-0.0326 (0.0134) (0.0116) (0.0126) Bytest 0.0924 0.0931 (0.0030) (0.0030) F emale 0.1434 0.1439 (0.0503) (0.0503) Black 0.3538 0.3384 (0.0675) (0.0689) Hispanic 0.4024 0.3492 (0.0737) (0.0774) Incomehi 0.3666 0.3741 (0.0622) (0.0623) Ownhome 0.1456 0.1433 (0.0648) (0.0652) Dadcoll 0.5699 0.5740 (0.0763) (0.0764) Momcoll 0.3792 0.3787 (0.0836) (0.0835) Cue80 0.0244 0.0283 (0.0093) (0.0095) Stwmfg80-0.0502-0.0426 (0.0196) (0.0199) U rban 0.0652 (0.0634) T uition -0.1848. (0.0988) Intercept 13.9559 8.8614 8.8935 (0.0378) (0.2411) (0.2437) Summary Statistics F -statistic for Black and Hispanic 22.155 F -statistic for U rban and T uition 2.4253 SER 1.807 1.538 1.538 R 2 0.0075 0.2829 0.2838 R 2 0.0072 0.2809 0.2814 N 3796 3796 3796 Model Specifications among Three Regressions of Years of Completed Education on Page 10
Distance to the Nearest College and Other Independent Variables. Heteroscedasticity- Robust Standard Errors in Parentheses under Coefficients. Significance Level (Using Two-Sided Test): ***,0.1%; **,1%; *,5%;.,10%. (a) An education advocacy group argues that, on average, a person s educational attainment would increase by approximately 0.15 year if distance to the nearest college is decreased by 20 miles. Run a regression of years of completed education (ED) on distance to the nearest college (Dist). Is the advocacy groups claim consistent with the estimated regression? Explain. Solution: # model e s t i m a t i o n model 1< lm ( formula=yrsed d i s t, data=cd) # h e t e r o s k e d a s t i c i t y robust standard e r r o r s c o e f t e s t ( model 1, vcov.=vcovhc( model 1, type= HC1 ) ) # summary o f model e s t i m a t i o n summary( model 1 ) ## The education advocacy group s claim is that the coefficient on Dist is 0.075, noting that the variable is in 10 s of miles). The 95% confidence interval for β Dist from column (1) in the above table is ( 0.0734 1.96 0.0134, 0.0734 + 1.96 0.0134) or (.099664,.047136), which includes the group s claim 0.075. Thus, the advocacy groups claim is consistent with the estimated regression. (b) Other factors also affect how much college a person completes. Does controlling for these other factors change the estimated effect of distance on college years completed? To answer this question, construct a table like Table 7.1 in the textbook. Include a simple specification [constructed in (a)], a base specification (that includes a set of important control variables), and several modifications of the base specification. Discuss how the estimated effect of Dist on ED changes across specifications. Solution: # model e s t i m a t i o n m2< lm ( formula=yrsed d i s t+b y t e st+female+black+h i s p a n i c +incomehi+ownhome+d a d c o l l+momcoll+cue80+stwmfg80, data=cd) # h e t e r o s k e d a s t i c i t y robust standard e r r o r s c o e f t e s t (m2, vcov=vcovhc(m2, type= HC1 ) ) # summary o f model e s t i m a t i o n summary(m2) ### # model e s t i m a t i o n m3< lm ( formula=yrsed d i s t+b y t e st+female Page 11
+black+h i s p a n i c+incomehi+ownhome+d a d c o l l +momcoll+cue80+stwmfg80+urban+t u i t i o n, data=cd) # h e t e r o s k e d a s t i c i t y robust standard e r r o r s c o e f t e s t (m3, vcov=vcovhc(m3, type= HC1 ) ) # summary o f model e s t i m a t i o n summary(m3) vcov m3< vcovhc(m3, type= HC1 ) # j o i n t s i g n i f i c a n c e t e s t l i n e a r H y p o t h e s i s (m3, c ( urban =0, t u i t i o n =0 ), vcov=vcov m3 ) ## The simple specification is shown in column (1) in the above table, which only includes the factor Dist, distance to the nearest college. From the empirical questions in last time s homework, we know that apart from Dist, some additional regressors controlling for characteristics of the student, the student s family and the local labour market, such as Bytest, F emale, Black, Hispanic, Incomehi, Ownhome, DadColl, M omcoll, Cue80, and Stwmf g80, are significant, too. Column (2) in the above table shows the base specification controlling for such important factors. In column (2), the coefficient on Dist is 0.0308, which is much different from the results from the simple specification in column (1). R2 and SER change much, too. Column (3) shows another model specification including another two factors, U rban and T uition, which are not jointly significant(f -statistic= 2.4253 and the p-value= 0.08859). The coefficient on Dist, 0.0326, in column (3) changes little from the results in column (2). R2 and SER change very little between the last two columns, too. From the base specification in column (2), the 95% confidence interval for β Dist is ( 0.0326 1.96 0.0126, 0.0326+1.96 0.0126) or (.057296,.007904), which doesn t include the group s claim 0.075. Similar results are obtained from the regression in column (3). (c) It has been argued that, controlling for other factors, blacks and Hispanics complete more college than whites. Is this result consistent with the regressions that you constructed in part (b)? Solution: # h e t e r o s k e d a s t i c i t y robust variance c o v a r i a n c e matrix vcov m2< vcovhc(m2, type= HC1 ) # j o i n t s i g n i f i c a n c e t e s t l i n e a r H y p o t h e s i s (m2, c ( black =0, h i s p a n i c =0 ), vcov=vcov m2 ) ## Yes. The base specification in column (2) shows that the estimated coefficients ˆβ Black and ˆβ Hispanic are positive, large, and statistically separately and jointly significant (the F -statistic= 22.155 and the p-value is nearly zero). (d) Graph a 95% joint confidence interval for the coefficients on blacks and Hispanics. Page 12
Solution: # e l l i p s e p l o t car : : e l l i p s e ( m2$coef [ 5 : 6 ], vcov m2 [ 5 : 6, 5 : 6 ], s q r t ( qchisq ( 0. 9 5, 2 ) ), add=f, xlab= Black, ylab= Hispanic ) mtext ( C o e f f i c i e n t s o f Black and Hispanic, s i d e =3, l i n e =3, f o n t =2, outer=false) mtext( 95% Joint Confidence I n t e r v a l, s i d e =3, l i n e =1.5, f o n t =3, outer=false) ## Coefficients of Black and Hispanic 95% Joint Confidence Interval Hispanic 0.3 0.4 0.5 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Black Page 13