Econometrics Problem Set 3 Conceptual Questions 1. This question refers to the estimated regressions in table 1 computed using data for 1988 from the U.S. Current Population Survey. The data set consists of information on 4000 full-time full-year workers. The highest educational achievement for each worker was either a high school diploma or a bachelor s degree. The worker s ages ranged from 25 to 34 years. The dataset also contained information on the region of the country where the person lived, marital status, and number of children. For the purposes of these exercises let AHE = average hourly earnings (in 1998 dollars) College = binary variable (1 if college, 0 if high school) F emale = binary variable (1 if female, 0 if male) Age = age (in years) Ntheast = binary variable (1 if Region = Northeast, 0 otherwise) Midwest = binary variable (1 if Region = Midwest, 0 otherwise) South = binary variable (1 if Region = South, 0 otherwise) W est = binary variable (1 if Region = West, 0 otherwise) (a) Compute R 2 for each of the regressions. (b) Using the regression results in column (1): i. Do workers with college degrees earn more, on average, than workers with only high school degrees? How much more? ii. Do men earn more than women on average? How much more? (c) Using the regression results in column (2): i. Is age an important determinant of earnings? Explain. ii. Sally is a 29-year-old female college graduate. Betsy is a 34-year-old female college graduate. Predict Sally s and Betsy s earnings. (d) Using the regression results in column (3):
Dependent variable: average hourly earnings (AHE). Regressor (1) (2) (3) College(X 1 ) 5.46 5.48 5.44 F emale(x 2 ) -2.64-2.62-2.62 Age(X 3 ) 0.29 0.29 Northeast(X 4 ) 0.69 Midwest(X 5 ) 0.60 South(X 6 ) -0.27 Intercept 12.69 4.40 3.75 Summary Statistics SER 6.27 6.22 6.21 R 2 0.176 0.190 0.194 R 2 n 4000 4000 4000 Table 1: Results of Regressions of Average Hourly Earnings on Gender and Education Binary Variables and Other Characteristics Using 1988 Data from the Current Population Survey i. Do there appear to be important regional differences? ii. Why is the wage regressor W est omitted from the regression? What would happen if it was included? iii. Juanita is a 28-year-old female college graduate from the South. Jennifer is a 28-year-old female college graduate from the Midwest. Calculate the expected difference in earnings between Juanita and Jennifer. 2. (SW 6.10) (Y i, X 1,i, X 2,i ) satisfy the four multiple regression model least squares assumptions; in addition, var(u i X 1,i, X 2,i ) = 4 and var(x 1,i ) = 6. A random sample of size n = 400 is drawn from the population. (a) Assume that X 1 and X 2 are uncorrelated. Compute the variance of ˆβ 1. [Hint: The variance of ˆβ 1 is [ ] σ 2ˆβ1 = 1 1 σu 2. n 1 ρ 2 X 1,X 2 σx 2 1 (b) Assume that cor(x 1, X 2 ) = 0.5. Compute the variance of ˆβ 1. Page 2
(c) Comment on the following statements: When X 1 and X 2 are correlated, the variance of ˆβ 1 is larger than it would be if X 1 and X 2 were uncorrelated. Thus, if you are interested in β 1, it is best to leave X 2 out of the regression if it is correlated with X 1. 3. (SW 6.11) Consider the regression model Y i = β 1 X 1i + β 2 X 2i + u i for i = 1,..., n. (Notice that there is no constant term in the regression). (a) Specify the least squares function that is minimized by OLS. (b) Compute the partial derivatives of the objective function with respect to b 1 and b 2. (c) Suppose that n i=1 X 1iX 2i = 0. Show that ˆβ 1 = n i=1 X 1iY i / n i=1 X2 1i. (d) Suppose that n i=1 X 1iX 2i 0. Derive an expression for ˆβ 1 as a function of the data (Y i, X 1i, X 2i ), i = 1,..., n. (e) Suppose that the model includes an intercept: Y i = β 0 + β 1 X 1i + β 2 X 2i + u i. Show that the least squares estimators satisfy ˆβ 0 = Ȳ ˆβ 1 X1 ˆβ 2 X2. 4. (SW 7.7) Data were collected from a random sample of 220 home sales from a community in 2003. Let P denote the selling price (in $1000), BDR denote the number of bedrooms, Bath denote the number of bathrooms, Hsize denote the size of the house (in square feet), Age denote the age of the house (in yeas), and P r denote a binary variable that is equal to 1 if the condition of the house is reported as poor. An estimated regression yields Pˆ =119.2 + 0.485BDR + 23.4Bath + 0.156Hsize + 0.002Lsize + 0.090Age - 48.8P r (23.9) (2.61) (8.94) (0.011) (0.00048) (0.311) (10.5) SER = 41.5, R 2 = 0.72. (a) Is the coefficient on BDR statistically significantly different from zero? (b) Typically five-bedroom houses sell for much more than two-bedroom houses. Is this consistent with your answer to (a) and with the regression more generally? (c) A homeowner purchases 2000 square feet from an adjacent lot. Construct a 99% confidence interval for the change in the value of her house. (d) Lot size is measured in square feet. Do you think that another scale might be more appropriate? Why or why not? (e) The F -statistic for omitting BDR and Age from the regression is F = 0.08. Are the coefficients on BDR and Age statistically different from zero at the 10% level? Page 3
5. A study was conducted to determine whether certain features could be used to explain variability in the price of furnaces. For a sample of 19 furnaces the following regression was estimated: where Ŷ = -68.23 + 0.0023X 1 + 19.73X 2 + 7.65X 3 SER = 41.5, R 2 = 0.72 (0.005) (8.99) (3.082) Y = Price, in dollars, X 1 = Rating of furnace, in BTU per hour, X 2 = Energy efficiency ratio, X 3 = Number of settings. The standard errors reported here assume homoskedasticity of the error term. (a) What assumptions are required to be able to use this regression analysis for statistical inference. (b) Under the required assumptions for statistical inference, find a 95% confidence interval for the expected increase in price resulting from an additional setting when the values of the rating and the energy efficiency ration remain fixed. (c) Under the required assumptions for statistical inference, test the null hypothesis that, all else being equal, the energy efficiency ratio of furnaces does not affect their price against the alternative that the higher the energy efficincy ratio, the higher the price. (d) Under the required assumptions for statistical inference, test the null hypothesis that, taken together, the three independent variables do not linearly influence the price of the furnaces. 6. (SW 7.9) Consider the regression model Y i = β 0 + β 1 X 1i + β 2 X 2i + u i. Use the transform the regression approach discussed in class to transform the regression so that you can use a t-statistic to test (a) β 1 = β 2 ; (b) β 1 + aβ 2 = 0, where a is a constant; (c) β 1 + β 2 = 1; (Hint: You must redefine the dependent variable in the regression.) (d) β 1 + β 2 = a, where a is a constant.
Empirical Questions For these empirical exercises, the required datasets and a detailed description of them can be found at www.xmueconometrics.weebly.com. 7. (SW E7.3) The data set used in this empirical exercise (CollegeDistance) contains data from a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986. In this exercise you will use these data to investigate the relationship between the number of completed years of education for young adults and the distance from each student s high school to the nearest four year college. (Proximity to college lowers the cost of education, so that students who live closer to a four-year college should, on average, complete more years of higher education.) (a) An education advocacy group argues that, on average, a person s educational attainment would increase by approximately 0.15 year if distance to the nearest college is decreased by 20 miles. Run a regression of years of completed education (ED) on distance to the nearest college (Dist). Is the advocacy groups claim consistent with the estimated regression? Explain. (b) Other factors also affect how much college a person completes. Does controlling for these other factors change the estimated effect of distance on college years completed? To answer this question, construct a table like Table 7.1 in the textbook. Include a simple specification [constructed in (a)], a base specification (that includes a set of important control variables), and several modifications of the base specification. Discuss how the estimated effect of Dist on ED changes across specifications. (c) It has been argued that, controlling for other factors, blacks and Hispanics complete more college than whites. Is this result consistent with the regressions that you constructed in part (b)? (d) Graph a 95% joint confidence interval for the coefficients on blacks and Hispanics.