ECON 497 Midterm Spring 2009 1 ECON 497: Economic Research and Forecasting Name: Spring 2009 Bellas Midterm You have three hours and twenty minutes to complete this exam. Answer all questions and explain your answers. Fifty points total, point per part indicated in parentheses. 1. Omitted variables are a fun part of any regression model. Imagine that a regression is done estimating the time it takes a student at my son s elementary school to run 100 meters. The model estimated is: T i = 28 2M i 1.5G i Where T i is the time (in seconds) that it took student i to run 100 meters, M i is a male dummy variable and G i is the grade level (0 through 6) of the student. One excluded variable is the student s height. Assuming that height is relevant to a person s speed in the 100 meter dash, how would the exclusion of height bias the estimated coefficient on G i? Explain. (2) Height is probably positive correlated with grade and likely has a negative impact on time, so the bias term would be negative. The exclusion of height would negatively bias the estimated coefficient on G. 2. Explain briefly what endogeneity is and offer a simple example. (2) Endogeneity occurs when there is a causal link between an explanatory variable and either other explanatory variables or the dependent variable, so that the value of the explanatory variable in question is not truly independent of the other variables in the equation. One example is that the acceleration time of a car depends on its weight and horsepower, but horsepower might depend on the weight of the car, so horsepower is endogenous.
ECON 497 Midterm Spring 2009 2 3. Linear regression involves estimating a linear relationship between one or more independent or explanatory variables and a dependent variable. Imagine that such a relationship has been estimated between a person s income in thousands of dollars (I i ), their age (A i ), a dummy variable indicating whether they are male (M i ), a dummy variable indicating whether they have a college degree (C i ) and a male-age interactive term, equal to the product of the male dummy and their age (MA i ). The estimated equation is: I i = -2.0 + 0.5A i + 5.0M i + 12.0C i - 0.2MA i A. Calculate the predicted income for a 30 year old woman with no college degree. (1) I-hat = -2.0 + 0.5*30 = 13 or $13,000. B. Calculate the predicted income for a 30 year old woman with a college degree (1) I-hat = -2.0 + 0.5*30 + 12.0*1 = $25,000 C. What is the interpretation of the coefficient of 5.0 on the male dummy? (1) Other things being the same, on average a male would earn $5000 more than a female. D. On one set of axes, draw a basic graph of income versus age for a woman who has a college degree and for a man who has a college degree. I will grade this based on the relative positions of the vertical intercepts and the relative slopes of the two lines. (2)
ECON 497 Midterm Spring 2009 3 5. Consider the following output from a regression done in SPSS. Regression of acceleration time (S) on a manual transmission dummy (T), drag coefficient (E) and horsepower (H). Model 1 Regression Residual Total a. Predictors: (Constant), H, T, E b. Dependent Variable: S ANOVA b Sum of Squares df Mean Square F Sig. 178.917 3 59.639 26.889.000 a 75.411 34 2.218 254.328 37 Model 1 (Cons tant) T E H a. Dependent Variable: S Unstandardiz ed Coefficients Coe fficients a Standardized Coefficients B Std. Error Beta t Sig. 10.322 1.396 7.396.000 -.963.548 -.169-1.758.088 8.061 3.638.215 2.216.034 -.018.002 -.772-8.179.000 A. What is the null hypothesis of the F test? (1) That all of the slope coefficients are jointly zero. Put somewhat differently, the null hypothesis is that the model is worthless or, to use a technical term, crap. B. Briefly discuss the meaning/interpretation of the Sig. value for the F test. (1) The very small (<0.001) value of the Sig. value or p-value for the F-test suggests that the null hypothesis should be rejected, meaning that at least one slope coefficient is not zero or, alternatively, that the model is of some value. C. What would you tell someone who asked whether, according to this model, the type of transmission (T) a car has affects its acceleration time. Please be careful and complete in your answer. (3) While the estimated coefficient on T is not significant at the 5% level, it is significant at the 10% level and suggests that the type of transmission that a car has does impact its acceleration time even if, by some standards, this result is not statistically significant.
ECON 497 Midterm Spring 2009 4 6. Consider the following diagram: A. Clearly indicate in the diagram the linear regression residuals e 1, e 2 and e 3. (1) B. Fill in the blank: e 1 + e 2 + e 3 = 0. (1) 7. Explain briefly why the R 2 value from a regression based on two data points will be 1.000. (2) Because a line drawn using two data points will pass exactly through each of those points, leaving no residual, meaning that RSS=0 so that ESS=TSS and R 2 =1. 8. Write out the relationship between total sum of squares (TSS), explained sum of squares (ESS) and residual sum of squares (RSS). (1) TSS = ESS + RSS
ECON 497 Midterm Spring 2009 5 9. A regression can suffer from several different violations of the classical assumptions. Among these are: Heteroskedasticity Omitted Variables Serial Correlation Multicollinearity Endogeneity For each of the items presented below, tell me which of these violations it addresses and, based on what you see, is this likely a problem or not. Please explain briefly. Two points each. A. This is probably heteroskedasticity because the variation of the error term seems to depend on the value of X. It might also suggest that X 2 is an omitted variable. It could also be serial correlation if X is time. This does seem to be a problem. B. VIF = 3.836 The VIF is a test for multicollinearity, but its small value (<5) suggests that it is not a problem here.
ECON 497 Midterm Spring 2009 6 C. A negative and significant coefficient from a Park test. The Park test is used to detect heteroskedasticity and the significant, albeit negative, result here suggests that it is a problem. The variance of the error term is greater when the value of the explanatory factor is smaller. D. An unbelievably large estimated coefficient on an explanatory variable. This suggests some omitted variable for which the included variable s estimated coefficient is trying to compensate. E. A high R 2 from your regression, but no estimated coefficients that are significantly different from zero. This is one of the classic signs of multicollinearity, and to the extent that collinearity is ever a problem, it seems to be a problem here.
ECON 497 Midterm Spring 2009 7 10. At the end of this exam, you will find Excel regression output from a regression of house price on various explanatory variables. Use these regression results to answer the following questions. A. According to Studenmund s four criteria, should the variable Age be included in the model? Explain. (2) In theory, the age of a house should matter for its price. The estimated coefficient on age is significant. It is also positive, which may be the expected result or not, depending on how you view older houses. The adjusted R 2 doesn t really change when AGE is added, so this is a bit of a toss up. Excluding AGE seems to greatly bias BATHROOMS, suggesting that AGE should be included. Overall, it should be included. This is largely because of the theoretical reasons, but also because of the bias on BATHROOMS. B. With which other explanatory variable is Age most highly correlated? Explain how you know this. (2) It is most highly correlated with BATHROOMS and you can tell this because of the huge bias in the estimated coefficient on BATH when AGE is omitted. C. Is the correlation between Age and this other variable positive or negative? Explain how you know. (2) The correlation is negative because the impact of BATH on price should be positive, but the estimated coefficient on AGE is much smaller when BATH is excluded, so the correlation must be negative.
ECON 497 Midterm Spring 2009 8 11. Even after all these years of teaching this subject, I still get some sick pleasure out of watching people worry about multicollinearity. A. What three options are available for detection of multicollinearity? (2) Scatterplots showing relationships between explanatory variables. Correlation coefficients between explanatory variables. High R 2 and few or no significant estimated coefficients. High VIF numbers, generally greater than 5 or 10. B. In one word, what should you do to address this problem in your regression? (1) Nothing. 12. Consider the simplest possible regression model: Y i = β 0 + β 1 X i + ε i From which of the following violations of the underlying assumptions of OLS could this regression not possibly suffer? Explain why not. (2) Endogeneity Serial Correlation Might not be a problem as this doesn t seem to be a time series regression. Heteroskedasticity Multicollinearity This can t be a problem because there is only one explanatory variable. Omitted variable bias
ECON 497 Midterm Spring 2009 9 13. Here is some totally fake regression output. Calculate the correct values for the blanks. If you can t calculate a value, make your best guess and justify it. Model Regression Residual Total ANOVA Sum of Squares df Mean Square F Sig. 12200 2 579 97.685 BLANK B 800 98 928 BLANK A 100 coefficients Standardized Coefficient Model B Std. Error Beta t Sig. (Constant) X1 X2 10.50 3.00 BLANK E 3.50 0.02 2.00 0.385 0.477 BLANK C 150.00 6.00 0.002 BLANK D 0.000 A. (1) 12200 + 800 = 13000 B. (1) Given the high F stat this is probably 0.000. C. (1) 10.50/3.50 = 3.000. D. (1) Given the very large t stat this is probably 0.000. E. (1) E/2.00 = 6.00 -> E=12.00. F. Calculate the R 2 for this regression. (1) 12200/13000 =
ECON 497 Midterm Spring 2009 10 14. A colleague estimates the following regression model: Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 X 3i + ε i She gets a low R 2 and isn t too happy. She then estimates the following, slightly modified version of the model: ln Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 X 3i + ε i She gets a much higher R 2 and is very excited. She claims that the higher R 2 for the second model strongly supports the idea that this is the correct model to use for the data. What should you tell her? (2) You can t compare the two R-squared figures because the dependent variable has been transformed in a non-linear way.
ECON 497 Midterm Spring 2009 11 15. What is the most important problem with the following regression of house prices on various house characteristics, as seen on the second homework assignment? (2) Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1.506 a.257.190 3.7858849687 81858E4 a. Predictors: (Constant), AGE, SQFT, NEIGH, BATH Coefficients a Unstandardized Coefficients Standardized Coefficients Model B Std. Error Beta t Sig. 1 (Constant) 35330.915 23840.136 1.482.145 SQFT 5.172 10.662.086.485.630 BATH 27204.392 11654.488.418 2.334.024 NEIGH 1992.057 6808.152.038.293.771 AGE -187.893 208.897 -.117 -.899.373 a. Dependent Variable: PRICE NEIGH is a qualitative variable and should be recoded as a series of dummy variables. Look for this question again on the final exam.
ECON 497 Midterm Spring 2009 12 SUMMARY OUTPUT Regression Statistics Multiple R 0.46 R Square 0.21 Adjusted R Square 0.20 Standard Error 285627.86 Observations 653.00 Coefficients Standard Error t Stat P-value Intercept -102126.32 67119.17-1.52 0.13 SQFTTOTL 169.96 22.35 7.60 0.00 SQFTLOT 0.16 0.17 0.92 0.36 STORIES 36901.19 27087.47 1.36 0.17 BATHS 46263.97 30138.08 1.54 0.13 BEDS -34681.51 16811.06-2.06 0.04 AGE 1573.48 591.21 2.66 0.01 SUMMARY OUTPUT Regression Statistics Multiple R 0.45 R Square 0.20 Adjusted R Square 0.20 Standard Error 286967.50 Observations 653.00 Coefficients Standard Error t Stat P-value Intercept 17440.21 50100.59 0.35 0.73 SQFTTOTL 177.13 22.30 7.94 0.00 SQFTLOT 0.16 0.17 0.91 0.36 STORIES 27994.00 27006.00 1.04 0.30 BATHS 473.62 24860.54 0.02 0.98 BEDS -28696.38 16738.11-1.71 0.09