Question 1 [17 points]: (ch 11)

Question 1 [17 points]: (ch 11) A study analyzed the probability that Major League Baseball (MLB) players "survive" for another season, or, in other words, play one more season. They studied a model of the following form: The dependent variable is a binary variable that takes on a value of one if the player played one more season (a minimum of 50 at bats or 25 innings pitched), and zero otherwise. Seasons is the number of total seasons played, measured in years, Perf is the performance of the player this year, and Avgperf is the average performance of the player over their career. The researchers had a sample of 4,728 hitters and 3,803 pitchers for the years 1901-1999. All explanatory variables are standardized (sample mean of 0, variance of 1). Probit estimation yielded the results as shown in the table: Regression (1) Hitters (2) Pitchers Regression model probit probit constant 2.010 (0.030) 1.625 (0.031) number of seasons played -0.058 (0.004) -0.031 (0.005) performance 0.794 (0.025) 0.677 (0.026) average performance 0.022 (0.033) 0.100 (0.036) (a) (6p) Interpret the two probit equations and calculate survival probabilities for hitters and pitchers at the sample mean. Provide an explanation for why these are so high. (b) (6p) Calculate the change in the survival probability for a player who has a very bad year by performing two standard deviations below the average (assume also that this player has been in the majors for many years so that his average performance is negligibly affected). How does this change the survival probability when compared to the answer in (a)? (c) (5p) Since the results for hitters and pitchers seem similar, the researcher could consider combining the two samples. With a combined sample, how could you test the hypothesis that the coefficients for the explanatory variables are the same for hitters and pitchers? Explain in some detail.. Answer: (a) Note that all variables are standardized, so that the mean is zero. This results in a survival probability of 0.997 for hitters and 0.991 for pitchers. These results are so 2

high because there is a high probability, in general, for a player to return the following season. (b) Since the variables are standardized, this implies a change of two for the performance variable. The result for hitters is a lowering of the survival probability to 0.65, and for pitchers to 0.633 (c) After combining the sample for hitters and pitchers, you would allow for a different intercept and slopes by introducing a binary variable for pitchers if hitters are the default. This binary variable would be introduced by itself and in combination with each of the above variables, thereby allowing all coefficients to differ. You could then conduct an F-test for the joint hypothesis that all coefficients involving the binary variables are zero. If the hypothesis cannot be rejected, then there is no difference between the coefficients for hitters and pitchers. 3

Question 2 [21 points]: (ch 10) Consider the following panel data regression with a single explanatory variable Yit = β0 + β1xit +. In each of the examples below, you will be including entity and time fixed effects. (a) (3 p) Consider the effect of beer taxes on the fatality rate using annual data from 1982-1988, and nine U.S. regions (New England, Pacific, Mid-Atlantic, South, etc.). How many total coefficients do you need to estimate? (b) (4 p) Certain regions (e.g. New England) that tend to have higher beer taxes also tend to have consistently higher quality hospitals. Does this pose a threat to your analysis? (c) (3 p) Consider the effect of the minimum wage on teenage employment using annual data from 1963-2000 for five Canadian Regions (Atlantic Provinces, Quebec, Ontario, Prairies, British Columbia). How many total coefficients do you need to estimate? (d) (4 p) Nationwide recessions impact both teenage employment and the minimum wage across the country. Does this pose a threat to your analysis? (e) (3 p) Consider the effect of savings rates on per capita income using data for three decades (1960-1969, 1970-1979, 1980-1989; one observation per decade) and 104 countries. How many total coefficients do you need to estimate? (f) (4 p) A number of countries industrialized at different times between 1960-1989, a process which can impact both the savings rate and per capita income. Does this pose a threat to your analysis? Answer: (a) 16 coefficients (6 time fixed effects, 8 entity fixed effects, intercept, slope). (b) No, entity fixed effects will account for entity constant omitted variables. (c) 43 coefficients (37 time fixed effects, 5 entity fixed effects, intercept, slope). (d) No, time fixed effects will account for this. (e) 107 coefficients (3 time fixed effects, 103 entity fixed effects, intercept, slope). (f) Yes, industrialization is a time and entity varying omitted variable. 4

Question 3 [15 points]: (IV regression) (Ch 12) Consider a supply model for edible chicken, which the the U.S. Department of Agriculture calls broilers Data for this question is adapted from the data provided by Epple and McCallum (2006) 1. The data are annual, 1950-2001 The Supply equation is: ( ) ( ) ( ) ( ) where is aggregate production of young chickens, is the real price index of fresh chicken, is real price index of broiler feed, and which is included to capture any technical progress in the production. Some potential external instrumental variables are ( ), where is the real per capita income; ( ), where is the real price of beef; is the percent population growth from year t-1 to year t; ( ) is the lagged log of real price of chickens; ( ) is the log of exports of chicken. Estimated supply equation for chicken can be written from the following output: Regression 1:. reg lnqprod lnp lnpf TIME lnqprod_1 Source SS df MS Number of obs = 40 -------------+------------------------------ F( 4, 35) = 3102.49 Model 11.9815945 4 2.99539863 Prob > F = 0.0000 Residual.03379186 35.000965482 R-squared = 0.9972 -------------+------------------------------ Adj R-squared = 0.9969 Total 12.0153864 39.308086831 Root MSE =.03107 lnqprod Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- lnp.0091099.0679409 0.13 0.894 -.1288175.1470373 lnpf -.0901945.0426459-2.11 0.042 -.1767703 -.0036186 TIME.0111706.0051486 2.17 0.037.0007183.0216229 lnqprod_1.7326902.1066347 6.87 0.000.5162103.94917 _cons 2.109681.7991519 2.64 0.012.487316 3.732045 Regression 2:. ivreg lnqprod (lnp=lnpb lny POPGRO lnexpts) lnpf TIME lnqprod_1 Instrumental variables (2SLS) regression Source SS df MS Number of obs = 40 -------------+------------------------------ F( 4, 35) = 1619.82 Model 11.9506133 4 2.98765333 Prob > F = 0.0000 Residual.064773079 35.001850659 R-squared = 0.9946 -------------+------------------------------ Adj R-squared = 0.9940 Total 12.0153864 39.308086831 Root MSE =.04302 1 Simultaneous Equation Econometrics: The Missing Example, Economic Inquiry, 44(2), 374-384 5

lnqprod Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- lnp.393975.1749342 2.25 0.031.0388398.7491103 lnpf -.1909911.0705566-2.71 0.010 -.3342286 -.0477535 TIME.0242389.0087117 2.78 0.009.0065532.0419247 lnqprod_1.5489031.1635754 3.36 0.002.2168274.8809789 _cons 3.298617 1.196567 2.76 0.009.8694559 5.727778 Instrumented: lnp Instruments: lnpf TIME lnqprod_1 lnpb lny POPGRO lnexpts Regression 3:. reg lnp lnpb lny POPGRO lnexpts lnpf TIME lnqprod_1 Source SS df MS Number of obs = 40 -------------+------------------------------ F( 7, 32) = 49.65 Model 1.61496433 7.230709191 Prob > F = 0.0000 Residual.14868612 32.004646441 R-squared = 0.9157 -------------+------------------------------ Adj R-squared = 0.8973 Total 1.76365045 39.045221807 Root MSE =.06816 lnp Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- lnpb.1159974.2186138 0.53 0.599 -.3293044.5612991 lny 1.471961.6529929 2.25 0.031.1418577 2.802064 POPGRO.0697965.0908676 0.77 0.448 -.1152949.2548878 lnexpts 2.438689.6971098 3.50 0.001 1.018723 3.858655 lnpf.154805.1068706 1.45 0.157 -.0628833.3724932 TIME -.0735312.0230427-3.19 0.003 -.1204676 -.0265948 lnqprod_1 -.0086269.2911554-0.03 0.977 -.601691.5844372 _cons -11.95739 6.311461-1.89 0.067-24.81341.8986362 -----------------------------------------------------------------------------c Regression 4:. reg lnqprod lnp lnpf TIME lnqprod_1 Source SS df MS Number of obs = 40 -------------+------------------------------ F( 4, 35) = 3102.49 Model 11.9815945 4 2.99539863 Prob > F = 0.0000 Residual.03379186 35.000965482 R-squared = 0.9972 -------------+------------------------------ Adj R-squared = 0.9969 Total 12.0153864 39.308086831 Root MSE =.03107 lnqprod Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- lnp.0091099.0679409 0.13 0.894 -.1288175.1470373 lnpf -.0901945.0426459-2.11 0.042 -.1767703 -.0036186 TIME.0111706.0051486 2.17 0.037.0007183.0216229 lnqprod_1.7326902.1066347 6.87 0.000.5162103.94917 _cons 2.109681.7991519 2.64 0.012.487316 3.732045 6

. predict e, residuals (1 missing values generated) Regression 5:. reg e lnpb lny POPGRO lnexpts lnpf TIME lnqprod_1 Source SS df MS Number of obs = 40 -------------+------------------------------ F( 7, 32) = 2.19 Model.010946966 7.001563852 Prob > F = 0.0618 Residual.022844894 32.000713903 R-squared = 0.3240 -------------+------------------------------ Adj R-squared = 0.1761 Total.03379186 39.000866458 Root MSE =.02672 e Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- lnpb.1180813.0856913 1.38 0.178 -.0564662.2926289 lny.2378684.2559575 0.93 0.360 -.2835.7592367 POPGRO -.0123288.0356179-0.35 0.732 -.0848802.0602225 lnexpts.9702997.2732502 3.55 0.001.4137072 1.526892 lnpf -.0522353.0418907-1.25 0.221 -.1375639.0330932 TIME -.0045154.0090322-0.50 0.621 -.0229133.0138826 lnqprod_1 -.2648651.1141259-2.32 0.027 -.497332 -.0323983 _cons.1666471 2.473941 0.07 0.947-4.872605 5.205899. test lnpb lny POPGRO lnexpts ( 1) lnpb = 0 ( 2) lny = 0 ( 3) POPGRO = 0 ( 4) lnexpts = 0 F( 4, 32) = 3.83 Prob > F = 0.0118 (a) (4p) Compare the results in regression 1 and 2. Explain the reasons for instrumental variables in regression 2? Answer: (b) (5p) What are the requirements for valid instruments? Explain with mathematical conditions. Answer: 7

Relevance: ( ) Exogeneity: ( ) (c) (6p) Do these instruments satisfy the requirements? You must use the necessary regression results for your answer. Please specify the regression number you use while answering each part of this questions. (1) Relevancy: Using regression 3, square of t test is greater than 10 only for lnexprt, that is the only relevant IV. (2) Exogeneity: ( ) Hence, reject Therefore IV are not exogeneous. 8

Question 4 [15 points]: (Ch 15) There is some economic research that suggests that oil prices play a central role in causing recessions in developed countries. In particular, this research suggests that it is specifically increases in oil prices that matter. As a result, economists often look only at the percentage point difference between oil prices at date t and the maximum value over the previous year. However, you notice that energy prices can fluctuate quite dramatically in both directions and believe that geographic areas also benefit substantially from oil price decreases. As a result, you decide to consider the effect of real oil prices (Poil/CPI) on GDP growth (Yt) You estimate the following distributed lag model using annual data (numbers in parenthesis are HAC standard errors): t = 3.39-0.009 (Poil/CPI)t - 0.028 (Poil/CPI)t-1 (0.27) (0.010) (0.011) t = 1960-2008, R2 = 0.15, SER = 1.88 (a) (5p) What is the impact effect of a 25 percentage point increase in real oil prices? (b) (5p) What is the predicted cumulative change in GDP Growth over two years of this effect? (c) (5p) The HAC F-statistic is 4.07. Can you reject the null hypothesis that oil price changes have no effect on real GDP growth? What is the critical value you considered? Is there any reason why you should be cautious using an F-test in this case, given the sample period? Answer: a. GDP growth would decrease by almost a quarter of a percentage point. b. The predicted decline in growth would be almost one percentage point (-0.925). c. The critical value of F2, = 3.00 at the 5% significance level. Hence you can reject the null hypothesis that oil prices have no effect on real GDP growth. However, since the sample period involves only 50 or so observations, it is not clear that the test statistic is actually F-distributes (small sample). 9

Question 5 [20 points]: (Ch 14) Given the following STATA output, you can find a VAR(2) (VectorAutoregression) model of change in inflation ( ) and unemployment rate ( ). var unem cinf Vector autoregression Sample: 1951-2012 No. of obs = 62 Log likelihood = -201.564 AIC = 6.824644 FPE = 3.156906 HQIC = 6.959349 Det(Sigma_ml) = 2.284871 SBIC = 7.167731 Equation Parms RMSE R-sq chi2 P>chi2 ---------------------------------------------------------------- unem 5 1.00228 0.6589 119.7914 0.0000 cinf 5 1.72495 0.3091 27.73971 0.0000 ---------------------------------------------------------------- Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- unem unem L1. 1.061241.1303681 8.14 0.000.8057245 1.316758 L2. -.2874012.133048-2.16 0.031 -.5481705 -.026632 cinf L1..0976014.0668152 1.46 0.144 -.0333539.2285567 L2..0623594.0572543 1.09 0.276 -.049857.1745758 _cons 1.345204.513183 2.62 0.009.3393835 2.351024 -------------+---------------------------------------------------------------- cinf unem L1. -.4678597.2243671-2.09 0.037 -.907611 -.0281084 L2..2932862.2289793 1.28 0.200 -.155505.7420773 cinf L1. -.0527481.1149907-0.46 0.646 -.2781258.1726296 L2. -.430232.0985363-4.37 0.000 -.6233595 -.2371044 _cons 1.00306.883202 1.14 0.256 -.7279845 2.734104 Table 1 Year Unem Inflation 2008 5.8 3.8 2009 9.3-0.3 2010 9.6 1.6 2011 8.9 3.1 2012 8.1 2.1 (a) (4p) Given the actual realizations of unemployment and inflation in table 1, forecast unemployment for 2013, show your work 10

(b) (4p) Given the actual realizations of unemployment and inflation in table 1, forecast inflation for 2013, show your work (c) (4p) Following is the joint test result for the second lags of unemployment rate and the inflation rate, according to the following test, would a VAR(1) model be better forecasting model than a VAR(2) model, explain why?. test L2.cinf L2.unem ( 1) [unem]l2.cinf = 0 ( 2) [cinf]l2.cinf = 0 ( 3) [unem]l2.unem = 0 ( 4) [cinf]l2.unem = 0 chi2( 4) = 30.26 Prob > chi2 = 0.0000 (d) (4p) Why might a researcher use change in inflation as opposed to inflation in this model? Explain. (e) (4p) Should one use change in unemployment instead of unemployment? Explain. 11

Question 6 [12 points]: (Derivation question) Consider the panel data model: where are i.i.d. and independent of Xs with mean zero and variance, (a) (3 p) Define and, the entity demeaned values of X and Y. (b) (3 p) Rewrite the model in terms of these demeaned variables. (c) (3 p) Derive algebraically the fixed-effects estimator of. The fixed effects estimator minimizes the sum of squared residuals of the model you wrote in part b. (d) (3 p) Show that, if is a random variable that is independent of X and u, the estimator is unbiased for. Explain your answer. Answer: (a) (b) (c) 12

Subtracting the last equation from the first we would get; ( ) or we can also write it as; (d) The fixed-effects estimator of is the OLS estimator of the above regression. ( ) ( ) Hence, Using We can write ( ) Since is independent of X s and U s, using Law of Iterated Expectations we can show that [ ] 13

Bonus Question [2 points]: The two conditions for instrument validity are corr(zi, Xi) 0 and corr(zi, ui) = 0. The reason for the inconsistency of OLS is that corr(xi, ui) 0. If X and Z are correlated, and X and u are also correlated, how is it possible that Z and u are not correlated? Explain. Answer: The major idea is that corr(xi, ui) has two parts: one for which the correlation is zero and a second for which it is non-zero. The trick is to isolate the uncorrelated part of X. For the instrument to be valid, corr(zi, ui) = 0 and corr(zi, Xi) 0 must hold. TSLS then generates predicted values of X in the first stage by using a linear combination of the instruments. As long as corr(zi, Xi) 0 and corr(zi, ui) = 0, then the part of X which is uncorrelated with the error term is extracted through the prediction. In the second stage, this captured exogenous variation in X is then used to estimate the effect of X on Y, which is exogenous. 14

Selected Tables from Stock and Watson, Introduction to Econometrics 15