EC312: Advanced Econometrics Problem Set 3 Solutions in Stata Nicola Limodio www.nicolalimodio.com N.Limodio1@lse.ac.uk The data set AIRQ contains observations for 30 standard metropolitan statistical areas (SMSAs) in California for 1972 on the following variables airq indicator for air quality (lower is better) vala value added of companies (.000 US$) rain amount of rain (inches) coas dummy variable, 1 for SMSAs at the coast, 0 others dens population density (per square mile) medi average income per head (US $) a) Estimate a linear regression model that explains airq from the other variables using ordinary least squares. Interpet the coe cient estimates. Simply reg airq vala rain coas dens medi and the usual interpretation (look at significance, sign and magnitudes). b) Test the null hypothesis that average income does not a ect the air quality. Test the joint hypothesis that none of the variables has an e ect upon air quality. The first hypothesis can be tested by looking at the t-statistic of medi. The second is a test of joint significance, F test and can be performed by typing: test vala rain coas dens medi or by looking at the top-right corner of the regression. c) Test whether the variance of the error terms is di erent for coastal and noncoastal areas, using the Goldfield-Quandt test. In view of the outcome of the test, comment upon the validity of the test from b). How would one correct the test of b) in the presence of heteroskedasticity?
From this question you can perceive that the source of heteroskedasticity that worries the researcher comes from the variable coas. For this reason the GQ test is the most appropriate. Recall that GQ is the appropriate test when we observe the following pattern between our Y and X However note that the problem set is challenging for one point: in the usual GQ test, we order the sample with respect to the relevant variable which we believe being responsible for heteroskedasticity, then split the sample in 3 equal parts (in terms of numer of observations) and compare the RSS from the regressions of the 1 st and 3 rd ; here we have a dummy variable being responsible for heteroskedasticity, therefore we apply the test for both values of the dummy (no 3-sample partition) and will need to apply an adjustement for the degrees of freedom. We can execute it as follows. reg airq vala rain coas dens medi if coas==1 now let s save the RSS of this regression scalar RSS1= _result(4) scalar list RSS1 and analogously let s run the same regression for areas far from the coast
reg airq vala rain coas dens medi if coas==0 scalar RSS2= _result(4) scalar list RSS2 now compute the ratio of the Residuals Sum of Squares, R= RSS2/RSS1. Under the null hypothesis of homoscedasticity, this ratio R is distributed according to a F((n-c- 2k)/2, (n-c-2k)/2) degrees of freedom, where n is the sample size, c is the number of dropped observations, and k is the number of regressors in the model. In the example above, n=30, c=0, and k=4. Hence, R ~ F(22, 22). And under the null, R < F. Hence: scalar R=RSS1/RSS2 scalar list R however we also need to adjust this number for the degrees of freedom (n1-k) and (n2-k), which in this case is 4/16 hence scalar test=r * 1/4 and to find the critical value from the relevant F table and apply the usual rejection rule. One way to account for this type of heteroskedasticity is to account for the clustering of the variance within-groups, hence run reg airq vala rain coas dens medi, cluster(coas) or, more appropriately, because of the small number of clusters we can just use robust standard errors reg airq vala rain coas dens medi, robust more on this will follow in the next classes. d) Perform a Breusch-Pagan test for heteroskedasticity related to all explanatory variables.
As before, run the OLS reg airq vala rain coas dens medi get the sum of the squared residuals predict error, resid matrix accum E=error matrix list E scalar N= _result(1) now generate a disturbance correction factor in the form of sum of the squared residuals divided by the sample size scalar N=_result(1) scalar sigmahat=el(e,1,1)/n scalar list N sigmahat regress the adjusted squared errors (in the form of original squared errors divided by the correction factor) on a list of explanatory variables supposed to influence the heteroscedasticity. Following the question we assume all variables influence heteroskedasticity. Hence: gen adjerr2=(error^2)/sigmahat regress adjerr2 vala rain coas dens medi This auxiliary regression gives you a model sum of squares (ESS) equals: scalar ESS=_result(2) scalar list ESS Under the null hypoteshis of homoscedasticity, ESS/2 asymptotically converges to a Chi-squared(k-1, 5%), where k is the number of coefficients on the auxiliary regression. In the last case, k=5. Hence, comparing (1/2) ESS with a Chi-squared with 4 degress of freedom and 5%, you can apply the usual rejection rule for our test.
e) Perform a White test for heteroskedasticity. Comment upon the appropriateness of the White test in light of the number of observations and the degrees of freedom of the test. Here the strategy is as follows: (i) Run the OLS regression (as you've done above, the results are ommited): (ii) Get the residuals: (iii) Generate the squared residuals: (iv) Generate new explanatory variables, in the form of the squares of the explanatory variables and the cross-product of the explanatory variables: Remember that you don t need to square a dummy variable. (v) Regress the squared residuals into a constant, the original explanatory variables, and the set of auxiliary explanatory variables (squares and cross-products) you've just created. (vi) Get the sample size (N) and the R-squared (R2), and construct the test statistic N*R2; (vii) Under the null hypothesis, the errors are homoscedastic, and NR2 is asymptotically distributed as a Chi-squared with k-1 degrees of freedom (where k is the number of coefficients on the auxiliary regression). Apply the usual rejection rules. Because we have only 30 observations, we are close to run short of degrees of freedom (because of the cross products). So the White test would not be the most appropriate. f), g) and h) are just a repetition of the previous tests with a different functional form and its application is left as an exercise.