:Effects of Data Scaling We ve already looked at the effects of data scaling on the OLS statistics, 2, and R 2. What about test statistics?

MRA: Further Issues :Effects of Data Scaling We ve already looked at the effects of data scaling on the OLS statistics, 2, and R 2. What about test statistics? 1. Scaling the explanatory variables Suppose we replace the explanatory variables X by X where X XD where D is an invertible matrix (a change of units is the special case where D diag c 1, c 2,,c K ). So instead of the model y X u we consider the model y X u D 1.

The general linear hypothesis H 0 : R r is replaced with H 0 : R r where R RD. The test statistic for H 0 : R r R r R X X 1 R 1 R r / q 2 The test statistic for H 0 : R r is given by R r R X X 1 R 1 R r / q 2 R r R X X 1 R 1 R r / q 2 given that we know R RDD 1,and 2 2 (changes in the basis representing Sp X don t change the sum of squared residuals). But

X X 1 XD XD 1 D X XD 1 D 1 X X 1 D 1 (with repeated application of the property BA 1 A 1 B 1 provided both inverses exist). We conclude that tests of the general linear hypothesis are invariant to the basis chosen for Sp X. Asthet test is a special case, t statistics are invariant to choice of units.

2. Scaling the dependent variable Suppose we replace the dependent variable y with y cy. So our new model is y X u where v c. The restrictions under test become R v r v,wherer v cr. The test statistic is R v r v R X X 1 R 1 R 2 v r v / q v But we know v c and v 2 c2 2. We conclude that tests of the general linear hypothesis are invariant to the units chosen for the dependent variable.

:Change of basis for the restrictions Suppose we have the restrictions 1 2and 2 0. This is equivalent to saying 1 2 2and 2 0. But these two ways of expressing the same restrictions generate two different values for the R matrix and r vector. Fortunately, if we replace H 0 : R r with H 0 : R v r v where R v BR, andr v Br with B invertible, we don t change the value of the test statistic. Exercise: Prove the result in the previous bullet. Another exercise: Find the B matrix for the example I ve given in the first bullet above.

:Beta (standardized) coefficients In some applications where units are difficult to interpret, researchers divide the dependent and independent variables by their standard deviations. In this case, the coefficients tell us how many standard deviations the dependent variable responds to a one standard deviation increase in each explanatory variable. Even in cases where the units are easy to interpret, it is sometimes useful to report standardized coefficients to give a sense of the importance of "typical" movements in an explanatory variable.

:Using logs Suppose we estimate the model ln y 0 1 ln x 1 2 x 2 u The parameter 1 measures an elasticity, 1 E ln y ln x 1 %Δ in predicted y %Δ in x 1 1 is "dimensionless" or "unit free". If we change units y y c 0 y x 1 x 1 c 1 x 1 The model becomes ln y 0 1 ln x 1 2 x 2 u ln y ln c 0 0 1 ln x 1 ln c 1 2 x 2 u Therefore

1 1 2 2 u u 0 0 ln c 0 1 ln c 1 Using logs for the dependent variable may lead to disturbances that appear more likely to be i.i.d draws from a normal density (fewer outlier, less heteroskedasticity). That s the idea behind the Box-Cox and similar transformations (see the paper by Wooldridge on the home page). But to use it, we must have strictly positive values for the dep. variable. Also, in a regression setting, using logs versus levels changes what we want to explain. This seems innocuous in wage regressions, but not if the dependent variable is, say, (gross) returns.

Using a Taylor series expansion ln y ln E y X 1 E y X y E y X So 0. 5 1 E y X 2 y E y X 2 E ln y X ln E y X 0. 5 y X E y X Even if returns are unpredictable, so E y X (a constant), running a regression on log-returns could generate statistical significant coefficients if the standard deviation of returns is predictable. 2

Whether or not logarithms should be used for the independent variables is a much more straightforward matter. We can treat it as a problem of hypothesis testing. For example we could run the regression ln y 0 1 ln x 1 2 x 2 3 x 1 u Test 3 0 to decide if the log specification is sufficient, or 1 0 to see if the linear specification is sufficient. If we don t reject either null, then the data don t care and it s a matter of taste which specification we use. If we reject one null, but not the other, then the data tell us which to choose. If reject both nulls, then neither the linear nor the logarithmic specification is sufficient to capture the response of lny to x 1.

:Parameter heterogeneity A very important consideration in applied work is that responses can differ across the observations. A conceptually simple case is where this variation only depends on the regressors, i.e. y i x i i u i x i x i u i For example, consider the special case with y i 0 1 x 1i 2 x i x 2i u i 2 x i 2 3 x 1i 4 x 2i

Substituting out for 2 x i, the model becomes y i 0 1 x 1i 2 3 x 1i 4 x 2i x 2i u i So 0 1 x 1i 2 x 2i 3 x 1i x 2i 4 x 2 2i u i E y i x i x 1i E y i x i x 2i 1 3 x 2i 2 3 x 1i 2 4 x 2i If the explanatory variable is not continuous the number of rooms in a house then it makes sense to work with ΔE y i x i (the textbook uses Δ y i ) to understand the effect of changes in the explanatory variables. This creates a small but important difference in interpretation.

:Goodness of fit and selection of regressors In what follows, assume the model contains an intercept. AhighR 2 doesn t mean that we have a good model (trending data often have a high R 2 ;alowr 2 doesn t mean that we have a bad model (the market efficiency hypothesis predicts a zero R 2 if we try to forecast returns). R 2 cannot fall when we add a regressor: R 2 1 SSR SST and, by definition, SSR can never increase if we add a regressor A (relatively dumb) alternative to R 2 is sometimes used (especially in finance) that does penalize for adding regressors. It s called the adjusted R 2 or the "R-bar squared"

R 2 1 SSR/ n K SST/ n 1 1 1 R 2 n 1 n k If we add a regressor, x K 1, to the model, then R 2 increases iff the t-statistic for H 0 : K 1 0 exceeds 1. If we add a set of regressors, x K 1, to the model, then R 2 increases iff the F-statistic for H 0 : K 1 0 exceeds 1. Notice that we can also write R 2 1 2 y Ay/ n 1 Changing regressors, affects only 2,soR 2 increases iff 2 decreases. It is better to report R 2 and 2 then R 2 and R 2.

We saw that if a model is "false" then E 2 2. This lead to a (very old) suggestion that we should use R 2 to choose between various specifications. This is NOT A GOOD IDEA. It makes no sense if the dependent variable changes across specifications. If the models are nested, we can use standard hypothesis tests. And if the models are non-nested, then we should use an information criterion (Akaike, or better Schwartz/BIC, or Hannan-Quinn) especially if we have more than two models to compare.

Loose ends: 1. Controlling for Too Many factors In the attempt to remove bias, you may make a coefficient estimate something very different from the effect of interest (eg. fatalities on beer tax and beer consumption, or wages on gender and industry dummies). See Wooldridge 2. Adding Regressors to Reduce Error Variance Even if the coefficient estimates are unbiased (random treatments), we can benefit from adding regressors if they reduce the variance of the error term (see Wooldridge).

:Prediction Suppose we wish to predict an out-of-sample observation, y 0, using our regression estimates. For our sample, we have the model y X u. Assume that y 0 comes from the model y 0 x 0 u 0 where u u 0 ~N 0, 2 I n 1 The obvious predictor is just y 0 x 0 with X X 1 X y But y 0 is just a lin. comb. of. Therefore x 0 ~N x0, 2 x 0 X X 1 x 0

This result allows us to form a confidence interval for x 0 using the t-distribution x 0 x0 ~t n K 2 x 0 X X 1 x 0 Easy to generalize to the cause where we want to predict several out of sample observations simultaneously. Then x 0 is a matrix, y 0 is a vector, but nothing else changes, except that we would use the F-distribution for a confidence ellipsoid.

A prediction interval for y 0 combines parameter uncertainty (coming from ) with intrinsic uncertainty coming from the disturbance u 0. The prediction error is defined by e 0 y 0 y 0 X u 0 But both pieces are normal and independent, therefore e 0 ~N 0, 2 1 x 0 X X 1 x 0 Proceeding as above we get an interval estimate for y 0 that is a mix of a confidence interval and a prediction interval. WARNING: Asymptotic theory gives a justification for using MVN to approximate the distribution of, but to construct the prediction interval above we have to take seriously the small sample distribution assumption for u 0.

:Residual analysis Which observations have the largest and smallest residuals u i? Looking at these residuals may suggest left out variables. But a better approach is the "leave one out" regression residuals discussed in lecture "Multiple Regression 1". To see why looking at u i can be very misleading, consider the following example. Suppose the data on y,x are 2, 2, 1, 1, 0, 0, 1, 1, 13, 2. The first four observations lie on the straight line y x. The fitted regression line is y 3 2x, and the OLS residuals are 3, 0, 3, 6, 6. It looks like observations 4 and 5 are a bit strange, but it s really only observation 5 that is out of line. Examples can be constructed where the "leave one out" outlier isn t the largest OLS residual

Sometimes residuals are used to measure "value-added" after controlling for the quality of inputs. For example Frontier production or cost functions School quality rankings (CD Howe David Johnson) Law School (see Wooldridge) Searching for "alpha" (expected returns in excess of compensation for risk)

:Prediciting y when ln y is the dep. variable When we regress ln y X u The coefficient E ln y X / X. But what if we are interested in E y X / X? Case 1. u~n We can show that ln y i ~N i, 2 i E y i exp i 2 i /2. Therefore, E y i x i x i e 2 /2 e x i x i e x i 2 /2 Replacing the unknown parameters by the OLS estimators gives us a consistent estimate of the response.

Case 2. E exp u i x i (a constant) Then E y i x i exp x i, and E y i x i ex i e x i x x i i We can estimate 1. by estimating the regression model through the origin y i m i i where m i exp x i and is the OLS estimator. 2. using the smearing estimate 1 n exp u i i Rk: If x i contains variables that aren t continuous, then we should look at ΔE y i x i (see Wooldridge Ex 7.5)

:Choosing levels or logs for the dependent variable Case 1. u~n (Box-Cox). Replace y with y y/ y g, where y g is the geometric mean of y, i.e.ln y g ln y i /n. Run the two regressions y X u ln y X w and choose the model that has the smallest value for 2.

Case 2. E exp u i x i (a constant) Regress y X u and store the R 2. Then compute the fitted vector y exp xi (where denotes either of the two estimators described in the section above) and calculate the squared correlation r 2 yy. If R 2 r y y 2, choose the level specification. Else, choose the log.