Model Specification and Data Problems. Part VIII

Part VIII Model Specification and Data Problems As of Oct 24, 2017

1 Model Specification and Data Problems RESET test Non-nested alternatives Outliers

A functional form misspecification generally means that the model does not account for some important nonlinearities. Recall that omitting important variable is also model misspecification. Generally functional form misspecification causes bias in the remaining parameter estimators.

Example 1 Suppose that the correct specification of the wage equation is (1) log(wage) = β 0 + β 1 educ + β 2 exper + β 3 (exper) 2 + u. Then the return for an extra year of experience is log(wage) exper = β 2 + 2β 3 exper. (2) If the second order term is dropped from (1), use of the resulting biased estimate of β 2 can be misleading.

1 Model Specification and Data Problems RESET test Non-nested alternatives Outliers

Ramsey (1969) 2 proposed a general functional form misspecification test, Regression Specification Error Test (RESET), which has proven to be useful. Estimate y = β 0 + β 1 x 1 + + β k x k + u, (3) get ŷ and test in the augmented model y = β 0 + β 1 x 1 + + β k x k + δ 1 ŷ 2 + δ 2 ŷ 3 + e. (4) Test the null hypothesis H 0 : δ 1 = δ 2 = 0. (5) with the F -test with numerator df 1 = 2 and denominator df 2 = n k 3. 2 Ramsey, J.B. (1969). Tests for specification errors in classical linear least-squares analysis, Journal of the Royal Statistical Society, Series B, 71, 350 371.

Example 2 Consider the house price data (Exercise 3.1) and estimate price = β 0 + β 1 lotsize + β 2 sqrft + β 3 bdrms + u. (6) Estimation results are: Dependent Variable: PRICE Method: Least Squares Sample: 1 88 Included observations: 88 ========================================================== Variable Coefficient Std. Error t-statistic Prob. ---------------------------------------------------------- C -21.77031 29.47504-0.738601 0.4622 LOTSIZE 0.002068 0.000642 3.220096 0.0018 SQRFT 0.122778 0.013237 9.275093 0.0000 BDRMS 13.85252 9.010145 1.537436 0.1279 ========================================================== ============================================================ R-squared 0.672362 Mean dependent var 293.5460 Adjusted R-squared 0.660661 S.D. dependent var 102.7134 S.E. of regression 59.83348 Akaike info criterion 11.06540 Sum squared resid 300723.8 Schwarz criterion 11.17800 Log likelihood -482.8775 F-statistic 57.46023 Durbin-Watson stat 2.109796 Prob(F-statistic) 0.000000 ============================================================

Estimate next (6) augmented with ( price) 2 and ( price) 3 as in (4). The F -statistic for the null hypothesis (5) becomes F = 4.67 with 2 and 82 degrees of freedom. The p-value is 0.012, such that we reject the null hypothesis at the 5% level. Thus, there is some evidence of non-linearity.

Estimate next log(price) = β 0 + β 1 log(lotsize) + β 2 log(sqrft) + β 3 bdrms + u. (7) Estimation results: Dependent Variable: LOG(PRICE) Method: Least Squares Date: 10/19/06 Time: 00:01 Sample: 1 88 Included observations: 88 ============================================================ Variable Coefficient Std. Error t-statistic Prob. ============================================================ C -1.297042 0.651284-1.991517 0.0497 LOG(LOTSIZE) 0.167967 0.038281 4.387714 0.0000 LOG(SQRFT) 0.700232 0.092865 7.540306 0.0000 BDRMS 0.036958 0.027531 1.342415 0.1831 ============================================================ ============================================================== R-squared 0.642965 Mean dependent var 5.633180 Adjusted R-squared 0.630214 S.D. dependent var 0.303573 S.E. of regression 0.184603 Akaike info criterion -0.496833 Sum squared resid 2.862563 Schwarz criterion -0.384227 Log likelihood 25.86066 F-statistic 50.42374 Durbin-Watson stat 2.088996 Prob(F-statistic) 0.000000 ==============================================================

Applying the RESET test, the F -statistic for the null hypothesis (5) is now F = 2.56 with p-value 0.084, which implies that the hypothesis is not rejected at the 5% level. Thus overall, on the basis of the RESET test the log-log model (7) is preferred.

1 Model Specification and Data Problems RESET test Non-nested alternatives Outliers

For example if the model choices are y = β 0 + β 1 x 1 + β 2 x 2 + u (8) and y = β 0 + β 1 log(x 1 ) + β 2 log(x 2 ) + u. (9) Because the models are non-nested the usual F -test does not apply. A common approach is to estimate a combined model y = γ 0 + γ 1 x 1 + γ 2 x 2 + γ 3 log(x 1 ) + γ 4 log(x 2 ) + u. H 0 : γ 3 = γ 4 = 0 is a hypothesis for (8) and H 0 : γ 1 = γ 2 = 0 is a hypothesis for (9). The usual F -test applies again here. (10)

Davidson and MacKinnon (1981) 3 procedure: For example to test (8), estimate first y = β 0 + β 1 x 1 + β 2 x 2 + θ 1 ŷ + v, (11) where ŷ is the fitted value of (9). A significant t value of the θ 1 -estimate is a rejection of (8). Similarly, if ŷ denotes the fitted values of (8), the test of (9) is the t-staistic of the θ 1 -estimate from y = β 0 + β 1 log(x 1 ) + β 2 log(x 2 ) + θ 1 ŷ + v, (12) 3 Davidson, R. and J.G. MacKinnon (1981). Several tests for model specification in the presence of alternative hypotheses, Econometrica 49, 781 793.

Remark 8.1: A clear winner need not emerge. Both models may be rejected or neither may be rejected. In the latter case adjusted R-square can be used to select the better fitting one. If both models are rejected, more work is needed. 4 4 For more complicated cases, see Wooldridge, J.M. (1994). A simple specification test for the predictive ability of transformation models, Review of Economics and Statistics 76, 59 65.

1 Model Specification and Data Problems RESET test Non-nested alternatives Outliers

As discussed earlier, an important source of bias in OLS is omitted variables that are correlated with the included explanatory variables. Often the reason for omission is that these variables are unobservable. A way to mitigate the problem is to collect data on proxy variables. Consider the following regression y = β 0 + β 1 x 1 + β 2 x 2 + u, (13) where x 2 is unobservable variable (e.g. human ability).

Suppose that the primary interest is to estimate β 1, so that x 2 is a control variable. However, as we know the simple regression y = β 0 + β 1 x 1 + v results to biased and inconsistent OLS estimator of β 1 such plim ˆβ 1 = β 1 + γ 1 β 2, where δ 1 is the coefficient of regression x 2 = γ 0 + γ 1 x 1 + error Suppose that we have a good proxy x 2 for x 2 such tat E[x 2 x 2, x 1 ] = E[x 2 x 2], i.e., given the proxy x 2, x 1 does not help in predicting the unobserved variable x 2. E[u x 2 ] = 0 for the error term in regression (13). These imply that in regression x 2 = δ 0 + δ 1 x 2 + θx 1 + e, θ = 0 so that only the proxy x 2 is related to the unobserved variable x 2, and that the proxy x 2 is not correlated with error term of the true regression in equation (13).

With this kind of a good proxy instead of (13), the model to be estimated becomes y = α 0 + β 1 x 1 + α 2 x 2 + w. (14) Now OLS is unbiased and consistent estimator of β 1, the parameter we are primarily interested in (also OLS estimators of α 0 and α 1 are unbiased and consistent for these parameters, but α 0 = β 0 + β 2 δ 0 and α 1 = δ 1 β 2 differ from β 0 and β 2 ).

Example 3 Consider the return to education in wages (monthly) for men (wage2 data set). lm(formula = log(wage) ~ educ + exper + tenure + married + south + urban + black, data = wdf) Residuals: Min 1Q Median 3Q Max -1.98069-0.21996 0.00707 0.24288 1.22822 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.395497 0.113225 47.653 < 2e-16 *** educ 0.065431 0.006250 10.468 < 2e-16 *** exper 0.014043 0.003185 4.409 1.16e-05 *** tenure 0.011747 0.002453 4.789 1.95e-06 *** married 0.199417 0.039050 5.107 3.98e-07 *** south -0.090904 0.026249-3.463 0.000558 *** urban 0.183912 0.026958 6.822 1.62e-11 *** black -0.188350 0.037667-5.000 6.84e-07 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.3655 on 927 degrees of freedom Multiple R-squared: 0.2526,Adjusted R-squared: 0.2469 F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16

The estimated return to education is 6.5%. However, if the omitted ability is positively correlated with educ, the estimate is too high. Adding IQ as a proxy to ability into the equation reduces the estimate to 5.4%, which is consistent with the omitted variable bias assumption. lm(formula = log(wage) ~ educ + exper + tenure + married + south + urban + black + iq, data = wdf) Residuals: Min 1Q Median 3Q Max -2.01203-0.22244 0.01017 0.22951 1.27478 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.1764391 0.1280006 40.441 < 2e-16 *** educ 0.0544106 0.0069285 7.853 1.12e-14 *** exper 0.0141459 0.0031651 4.469 8.82e-06 *** tenure 0.0113951 0.0024394 4.671 3.44e-06 *** married 0.1997644 0.0388025 5.148 3.21e-07 *** south -0.0801695 0.0262529-3.054 0.002325 ** urban 0.1819463 0.0267929 6.791 1.99e-11 *** black -0.1431253 0.0394925-3.624 0.000306 *** iq 0.0035591 0.0009918 3.589 0.000350 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.3632 on 926 degrees of freedom Multiple R-squared: 0.2628,Adjusted R-squared: 0.2564 F-statistic: 41.27 on 8 and 926 DF, p-value: < 2.2e-16

Test whether the interaction of ability and education affects wages. lm(formula = log(wage) ~ educ + exper + tenure + married + south + urban + black + iq + iq:educ, data = wdf) # iq:educ introduces interaction iq*educ Residuals: Min 1Q Median 3Q Max -2.00733-0.21715 0.01177 0.23456 1.27305 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.6482478 0.5462963 10.339 < 2e-16 *** educ 0.0184560 0.0410608 0.449 0.653192 exper 0.0139072 0.0031768 4.378 1.34e-05 *** tenure 0.0113929 0.0024397 4.670 3.46e-06 *** married 0.2008658 0.0388267 5.173 2.82e-07 *** south -0.0802354 0.0262560-3.056 0.002308 ** urban 0.1835758 0.0268586 6.835 1.49e-11 *** black -0.1466989 0.0397013-3.695 0.000233 *** iq -0.0009418 0.0051625-0.182 0.855290 educ:iq 0.0003399 0.0003826 0.888 0.374564 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.3632 on 925 degrees of freedom Multiple R-squared: 0.2634,Adjusted R-squared: 0.2563 F-statistic: 36.76 on 9 and 925 DF, p-value: < 2.2e-16

Adding iq educ is not only insignificant but it also renders educ and iq insignificant! This is due to high correlation of the interaction term with its components: > with(wdf, cor(cbind(educ, iq, educ*iq))) educ iq educ*iq educ 1.0000000 0.5156970 0.8880035 iq 0.5156970 1.0000000 0.8453237 educ*iq 0.8880035 0.8453237 1.0000000 The implied collinearity can be materially reduced by defining the interaction term in terms of demeand variables: > with(wdf, cor(cbind(educ, iq, (educ - mean(educ))*(iq - mean(iq))))) educ iq (e-m(e))*(i-m(i)) educ 1.0000000 0.5156970 0.1864668 iq 0.5156970 1.0000000-0.0133327 (educ-m(educ)*(iq-m(iq)) 0.1864668-0.0133327 1.0000000

Interaction term of the demeaned components leads also to a meaningful interpretation of the implied model. We can write as log(wage) = β 0 + β 1 educ + β 2 iq + β 12 ( educ ĩq) + other factors log(wage) = β 0 + β 1 educ + β 2 ĩq + β 12 ( educ ĩq) + other factors, where ẽduc = educ educ and ĩq = iq iq are demeaned educ and iq, and β 0 = β 0 + β 1 educ + β 2 iq. We can further write log(wage) = β 0 + (β 1 + β 12 ĩq) educ + β 2 ĩq + other factors.

The slope coefficient β 1 + β 12 ĩq of educ implies that return to education depends on the level of ability (measured by IQ). At the mean IQ, ĩq = 0, so that β 1 indicates the return to education for a person with average ability and β 12 indicates per IQ point the rate by which return to education changes when ability (measured in terms of IQ) deviates from the average. Assuming β 12 > 0, above average ability implies higher return to education and below average lower return to education.

Estimating the model, however, indicates that ˆβ 12 =.00034 with p-value.37 is not at all statistically significant, which implies that there is no evidence that variability in IQ as such affects return to education. lm(formula = log(wage) ~ educ + exper + tenure + married + south + urban + black + iq + I((iq - mean(iq)) * (educ - mean(educ))), data = wdf) Residuals: Min 1Q Median 3Q Max -2.00733-0.21715 0.01177 0.23456 1.27305 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.1846286 0.1283466 40.396 < 2e-16 educ 0.0528786 0.0071406 7.405 2.94e-13 exper 0.0139072 0.0031768 4.378 1.34e-05 tenure 0.0113929 0.0024397 4.670 3.46e-06 married 0.2008658 0.0388267 5.173 2.82e-07 south -0.0802354 0.0262560-3.056 0.002308 urban 0.1835758 0.0268586 6.835 1.49e-11 black -0.1466989 0.0397013-3.695 0.000233 iq 0.0036357 0.0009957 3.652 0.000275 I((iq - mean(iq)) * (educ - mean(educ))) 0.0003399 0.0003826 0.888 0.374564 Residual standard error: 0.3632 on 925 degrees of freedom Multiple R-squared: 0.2634,Adjusted R-squared: 0.2563 F-statistic: 36.76 on 9 and 925 DF, p-value: < 2.2e-16

Outliers 1 Model Specification and Data Problems RESET test Non-nested alternatives Outliers

Outliers Particularly in small data sets OLS estimates may be influenced by one or several observations (see figure). Generally such observations are called outliers or influential observations. Loosely, an observation is an outlier if dropping it changes estimation results materially. In detection of outliers a usual practice is to investigate standardized (or studentized ) residuals. If an outlier is an obvious mistake in recording the data, it can be corrected. Usual practice also is to eliminate such observations. Data transformations, like taking logarithms often narrow the range of data and hence may alleviate outlier problems, too.