Model Specification and Data Problems. Part VIII

Similar documents
Statistical Inference. Part IV. Statistical Inference

More on Specification and Data Issues

The Simple Regression Model. Part II. The Simple Regression Model

Multiple Regression Analysis. Part III. Multiple Regression Analysis

CHAPTER 6: SPECIFICATION VARIABLES

Heteroskedasticity. Part VII. Heteroskedasticity

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

Regression with Qualitative Information. Part VI. Regression with Qualitative Information

Outline. 2. Logarithmic Functional Form and Units of Measurement. Functional Form. I. Functional Form: log II. Units of Measurement

Multiple Regression Analysis: Inference MULTIPLE REGRESSION ANALYSIS: INFERENCE. Sampling Distributions of OLS Estimators

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1)

Answers to Problem Set #4

2. Linear regression with multiple regressors

Multiple Regression: Inference

Multiple Regression Analysis: Heteroskedasticity

Problem C7.10. points = exper.072 exper guard forward (1.18) (.33) (.024) (1.00) (1.00)

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

2) For a normal distribution, the skewness and kurtosis measures are as follows: A) 1.96 and 4 B) 1 and 2 C) 0 and 3 D) 0 and 0

4. Nonlinear regression functions

Lecture 8. Using the CLR Model. Relation between patent applications and R&D spending. Variables

6. Assessing studies based on multiple regression

Economics 471: Econometrics Department of Economics, Finance and Legal Studies University of Alabama

Review of Econometrics

Variance Decomposition and Goodness of Fit

7. Prediction. Outline: Read Section 6.4. Mean Prediction

13. Time Series Analysis: Asymptotics Weakly Dependent and Random Walk Process. Strict Exogeneity

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Inference in Regression Analysis

Solutions to Problem Set 5 (Due November 22) Maximum number of points for Problem set 5 is: 220. Problem 7.3

Heteroscedasticity 1

Lecture 8. Using the CLR Model

Eastern Mediterranean University Department of Economics ECON 503: ECONOMETRICS I. M. Balcilar. Midterm Exam Fall 2007, 11 December 2007.

Brief Suggested Solutions

ECMT 676 Assignment #1 March 18, and x. are unknown? - Run the following regression: directly? What if μ1

Regression #8: Loose Ends

10. Time series regression and forecasting

The linear model. Our models so far are linear. Change in Y due to change in X? See plots for: o age vs. ahe o carats vs.

3. Linear Regression With a Single Regressor

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

7. Integrated Processes

Tests of Linear Restrictions

ECON2228 Notes 8. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 35

Density Temp vs Ratio. temp

7. Integrated Processes

1 Quantitative Techniques in Practice

Intermediate Econometrics

Exercise Sheet 6: Solutions

ECON Introductory Econometrics. Lecture 16: Instrumental variables

Financial Time Series Analysis: Part II

ST430 Exam 1 with Answers

Multiple Regression. Midterm results: AVG = 26.5 (88%) A = 27+ B = C =

Inferences on Linear Combinations of Coefficients

Econ 510 B. Brown Spring 2014 Final Exam Answers

Brief Sketch of Solutions: Tutorial 3. 3) unit root tests

Coefficient of Determination

Practical Econometrics. for. Finance and Economics. (Econometrics 2)

Introduction to Econometrics Chapter 4

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Making sense of Econometrics: Basics

Practice Questions for the Final Exam. Theoretical Part

1. You have data on years of work experience, EXPER, its square, EXPER2, years of education, EDUC, and the log of hourly wages, LWAGE

CHAPTER 4. > 0, where β

Problem Set 2: Box-Jenkins methodology

Multiple Regression Analysis. Basic Estimation Techniques. Multiple Regression Analysis. Multiple Regression Analysis

APPLIED MACROECONOMETRICS Licenciatura Universidade Nova de Lisboa Faculdade de Economia. FINAL EXAM JUNE 3, 2004 Starts at 14:00 Ends at 16:30

Inference with Heteroskedasticity

Estimating the return to education for married women mroz.csv: 753 observations and 22 variables

LECTURE 11. Introduction to Econometrics. Autocorrelation

Statistics II. Management Degree Management Statistics IIDegree. Statistics II. 2 nd Sem. 2013/2014. Management Degree. Simple Linear Regression

Exercise Sheet 5: Solutions

11. Simultaneous-Equation Models

About the seasonal effects on the potential liquid consumption

1. The shoe size of five randomly selected men in the class is 7, 7.5, 6, 6.5 the shoe size of 4 randomly selected women is 6, 5.

Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems

Problemsets for Applied Econometrics

Multiple Regression Analysis

Inference for Regression

Exercise sheet 6 Models with endogenous explanatory variables

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Univariate linear models

Applied Econometrics. Applied Econometrics Second edition. Dimitrios Asteriou and Stephen G. Hall

Introduction to Econometrics. Heteroskedasticity

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication

Wednesday, October 10 Handout: One-Tailed Tests, Two-Tailed Tests, and Logarithms

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

ARDL Cointegration Tests for Beginner

Econometrics - Slides

Hint: The following equation converts Celsius to Fahrenheit: F = C where C = degrees Celsius F = degrees Fahrenheit

Multiple Regression Analysis: Further Issues

Introductory Econometrics

Comparing Nested Models

Heteroskedasticity (Section )

Econometric Methods. Prediction / Violation of A-Assumptions. Burcu Erdogan. Universität Trier WS 2011/2012

Problem set 1: answers. April 6, 2018

ECO220Y Simple Regression: Testing the Slope

1/34 3/ Omission of a relevant variable(s) Y i = α 1 + α 2 X 1i + α 3 X 2i + u 2i

Christopher Dougherty London School of Economics and Political Science

centeris paribus. w w partial effect E(y w, c)á w. abil. c =(exper, abil) exper

Transcription:

Part VIII Model Specification and Data Problems As of Oct 24, 2017

1 Model Specification and Data Problems RESET test Non-nested alternatives Outliers

A functional form misspecification generally means that the model does not account for some important nonlinearities. Recall that omitting important variable is also model misspecification. Generally functional form misspecification causes bias in the remaining parameter estimators.

Example 1 Suppose that the correct specification of the wage equation is (1) log(wage) = β 0 + β 1 educ + β 2 exper + β 3 (exper) 2 + u. Then the return for an extra year of experience is log(wage) exper = β 2 + 2β 3 exper. (2) If the second order term is dropped from (1), use of the resulting biased estimate of β 2 can be misleading.

1 Model Specification and Data Problems RESET test Non-nested alternatives Outliers

Ramsey (1969) 2 proposed a general functional form misspecification test, Regression Specification Error Test (RESET), which has proven to be useful. Estimate y = β 0 + β 1 x 1 + + β k x k + u, (3) get ŷ and test in the augmented model y = β 0 + β 1 x 1 + + β k x k + δ 1 ŷ 2 + δ 2 ŷ 3 + e. (4) Test the null hypothesis H 0 : δ 1 = δ 2 = 0. (5) with the F -test with numerator df 1 = 2 and denominator df 2 = n k 3. 2 Ramsey, J.B. (1969). Tests for specification errors in classical linear least-squares analysis, Journal of the Royal Statistical Society, Series B, 71, 350 371.

Example 2 Consider the house price data (Exercise 3.1) and estimate price = β 0 + β 1 lotsize + β 2 sqrft + β 3 bdrms + u. (6) Estimation results are: Dependent Variable: PRICE Method: Least Squares Sample: 1 88 Included observations: 88 ========================================================== Variable Coefficient Std. Error t-statistic Prob. ---------------------------------------------------------- C -21.77031 29.47504-0.738601 0.4622 LOTSIZE 0.002068 0.000642 3.220096 0.0018 SQRFT 0.122778 0.013237 9.275093 0.0000 BDRMS 13.85252 9.010145 1.537436 0.1279 ========================================================== ============================================================ R-squared 0.672362 Mean dependent var 293.5460 Adjusted R-squared 0.660661 S.D. dependent var 102.7134 S.E. of regression 59.83348 Akaike info criterion 11.06540 Sum squared resid 300723.8 Schwarz criterion 11.17800 Log likelihood -482.8775 F-statistic 57.46023 Durbin-Watson stat 2.109796 Prob(F-statistic) 0.000000 ============================================================

Estimate next (6) augmented with ( price) 2 and ( price) 3 as in (4). The F -statistic for the null hypothesis (5) becomes F = 4.67 with 2 and 82 degrees of freedom. The p-value is 0.012, such that we reject the null hypothesis at the 5% level. Thus, there is some evidence of non-linearity.

Estimate next log(price) = β 0 + β 1 log(lotsize) + β 2 log(sqrft) + β 3 bdrms + u. (7) Estimation results: Dependent Variable: LOG(PRICE) Method: Least Squares Date: 10/19/06 Time: 00:01 Sample: 1 88 Included observations: 88 ============================================================ Variable Coefficient Std. Error t-statistic Prob. ============================================================ C -1.297042 0.651284-1.991517 0.0497 LOG(LOTSIZE) 0.167967 0.038281 4.387714 0.0000 LOG(SQRFT) 0.700232 0.092865 7.540306 0.0000 BDRMS 0.036958 0.027531 1.342415 0.1831 ============================================================ ============================================================== R-squared 0.642965 Mean dependent var 5.633180 Adjusted R-squared 0.630214 S.D. dependent var 0.303573 S.E. of regression 0.184603 Akaike info criterion -0.496833 Sum squared resid 2.862563 Schwarz criterion -0.384227 Log likelihood 25.86066 F-statistic 50.42374 Durbin-Watson stat 2.088996 Prob(F-statistic) 0.000000 ==============================================================

Applying the RESET test, the F -statistic for the null hypothesis (5) is now F = 2.56 with p-value 0.084, which implies that the hypothesis is not rejected at the 5% level. Thus overall, on the basis of the RESET test the log-log model (7) is preferred.

1 Model Specification and Data Problems RESET test Non-nested alternatives Outliers

For example if the model choices are y = β 0 + β 1 x 1 + β 2 x 2 + u (8) and y = β 0 + β 1 log(x 1 ) + β 2 log(x 2 ) + u. (9) Because the models are non-nested the usual F -test does not apply. A common approach is to estimate a combined model y = γ 0 + γ 1 x 1 + γ 2 x 2 + γ 3 log(x 1 ) + γ 4 log(x 2 ) + u. H 0 : γ 3 = γ 4 = 0 is a hypothesis for (8) and H 0 : γ 1 = γ 2 = 0 is a hypothesis for (9). The usual F -test applies again here. (10)

Davidson and MacKinnon (1981) 3 procedure: For example to test (8), estimate first y = β 0 + β 1 x 1 + β 2 x 2 + θ 1 ŷ + v, (11) where ŷ is the fitted value of (9). A significant t value of the θ 1 -estimate is a rejection of (8). Similarly, if ŷ denotes the fitted values of (8), the test of (9) is the t-staistic of the θ 1 -estimate from y = β 0 + β 1 log(x 1 ) + β 2 log(x 2 ) + θ 1 ŷ + v, (12) 3 Davidson, R. and J.G. MacKinnon (1981). Several tests for model specification in the presence of alternative hypotheses, Econometrica 49, 781 793.

Remark 8.1: A clear winner need not emerge. Both models may be rejected or neither may be rejected. In the latter case adjusted R-square can be used to select the better fitting one. If both models are rejected, more work is needed. 4 4 For more complicated cases, see Wooldridge, J.M. (1994). A simple specification test for the predictive ability of transformation models, Review of Economics and Statistics 76, 59 65.

1 Model Specification and Data Problems RESET test Non-nested alternatives Outliers

As discussed earlier, an important source of bias in OLS is omitted variables that are correlated with the included explanatory variables. Often the reason for omission is that these variables are unobservable. A way to mitigate the problem is to collect data on proxy variables. Consider the following regression y = β 0 + β 1 x 1 + β 2 x 2 + u, (13) where x 2 is unobservable variable (e.g. human ability).

Suppose that the primary interest is to estimate β 1, so that x 2 is a control variable. However, as we know the simple regression y = β 0 + β 1 x 1 + v results to biased and inconsistent OLS estimator of β 1 such plim ˆβ 1 = β 1 + γ 1 β 2, where δ 1 is the coefficient of regression x 2 = γ 0 + γ 1 x 1 + error Suppose that we have a good proxy x 2 for x 2 such tat E[x 2 x 2, x 1 ] = E[x 2 x 2], i.e., given the proxy x 2, x 1 does not help in predicting the unobserved variable x 2. E[u x 2 ] = 0 for the error term in regression (13). These imply that in regression x 2 = δ 0 + δ 1 x 2 + θx 1 + e, θ = 0 so that only the proxy x 2 is related to the unobserved variable x 2, and that the proxy x 2 is not correlated with error term of the true regression in equation (13).

With this kind of a good proxy instead of (13), the model to be estimated becomes y = α 0 + β 1 x 1 + α 2 x 2 + w. (14) Now OLS is unbiased and consistent estimator of β 1, the parameter we are primarily interested in (also OLS estimators of α 0 and α 1 are unbiased and consistent for these parameters, but α 0 = β 0 + β 2 δ 0 and α 1 = δ 1 β 2 differ from β 0 and β 2 ).

Example 3 Consider the return to education in wages (monthly) for men (wage2 data set). lm(formula = log(wage) ~ educ + exper + tenure + married + south + urban + black, data = wdf) Residuals: Min 1Q Median 3Q Max -1.98069-0.21996 0.00707 0.24288 1.22822 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.395497 0.113225 47.653 < 2e-16 *** educ 0.065431 0.006250 10.468 < 2e-16 *** exper 0.014043 0.003185 4.409 1.16e-05 *** tenure 0.011747 0.002453 4.789 1.95e-06 *** married 0.199417 0.039050 5.107 3.98e-07 *** south -0.090904 0.026249-3.463 0.000558 *** urban 0.183912 0.026958 6.822 1.62e-11 *** black -0.188350 0.037667-5.000 6.84e-07 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.3655 on 927 degrees of freedom Multiple R-squared: 0.2526,Adjusted R-squared: 0.2469 F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16

The estimated return to education is 6.5%. However, if the omitted ability is positively correlated with educ, the estimate is too high. Adding IQ as a proxy to ability into the equation reduces the estimate to 5.4%, which is consistent with the omitted variable bias assumption. lm(formula = log(wage) ~ educ + exper + tenure + married + south + urban + black + iq, data = wdf) Residuals: Min 1Q Median 3Q Max -2.01203-0.22244 0.01017 0.22951 1.27478 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.1764391 0.1280006 40.441 < 2e-16 *** educ 0.0544106 0.0069285 7.853 1.12e-14 *** exper 0.0141459 0.0031651 4.469 8.82e-06 *** tenure 0.0113951 0.0024394 4.671 3.44e-06 *** married 0.1997644 0.0388025 5.148 3.21e-07 *** south -0.0801695 0.0262529-3.054 0.002325 ** urban 0.1819463 0.0267929 6.791 1.99e-11 *** black -0.1431253 0.0394925-3.624 0.000306 *** iq 0.0035591 0.0009918 3.589 0.000350 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.3632 on 926 degrees of freedom Multiple R-squared: 0.2628,Adjusted R-squared: 0.2564 F-statistic: 41.27 on 8 and 926 DF, p-value: < 2.2e-16

Test whether the interaction of ability and education affects wages. lm(formula = log(wage) ~ educ + exper + tenure + married + south + urban + black + iq + iq:educ, data = wdf) # iq:educ introduces interaction iq*educ Residuals: Min 1Q Median 3Q Max -2.00733-0.21715 0.01177 0.23456 1.27305 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.6482478 0.5462963 10.339 < 2e-16 *** educ 0.0184560 0.0410608 0.449 0.653192 exper 0.0139072 0.0031768 4.378 1.34e-05 *** tenure 0.0113929 0.0024397 4.670 3.46e-06 *** married 0.2008658 0.0388267 5.173 2.82e-07 *** south -0.0802354 0.0262560-3.056 0.002308 ** urban 0.1835758 0.0268586 6.835 1.49e-11 *** black -0.1466989 0.0397013-3.695 0.000233 *** iq -0.0009418 0.0051625-0.182 0.855290 educ:iq 0.0003399 0.0003826 0.888 0.374564 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.3632 on 925 degrees of freedom Multiple R-squared: 0.2634,Adjusted R-squared: 0.2563 F-statistic: 36.76 on 9 and 925 DF, p-value: < 2.2e-16

Adding iq educ is not only insignificant but it also renders educ and iq insignificant! This is due to high correlation of the interaction term with its components: > with(wdf, cor(cbind(educ, iq, educ*iq))) educ iq educ*iq educ 1.0000000 0.5156970 0.8880035 iq 0.5156970 1.0000000 0.8453237 educ*iq 0.8880035 0.8453237 1.0000000 The implied collinearity can be materially reduced by defining the interaction term in terms of demeand variables: > with(wdf, cor(cbind(educ, iq, (educ - mean(educ))*(iq - mean(iq))))) educ iq (e-m(e))*(i-m(i)) educ 1.0000000 0.5156970 0.1864668 iq 0.5156970 1.0000000-0.0133327 (educ-m(educ)*(iq-m(iq)) 0.1864668-0.0133327 1.0000000

Interaction term of the demeaned components leads also to a meaningful interpretation of the implied model. We can write as log(wage) = β 0 + β 1 educ + β 2 iq + β 12 ( educ ĩq) + other factors log(wage) = β 0 + β 1 educ + β 2 ĩq + β 12 ( educ ĩq) + other factors, where ẽduc = educ educ and ĩq = iq iq are demeaned educ and iq, and β 0 = β 0 + β 1 educ + β 2 iq. We can further write log(wage) = β 0 + (β 1 + β 12 ĩq) educ + β 2 ĩq + other factors.

The slope coefficient β 1 + β 12 ĩq of educ implies that return to education depends on the level of ability (measured by IQ). At the mean IQ, ĩq = 0, so that β 1 indicates the return to education for a person with average ability and β 12 indicates per IQ point the rate by which return to education changes when ability (measured in terms of IQ) deviates from the average. Assuming β 12 > 0, above average ability implies higher return to education and below average lower return to education.

Estimating the model, however, indicates that ˆβ 12 =.00034 with p-value.37 is not at all statistically significant, which implies that there is no evidence that variability in IQ as such affects return to education. lm(formula = log(wage) ~ educ + exper + tenure + married + south + urban + black + iq + I((iq - mean(iq)) * (educ - mean(educ))), data = wdf) Residuals: Min 1Q Median 3Q Max -2.00733-0.21715 0.01177 0.23456 1.27305 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.1846286 0.1283466 40.396 < 2e-16 educ 0.0528786 0.0071406 7.405 2.94e-13 exper 0.0139072 0.0031768 4.378 1.34e-05 tenure 0.0113929 0.0024397 4.670 3.46e-06 married 0.2008658 0.0388267 5.173 2.82e-07 south -0.0802354 0.0262560-3.056 0.002308 urban 0.1835758 0.0268586 6.835 1.49e-11 black -0.1466989 0.0397013-3.695 0.000233 iq 0.0036357 0.0009957 3.652 0.000275 I((iq - mean(iq)) * (educ - mean(educ))) 0.0003399 0.0003826 0.888 0.374564 Residual standard error: 0.3632 on 925 degrees of freedom Multiple R-squared: 0.2634,Adjusted R-squared: 0.2563 F-statistic: 36.76 on 9 and 925 DF, p-value: < 2.2e-16

Outliers 1 Model Specification and Data Problems RESET test Non-nested alternatives Outliers

Outliers Particularly in small data sets OLS estimates may be influenced by one or several observations (see figure). Generally such observations are called outliers or influential observations. Loosely, an observation is an outlier if dropping it changes estimation results materially. In detection of outliers a usual practice is to investigate standardized (or studentized ) residuals. If an outlier is an obvious mistake in recording the data, it can be corrected. Usual practice also is to eliminate such observations. Data transformations, like taking logarithms often narrow the range of data and hence may alleviate outlier problems, too.