Inference with Heteroskedasticity

Similar documents
Tests of Linear Restrictions

Linear Probability Model

Inferences on Linear Combinations of Coefficients

Variance Decomposition and Goodness of Fit

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Dealing with Heteroskedasticity

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

Introduction to Simple Linear Regression

The Application of California School

MS&E 226: Small Data

Heteroskedasticity. Part VII. Heteroskedasticity

1 Multiple Regression

Model Specification and Data Problems. Part VIII

Regression and the 2-Sample t

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Inference for Regression

Applied Regression Analysis

Intermediate Econometrics

A discussion on multiple regression models

SCHOOL OF MATHEMATICS AND STATISTICS

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Summary of OLS Results - Model Variables

Econometrics Multiple Regression Analysis: Heteroskedasticity

Economics 471: Econometrics Department of Economics, Finance and Legal Studies University of Alabama

Ordinary Least Squares Regression Explained: Vartanian

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

coefficients n 2 are the residuals obtained when we estimate the regression on y equals the (simple regression) estimated effect of the part of x 1

Ordinary Least Squares Regression Explained: Vartanian

Lab 11 - Heteroskedasticity

Multiple Regression Analysis: Heteroskedasticity

MODELS WITHOUT AN INTERCEPT

df=degrees of freedom = n - 1

Multiple Linear Regression

Simple Linear Regression

Economics 113. Simple Regression Assumptions. Simple Regression Derivation. Changing Units of Measurement. Nonlinear effects

POLSCI 702 Non-Normality and Heteroskedasticity

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

ST505/S697R: Fall Homework 2 Solution.

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Multiple Regression Analysis

Inference for the Regression Coefficient

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

1 Introduction 1. 2 The Multiple Regression Model 1

14 Multiple Linear Regression

Comparing Nested Models

Lectures 5 & 6: Hypothesis Testing

Coefficient of Determination

Lecture 6 Multiple Linear Regression, cont.

Econometrics 1. Lecture 8: Linear Regression (2) 黄嘉平

Introduction to Econometrics. Heteroskedasticity

Stat 5102 Final Exam May 14, 2015

STAT 3022 Spring 2007

Time-Series Regression and Generalized Least Squares in R*

2) For a normal distribution, the skewness and kurtosis measures are as follows: A) 1.96 and 4 B) 1 and 2 C) 0 and 3 D) 0 and 0

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Explanatory Variables Must be Linear Independent...

The linear model. Our models so far are linear. Change in Y due to change in X? See plots for: o age vs. ahe o carats vs.

Topic 3: Inference and Prediction

Applied Statistics and Econometrics

İlk Dönem Ocak 2008 ve Aralık 2009 arasındak borsa kapanış fiyatlarının logaritmik farklarının 100 ile çarpılması ile elde edilmiştir.

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

Chapter 8 Handout: Interval Estimates and Hypothesis Testing

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Regression and Models with Multiple Factors. Ch. 17, 18

Final Exam - Solutions

Statistical Inference with Regression Analysis

Chapter 8 Conclusion

ECO220Y Simple Regression: Testing the Slope

Statistics 191 Introduction to Regression Analysis and Applied Statistics Practice Exam

Topic 3: Inference and Prediction

GMM - Generalized method of moments

MANOVA is an extension of the univariate ANOVA as it involves more than one Dependent Variable (DV). The following are assumptions for using MANOVA:

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Final Exam. Name: Solution:

The Simple Linear Regression Model

ST430 Exam 2 Solutions

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

1 Correlation and Inference from Regression

ECON 497: Lecture Notes 10 Page 1 of 1

Stat 5303 (Oehlert): Balanced Incomplete Block Designs 1

Simple Linear Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression

ST430 Exam 1 with Answers

ANOVA (Analysis of Variance) output RLS 11/20/2016

Stat 401B Final Exam Fall 2015

Panel Data. March 2, () Applied Economoetrics: Topic 6 March 2, / 43

The Multiple Linear Regression Model

Stat 401B Exam 2 Fall 2015

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

2 Regression Analysis

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Week 7 Multiple factors. Ch , Some miscellaneous parts

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Transcription:

Inference with Heteroskedasticity Note on required packages: The following code requires the packages sandwich and lmtest to estimate regression error variance that may change with the explanatory variables. If you have not done so on your machine, download the package lmtest. This should only need to be done once on your computer. install.packages("lmtest") You can call the libraries sandwich and lmtest with the following calls to library: library("sandwich") library("lmtest") Loading required package: zoo Attaching package: zoo The following objects are masked from package:base : as.date, as.date.numeric 1. Introduction Homoskedasticity is the property that the variance of the error term of a regression (estimated by the variance of the residual in the sample) is the same across different values for the explanatory variables, or the same across time for time series models. Heteroskedasticity is the property when the variance of the error term changes predictably with one or more of the explanatory variables. Ordinary least squares (OLS) estimation of a model with heteroskedastic errors produces unbiased and consistent estimators for the coefficients and the predicted values. The problem is that the OLS estimates for the variances for these estimators are incorrect. The purpose of this tutorial is to demonstrate how to compute heteroskedasticity-consistent variances, and use these estimates for conducting hypothesis tests and estimating confidence intervals. 2. Example: Factors affecting monthly earnings Let us examine a data set that explores the relationship between total monthly earnings (MonthlyEarnings) and a number of factors that may influence monthly earnings including including each person s IQ (IQ), a measure of knowledge workplace environment (Knowledge), years of education (YearsEdu), years experience (YearsExperience), and years at current job (Tenure). The code below downloads a CSV file that includes data on the above variables from 1980 for 935 individuals and assigns it to a data set that we name wages. The next call to lm() estimates a multiple regression predicting monthly earnings based on the five explanatory variables given above and includes some interaction terms with IQ. 1

wages <- read.csv("http://murraylax.org/datasets/wage2.csv"); lmwages <- lm(monthlyearnings ~ IQ + Knowledge + YearsEdu + YearsExperience + Tenure + IQ:YearsEdu + IQ:YearsExperience, data = wages) 3. Heteroskedasticity robust standard errors White s correction for heteroskedasticity is a method for using OLS estimates for coefficients and predicted values (which are unbiased), but fixing the estimates for the variances of the coefficients (which are biased). It uses as an estimate for the possibly changing variance the squared residuals estimated from OLS which we computed and graphed above. In R, we can compute the White heteroskedastic variance/covariance matrix for the coefficients with the call to vcovhc (which stands for Variance / Covariance Heteroskedastic Consistent): vv <- vcovhc(lmwages, type="hc1") The first parameter in the call above is our original output from our call to lm() above. The second parameter type="hc1" tells the function to use the White estimate for the variance covariance matrix which uses as an estimate for the changing variance the squared residuals from the OLS call. 4. Hypothesis testing Now that we have an estimate for the heteroskedastic-robust variance / covariance matrix, we can use it to properly compute our standard errors, t-statistics, and p-values for the coefficients: coeftest(lmwages, vcov = vv) t test of coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 190.594304 683.642404 0.2788 0.78047 IQ -3.211755 6.680955-0.4807 0.63082 Knowledge 8.438092 2.026726 4.1634 3.428e-05 *** YearsEdu -5.383972 45.770462-0.1176 0.90639 YearsExperience 8.386347 22.846782 0.3671 0.71365 Tenure 6.236246 2.518999 2.4757 0.01348 * IQ:YearsEdu 0.496504 0.437963 1.1337 0.25723 IQ:YearsExperience 0.030878 0.224251 0.1377 0.89051 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 The first parameter to coeftest is the result of our original call to lm() and the second parameter is the updated variance / covariance matrix to use. Let s compare the result to the OLS regression output: summary(lmwages) 2

Call: lm(formula = MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience + Tenure + IQ:YearsEdu + IQ:YearsExperience, data = wages) Residuals: Min 1Q Median 3Q Max -836.54-244.14-44.04 180.97 2262.80 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 190.59430 712.52015 0.267 0.7891 IQ -3.21176 6.86803-0.468 0.6402 Knowledge 8.43809 1.83370 4.602 4.77e-06 *** YearsEdu -5.38397 44.50816-0.121 0.9037 YearsExperience 8.38635 21.37629 0.392 0.6949 Tenure 6.23625 2.45712 2.538 0.0113 * IQ:YearsEdu 0.49650 0.41535 1.195 0.2322 IQ:YearsExperience 0.03088 0.21136 0.146 0.8839 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 365.5 on 927 degrees of freedom Multiple R-squared: 0.1892, Adjusted R-squared: 0.1831 F-statistic: 30.9 on 7 and 927 DF, p-value: < 2.2e-16 The coefficients are exactly the same, but the estimates for the standard errors, the t-statistics, and p-values are slightly different. One reason the standard errors and p-values are only slightly different is that there may actually be very little heteroskedasticity. The differences in variance of the error term may be small even for large differences in the predicted value for x. The White heteroskedastic robust standard errors are valid for either homoskedasticity or heteroskedasticity, so it is always safe to use these estimates if you are not sure if the homoskedasticity assumption holds. 5. Confidence Intervals The function confint() computes confidence intervals for the coefficients, but assumes homoskedasticity. Here is an example: confint(lmwages, conf.level=0.95) 2.5 % 97.5 % (Intercept) -1207.7452702 1588.9338778 IQ -16.6904391 10.2669289 Knowledge 4.8394156 12.0367687 YearsEdu -92.7324169 81.9644733 YearsExperience -33.5651905 50.3378852 Tenure 1.4140877 11.0584039 IQ:YearsEdu -0.3186292 1.3116367 IQ:YearsExperience -0.3839230 0.4456795 3

Unfortunately, there R includes no similar method to compute confidence intervals for coefficients with heteroskedastic-robust standard errors, so we have to compute these manually. Let of first familiarize ourselves with how the confidence intervals are computed with ordinary least squares and homoskedasticity. Confidence intervals have the general form: estimate ± t α/2,df (std error of estimate) Where t α/2,df grabs the t-statistic on the right-side of the t distribution for a (1 α)100% confidence interval. For example, for a 95% confidence interval, we want 95% of the distribution in the middle, and so we want 2.5% left in each tail. The value t 0.025,df tells us what value of t has 2.5% left in the right-side tail, for a given number of degrees of freedom. We can compute this value of t for our regression as follows: alpha <- 0.05 # 95\% confidence interval tval <- qt(p=1-alpha/2, df=lmwages$df.residual) The function qt(p,df) delivers the t-statistic that has probability p to the left of the value. Since we want the right-side of the distribution, 2.5% of the distribution in the right tail is the same as 1-0.025 = 0.975 = 97.5% of the distribution to the left. It is for this reason we pass as a parameter p=1-alpha/2 To estimate the confidence interval with heteroskedasticity-robust standard errors, we can use this t-statistic and our heteroskedastic-robust variances that we computed in the call to vcov(). The diagonal elements of vv are the variances of the sampling distributions for the coefficients (the off diagonal elements are the covariances). The following line of code computes the standard errors by picking off the diagonal elements of vv and then computing the square root (the standard error is equal to the square root of the variances). stdberr <- sqrt( diag(vv) ) The upper limit of the confidence interval is computed as, coefupper <- lmwages$coefficients + tval * stdberr The lower limit of the confidence interval is given by, coeflower <- lmwages$coefficients - tval * stdberr If we want a pretty display of the lower and upper confidence levels, we can call cbind() (which has the function to bind columns): cbind(coeflower, coefupper) coeflower coefupper (Intercept) -1151.0719330 1532.2605407 IQ -16.3233055 9.8997953 Knowledge 4.4605895 12.4155948 YearsEdu -95.2097091 84.4417656 YearsExperience -36.4510643 53.2237590 Tenure 1.2926447 11.1798469 IQ:YearsEdu -0.3630097 1.3560172 IQ:YearsExperience -0.4092196 0.4709761 4

6. Joint Test for Multiple Restrictions Suppose we want to test whether or not IQ has any influence on monthly earnings. The variable IQ appears multiple times: once on its own and twice in interaction effects. The null and alternative hypotheses for this test is given by, H 0 : β I Q = 0; β IQxY earsedu = 0 β IQxY earsexperience = 0 H A : At least one of these is not equal to zero The standard F-test for multiple restrictions that compares the sum of squared explained between an unrestricted and (nested) restricted model (often called the Wald test) assumes homoskedasticity, and so it is not relevant if heteroskedasticity is present. There is a heteroskedasticity-robust version of the test which can be estimated with the waldtest() function. The default output of waldtest() is the same as anova(), but it has the additional power that we can specify the correct variance / covariance matrix. First we estimate the restricted model, which does not include IQ in any of the terms: lmwages_res <- lm(monthlyearnings ~ Knowledge + YearsEdu + YearsExperience + Tenure, data = wages) Then we call waldtest(), passing both the restricted and unrestricted regression outputs, and we include our estimate for the variance / covariance robust estimator for the unrestricted model: waldtest(lmwages, lmwages_res, vcov=vv) Wald test Model 1: MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience + Tenure + IQ:YearsEdu + IQ:YearsExperience Model 2: MonthlyEarnings ~ Knowledge + YearsEdu + YearsExperience + Tenure Res.Df Df F Pr(>F) 1 927 2 930-3 5.3398 0.001194 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 The p-value is 0.001 which is below 0.05. We reject the null hypothesis and conclude that there is sufficient statistical evidence that IQ does help explain monthly earnings. 5