Quantitative Methods I: Regression diagnostics

Similar documents
Advanced Quantitative Methods: Regression diagnostics

Advanced Quantitative Methods: specification and multicollinearity

Advanced Quantitative Methods: ordinary least squares

Homoskedasticity. Var (u X) = σ 2. (23)

Lecture 4: Regression Analysis

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Regression Analysis for Data Containing Outliers and High Leverage Points

Multicollinearity and A Ridge Parameter Estimation Approach

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Linear Regression. Junhui Qian. October 27, 2014

Multiple Linear Regression

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Unit 10: Simple Linear Regression and Correlation

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

STAT 4385 Topic 06: Model Diagnostics

Regression diagnostics

Ordinary Least Squares Regression

Multiple Regression Analysis. Part III. Multiple Regression Analysis

LECTURE 11. Introduction to Econometrics. Autocorrelation

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

The Simple Regression Model. Part II. The Simple Regression Model

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Econometrics. 7) Endogeneity

Case of single exogenous (iv) variable (with single or multiple mediators) iv à med à dv. = β 0. iv i. med i + α 1

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Review of Econometrics

Weighted Least Squares

Advanced Econometrics I

Regression Diagnostics Procedures

Introduction to Estimation Methods for Time Series models. Lecture 1

Regression Diagnostics for Survey Data

Econometrics - 30C00200

Reliability of inference (1 of 2 lectures)

Weighted Least Squares

Lecture 4: Multivariate Regression, Part 2

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Dealing With Endogeneity

ECNS 561 Multiple Regression Analysis

Econometrics. 9) Heteroscedasticity and autocorrelation

Heteroskedasticity. Part VII. Heteroskedasticity

Basic econometrics. Tutorial 3. Dipl.Kfm. Johannes Metzler

Applied Statistics and Econometrics

1. The OLS Estimator. 1.1 Population model and notation

2 Prediction and Analysis of Variance

Lecture #8 & #9 Multiple regression

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

Linear Regression Models

Remedial Measures for Multiple Linear Regression Models

Multiple Linear Regression CIVL 7012/8012

The Simple Linear Regression Model

Iris Wang.

Applied Quantitative Methods II

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

What to do if Assumptions are Violated?

Outline. Possible Reasons. Nature of Heteroscedasticity. Basic Econometrics in Transportation. Heteroscedasticity

Lecture 4: Multivariate Regression, Part 2

Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator.

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Multiple Linear Regression

14 Multiple Linear Regression

2. Linear regression with multiple regressors

ECON The Simple Regression Model

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Diagnostics of Linear Regression

Linear Models, Problems

Economics 308: Econometrics Professor Moody

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Specification errors in linear regression models

Motivation for multiple regression

Variable Selection and Model Building

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Multiple Regression Analysis

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

ECON Introductory Econometrics. Lecture 6: OLS with Multiple Regressors

ECON 3150/4150, Spring term Lecture 7

Single and multiple linear regression analysis

Statistics 910, #5 1. Regression Methods

MS&E 226: Small Data

Regression Model Building

Econometrics 2, Class 1

Økonomisk Kandidateksamen 2004 (I) Econometrics 2. Rettevejledning

The Model Building Process Part I: Checking Model Assumptions Best Practice

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

Regression Models - Introduction

Least Squares Estimation-Finite-Sample Properties

13 Simple Linear Regression

Assumptions of the error term, assumptions of the independent variables

Lecture 14 Simple Linear Regression

Lecture 1: OLS derivations and inference

ECO375 Tutorial 8 Instrumental Variables

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

Labor Economics with STATA. Introduction to Regression Diagnostics

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Lecture 1: Linear Models and Applications

A Guide to Modern Econometric:

Applied Econometrics (QEM)

Transcription:

Quantitative Methods I: Regression University College Dublin 10 December 2014

1 Assumptions and errors 2 3 4

Outline Assumptions and errors 1 Assumptions and errors 2 3 4

Assumptions: specification Linear in parameters (i.e. f (Xβ) = Xβ and E[y] = Xβ) No extraneous variables in X No omitted independent variables Parameters to be estimated are constant Number of parameters is less than the number of cases, k < n

Assumptions: errors Assumptions and errors Errors have an expected value of zero, E[ε X] = 0 Errors are normally distributed, ε N(0, σ 2 ) Errors have a constant variance, Var(ε X) = σ 2 < Errors are not autocorrelated, Cov(ε i, ε j X) = 0 i j Errors and X are uncorrelated, Cov(X, ε) = 0

Assumptions: regressors X varies X is of full column rank (note: requires k < n) No measurement error in X No endogenous variables in X

Assumptions for unbiasedness If the population regression model is linear in its parameters; the sample is a random sample from the population; there is no perfect collinearity, r(x) = k; and the expected values of the error term is zero conditional on the explanatory variables, E(ε X) = 0 and Cov(ε, X) = 0, then the OLS estimators of β are unbiased. (Glynn, 2007)

Non-constant error variance Consequences: but: ˆσ 2 is a biased estimator of σ 2 ; the probability of a Type I error will not be α; the least squares estimator is no longer the best linear unbiased estimator, the severity depends on the level of heteroscedasticity; ˆβ still an unbiased estimator of β. (King, 2007)

Non-normal errors Assumptions and errors Consequences: sampling distribution of ˆβ not normal; test statistics will not have t- and F-distributions; the probability of a Type I error will not be α, but the estimates are still consistent: as n increases, the above problems disappear. (King, 2007)

Exercise: film reviews Open the films.dta data set. Create a new variable highrating, which is 1 for films rated 3 or higher, 0 otherwise. Using matrix formulas, 1 regress desclength on a constant 2 regress desclength on castsize 3 regress desclength on castsize, highrating, length

Exercise: film reviews Based on the last regression: 1 Which observation has the largest residual? 2 Compute mean and median of residuals 3 Compute correlation between residuals and fitted values 4 Compute correlation between residuals and length 5 All other predictors held constant, what would be the difference in predicted description length between high and low rated movies?

Non-linearity Assumptions and errors If there is non-linearity in the variables, but not in the parameters, there is no problem. E.g. can be estimated with OLS. y i = β 0 + β 1 x i + β 2 x 2 i + ε i If there are other non-linearities, sometimes the equation can be transformed. E.g. y i = β 0 x β 1 i ε i log(y i ) = log(β 0 ) + β 1 log(x i ) + log(ε i ) y i = β 0 + β 1x i + ε i

Functional forms for additional non-linear transformations log-linear as with the previous example semi-log has two forms: y i = β 0 + β 1 log(x i ), where β 1 is y due to % x log(y i ) = β 0 + β 1 x i, where β 1 is % y due to x inverse or reciprocal: y i = β 0 + β 1 1 x i polynomial y i = β 0 + β 1 x i + β 2 x 2 i

Outline Assumptions and errors 1 Assumptions and errors 2 3 4

Leverage Assumptions and errors A high leverage point i is one where x i is far from the mean of X. These points can be identified using the so-called hat matrix, the matrix that puts a hat on y: ŷ = X ˆβ = X(X X) 1 X y = Hy, the diagonal of which is a measure of leverage. (King, 2007)

Outliers Assumptions and errors An outlier is a point on the regression line where the residual is large.

Outliers Assumptions and errors An outlier is a point on the regression line where the residual is large. To account for the potential variables in the sampling variances of the residuals, we calculate externally studentized residuals (or studentized deleted residuals), where a large absolute value indicates an outlier. A test could be based on the fact that in a model without outliers, they should follow a t(n k) distribution. (Kutner et al., 2005, 390 398)

Influence Assumptions and errors An influential point is one which has a strong impact on the estimation of ˆβ. An influential point is one which has high leverage and is also an outlier. We typically look at Cook s Distance to assess the level of influence of each variable. (King, 2007)

A point with high leverage is located far from the other points. A high leverage point that strongly influences the regression line is called an influential point.

Outlier, low leverage, low influence y 8 10 12 14 3 4 5 6 7 8

High leverage, low influence y 10 15 20 25 5 10 15 20

High leverage, high influence y 8 10 12 14 16 5 10 15 20

Cook s Distance Assumptions and errors n j=1 D i (ŷ j ŷ j( i) ) 2 ks ( 2 e = i s 1 h i = t2 i var(ŷ i ) k var(e i ) F(k, n k) ) 2 h i k(1 h i ) = ( ˆβ OLS ( i) ˆβ OLS ) X X( ˆβ OLS ( i) ˆβ OLS ) ks 2 The F-test here refers to whether ˆβ OLS would be significantly different if observation i were to be removed (H 0 : β = β ( i) ). (Cook, 1979, 168)

Cook s Distance Assumptions and errors D i = t2 i k var(ŷ i ) var(e i ) ti 2 is a measure of the degree to which the ith observation can be considered as an outlier from the assumed model. The ratios var(ŷ i ) var(e i ) measure the relative sensitivity of the estimate, ˆβ OLS, to potential outlying values at each data point. (Cook, 1977, 16)

What to do with outliers? Options: 1 Ignore the problem 2 Investigate why the data are outliers what makes them unusual? 3 Consider respecifying the model, either by tranforming a variable or by including an additional variable (but beware of overfitting) 4 Consider a variant of robust regression that downweights outliers

Diagnosing problems in R A very easy set of diagnostic plots can be accessed by plotting a lm object, using plot.lm() This produces, in order: 1 residuals against fitted values 2 Normal Q-Q plot 3 scale-location plot of e i against fitted values 4 Cook s distances versus row labels 5 residuals against leverages 6 Cook s distances against leverage/(1-leverage) Note that by default, plot.lm() only gives you 1,2,3,5

Exercise Assumptions and errors Open the uswages.dta data set and regress log(wage) on educ, exper and race. Check for leverage, outliers, influential points and nonlinearities.

Outline Assumptions and errors 1 Assumptions and errors 2 3 4

Collinearity Assumptions and errors When some variables are linear combinations of others then we have exact (or perfect) collinearity, and there is no unique least squares estimate of β. When X variables are highly correlated, we have multicollinearity. Detecting multicollinearity: look at correlation matrix of predictors for pairwise correlations regress each independent variable on all other independent variables to produce R 2 j, and look for high values (close to 1.0)

Assumptions and errors The extent to which multicollinearity is a problem is debatable. The issue is comparable to that of sample size: if n is too small, we have difficulty picking up effects even if they really exist; the same holds for variables that are highly multicollinear, making it difficult to separate their effects on y.

Assumptions and errors However, some problems with high multicollinearity: Small changes in data can lead to large changes in estimates High standard errors but joint significance Coefficients may have wrong sign or implausible magnitudes (Greene, 2002, 57)

Variance of ˆβ OLS Assumptions and errors var( ˆβ OLS k ) = σ 2 (1 R 2 k ) n i (x ik x k ) 2 σ 2 : all else equal, the better the fit, the lower the variance (1 R 2 k ): all else equal, the lower the R2 from regressing the kth independent variable on all other independent variables, the lower the variance (Greene, 2002, 57)

Variance Inflation Factor var( ˆβ OLS k ) = σ 2 (1 Rk 2) n i (x ik x k ) 2 1 VIF k = 1 Rk 2, thus VIF k shows the increase in the var( ˆβ k OLS ) due to the variable being collinear with other independent variables. library(faraway) vif(lm(...))

: solutions Check for coding or logical mistakes (esp. in cases of perfect multicollinearity) Increase n Remove one of the collinear variables (apparently not adding much) Combine multiple variables in indices or underlying dimensions Formalise the relationship

Exercise Assumptions and errors Using demdev.dta data and model polity2 i = β 0 +β 1 cwar i +β 3 laggdppc i +β 4 propdem i +β 5 energy2 i +ε i, check whether there are any multicollinearity problems.

Outline Assumptions and errors 1 Assumptions and errors 2 3 4

Homoscedasticity Assumptions and errors

Assumptions and errors

Assumptions and errors Regression disturbances whose variances are not constant across observations are heteroscedastic. Under heteroscedasticity, the OLS estimators remain unbiased and consistent, but are no longer BLUE or asymptotically efficient. (Thomas, 1985, 94)

Causes of heteroscedasicity More variation for larger sizes (e.g. profits of firms varies more for larger firms) More variation across different groups in the sample Learning effects in time-series Variation in data collection quality (e.g. historical data) Turbulence after shocks in time-series (e.g. financial markets) Omitted variable Wrong functional form Aggregation with varying sizes of populations etc.

Assumptions and errors Since OLS is no longer BLUE or asymptotically efficient, other linear unbiased estimators exist which have smaller sampling variances; other consistent estimators exist which collapse more quickly to the true values as n increases; we can no longer trust hypothesis tests, because var( ˆβ OLS ) is biased. cov(x 2 i, σ2 i ) > 0, then var( ˆβ OLS ) is underestimated cov(x 2 i, σ2 i ) = 0, then no bias in var( ˆβ OLS ) cov(x 2 i, σ2 i ) < 0, then var( ˆβ OLS ) is overestimated (inefficient) (Thomas, 1985, 94 95) (Judge et al., 1985, 422)

Residual plots: heteroscedasticity To detect heteroscedasticity (unequal variances), it is useful to plot: Residuals against fitted values Residuals against dependent variable Residuals against independent variable(s) Usually, the first one is sufficient to detect heteroscedasticity, and can simply be found by: m <- lm(y ~ x) plot(m)

Residual plots: heteroscedasticity 0 2 4 6 8 5 10 15 20 25 30 35 x y

Residual plots: heteroscedasticity 5 10 15 20 25 10 5 0 5 fitted(m) residuals(m)

Residual plots: heteroscedasticity 5 10 15 20 25 30 35 10 5 0 5 y residuals(m)

Residual plots: heteroscedasticity 0 2 4 6 8 10 5 0 5 x residuals(m)

Residual plots: homoscedasticity 0 2 4 6 5 10 15 20 25 x y

Residual plots: homoscedasticity 5 10 15 20 25 2 1 0 1 2 fitted(m) residuals(m)

Residual plots: homoscedasticity 5 10 15 20 25 2 1 0 1 2 y residuals(m)

Residual plots: homoscedasticity 0 2 4 6 2 1 0 1 2 x residuals(m)

Residual plots: heteroscedasticity 0 2 4 6 8 0 20 40 60 80 100 120 x y

Residual plots: heteroscedasticity 20 0 20 40 60 80 10 0 10 20 fitted(m) residuals(m)

Residual plots: heteroscedasticity 0 20 40 60 80 100 120 10 0 10 20 y residuals(m)

Residual plots: heteroscedasticity 0 2 4 6 8 10 0 10 20 x residuals(m)

Cook, R. Dennis. 1977. Detection of influential observation in linear regression. Technometrics pp. 15 18. Cook, R. Dennis. 1979. Influential observations in linear regression. Journal of the American Statistical Association 74(365):169 174. Glynn, Adam. 2007. GOV 2000: Quantitative Methodology for Political Science I. Lecture slides, Harvard University. Greene, William H. 2002. Econometric analysis. London: Prentice Hall. Judge, George G, William E Griffiths, R Carter Hill, Helmut Lutkepohl and Tsoung-Chao Lee. 1985. The Theory and Practice of Econometrics. New York: John Wiley and Sons. King, Gary. 2007. GOV 2000: Quantitative Methodology for Political Science I. Lecture slides, Harvard University. Kutner, Michael H., Christopher J. Nachtsheim, John Neter and William Li. 2005. Applied linear statistical models. 5th ed. McGraw-Hill. Thomas, R. Leighton. 1985. Introductory econometrics: theory and applications. Longman Harlow, Essex.