Quantitative Methods I: Regression diagnostics

Quantitative Methods I: Regression University College Dublin 10 December 2014

1 Assumptions and errors 2 3 4

Outline Assumptions and errors 1 Assumptions and errors 2 3 4

Assumptions: specification Linear in parameters (i.e. f (Xβ) = Xβ and E[y] = Xβ) No extraneous variables in X No omitted independent variables Parameters to be estimated are constant Number of parameters is less than the number of cases, k < n

Assumptions: errors Assumptions and errors Errors have an expected value of zero, E[ε X] = 0 Errors are normally distributed, ε N(0, σ 2 ) Errors have a constant variance, Var(ε X) = σ 2 < Errors are not autocorrelated, Cov(ε i, ε j X) = 0 i j Errors and X are uncorrelated, Cov(X, ε) = 0

Assumptions: regressors X varies X is of full column rank (note: requires k < n) No measurement error in X No endogenous variables in X

Assumptions for unbiasedness If the population regression model is linear in its parameters; the sample is a random sample from the population; there is no perfect collinearity, r(x) = k; and the expected values of the error term is zero conditional on the explanatory variables, E(ε X) = 0 and Cov(ε, X) = 0, then the OLS estimators of β are unbiased. (Glynn, 2007)

Non-constant error variance Consequences: but: ˆσ 2 is a biased estimator of σ 2 ; the probability of a Type I error will not be α; the least squares estimator is no longer the best linear unbiased estimator, the severity depends on the level of heteroscedasticity; ˆβ still an unbiased estimator of β. (King, 2007)

Non-normal errors Assumptions and errors Consequences: sampling distribution of ˆβ not normal; test statistics will not have t- and F-distributions; the probability of a Type I error will not be α, but the estimates are still consistent: as n increases, the above problems disappear. (King, 2007)

Exercise: film reviews Open the films.dta data set. Create a new variable highrating, which is 1 for films rated 3 or higher, 0 otherwise. Using matrix formulas, 1 regress desclength on a constant 2 regress desclength on castsize 3 regress desclength on castsize, highrating, length

Exercise: film reviews Based on the last regression: 1 Which observation has the largest residual? 2 Compute mean and median of residuals 3 Compute correlation between residuals and fitted values 4 Compute correlation between residuals and length 5 All other predictors held constant, what would be the difference in predicted description length between high and low rated movies?

Non-linearity Assumptions and errors If there is non-linearity in the variables, but not in the parameters, there is no problem. E.g. can be estimated with OLS. y i = β 0 + β 1 x i + β 2 x 2 i + ε i If there are other non-linearities, sometimes the equation can be transformed. E.g. y i = β 0 x β 1 i ε i log(y i ) = log(β 0 ) + β 1 log(x i ) + log(ε i ) y i = β 0 + β 1x i + ε i

Functional forms for additional non-linear transformations log-linear as with the previous example semi-log has two forms: y i = β 0 + β 1 log(x i ), where β 1 is y due to % x log(y i ) = β 0 + β 1 x i, where β 1 is % y due to x inverse or reciprocal: y i = β 0 + β 1 1 x i polynomial y i = β 0 + β 1 x i + β 2 x 2 i

Outline Assumptions and errors 1 Assumptions and errors 2 3 4

Leverage Assumptions and errors A high leverage point i is one where x i is far from the mean of X. These points can be identified using the so-called hat matrix, the matrix that puts a hat on y: ŷ = X ˆβ = X(X X) 1 X y = Hy, the diagonal of which is a measure of leverage. (King, 2007)

Outliers Assumptions and errors An outlier is a point on the regression line where the residual is large.

Outliers Assumptions and errors An outlier is a point on the regression line where the residual is large. To account for the potential variables in the sampling variances of the residuals, we calculate externally studentized residuals (or studentized deleted residuals), where a large absolute value indicates an outlier. A test could be based on the fact that in a model without outliers, they should follow a t(n k) distribution. (Kutner et al., 2005, 390 398)

Influence Assumptions and errors An influential point is one which has a strong impact on the estimation of ˆβ. An influential point is one which has high leverage and is also an outlier. We typically look at Cook s Distance to assess the level of influence of each variable. (King, 2007)

A point with high leverage is located far from the other points. A high leverage point that strongly influences the regression line is called an influential point.

Outlier, low leverage, low influence y 8 10 12 14 3 4 5 6 7 8

High leverage, low influence y 10 15 20 25 5 10 15 20

High leverage, high influence y 8 10 12 14 16 5 10 15 20

Cook s Distance Assumptions and errors n j=1 D i (ŷ j ŷ j( i) ) 2 ks ( 2 e = i s 1 h i = t2 i var(ŷ i ) k var(e i ) F(k, n k) ) 2 h i k(1 h i ) = ( ˆβ OLS ( i) ˆβ OLS ) X X( ˆβ OLS ( i) ˆβ OLS ) ks 2 The F-test here refers to whether ˆβ OLS would be significantly different if observation i were to be removed (H 0 : β = β ( i) ). (Cook, 1979, 168)

Cook s Distance Assumptions and errors D i = t2 i k var(ŷ i ) var(e i ) ti 2 is a measure of the degree to which the ith observation can be considered as an outlier from the assumed model. The ratios var(ŷ i ) var(e i ) measure the relative sensitivity of the estimate, ˆβ OLS, to potential outlying values at each data point. (Cook, 1977, 16)

What to do with outliers? Options: 1 Ignore the problem 2 Investigate why the data are outliers what makes them unusual? 3 Consider respecifying the model, either by tranforming a variable or by including an additional variable (but beware of overfitting) 4 Consider a variant of robust regression that downweights outliers

Diagnosing problems in R A very easy set of diagnostic plots can be accessed by plotting a lm object, using plot.lm() This produces, in order: 1 residuals against fitted values 2 Normal Q-Q plot 3 scale-location plot of e i against fitted values 4 Cook s distances versus row labels 5 residuals against leverages 6 Cook s distances against leverage/(1-leverage) Note that by default, plot.lm() only gives you 1,2,3,5

Exercise Assumptions and errors Open the uswages.dta data set and regress log(wage) on educ, exper and race. Check for leverage, outliers, influential points and nonlinearities.

Outline Assumptions and errors 1 Assumptions and errors 2 3 4

Collinearity Assumptions and errors When some variables are linear combinations of others then we have exact (or perfect) collinearity, and there is no unique least squares estimate of β. When X variables are highly correlated, we have multicollinearity. Detecting multicollinearity: look at correlation matrix of predictors for pairwise correlations regress each independent variable on all other independent variables to produce R 2 j, and look for high values (close to 1.0)

Assumptions and errors The extent to which multicollinearity is a problem is debatable. The issue is comparable to that of sample size: if n is too small, we have difficulty picking up effects even if they really exist; the same holds for variables that are highly multicollinear, making it difficult to separate their effects on y.

Assumptions and errors However, some problems with high multicollinearity: Small changes in data can lead to large changes in estimates High standard errors but joint significance Coefficients may have wrong sign or implausible magnitudes (Greene, 2002, 57)

Variance of ˆβ OLS Assumptions and errors var( ˆβ OLS k ) = σ 2 (1 R 2 k ) n i (x ik x k ) 2 σ 2 : all else equal, the better the fit, the lower the variance (1 R 2 k ): all else equal, the lower the R2 from regressing the kth independent variable on all other independent variables, the lower the variance (Greene, 2002, 57)

Variance Inflation Factor var( ˆβ OLS k ) = σ 2 (1 Rk 2) n i (x ik x k ) 2 1 VIF k = 1 Rk 2, thus VIF k shows the increase in the var( ˆβ k OLS ) due to the variable being collinear with other independent variables. library(faraway) vif(lm(...))

: solutions Check for coding or logical mistakes (esp. in cases of perfect multicollinearity) Increase n Remove one of the collinear variables (apparently not adding much) Combine multiple variables in indices or underlying dimensions Formalise the relationship

Exercise Assumptions and errors Using demdev.dta data and model polity2 i = β 0 +β 1 cwar i +β 3 laggdppc i +β 4 propdem i +β 5 energy2 i +ε i, check whether there are any multicollinearity problems.

Outline Assumptions and errors 1 Assumptions and errors 2 3 4

Homoscedasticity Assumptions and errors

Assumptions and errors

Assumptions and errors Regression disturbances whose variances are not constant across observations are heteroscedastic. Under heteroscedasticity, the OLS estimators remain unbiased and consistent, but are no longer BLUE or asymptotically efficient. (Thomas, 1985, 94)

Causes of heteroscedasicity More variation for larger sizes (e.g. profits of firms varies more for larger firms) More variation across different groups in the sample Learning effects in time-series Variation in data collection quality (e.g. historical data) Turbulence after shocks in time-series (e.g. financial markets) Omitted variable Wrong functional form Aggregation with varying sizes of populations etc.

Assumptions and errors Since OLS is no longer BLUE or asymptotically efficient, other linear unbiased estimators exist which have smaller sampling variances; other consistent estimators exist which collapse more quickly to the true values as n increases; we can no longer trust hypothesis tests, because var( ˆβ OLS ) is biased. cov(x 2 i, σ2 i ) > 0, then var( ˆβ OLS ) is underestimated cov(x 2 i, σ2 i ) = 0, then no bias in var( ˆβ OLS ) cov(x 2 i, σ2 i ) < 0, then var( ˆβ OLS ) is overestimated (inefficient) (Thomas, 1985, 94 95) (Judge et al., 1985, 422)

Residual plots: heteroscedasticity To detect heteroscedasticity (unequal variances), it is useful to plot: Residuals against fitted values Residuals against dependent variable Residuals against independent variable(s) Usually, the first one is sufficient to detect heteroscedasticity, and can simply be found by: m <- lm(y ~ x) plot(m)

Residual plots: heteroscedasticity 0 2 4 6 8 5 10 15 20 25 30 35 x y

Residual plots: heteroscedasticity 5 10 15 20 25 10 5 0 5 fitted(m) residuals(m)

Residual plots: heteroscedasticity 5 10 15 20 25 30 35 10 5 0 5 y residuals(m)

Residual plots: heteroscedasticity 0 2 4 6 8 10 5 0 5 x residuals(m)

Residual plots: homoscedasticity 0 2 4 6 5 10 15 20 25 x y

Residual plots: homoscedasticity 5 10 15 20 25 2 1 0 1 2 fitted(m) residuals(m)

Residual plots: homoscedasticity 5 10 15 20 25 2 1 0 1 2 y residuals(m)

Residual plots: homoscedasticity 0 2 4 6 2 1 0 1 2 x residuals(m)

Residual plots: heteroscedasticity 0 2 4 6 8 0 20 40 60 80 100 120 x y

Residual plots: heteroscedasticity 20 0 20 40 60 80 10 0 10 20 fitted(m) residuals(m)

Residual plots: heteroscedasticity 0 20 40 60 80 100 120 10 0 10 20 y residuals(m)

Residual plots: heteroscedasticity 0 2 4 6 8 10 0 10 20 x residuals(m)

Cook, R. Dennis. 1977. Detection of influential observation in linear regression. Technometrics pp. 15 18. Cook, R. Dennis. 1979. Influential observations in linear regression. Journal of the American Statistical Association 74(365):169 174. Glynn, Adam. 2007. GOV 2000: Quantitative Methodology for Political Science I. Lecture slides, Harvard University. Greene, William H. 2002. Econometric analysis. London: Prentice Hall. Judge, George G, William E Griffiths, R Carter Hill, Helmut Lutkepohl and Tsoung-Chao Lee. 1985. The Theory and Practice of Econometrics. New York: John Wiley and Sons. King, Gary. 2007. GOV 2000: Quantitative Methodology for Political Science I. Lecture slides, Harvard University. Kutner, Michael H., Christopher J. Nachtsheim, John Neter and William Li. 2005. Applied linear statistical models. 5th ed. McGraw-Hill. Thomas, R. Leighton. 1985. Introductory econometrics: theory and applications. Longman Harlow, Essex.