Regression diagnostics Leiden University Leiden, 30 April 2018
Outline 1 Error assumptions Introduction Variance Normality 2 Residual vs error Outliers Influential observations
Introduction Errors and residuals Assumption on errors: ε i N(0, σ 2 ), i = 1,..., n. How to check? Examine the residuals ˆε i s. If the error assumption is okay, ˆε i will look like a sample generated from the normal distribution.
Variance Mean zero and constant variance Diagnostic plot: fitted values Ŷ i s versus residuals ˆε i s. Illustration: savings data on 50 countries from 1960 to 1970. Linear regression; covariates: per capita disposable income, percentage of population under 15 etc.
Variance R code > library(faraway) > data(savings) > g<-lm(sr~pop15+pop75+dpi+ddpi,savings) > plot(fitted(g),residuals(g),xlab="fitted", + ylab="residuals") > abline(h=0)
Variance Plot No significant evidence against constant variance. Residuals 5 0 5 10 6 8 10 12 14 16 Fitted
Variance Constant variance: examples rnorm(50) 1.5 0.5 0.5 1.5 rnorm(50) 2 1 0 1 2 0 10 20 30 40 50 1:50 0 10 20 30 40 50 1:50 rnorm(50) 2 0 1 2 rnorm(50) 1 0 1 2 0 10 20 30 Botond 40 Szabo 50 0 10 20 30 40 50
Variance Constant variance: strong violation (1:50) * rnorm(50) 50 0 50 (1:50) * rnorm(50) 40 0 40 80 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50 (1:50) * rnorm(50) 100 0 50 (1:50) * rnorm(50) 40 0 20 60 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50
Variance Constant variance: milder violation sqrt((1:50)) * rnorm(50) 15 5 5 15 sqrt((1:50)) * rnorm(50) 15 5 0 5 10 0 10 20 30 40 50 1:50 0 10 20 30 40 50 1:50 sqrt((1:50)) * rnorm(50) 10 0 5 10 sqrt((1:50)) * rnorm(50) 10 0 5 10 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50
Variance Nonlinearity cos((1:50) * pi/25) + rnorm(50) 2 0 1 2 3 0 10 20 30 40 50 cos((1:50) * pi/25) + rnorm(50) 2 0 1 2 0 10 20 30 40 50 1:50 1:50 cos((1:50) * pi/25) + rnorm(50) 2 1 0 1 2 0 10 20 30 40 50 cos((1:50) * pi/25) + rnorm(50) 3 1 0 1 2 0 10 20 30 40 50 1:50 1:50
Variance Predictors versus residuals Another diagnostic tool: predictors X ij s versus residuals ˆε i s. > plot(savings$pop15,residuals(g), + xlab="population under 15", + ylab="residuals")
Variance Plot Residuals 5 0 5 10 25 30 35 40 45 Population under 15
Variance Variance test Two groups can be identified in the plot. Test the null hypothesis that the ratio of variances is equal to 1. Only the p-value displayed on this slide. > var.test(residuals(g)[savings$pop15>35], + residuals(g)[savings$pop15<35])$p.value [1] 0.01357595
Variance Dealing with nonconstant variance Transforming the responses Y i s through a function h into h(y i ) s is a possible way to deal with nonconstant variance. Two choices that often work: h(y) = log y and h(y) = y. General method: Box-Cox transformation. Works well, but not always. Upon transforming the response, what do the parameters mean?
Variance Galapagos tortoise example I
Variance Galapagos tortoise example II > data(gala) > gg<-lm(species~area+elevation+scruz+nearest + +Adjacent,gala) > plot(fitted(gg),residuals(gg),xlab="fitted", + ylab="residuals") Residuals 100 50 0 50 100 150 0 100 200 300 400 Fitted
Variance Fixing problem > gs<-lm(sqrt(species)~area+elevation+scruz+nearest + +Adjacent,gala) > plot(fitted(gs),residuals(gs),xlab="fitted", + ylab="residuals") Residuals 4 2 0 2 4 5 10 15 20 Fitted
Normality Checking normality Assume the constant variance assumption is fine. Normality? QQ-plot, histogram and normality tests based on residuals.
Normality Savings data example: QQ-plot > qqnorm(residuals(g)) > qqline(residuals(g)) Normal Q Q Plot Sample Quantiles 5 0 5 10 2 1 0 1 2 Theoretical Quantiles
Normality Savings data example: histogram Usual warning: histogram is sensitive to bin width and placement. > hist(residuals(g)) Histogram of residuals(g) Frequency 0 2 4 6 8 10 10 5 0 5 10 residuals(g)
Normality Savings data example: Shapiro-Wilk test > shapiro.test(residuals(g)) Shapiro-Wilk normality test data: residuals(g) W = 0.987, p-value = 0.8524 No evidence against normality found. Usual warning: can be unreliable for small sample sizes; for large sample sizes even mild deviations from normality will be detected, but is the effect so noticeable we need to care? Use only in conjunction with QQ-plot.
Residual vs error Leverage Errors (ɛ i ) and residuals (ɛ i ) are not the same. Recall that H = X (X T X ) 1 X T and therefore ˆɛ = Y Ŷ = (I H)Y (I H)X β + (I H)ɛ = (I H)ɛ. V(ˆɛ) = V[(I H)ɛ] = (I H)σ 2 (assuming indpendent noise with variance σ 2 ).
Residual vs error Leverage h i = H ii are called leverages. Variance of residuals: V[ˆɛ i ] = σ 2 (1 h i ). If h i is large, V[ˆɛ i ] is small and the fitted line is forced to stay close to Y i. Large values of h i are due to extreme values in X. One has i h i = p, so on average h i is p/n and a rule of thumb is to look at leverages larger than 2p/n. A high leverage point is unusual in the predictor space and has potential of influencing the LS fit.
Residual vs error Savings data example The code below computes leverages for the savings data example (part of the output displayed only). > ginf<-lm.influence(g) > ginf$hat[1:3] Australia Austria Belgium 0.06771343 0.12038393 0.08748248
Residual vs error Leverages and residuals
Residual vs error Leverages: visualisation Leverages can be visualised through a half-normal plot. Unlike the QQ-plot we are not looking for a straight line relationship, but for unusual quantities. > countries<-row.names(savings) > halfnorm(lm.influence(g)$hat,labs=countries, + ylab="leverages")
Residual vs error Half-normal plot Libya Leverages 0.0 0.1 0.2 0.3 0.4 0.5 United States 0.0 0.5 1.0 1.5 2.0 Half normal quantiles
Residual vs error Aside: studentised residuals V[ˆε i ] = σ 2 (1 h i ), so instead of the raw residuals we can use studentised residuals for diagnostics: r i = ˆε i ˆσ i 1 hi. Studentisation corrects only for nonconstant variance among residuals (assuming that the error has constant variance). For nonconstant variance among errors studentisation does not help. Using studentised residuals does not lead to much different conclusions, unless there is unusually high leverage.
Residual vs error Studentised residuals: illustration > stud<-rstandard(g) > qqnorm(stud) > qqline(stud)
Residual vs error Plot Normal Q Q Plot Sample Quantiles 2 1 0 1 2 2 1 0 1 2 Theoretical Quantiles
Outliers Plot: Outlier
Outliers Outlier An outlier is a point that does not fit the current model. Outliers may badly affect the fit, so finding them is important. Statistic ( ) n p 1 1/2 T i = r i n p ri 2. If the model assumptions are correct, T i t n p 1 and this can be used to construct a hypothesis test that the ith data point is an outlier. Even though we explicitly test only one or two unusual cases, implicitly we are testing all of them and hence need to adjust the level α. Recall the Bonferroni method: test each case at level α/n.
Outliers Savings data example > jack<-rstudent(g) > jack[which.max(abs(jack))] Zambia 2.853558 > qt(0.025/(50),44) [1] -3.525801
Outliers Remarks Several outliers next to each other might hide each other. If you transform your model, outliers in the original model will not necessarily be outliers in the transformed model and vice versa. Individual outliers typically are not a big problem in large datasets. Clusters of outliers are. Do not remove outliers mechanically: use astronomical knowledge to understand what is going on and why. Always report removal of outliers in your papers.
Outliers Astronomical example Astronomical data are the log surface temperature versus log light intensity of 47 stars in the star cluster CYG OB1 (in the direction of Cygnus).
Outliers Data plot > data(star) > plot(star$temp,star$light,xlab="log(temperature)", + ylab="log(light intensity)") log(light intensity) 4.0 4.5 5.0 5.5 6.0 3.6 3.8 4.0 4.2 4.4 4.6 log(temperature)
Outliers Least squares fit > ga<-lm(light~temp,star) > plot(star$temp,star$light,xlab="log(temperature)", + ylab="log(light intensity)") > abline(ga) log(light intensity) 4.0 4.5 5.0 5.5 6.0 3.6 3.8 4.0 4.2 4.4 4.6 log(temperature)
Outliers Giants excluded > gaa<-lm(light~temp,star,subset=(temp>3.6)) > plot(star$temp,star$light,xlab="log(temperature)", + ylab="log(light intensity)") > abline(gaa) log(light intensity) 4.0 4.5 5.0 5.5 6.0 3.6 3.8 4.0 4.2 4.4 4.6 log(temperature)
Influential observations Cook statistic An influential observation is one whose removal from the dataset causes a large change in the fit. An influential observation may or may not be an outlier, and may or may not have large leverage, but typically it is at least one of these. Cook statistic D i = r i 2 p h i 1 h i. A half-normal plot can be used to identify influential observations.
Influential observations Savings data example > cook<-cooks.distance(g) > halfnorm(cook,3,labs=countries,ylab="cooks distances") Cook's distances 0.00 0.05 0.10 0.15 0.20 0.25 Zambia Japan Libya 0.0 0.5 1.0 1.5 2.0 Half normal quantiles
Influential observations Lybia included
Influential observations Lybia excluded We notice in particular that the ddpi parameter estimate changed by about 50%. Lybia seems to be influential and this is in accord with what the Cook statistics told us.
Influential observations Summary After fitting a model always perform diagnostics. Try to fix problems, don t be shy of refitting the model. There is more on diagnostics.