Regression diagnostics

Similar documents
MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

STAT5044: Regression and Anova

Lecture 1: Linear Models and Applications

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Weighted Least Squares

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice

Statistical Modelling in Stata 5: Linear Models

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

STAT 4385 Topic 06: Model Diagnostics

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Lectures on Simple Linear Regression Stat 431, Summer 2012

2. Outliers and inference for regression

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Simple Linear Regression for the Advertising Data

10 Model Checking and Regression Diagnostics

22s:152 Applied Linear Regression. Chapter 6: Statistical Inference for Regression

Simple Linear Regression

Module 6: Model Diagnostics

Transformations. Merlise Clyde. Readings: Gelman & Hill Ch 2-4, ALR 8-9

Checking model assumptions with regression diagnostics

Regression Diagnostics Procedures

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

x 21 x 22 x 23 f X 1 X 2 X 3 ε

Simple Linear Regression

14 Multiple Linear Regression

Introduction to Linear regression analysis. Part 2. Model comparisons

Regression diagnostics

STATISTICS 479 Exam II (100 points)

Simple Linear Regression for the Climate Data

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

Linear models and their mathematical foundations: Simple linear regression

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

CHAPTER 5. Outlier Detection in Multivariate Data

Linear Regression Models

Statistical View of Least Squares

Simple and Multiple Linear Regression

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Density Temp vs Ratio. temp

Simple Linear Regression

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

INFERENCE FOR REGRESSION

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Regression Diagnostics

Chapter 8: Correlation & Regression

Simple Linear Regression

Analysing data: regression and correlation S6 and S7

Homework 2: Simple Linear Regression

Lecture One: A Quick Review/Overview on Regular Linear Regression Models

Simple Linear Regression

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Multiple Linear Regression

ETH Zürich, October 25, 2010

3. Diagnostics and Remedial Measures

9 Correlation and Regression

Simple linear regression: estimation, diagnostics, prediction

Unit 6 - Introduction to linear regression

Summarizing Data: Paired Quantitative Data

6.1 Introduction. Regression Model:

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Weighted Least Squares

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Single and multiple linear regression analysis

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

1 Least Squares Estimation - multiple regression.

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

enote 6 1 enote 6 Model Diagnostics

Note on Bivariate Regression: Connecting Practice and Theory. Konstantin Kashin

Regression Model Building

Joint Probability Distributions

Generalized Linear Models

Tutorial 6: Linear Regression

13 Simple Linear Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Introduction to Regression

Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Bivariate Relationships Between Variables

Math 423/533: The Main Theoretical Topics

Transformations. Merlise Clyde. Readings: Gelman & Hill Ch 2-4

STAT 420: Methods of Applied Statistics

Assessing Model Adequacy

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Unit 10: Simple Linear Regression and Correlation

8. Example: Predicting University of New Mexico Enrollment

Chapter 4: Regression Models

Outline. 1 Preliminaries. 2 Introduction. 3 Multivariate Linear Regression. 4 Online Resources for R. 5 References. 6 Upcoming Mini-Courses

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Topic 14: Inference in Multiple Regression

Diagnostics for Linear Models With Functional Responses

One-way ANOVA Model Assumptions

Exam Applied Statistical Regression. Good Luck!

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Chapter 3. Diagnostics and Remedial Measures

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

R 2 and F -Tests and ANOVA

Formal Statement of Simple Linear Regression Model

Transcription:

Regression diagnostics Leiden University Leiden, 30 April 2018

Outline 1 Error assumptions Introduction Variance Normality 2 Residual vs error Outliers Influential observations

Introduction Errors and residuals Assumption on errors: ε i N(0, σ 2 ), i = 1,..., n. How to check? Examine the residuals ˆε i s. If the error assumption is okay, ˆε i will look like a sample generated from the normal distribution.

Variance Mean zero and constant variance Diagnostic plot: fitted values Ŷ i s versus residuals ˆε i s. Illustration: savings data on 50 countries from 1960 to 1970. Linear regression; covariates: per capita disposable income, percentage of population under 15 etc.

Variance R code > library(faraway) > data(savings) > g<-lm(sr~pop15+pop75+dpi+ddpi,savings) > plot(fitted(g),residuals(g),xlab="fitted", + ylab="residuals") > abline(h=0)

Variance Plot No significant evidence against constant variance. Residuals 5 0 5 10 6 8 10 12 14 16 Fitted

Variance Constant variance: examples rnorm(50) 1.5 0.5 0.5 1.5 rnorm(50) 2 1 0 1 2 0 10 20 30 40 50 1:50 0 10 20 30 40 50 1:50 rnorm(50) 2 0 1 2 rnorm(50) 1 0 1 2 0 10 20 30 Botond 40 Szabo 50 0 10 20 30 40 50

Variance Constant variance: strong violation (1:50) * rnorm(50) 50 0 50 (1:50) * rnorm(50) 40 0 40 80 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50 (1:50) * rnorm(50) 100 0 50 (1:50) * rnorm(50) 40 0 20 60 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50

Variance Constant variance: milder violation sqrt((1:50)) * rnorm(50) 15 5 5 15 sqrt((1:50)) * rnorm(50) 15 5 0 5 10 0 10 20 30 40 50 1:50 0 10 20 30 40 50 1:50 sqrt((1:50)) * rnorm(50) 10 0 5 10 sqrt((1:50)) * rnorm(50) 10 0 5 10 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50

Variance Nonlinearity cos((1:50) * pi/25) + rnorm(50) 2 0 1 2 3 0 10 20 30 40 50 cos((1:50) * pi/25) + rnorm(50) 2 0 1 2 0 10 20 30 40 50 1:50 1:50 cos((1:50) * pi/25) + rnorm(50) 2 1 0 1 2 0 10 20 30 40 50 cos((1:50) * pi/25) + rnorm(50) 3 1 0 1 2 0 10 20 30 40 50 1:50 1:50

Variance Predictors versus residuals Another diagnostic tool: predictors X ij s versus residuals ˆε i s. > plot(savings$pop15,residuals(g), + xlab="population under 15", + ylab="residuals")

Variance Plot Residuals 5 0 5 10 25 30 35 40 45 Population under 15

Variance Variance test Two groups can be identified in the plot. Test the null hypothesis that the ratio of variances is equal to 1. Only the p-value displayed on this slide. > var.test(residuals(g)[savings$pop15>35], + residuals(g)[savings$pop15<35])$p.value [1] 0.01357595

Variance Dealing with nonconstant variance Transforming the responses Y i s through a function h into h(y i ) s is a possible way to deal with nonconstant variance. Two choices that often work: h(y) = log y and h(y) = y. General method: Box-Cox transformation. Works well, but not always. Upon transforming the response, what do the parameters mean?

Variance Galapagos tortoise example I

Variance Galapagos tortoise example II > data(gala) > gg<-lm(species~area+elevation+scruz+nearest + +Adjacent,gala) > plot(fitted(gg),residuals(gg),xlab="fitted", + ylab="residuals") Residuals 100 50 0 50 100 150 0 100 200 300 400 Fitted

Variance Fixing problem > gs<-lm(sqrt(species)~area+elevation+scruz+nearest + +Adjacent,gala) > plot(fitted(gs),residuals(gs),xlab="fitted", + ylab="residuals") Residuals 4 2 0 2 4 5 10 15 20 Fitted

Normality Checking normality Assume the constant variance assumption is fine. Normality? QQ-plot, histogram and normality tests based on residuals.

Normality Savings data example: QQ-plot > qqnorm(residuals(g)) > qqline(residuals(g)) Normal Q Q Plot Sample Quantiles 5 0 5 10 2 1 0 1 2 Theoretical Quantiles

Normality Savings data example: histogram Usual warning: histogram is sensitive to bin width and placement. > hist(residuals(g)) Histogram of residuals(g) Frequency 0 2 4 6 8 10 10 5 0 5 10 residuals(g)

Normality Savings data example: Shapiro-Wilk test > shapiro.test(residuals(g)) Shapiro-Wilk normality test data: residuals(g) W = 0.987, p-value = 0.8524 No evidence against normality found. Usual warning: can be unreliable for small sample sizes; for large sample sizes even mild deviations from normality will be detected, but is the effect so noticeable we need to care? Use only in conjunction with QQ-plot.

Residual vs error Leverage Errors (ɛ i ) and residuals (ɛ i ) are not the same. Recall that H = X (X T X ) 1 X T and therefore ˆɛ = Y Ŷ = (I H)Y (I H)X β + (I H)ɛ = (I H)ɛ. V(ˆɛ) = V[(I H)ɛ] = (I H)σ 2 (assuming indpendent noise with variance σ 2 ).

Residual vs error Leverage h i = H ii are called leverages. Variance of residuals: V[ˆɛ i ] = σ 2 (1 h i ). If h i is large, V[ˆɛ i ] is small and the fitted line is forced to stay close to Y i. Large values of h i are due to extreme values in X. One has i h i = p, so on average h i is p/n and a rule of thumb is to look at leverages larger than 2p/n. A high leverage point is unusual in the predictor space and has potential of influencing the LS fit.

Residual vs error Savings data example The code below computes leverages for the savings data example (part of the output displayed only). > ginf<-lm.influence(g) > ginf$hat[1:3] Australia Austria Belgium 0.06771343 0.12038393 0.08748248

Residual vs error Leverages and residuals

Residual vs error Leverages: visualisation Leverages can be visualised through a half-normal plot. Unlike the QQ-plot we are not looking for a straight line relationship, but for unusual quantities. > countries<-row.names(savings) > halfnorm(lm.influence(g)$hat,labs=countries, + ylab="leverages")

Residual vs error Half-normal plot Libya Leverages 0.0 0.1 0.2 0.3 0.4 0.5 United States 0.0 0.5 1.0 1.5 2.0 Half normal quantiles

Residual vs error Aside: studentised residuals V[ˆε i ] = σ 2 (1 h i ), so instead of the raw residuals we can use studentised residuals for diagnostics: r i = ˆε i ˆσ i 1 hi. Studentisation corrects only for nonconstant variance among residuals (assuming that the error has constant variance). For nonconstant variance among errors studentisation does not help. Using studentised residuals does not lead to much different conclusions, unless there is unusually high leverage.

Residual vs error Studentised residuals: illustration > stud<-rstandard(g) > qqnorm(stud) > qqline(stud)

Residual vs error Plot Normal Q Q Plot Sample Quantiles 2 1 0 1 2 2 1 0 1 2 Theoretical Quantiles

Outliers Plot: Outlier

Outliers Outlier An outlier is a point that does not fit the current model. Outliers may badly affect the fit, so finding them is important. Statistic ( ) n p 1 1/2 T i = r i n p ri 2. If the model assumptions are correct, T i t n p 1 and this can be used to construct a hypothesis test that the ith data point is an outlier. Even though we explicitly test only one or two unusual cases, implicitly we are testing all of them and hence need to adjust the level α. Recall the Bonferroni method: test each case at level α/n.

Outliers Savings data example > jack<-rstudent(g) > jack[which.max(abs(jack))] Zambia 2.853558 > qt(0.025/(50),44) [1] -3.525801

Outliers Remarks Several outliers next to each other might hide each other. If you transform your model, outliers in the original model will not necessarily be outliers in the transformed model and vice versa. Individual outliers typically are not a big problem in large datasets. Clusters of outliers are. Do not remove outliers mechanically: use astronomical knowledge to understand what is going on and why. Always report removal of outliers in your papers.

Outliers Astronomical example Astronomical data are the log surface temperature versus log light intensity of 47 stars in the star cluster CYG OB1 (in the direction of Cygnus).

Outliers Data plot > data(star) > plot(star$temp,star$light,xlab="log(temperature)", + ylab="log(light intensity)") log(light intensity) 4.0 4.5 5.0 5.5 6.0 3.6 3.8 4.0 4.2 4.4 4.6 log(temperature)

Outliers Least squares fit > ga<-lm(light~temp,star) > plot(star$temp,star$light,xlab="log(temperature)", + ylab="log(light intensity)") > abline(ga) log(light intensity) 4.0 4.5 5.0 5.5 6.0 3.6 3.8 4.0 4.2 4.4 4.6 log(temperature)

Outliers Giants excluded > gaa<-lm(light~temp,star,subset=(temp>3.6)) > plot(star$temp,star$light,xlab="log(temperature)", + ylab="log(light intensity)") > abline(gaa) log(light intensity) 4.0 4.5 5.0 5.5 6.0 3.6 3.8 4.0 4.2 4.4 4.6 log(temperature)

Influential observations Cook statistic An influential observation is one whose removal from the dataset causes a large change in the fit. An influential observation may or may not be an outlier, and may or may not have large leverage, but typically it is at least one of these. Cook statistic D i = r i 2 p h i 1 h i. A half-normal plot can be used to identify influential observations.

Influential observations Savings data example > cook<-cooks.distance(g) > halfnorm(cook,3,labs=countries,ylab="cooks distances") Cook's distances 0.00 0.05 0.10 0.15 0.20 0.25 Zambia Japan Libya 0.0 0.5 1.0 1.5 2.0 Half normal quantiles

Influential observations Lybia included

Influential observations Lybia excluded We notice in particular that the ddpi parameter estimate changed by about 50%. Lybia seems to be influential and this is in accord with what the Cook statistics told us.

Influential observations Summary After fitting a model always perform diagnostics. Try to fix problems, don t be shy of refitting the model. There is more on diagnostics.