Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science

Similar documents
Regression Diagnostics

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

Statistical View of Least Squares

Data Analysis and Statistical Methods Statistics 651

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

The Simple Linear Regression Model

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Statistical View of Least Squares

Simple Linear Regression for the Advertising Data

Chapter 27 Summary Inferences for Regression

Simulating MLM. Paul E. Johnson 1 2. Descriptive 1 / Department of Political Science

Multiple Regression Analysis

Section 3: Simple Linear Regression

appstats27.notebook April 06, 2017

MATH 1150 Chapter 2 Notation and Terminology

Business Statistics. Lecture 9: Simple Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Bivariate data analysis

Economics 113. Simple Regression Assumptions. Simple Regression Derivation. Changing Units of Measurement. Nonlinear effects

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

6. Dummy variable regression

1 Warm-Up: 2 Adjusted R 2. Introductory Applied Econometrics EEP/IAS 118 Spring Sylvan Herskowitz Section #

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Chapter 7. Scatterplots, Association, and Correlation

Econometrics Part Three

10 Model Checking and Regression Diagnostics

Linear Regression & Correlation

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

Chapter 2: Looking at Data Relationships (Part 3)

Warm-up Using the given data Create a scatterplot Find the regression line

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

Unit 10: Simple Linear Regression and Correlation

Sociology 6Z03 Review I

Exam Applied Statistical Regression. Good Luck!

Inferences for Regression

ECON 497 Midterm Spring

1 A Review of Correlation and Regression

Applied Statistics and Econometrics

9 Correlation and Regression

Notes 6. Basic Stats Procedures part II

review session gov 2000 gov 2000 () review session 1 / 38

Simple Linear Regression

POLSCI 702 Non-Normality and Heteroskedasticity

Assumptions in Regression Modeling

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

ECON 497: Lecture 4 Page 1 of 1

Weighted Least Squares

Unit 6 - Simple linear regression

1 The Classic Bivariate Least Squares Model

3 Non-linearities and Dummy Variables

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

1 Correlation and Inference from Regression

R 2 and F -Tests and ANOVA

Finite Mathematics : A Business Approach

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Regression. Marc H. Mehlman University of New Haven

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Of small numbers with big influence The Sum Of Squares

LECTURE 15: SIMPLE LINEAR REGRESSION I

The Simple Regression Model. Part II. The Simple Regression Model

Unit 6 - Introduction to linear regression

MATH11400 Statistics Homepage

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model

Announcements. Lecture 10: Relationship between Measurement Variables. Poverty vs. HS graduate rate. Response vs. explanatory

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Ordinary Least Squares Linear Regression

Regression Models for Time Trends: A Second Example. INSR 260, Spring 2009 Bob Stine

Regression Analysis: Exploring relationships between variables. Stat 251


Chapter 16. Simple Linear Regression and dcorrelation

STA 302f16 Assignment Five 1

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

Relationships Regression

Tutorial 6: Linear Regression

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Recall, Positive/Negative Association:

Introduction and Single Predictor Regression. Correlation

Econometrics Multiple Regression Analysis: Heteroskedasticity

Simple Regression Model. January 24, 2011

Linear Regression with Multiple Regressors

STATISTICS 479 Exam II (100 points)

STAT 704 Sections IRLS and Bootstrap

Single and multiple linear regression analysis

Multiple Linear Regression

1 Introduction to Minitab

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

ECNS 561 Multiple Regression Analysis

Simple Linear Regression for the MPG Data

Lesson 11-1: Parabolas

WISE International Masters

Linear Regression with Multiple Regressors

Simple Linear Regression for the Climate Data

Chapter 5 Exercises 1

Heteroscedasticity 1

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

2. Outliers and inference for regression

Transcription:

(Mostly QQ and Leverage Plots) 1 / 63 Graphical Diagnosis Paul E. Johnson 1 2 1 Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas.

(Mostly QQ and Leverage Plots) 2 / 63 Outline 1 Introduction 2 Start Simple: Scatterplot 3 Standard Diagnostic Plots 4 Decipher Each Type Of Plot Top Left: Residual Plot Top Right: Q-Q plot Bottom Left: Scale-Location Plot Bottom-Right: Leverage, Cook s Distance 5 Repeat with more Predictors Multiple Regression Fitted Termplot Diagnostic Plots 6 Stress Test These Diagnostics Bad Nonlinearity Add Some Outliers Heteroskedasticity 7 Practice Problems

(Mostly QQ and Leverage Plots) 3 / 63 Introduction Outline 1 Introduction 2 Start Simple: Scatterplot 3 Standard Diagnostic Plots 4 Decipher Each Type Of Plot Top Left: Residual Plot Top Right: Q-Q plot Bottom Left: Scale-Location Plot Bottom-Right: Leverage, Cook s Distance 5 Repeat with more Predictors Multiple Regression Fitted Termplot Diagnostic Plots 6 Stress Test These Diagnostics Bad Nonlinearity Add Some Outliers Heteroskedasticity 7 Practice Problems

(Mostly QQ and Leverage Plots) 4 / 63 Introduction Did You Fit the Right Model? You assumed 1 Linearity: y i = β 0 + β 1 x1 i + β 2 x2 i + e i 2 Assumptions about Error term e i. Either: e i is Normal(0, σe 2 ) or this: 1 Unbiased errors: E[e i ] = 0 2 Homoskedasticity: E[e 2 i ] = σ 2 e 3 No autocorrelation between error terms, E[e i e j ] = 0 for all i and j 4 No correlation between x s and the error term, E[xj i e i ] for variables j and cases i

(Mostly QQ and Leverage Plots) 5 / 63 Introduction The Rest of This Course Is Diagnostics and Remedies Decide how inappropriate the results from the linear model and OLS might be If inappropriate, 3 major options 1 Choose a different formula for y i or e i (or both) nonlinear model for y i Weighted Least Squares (WLS) or Generalized Least Squares (GLS) 2 Keep the same formula and estimate in a different way robust regression 3 Keep the same estimates but apply a post hoc correction (e.g., robust heteroskedasticity consistent standard errors for parameter estimates)

(Mostly QQ and Leverage Plots) 6 / 63 Start Simple: Scatterplot Outline 1 Introduction 2 Start Simple: Scatterplot 3 Standard Diagnostic Plots 4 Decipher Each Type Of Plot Top Left: Residual Plot Top Right: Q-Q plot Bottom Left: Scale-Location Plot Bottom-Right: Leverage, Cook s Distance 5 Repeat with more Predictors Multiple Regression Fitted Termplot Diagnostic Plots 6 Stress Test These Diagnostics Bad Nonlinearity Add Some Outliers Heteroskedasticity 7 Practice Problems

(Mostly QQ and Leverage Plots) 7 / 63 Start Simple: Scatterplot Scatterplot: Two Numeric Variables Stopping Distance (feet) 0 20 40 60 80 100 120 Stopping Distance of Cars in the 1920s If there is only one predictor, the best diagnostic might be the simple scatterplot. Look for linearity homoskedasticity This is R s cars data set, a set commonly used for illustrations 5 10 15 20 25 Speed (mph)

(Mostly QQ and Leverage Plots) 8 / 63 Start Simple: Scatterplot Superimpose a Regression Line 5 10 15 20 25 0 20 40 60 80 100 120 Stopping Distance of Cars in the 1920s Speed (mph) Stopping Distance (feet) Straight line may be OK

(Mostly QQ and Leverage Plots) 9 / 63 Start Simple: Scatterplot Superimpose a Loess Line Stopping Distance (feet) 0 20 40 60 80 100 120 Stopping Distance of Cars in the 1920s OLS loess 5 10 15 20 25 Loess: locally weighted error sum of squares regression Fits a regression model individually for each point in data! Puts less weight on far away observations Predicted values are a connect the dots line, smoothed graphically to look pleasant. Speed (mph)

(Mostly QQ and Leverage Plots) 10 / 63 Start Simple: Scatterplot Plot residuals against X Stopping Residuals 20 0 20 40 Residuals for Stopping Distance loess fit to residuals Loess: locally weighted error sum of squares regression Evaluate subjectively (!) or (?) Hints about how model might be redesigned to fit the data more accurately 5 10 15 20 25 Speed (mph)

(Mostly QQ and Leverage Plots) 11 / 63 Standard Diagnostic Plots Outline 1 Introduction 2 Start Simple: Scatterplot 3 Standard Diagnostic Plots 4 Decipher Each Type Of Plot Top Left: Residual Plot Top Right: Q-Q plot Bottom Left: Scale-Location Plot Bottom-Right: Leverage, Cook s Distance 5 Repeat with more Predictors Multiple Regression Fitted Termplot Diagnostic Plots 6 Stress Test These Diagnostics Bad Nonlinearity Add Some Outliers Heteroskedasticity 7 Practice Problems

(Mostly QQ and Leverage Plots) 12 / 63 Standard Diagnostic Plots The R Diagnostic Plot Series In R, one can fit a model and then ask for the standard diagnostic plot: mod1 < lm ( output x1 +x2+x3+x4+x5, data=dat ) p l o t (mod1) plot is a generic function, does lots of different things, depending on what you give it.

0 20 40 60 80 20 0 20 40 Fitted values Residuals Residuals vs Fitted 49 23 35 2 1 0 1 2 2 1 0 1 2 3 Theoretical Quantiles Normal Q Q 23 49 35 0 20 40 60 80 0.0 0.5 1.0 1.5 Fitted values Scale Location 49 23 35 0.00 0.02 0.04 0.06 0.08 0.10 2 1 0 1 2 3 Leverage Cook's distance 0.5 Residuals vs Leverage 49 23 39 lm(dist ~ speed) The 2x2 diagnostic plot matrix for the cars regression

(Mostly QQ and Leverage Plots) 14 / 63 Decipher Each Type Of Plot Outline 1 Introduction 2 Start Simple: Scatterplot 3 Standard Diagnostic Plots 4 Decipher Each Type Of Plot Top Left: Residual Plot Top Right: Q-Q plot Bottom Left: Scale-Location Plot Bottom-Right: Leverage, Cook s Distance 5 Repeat with more Predictors Multiple Regression Fitted Termplot Diagnostic Plots 6 Stress Test These Diagnostics Bad Nonlinearity Add Some Outliers Heteroskedasticity 7 Practice Problems

(Mostly QQ and Leverage Plots) 15 / 63 Decipher Each Type Of Plot Top Left: Residual Plot Residuals with Loess Line Residuals vs Fitted 23 49 Residuals 20 0 20 40 35 0 20 40 60 80 Fitted values lm(dist ~ speed)

(Mostly QQ and Leverage Plots) 16 / 63 Decipher Each Type Of Plot Top Right: Q-Q plot QQ plot Normal Q Q 2 1 0 1 2 3 35 23 49 2 1 0 1 2 Theoretical Quantiles lm(dist ~ speed)

(Mostly QQ and Leverage Plots) 17 / 63 Decipher Each Type Of Plot Top Right: Q-Q plot QQ plot Standardized (or Studentized ) residuals Standardization intended to put residuals onto scale of their true variance (at a particular value of x i ) Proceed with assumption that these residuals are drawn from N(0, 1) 2 1 0 1 2 3 Normal Q Q 35 2 1 0 1 2 Theoretical Quantiles lm(dist ~ speed) 23 49

(Mostly QQ and Leverage Plots) 18 / 63 Decipher Each Type Of Plot Top Right: Q-Q plot QQ Compare Residuals against Normal(0,1) CDF Recall Normal CDF tells us how likely each value is theoretically supposed to be. QQ plot matches theoretical distribution with observed distribution If all points are exactly on the line, then the observed distribution matches N(0,1) If points deviate above or below the line, we suspect error is not normal 2 1 0 1 2 3 Normal Q Q 35 2 1 0 1 2 Theoretical Quantiles lm(dist ~ speed) 23 49

(Mostly QQ and Leverage Plots) 19 / 63 Decipher Each Type Of Plot Bottom Left: Scale-Location Plot Scale-Location Scale Location 23 49 0.0 0.5 1.0 1.5 35 0 20 40 60 80 Fitted values lm(dist ~ speed) Should be a homogeneous cloud, that is not taller on one side than the other

(Mostly QQ and Leverage Plots) 20 / 63 Decipher Each Type Of Plot Bottom-Right: Leverage, Cook s Distance Review One At a Time Residuals vs Leverage 2 1 0 1 2 3 23 Cook's distance 39 49 0.5 0.00 0.02 0.04 0.06 0.08 0.10 Leverage lm(dist ~ speed)

(Mostly QQ and Leverage Plots) 21 / 63 Decipher Each Type Of Plot Bottom-Right: Leverage, Cook s Distance Leverage: Outlier Diagnostics Leverage: Case-by-case estimate of a case s potential for influence on predicted values (not just its own predicted value, but predictions for other cases). Recall predicted values are ŷ = {ŷ 1, ŷ 2,... ŷ N } It can be shown that the predicted value can be calculated as a linear combination of the observations y i like so: ŷ j = h j1 y 1 + h j2 y 2 + h j3 y 3 + h j4 y 4 +... + h j(n 1) y j(n 1) + h jn y jn (The prediction for the j th case is a weighted sum of all observed y i ). The coefficients h ji are from a thing called the hat matrix Intuition: In a perfect world, no observation exerts a huge influence on the predictions, the h ji weights are all roughly the same (and will average out to 1/N).

(Mostly QQ and Leverage Plots) 22 / 63 Decipher Each Type Of Plot Bottom-Right: Leverage, Cook s Distance Cook s Distance Cook s Distance. I interpret this as a weighted summary one case s impact on slope coefficient estimates. Fit the model with all of the cases, get the regression slopes in a vector ˆβ = ( ˆβ 0, ˆβ 1,..., ˆβ k ). Exclude a row (case) j, estimate the regression, and calculate the leave j out estimates, ˆβ j = ( ˆβ 0j, ˆβ 1j,..., ˆβ kj ). Square the differences between ˆβ and ˆβ j and add them up, inserting a weighting formula that Cook proposed. The end result is interpreted as a change in the predicted value for all cases caused by case j Will study more later when we do regression diagnostics

(Mostly QQ and Leverage Plots) 23 / 63 Repeat with more Predictors Outline 1 Introduction 2 Start Simple: Scatterplot 3 Standard Diagnostic Plots 4 Decipher Each Type Of Plot Top Left: Residual Plot Top Right: Q-Q plot Bottom Left: Scale-Location Plot Bottom-Right: Leverage, Cook s Distance 5 Repeat with more Predictors Multiple Regression Fitted Termplot Diagnostic Plots 6 Stress Test These Diagnostics Bad Nonlinearity Add Some Outliers Heteroskedasticity 7 Practice Problems

(Mostly QQ and Leverage Plots) 24 / 63 Repeat with more Predictors Multiple Regression Fitted Occupational Prestige Data from John Fox s car package l i b r a r y ( c a r ) P r e s t i g e $ income < P r e s t i g e $ income /10 presmod1 < lm ( p r e s t i g e income + e d u c a t i o n + women + type, data=p r e s t i g e )

(Mostly QQ and Leverage Plots) 25 / 63 Repeat with more Predictors Multiple Regression Fitted My Professionally Acceptable Regression Table M1 Estimate (S.E.) (Intercept) -0.814 (5.331) income 0.104*** (0.026) education 3.662*** (0.646) women 0.006 (0.030) typeprof 5.905 (3.938) typewc -2.917 (2.665) N 98 RMSE 7.132 R 2 0.835 adj R 2 0.826 p 0.05 p 0.01 p 0.001

(Mostly QQ and Leverage Plots) 26 / 63 Repeat with more Predictors Termplot I d usually run Termplot Termplot is Multiple Regression Equivalent of Scatterplot t e r m p l o t ( presmod1, se=t, p a r t i a l=t)

(Mostly QQ and Leverage Plots) 27 / 63 Repeat with more Predictors Termplot termplot(presmod1, se=t, partial.resid=t) Education Termplot: 6 8 10 12 14 16 20 10 0 10 20 education Partial for education

(Mostly QQ and Leverage Plots) 28 / 63 Repeat with more Predictors Termplot termplot(presmod1, se=t, partial.resid=t) Income Termplot: 50 100 150 200 250 20 10 0 10 20 30 income Partial for income

(Mostly QQ and Leverage Plots) 29 / 63 Repeat with more Predictors Termplot termplot(presmod1, se=t, partial.resid=t) Women (percentage of members in field) Termplot: 0 20 40 60 80 100 15 5 0 5 10 15 20 women Partial for women

(Mostly QQ and Leverage Plots) 30 / 63 Repeat with more Predictors Termplot termplot(presmod1, se=t, partial.resid=t) type Termplot: 15 5 0 5 10 15 type Partial for type bc prof wc

Plot Diagnostics 30 40 50 60 70 80 90 10 0 10 20 Fitted values Residuals Residuals vs Fitted medical.technicians electronic.workers service.station.attendant 2 1 0 1 2 2 1 0 1 2 3 Theoretical Quantiles Normal Q Q medical.technicians electronic.workers service.station.attendant 30 40 50 60 70 80 90 0.0 0.5 1.0 1.5 Fitted values Scale Location medical.technicians electronic.workers service.station.attendant 0.0 0.1 0.2 0.3 2 1 0 1 2 3 Leverage Cook's distance 0.5 0.5 1 Residuals vs Leverage general.managers medical.technicians service.station.attendant lm(prestige ~ income + education + women + type)

(Mostly QQ and Leverage Plots) 32 / 63 Repeat with more Predictors Diagnostic Plots Diagnostic Plot 1 30 40 50 60 70 80 90 10 0 10 20 Fitted values Residuals lm(prestige ~ income + education + women + type) Residuals vs Fitted medical.technicians electronic.workers service.station.attendant

(Mostly QQ and Leverage Plots) 33 / 63 Repeat with more Predictors Diagnostic Plots Diagnostic Plot 2 2 1 0 1 2 2 1 0 1 2 3 Theoretical Quantiles lm(prestige ~ income + education + women + type) Normal Q Q medical.technicians electronic.workers service.station.attendant

(Mostly QQ and Leverage Plots) 34 / 63 Repeat with more Predictors Diagnostic Plots Diagnostic Plot 3 30 40 50 60 70 80 90 0.0 0.5 1.0 1.5 Fitted values lm(prestige ~ income + education + women + type) Scale Location medical.technicians electronic.workers service.station.attendant

(Mostly QQ and Leverage Plots) 35 / 63 Repeat with more Predictors Diagnostic Plots Diagnostic Plot 4 0.0 0.1 0.2 0.3 2 1 0 1 2 3 Leverage lm(prestige ~ income + education + women + type) Cook's distance 0.5 0.5 1 Residuals vs Leverage general.managers medical.technicians service.station.attendant

(Mostly QQ and Leverage Plots) 36 / 63 Stress Test These Diagnostics Outline 1 Introduction 2 Start Simple: Scatterplot 3 Standard Diagnostic Plots 4 Decipher Each Type Of Plot Top Left: Residual Plot Top Right: Q-Q plot Bottom Left: Scale-Location Plot Bottom-Right: Leverage, Cook s Distance 5 Repeat with more Predictors Multiple Regression Fitted Termplot Diagnostic Plots 6 Stress Test These Diagnostics Bad Nonlinearity Add Some Outliers Heteroskedasticity 7 Practice Problems

(Mostly QQ and Leverage Plots) 37 / 63 Stress Test These Diagnostics Experience Required To Interpret Plots Usually (for me), a plot will reveal a really bad, obvious problem look not grossly wrong. Only way I can think of to get some practice is to make up data with flaws and then study the diagnostic plots. So I worked out a couple of experiments to illustrate visual effect of nonlinearity outliers

(Mostly QQ and Leverage Plots) 38 / 63 Stress Test These Diagnostics Bad Nonlinearity Demo with Manufactured Quadratic Relationship Fake y 200 100 0 100 200 300 yi = 3 + 13.4xi 0.15x i 2 + ei OLS Linear Fit True Nonlinear Curve The true model is a parabola (quadratic equation) y i = 3 + 13.4x 1 0.15x 2 i +e i And the error term is drawn from N(µ e = 0, σ 2 e = 22 2 ) 0 20 40 60 80 100 Fake x

The 2x2 diagnostic plot matrix for the manufactured quadratic data The Manufactured Quadratic Data Residuals vs Fitted lm(y ~ x) Normal Q Q Residuals 300 100 0 100 1 24 2 1 0 1 2 4 1 100 150 200 250 3 2 1 0 1 2 3 Fitted values Theoretical Quantiles 0.0 0.5 1.0 1.5 Scale Location 24 1 2 1 0 1 Residuals vs Leverage Cook's distance 1 24 100 150 200 250 0.000 0.005 0.010 0.015 0.020 Fitted values Leverage

(Mostly QQ and Leverage Plots) 40 / 63 Stress Test These Diagnostics Add Some Outliers Demo with Manufactured Outliers Fake y 0 100 200 300 OLS Linear Fit True Line yi = 3 + 1.4xi + ei The true model is a straight line y i = 3 + 1.4x 1 +e i And the error term e i N(µ e = 0, σ 2 e = 22 2 ) 10 Randomly Drawn Cases have magnified error e i = 4 e i The 10 bad cases are: 0 20 40 60 80 100 Fake x [ 1 ] 11 43 50 63 86 116 141 160 187 198

(Mostly QQ and Leverage Plots) 41 / 63 Stress Test These Diagnostics Add Some Outliers The Regression Table (just for the record) M1 Estimate (S.E.) (Intercept) 1.295 (4.198) x 1.429*** (0.070) N 200 RMSE 29.216 R 2 0.677 p 0.05 p 0.01 p 0.001

10 errors magnified by factor of 4 R Plot for lm With The 4 Manufactured Outlier Data Residuals vs Fitted lm(y ~ x) Normal Q Q Residuals 100 0 100 200 141 50 187 4 2 0 2 4 6 187 141 50 0 20 40 60 80 100 120 140 3 2 1 0 1 2 3 Fitted values Theoretical Quantiles 0.0 0.5 1.0 1.5 2.0 2.5 Scale Location 141 187 50 6 2 0 2 4 6 Residuals vs Leverage 141 11 Cook's distance 187 0.5 0 20 40 60 80 100 120 140 0.000 0.005 0.010 0.015 0.020 Fitted values Leverage

(Mostly QQ and Leverage Plots) 43 / 63 Stress Test These Diagnostics Add Some Outliers Concern: Leverage Plot Doesn t Notice Effect of Outliers I expected points would be in the extreme bad Cook s distance area I m going to torture this until it works (should I say breaks?) 0.000 0.005 0.010 0.015 0.020 6 4 2 0 2 4 6 Cook's distance 0.5 Residuals vs Leverage 187 141 11

(Mostly QQ and Leverage Plots) 44 / 63 Stress Test These Diagnostics Add Some Outliers First Retry: Magnify the Outliers x 10 Fake y 200 0 200 400 600 OLS Linear Fit True Line yi = 3 + 1.4xi + ei The true model is a straight line y i = 3 + 1.4x 1 +e i And the error term e i N(µ e = 0, σ 2 e = 22 2 ) 10 Randomly Drawn Cases have magnified error e i = 10 e i The 10 bad cases are: 0 20 40 60 80 100 Fake x [ 1 ] 11 43 50 63 86 116 141 160 187 198

(Mostly QQ and Leverage Plots) 45 / 63 Stress Test These Diagnostics Add Some Outliers The Regression Table (just for the record) M1 Estimate (S.E.) (Intercept) -0.052 (7.812) x 1.441*** (0.131) N 200 RMSE 54.369 R 2 0.380 p 0.05 p 0.01 p 0.001

10 magnified errors: Still nothing in the Leverage Plot Manufactured Outliers Magnified x 10 Residuals vs Fitted lm(y ~ x) Normal Q Q Residuals 400 0 200 400 50 141 187 5 0 5 10 187 50 141 0 20 40 60 80 100 120 140 3 2 1 0 1 2 3 Fitted values Theoretical Quantiles 0.0 1.0 2.0 3.0 Scale Location 50 141 187 5 0 5 10 Residuals vs Leverage 141 11 Cook's distance 187 0.5 1 0.5 0 20 40 60 80 100 120 140 0.000 0.005 0.010 0.015 0.020 Fitted values Leverage

(Mostly QQ and Leverage Plots) 47 / 63 Stress Test These Diagnostics Add Some Outliers Cluster and Magnify the Outliers x 10 Fake y 0 100 200 300 400 500 OLS Linear Fit True Line yi = 3 + 1.4xi + ei 0 20 40 60 80 100 The true model is a straight line y i = 3 + 1.4x 1 +e i And the error term e i N(µ e = 0, σ 2 e = 22 2 ) 10 rightmost positive errors: e i = 10 e i The 10 bad cases are: [ 1 ] 181 183 185 186 188 189 193 194 195 196 199 Fake x

(Mostly QQ and Leverage Plots) 48 / 63 Stress Test These Diagnostics Add Some Outliers The Regression Table (just for the record) M1 Estimate (S.E.) (Intercept) -11.238 (6.675) x 1.828*** (0.112) N 200 RMSE 46.453 R 2 0.575 p 0.05 p 0.01 p 0.001

10 magnified positive errors clustered on right. Still nothing in Leverage! R Plot: 10 Rightmost Positive Magnified Errors Residuals vs Fitted lm(y ~ x) Normal Q Q Residuals 100 0 100 200 300 196 199 195 2 0 2 4 6 8 196 195 199 0 50 100 150 3 2 1 0 1 2 3 Fitted values Theoretical Quantiles 0.0 0.5 1.0 1.5 2.0 2.5 Scale Location 196 199 195 2 0 2 4 6 8 Residuals vs Leverage 196 199 195 Cook's distance 0.5 0 50 100 150 0.000 0.005 0.010 0.015 0.020 Fitted values Leverage

(Mostly QQ and Leverage Plots) 50 / 63 Stress Test These Diagnostics Add Some Outliers Am Becoming Angry Had believed leverage plot would isolate the outliers after they were magnified (it did not) Had believed leverage plot would isolate the outliers after they were clustered (it did not) Drop back to simplest test: create just one outlier

(Mostly QQ and Leverage Plots) 51 / 63 Stress Test These Diagnostics Add Some Outliers Insert One Outlier On High Right Side OLS Linear Fit True Line yi = 3 + 1.4xi + ei The true model is a straight line Fake y 0 100 200 300 [ 1 ] 199 y i = 3 + 1.4x 1 +e i And the error term e i N(µ e = 0, σ 2 e = 22 2 ) 1 rightmost positive errors: e i = 10 e i The bad case is: 0 20 40 60 80 100 Fake x

(Mostly QQ and Leverage Plots) 52 / 63 Stress Test These Diagnostics Add Some Outliers The Regression Table (just for the record) M1 Estimate (S.E.) (Intercept) 0.098 (3.896) x 1.479*** (0.065) N 200 RMSE 27.116 R 2 0.722 p 0.05 p 0.01 p 0.001

1 super magnified positive error on right Just One Bad Outlier Residuals 50 50 150 250 Residuals vs Fitted 199 105 190 lm(y ~ x) 2 0 2 4 6 8 Normal Q Q 199 105 190 0 50 100 150 3 2 1 0 1 2 3 Fitted values Theoretical Quantiles 0.0 1.0 2.0 3.0 Scale Location 199 105 190 2 0 2 4 6 8 Residuals vs Leverage 199 37 Cook's distance 190 1 0.5 0 50 100 150 0.000 0.005 0.010 0.015 0.020 Fitted values Leverage

(Mostly QQ and Leverage Plots) 54 / 63 Stress Test These Diagnostics Add Some Outliers Leverage Plot Finally Spots an Outlier If there s just one extreme case, procedure spots it Conclusion: mechanical application of spot one outlier method not powerful with multiple outliers Appears outliers can hide in plain sight if there are enough of them 2 0 2 4 6 8 Cook's distance Residuals vs Leverage 199 37 190 1 0.5 0.000 0.005 0.010 0.015 0.020

(Mostly QQ and Leverage Plots) 55 / 63 Stress Test These Diagnostics Heteroskedasticity Error Term Variance Proportional to x i 2 yi = 3 + 1.4xi + ei The true model is Fake y 10000 5000 0 5000 y i = 3 + 1.4x i + e i And the error term is drawn from N(µ e = 0, σ 2 e = 0.5 x 2 i ) OLS Linear Fit True Linear Equation 0 20 40 60 80 100 Fake x

(Mostly QQ and Leverage Plots) 56 / 63 Stress Test These Diagnostics Heteroskedasticity The Regression Table (just for the record) M1 Estimate (S.E.) (Intercept) 654.256* (324.255) x -23.201*** ( 5.444) N 200 RMSE 2284.379 R 2 0.084 p 0.05 p 0.01 p 0.001 Will work on heteroskedasticity later Causes high variance in estimates of ˆβ 1 and std.err.( ˆβ 1 ) is lower than it should be

(Mostly QQ and Leverage Plots) 57 / 63 Stress Test These Diagnostics Heteroskedasticity Loess / Residual Plot for Manufactured Data Residuals 10000 5000 0 5000 loess fit to residuals yi = 3 + 1.4xi + ei dispersion is greater on the right when plotting against x will see flip flop when plotting against predicted value 0 20 40 60 80 100 fake x

The 2x2 diagnostic plot matrix for the manufactured heterodskedasticity The Manufactured Heteroskedastic Data Residuals vs Fitted lm(y ~ x) Normal Q Q Residuals 10000 0 5000 185 189 197 4 2 0 2 4 189 185 197 1500 1000 500 0 500 3 2 1 0 1 2 3 Fitted values Theoretical Quantiles 0.0 0.5 1.0 1.5 2.0 197 Scale Location 189 185 6 4 2 0 2 4 Residuals vs Leverage 185 189 Cook's distance 197 1500 1000 500 0 500 0.000 0.005 0.010 0.015 0.020 Fitted values Leverage

(Mostly QQ and Leverage Plots) 59 / 63 Practice Problems Outline 1 Introduction 2 Start Simple: Scatterplot 3 Standard Diagnostic Plots 4 Decipher Each Type Of Plot Top Left: Residual Plot Top Right: Q-Q plot Bottom Left: Scale-Location Plot Bottom-Right: Leverage, Cook s Distance 5 Repeat with more Predictors Multiple Regression Fitted Termplot Diagnostic Plots 6 Stress Test These Diagnostics Bad Nonlinearity Add Some Outliers Heteroskedasticity 7 Practice Problems

(Mostly QQ and Leverage Plots) 60 / 63 Practice Problems Problems 1 Fit a regression with several predictors. This is more interesting if some predictors are factors (categorical variables according to R) and some are continuous numeric variables. Then run the command termplot(mod1) on your model, which I assume is called mod1. Note that R will wait for you to click next before it shows the graph. 2 On the same fitted model as you used in the previous example, run the command plot(mod1). 3 For the R Summer Course, I made several presentations about the plotting features of R. If you didn t know about them yet, this might be a good time to take a quick survey. In my guides folder, they are under Rcourse. 1 plot-1 2 plot-2 3 plot-3d 4 regression-plots-1

(Mostly QQ and Leverage Plots) 61 / 63 Practice Problems Problems... For regression plots, I suppose you will be mainly interested in plot-1 (scatterplots and histograms) and regression-plots-1. (I made a pretty vigorous effort to do 3-D plotting in plot-3d, but in many ways, it is just too hard for R beginners.) 4 Here s a trick question for you. Consider this display of a q-q plot. [imagine qq-plot that shows several points that are way far off the straight line.] Does this mean the regression results are wrong? Please explain. (This is a final exam sort of question. Its one that should make you connect theory and practice. One trick here is that I ve used the word wrong and leave you to decide what wrong means. Did I mean biased? Inconsistent? Another trick here is that you can do almost all of the usual regression exercises without assuming that e i is normal. So think about how a regression can still be unbiased and consistent if the error is not normal.)

(Mostly QQ and Leverage Plots) 62 / 63 Practice Problems Problems... 5 Here is one way to find out which cases are outliers. R has a function called identify and I ve never gotten very good at it. But maybe you are better. The idea is that you can create a scatterplot and then click on certain points to identify them. Run?identify to read more. Here s a working example. x < c ( 1, 2, 3, 4, 5 ) y < c ( 5, 4, 3, 4, 5 ) nam < c ( B i l l, C h a r l e s, Jane, J i l l, Jaime ) dat < d a t a. f r a m e ( x, y, nam) rm ( x, y, nam) p l o t ( y x, dat=dat ) with ( dat, i d e n t i f y ( x, y, nam) )

(Mostly QQ and Leverage Plots) 63 / 63 Practice Problems Problems... As soon as you hit return on the last line, the R session will seem to freeze. It is waiting for you to left-click on the points in the graph. You left-click on a point to insert nam next to it, and when you are finished, you can right-click to release control from the identify function. This is one way to spot outliers. This one frustrates me so much I made a WorkingExample for it. That s in my collection plot-identify points-1.r (Recall, you can get there either though pj.freefaulty.org/r or via the Rcourse nots (end up at same place). If you don t soup up your computers, I bet you will have more fun with it than I do. My video driver is constantly out of whack, so a click does not do what I expect.