Regression diagnostics

Similar documents
Multiple Linear Regression

Simple Linear Regression

Specification Errors, Measurement Errors, Confounding

Weighted Least Squares

For more information about how to cite these materials visit

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

STAT 4385 Topic 06: Model Diagnostics

where x and ȳ are the sample means of x 1,, x n

Simple Linear Regression

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Linear models and their mathematical foundations: Simple linear regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Scatter plot of data from the study. Linear Regression

Weighted Least Squares

Scatter plot of data from the study. Linear Regression

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Simple Linear Regression

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Linear Regression. Junhui Qian. October 27, 2014

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Problem Selected Scores

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

[y i α βx i ] 2 (2) Q = i=1

Regression Diagnostics for Survey Data

Introduction to bivariate analysis

CAS MA575 Linear Models

Example: Suppose Y has a Poisson distribution with mean

Quantitative Methods I: Regression diagnostics

Remedial Measures for Multiple Linear Regression Models

Introduction to bivariate analysis

Lecture 1: Linear Models and Applications

STAT 540: Data Analysis and Regression

Bivariate data analysis

4 Multiple Linear Regression

Math 423/533: The Main Theoretical Topics

3 Multiple Linear Regression

Lecture 4: Regression Analysis

STATISTICS 479 Exam II (100 points)

Regression Diagnostics Procedures

Multivariate Regression

Correlation and Regression

Topic 12 Overview of Estimation

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Matrix Approach to Simple Linear Regression: An Overview

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Maximum Likelihood Estimation

Analysing data: regression and correlation S6 and S7

Lecture 4 Multiple linear regression

Linear Regression (9/11/13)

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility

CHAPTER 5. Outlier Detection in Multivariate Data

Lectures on Simple Linear Regression Stat 431, Summer 2012

Part 6: Multivariate Normal and Linear Models

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Regression Model Building

5. Linear Regression

11 Hypothesis Testing

Simple Linear Regression for the MPG Data

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

MIT Spring 2015

Unit 10: Simple Linear Regression and Correlation

The linear model is the most fundamental of all serious statistical models encompassing:

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Bayesian Inference. Chapter 9. Linear models and regression

ECON The Simple Regression Model

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Lecture 18: Simple Linear Regression

Generalized Linear Models

In the bivariate regression model, the original parameterization is. Y i = β 1 + β 2 X2 + β 2 X2. + β 2 (X 2i X 2 ) + ε i (2)

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

1. Variance stabilizing transformations; Box-Cox Transformations - Section. 2. Transformations to linearize the model - Section 5.

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Outline of GLMs. Definitions

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Well-developed and understood properties

Lecture One: A Quick Review/Overview on Regular Linear Regression Models

MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE

ECON 4160, Autumn term Lecture 1

Reference: Davidson and MacKinnon Ch 2. In particular page

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Distribution Assumptions

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Ch. 5 Transformations and Weighting

Single and multiple linear regression analysis

ECON 3150/4150, Spring term Lecture 7

STAT 350: Geometry of Least Squares

The Simple Regression Model. Part II. The Simple Regression Model

Linear Methods for Prediction

Chapter 1. Linear Regression with One Predictor Variable

Applied linear statistical models: An overview

Transcription:

Regression diagnostics Kerby Shedden Department of Statistics, University of Michigan November 5, 018 1 / 6

Motivation When working with a linear model with design matrix X, the conventional linear model is based on the following conditions: E[Y X] col(x) and var[y X] = σ I. Least squares point estimates depend on the first condition approximately holding. Least squares inferences depend on both of the above conditions approximately holding. Inferences for small sample sizes may also depend on the distribution of Y E[Y X] being approximately multivariate Gaussian, but for moderate or large sample sizes this condition is not critical. Regression diagnostics for linear models are approaches for assessing how well a particular data set fits these two conditions. / 6

Residuals Linear models can be expressed in two equivalent ways: Focus only on moments: E[Y X] col(x) and var[y X] = σ I. Use a generative model, in this case an additive error model of the form y = Xβ + ɛ, where ɛ is random with E[ɛ X] = 0, and cov[ɛ X] I. Since the residuals can be viewed as predictions of the errors, it turns out that regression model diagnostics can often be developed using the residuals. Recall that the residuals can be expressed R (I P)y where P is the projection onto col(x ). 3 / 6

Residuals The residuals have two key mathematical properties regardless of the correctness of the model specification: The residuals sum to zero, since (I P)1 = 0 and hence 1 r = 1 (I P)y = 0. The residuals and fitted values are orthogonal (they have zero sample covariance): ĉov(r, ŷ X) (r r) Ŷ = r ŷ = y (I P)Py = 0. These properties hold as long as an intercept is included in the model (so P 1 = 1, where 1 is a vector of 1 s). / 6

Residuals If the basic linear model conditions hold, these two properties have population counterparts: The expected value of each residual is zero: E[R X] = (I P)E[Y X] = 0 R n. The population covariance between any residual and any fitted value is zero: cov(r, ŷ X) = E[rŷ ] = (I P)cov(Y X)P = σ (I P)P = 0 R n n. 5 / 6

Residuals If the model is correctly specified, there is a simple formula for the variances and covariances of the residuals: cov(r X) = (I P)E[yy ](I P) = (I P) ( Xββ X + σ I ) (I P) = σ (I P). If the model is correctly specified, the standardized residuals and the Studentized residuals y i ŷ i ˆσ y i ŷ i ˆσ(1 P ii ) 1/ approximately have mean zero and variance one. 6 / 6

External standardization of residuals Let ˆσ i be the estimate of σ obtained by fitting a regression model omitting the i th case. It turns out that we can calculate this value without actually refitting the model: ˆσ i = (n p 1)ˆσ r i /(1 P ii) n p where r i is the residual for the model fit to all data. The externally standardized residuals are y i ŷ i ˆσ i, The externally Studentized residuals are y i ŷ i ˆσ i (1 P ii ) 1/. 7 / 6

Outliers and masking In some settings, residuals can be used to identify outliers. However, in a small data set, a large outlier will increase the value of ˆσ, and hence may mask itself. Externally Studentized residuals solve the problem of a single large outlier masking itself. But masking may still occur if multiple large outliers are present. 8 / 6

Outliers and masking If multiple large outliers may be present we may use alternate estimates of the scale parameter σ: Interquartile range (IQR): this is the difference between the 75 th percentile and the 5 th percentile of the distribution or data. The IQR of the standard normal distribution is 1.35, so IQR/1.35 can be used to estimate σ. Median Absolute Deviation (MAD): this is the median value of the absolute deviations from the median of the distribution or data, i.e. median( Z median(z) ). The MAD of the standard normal distribution is 0.65, so MAD/0.65 can be used to estimate σ. These alternative estimates of σ can be used in place of the usual ˆσ for standardizing or Studentizing residuals. 9 / 6

Leverage Leverage is a measure of how strongly the data for case i determine the fitted value ŷ i. Since ŷ = Py, and ŷ i = j P ij y j, it is natural to define the leverage for case i as P ii, where P is the projection matrix onto col(x). This is related to the fact that the variance of the i th residual is σ (1 P ii ). Since the residuals have mean zero, when P ii is close to 1, the residual will likely be close to zero. This means that fitted line will usually pass close to (x i, y i ) if it is a high leverage point. 10 / 6

Leverage These are the coefficients P ij plotted against x j (for a specific value of i), in a simple linear regression: 0.0 0.01 0.00 0.01 0.0 0.0 0. 0. 0.6 0.8 1.0 X j ŷ k = i S + n(x i x)(x k x) y i ns 11 / 6

P ij 0.15 0.10 0.05 0.00 0.05 0.10 0.0 0. 0. 0.6 0.8 1.0 Leverage If we use basis functions, the coefficients in each row of P are much more local. X j 1 / 6

Leverage What is a big leverage? The average leverage is trace(p)/n = (p + 1)/n. If the leverage for a particular case is two or more times greater than the average leverage, it may be considered to have high leverage. In simple linear regression, it is easy to show that var(y i ˆα ˆβx i ) = (n 1)σ /n σ (x i X ) / j (x j x). This implies that when p = 1, P ii = 1/n + (x i x) / j (x j x). 13 / 6

Leverage Leverage values in a simple linear regression: Y.5.0 1.5 1.0 0.5 0.0 0.5 1.0 0.0 0. 0. 0.6 0.8 1.0 X Leverage 0.05 0.00 0.035 0.030 0.05 0.00 0.015 0.010 0.0 0. 0. 0.6 0.8 1.0 X 1 / 6

Leverage Leverage values in a linear regression with two independent variables: 1.0 0.8 X 0.6 0. 0. 0.0 0.0 0. 0. 0.6 0.8 1.0 X1 15 / 6

Leverage In general, P ii = x i (X X) 1 x i = x i (X X/n) 1 x i/n where x i is the i th row of X (including the intercept). Let x i be row i of X without the intercept, let µ be the sample mean of the x i, and let Σ X be the sample covariance matrix of the x i (scaled by n rather than n 1). It is a fact that and therefore x i (X X/n) 1 x i = ( x i µ)σ 1 X ( x i µ) + 1 P ii = ( ( x i µ X )Σ 1 X ( x i µ X ) + 1 ) /n. Note that this implies that P ii 1/n. 16 / 6

Leverage The expression ( x i µ X )Σ 1 X ( x i µ X ) is the Mahalanobis distance between x i and µ X. Thus there is a direct relationship between the Mahalanobis distance of a point relative to the center of the covariate set, and its leverage. 17 / 6

Influence Influence measures the degree to which deletion of a case changes the fitted model. We will see that this is different from leverage a high leverage point has the potential to be influential, but is not always influential. The deleted slope for case i is the fitted slope vector that obtained upon deleting case i. The following identity allows the deleted slopes to be calculated efficiently ˆβ (i) = ˆβ r i 1 P ii (X X) 1 x i, where r i is the i th residual, and x i is row i of the design matrix. 18 / 6

Influence The vector of all deleted fitted values Ŷ(i) are ŷ (i) = X ˆβ (i) = ŷ r i 1 P ii X(X X) 1 X. Influence can be measured by Cook s distance: D i = = 1 (p + 1)ˆσ (ŷ ŷ (i)) (ŷ ŷ (i) ) r i (1 P ii ) (p + 1)ˆσ x i(x X) 1 x i P ii r s i (1 P ii )(p + 1), where r i is the residual and r s i is the studentized residual. 19 / 6

Influence Cook s distance approximately captures the average squared change in fitted values due to deleting case i, in error variance units. Cook s distance is large only if both the leverage P ii is high, and the studentized residual for the i th case is large. As a general rule, D i values from 1/ to 1 are high, and values greater than 1 are considered to be very high. 0 / 6

Influence Cook s distances in a simple linear regression: 0.10 Cook's distance 0.08 0.06 0.0 0.0 0.00 0.0 0. 0. 0.6 0.8 1.0 X 1 / 6

Influence Cook s distances in a linear regression with two variables: X 1.0 0.8 0.6 0. 0. 0.0 0.0 0. 0. 0.6 0.8 1.0 X1 0.3 0.8 0. 0.0 0.16 0.1 0.08 0.0 / 6

Regression graphics Quite a few graphical techniques have been proposed to aid in visualizing regression relationships. We will discuss the following plots: 1. Scatterplots of Y against individual X variables.. Scatterplots of X variables against each other. 3. Residuals versus fitted values plot.. Added variable plots. 5. Partial residual plots. 6. Residual quantile plots. 3 / 6

Scatterplots of Y against individual X variables E[Y X ] = X 1 X + X 3, var[y X ] = 1, var(x j ) = 1, cor(x j, X k ) = 0.3 Y 0 Y 0 0 X 1 0 X Y 0 Y 0 0 X 3 0 X 1 X +X 3 / 6

Scatterplots of X variables against each other E[Y X ] = X 1 X + X 3, var[y X ] = 1, var(x j ) = 1, cor(x j, X k ) = 0.3 X 0 0 X 1 X 3 0 X 3 0 0 X 0 X 1 5 / 6

Residuals against fitted values plot E[Y X ] = X 1 X + X 3, var[y X ] = 1, var(x j ) = 1, cor(x j, X k ) = 0.3 Residuals 0 0 Fitted values 6 / 6

Residuals against fitted values plots Heteroscedastic errors: E[Y X ] = X 1 + X 3, var[y X ] = + X 1 + X 3, var(x j ) = 1, cor(x j, X k ) = 0.3 0 Residuals 10 0 10 0 0 Fitted values 7 / 6

Residuals against fitted values plots Nonlinear mean structure: E[Y X ] = X 1, var[y X ] = 1, var(x j) = 1, cor(x j, X k ) = 0.3 Residuals 0 0 Fitted values 8 / 6

Added variable plots Suppose P j is the projection onto the span of all covariates except X j, and define Ŷ j = P j Y, Xj = P j X j. The added variable plot is a scatterplot of Y Ŷ j against X Xj. The squared correlation coefficient of the points in the added variable plot is the partial R for variable j. Added variable plots are also called partial regression plots. 9 / 6

Added variable plots E[Y X ] = X 1 X + X 3, var[y X ] = 1, var(x j ) = 1, cor(x j, X k ) = 0.3 Ŷ 1 0 0 X 1 Ŷ 3 0 Ŷ 0 0 X 3 0 X 30 / 6

Partial residual plot Suppose we fit the model Ŷ i = ˆβ X i = ˆβ 0 + ˆβ 1 X i1 + ˆβ p X ip. The partial residual plot for covariate j is a plot of ˆβ j X ij + R i against X ij, where R i is the residual. The partial residual plot attempts to show how covariate j is related to Y, if we control for the effects of all other covariates. 31 / 6

Partial residual plot E[Y X ] = X 1, var[y X ] = 1, var(x j) = 1, cor(x j, X k ) = 0.3 ˆβ1 X 1 +R 0 ˆβ X +R 0 0 X 1 0 X ˆβ3 X 3 +R 0 0 X 3 3 / 6

Residual quantile plots E[Y X ] = X 1, var[y X ] = 1, var(x j) = 1, cor(x j, X k ) = 0.3 t distributed errors Residual quantiles (standardized) 0 0 Standard normal quantiles 33 / 6

Transformations As noted above, the linear model imposes two main constrains on the population that is under study. Specifically, the conditional mean function should be linear, and the conditional variance function should be constant. If it appears that E[Y X = x] is not linear in x, or that Var[Y X = x] is not constant in x, it may be possible to continuously transform either y or x so that the linear model becomes more consistent with the data. 3 / 6

Variance stabilizing transformations Many populations encountered in practice exhibit a mean/variance relationship, where E[Y i ] and var[y i ] are related. Suppose that var[y i ] = g(e[y i ])σ, and let f ( ) be a transform to be applied to the y i. The goal is to find a transform such that the variances of the transformed responses are constant. Using a Taylor expansion, f (Y i ) f (E[Y i ]) + f (E[Y i ])(Y i E[Y i ]). 35 / 6

Variance stabilizing transformations Therefore var[f (Y i )] f (E[Y i ]) var[y i ] = f (E[Y i ]) g(e[y i ])σ. The goal is to find f such that f = 1/ g. Example: Suppose g(z) = z λ. This includes the Poisson regression case λ = 1, where the variance is proportional to the mean, and the case λ = where the standard deviation is proportional to the mean. When λ = 1, f solves f (z) = 1/ z, so f is the square root function. When λ =, f solves f (z) = 1/z, so f is the logarithm function. 36 / 6

Log/log regression Suppose we fit a simple linear regression of the form E[log(Y ) log(x )] = α + β log(x ). E[log(Y ) X = x + 1] E[log(Y ) X = x] = β Using the crude approximation log E[Y X ] E[log(Y ) X ], we conclude E[Y X ] is approximately scaled by a factor of e β when X is scaled by a factor of e. Thus in a log/log model, we may say that a f % change in X is approximately associated with a f β % change in the expected response. 37 / 6

Maximum likelihood estimation of a data transformation The Box-Cox family of transforms is y y λ 1, λ which makes sense only when all Y i are positive. The Box-Cox family includes the identity (λ = 1), all power transformations such as the square root (λ = 1/) and reciprocal (λ = 1), and the logarithm in the limiting case λ 0. 38 / 6

Maximum likelihood estimation of a data transformation Suppose we assume that for some value of λ, the transformed data follow a linear model with Gaussian errors. We can then set out to estimate λ. The joint log-likelihood of the transformed data is n log(π) n log σ 1 σ i (Y (λ) i X i β). Next we transform this back to a likelihood in terms of Y i = g 1 λ This joint log-likelihood is (Y (λ) i ). n log(π) n log σ 1 σ (g λ (Y i ) X i β) + i i log J i where the Jacobian is log J i = log g λ(y i ) = (λ 1) log Y i. 39 / 6

Maximum likelihood estimation of a data transformation The joint log likelihood for the Y i is n log(π) n log σ 1 σ (g λ (Y i ) X i β) + (λ 1) i i log Y i. This likelihood is maximized with respect to λ, β, and σ to identify the MLE. 0 / 6

Maximum likelihood estimation of a data transformation To do the maximization, let Y (λ) g λ (Y ) denote the transformed observed responses, and let Ŷ (λ) denote the fitted values from regressing Y (λ) on X. Since σ does not appear in the Jacobian, ˆσ λ n 1 Y (λ) Ŷ (λ) will be the maximizing value of σ. Therefore the MLE of β and λ will maximize n log ˆσ λ + (λ 1) i log Y i. 1 / 6

Collinearity Diagnostics Collinearity inflates the sampling variances of covariate effect estimates. To understand the effect of collinearity on var[ ˆβ j X], reorder the columns and partition the design matrix X as X = ( X j X 0 ) = ( Xj X j + X j X 0 ) where X 0 is the n p matrix consisting of all columns in X except X j, and X j is the projection of X j onto col(x 0 ). Therefore ( H X X = X j X j (X j X j ) X 0 X 0 (X j X j ) X 0 X 0 ). var ˆβ j = σ H 1 11, so we want a simple expression for H 1 11. / 6

Collinearity Diagnostics A symmetric block matrix can be inverted using: ( A B C B ) 1 = ( S 1 C 1 B S 1 S 1 BC 1 C 1 + C 1 B S 1 BC 1 ), where S = A BC 1 B. Therefore H 1 1,1 = 1 X j (X j Xj ) P 0 (X j Xj ), where P 0 = X 0 (X 0 X 0) 1 X 0 is the projection matrix onto col(x 0 ). 3 / 6

Collinearity Diagnostics Since X j X j and since X j so col(x 0 ), we can write H 1 1,1 = 1 X j X j X j, (X j Xj ) = 0, it follows that X j = X j X j + X j = X j Xj + Xj, H 1 1,1 = 1 Xj. This makes sense, since smaller values of Xj correspond to greater collinearity. / 6

Collinearity Diagnostics Let R jx be the coefficient of determination (multiple R ) for the regression of X j on the other covariates. R jx = 1 X j (X j X j ) X j X j = 1 X j X j X j. Combining the two equations yields H 1 11 = 1 X j X j 1 1 Rjx. 5 / 6

Collinearity Diagnostics The two factors in the expression H 1 11 = 1 X j X j reflect two different sources of variance of ˆβ j : 1 1 Rjx. 1/ X j X j = 1/ ((n 1) var(x j )) reflects the scaling of X j The variance inflation factor (VIF) 1/(1 Rjx ) is scale-free. It is always greater than or equal to 1, and is equal to 1 only if X j is orthogonal to the other covariates. Large values of the VIF indicate that parameter estimation is strongly affected by collinearity. 6 / 6