Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Similar documents
Beam Example: Identifying Influential Observations using the Hat Matrix

14 Multiple Linear Regression

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

MATH 644: Regression Analysis Methods

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Introduction and Single Predictor Regression. Correlation

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Lecture 1: Linear Models and Applications

Chaper 5: Matrix Approach to Simple Linear Regression. Matrix: A m by n matrix B is a grid of numbers with m rows and n columns. B = b 11 b m1 ...

Lecture 6 Multiple Linear Regression, cont.

13 Simple Linear Regression

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Weighted Least Squares

ST430 Exam 1 with Answers

Lecture 18: Simple Linear Regression

Comparing Nested Models

Lecture 4: Regression Analysis

Multiple Linear Regression

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Multiple Linear Regression

ST430 Exam 2 Solutions

Regression Analysis for Data Containing Outliers and High Leverage Points

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

Example: 1982 State SAT Scores (First year state by state data available)

Ch 3: Multiple Linear Regression

STATISTICS 479 Exam II (100 points)

Regression. Marc H. Mehlman University of New Haven

SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Regression Diagnostics

lm statistics Chris Parrish

Lecture 10. Factorial experiments (2-way ANOVA etc)

Lecture 2. Simple linear regression

Math 2311 Written Homework 6 (Sections )

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Unit 6 - Introduction to linear regression

1 Use of indicator random variables. (Chapter 8)

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Dealing with Heteroskedasticity

Biostatistics 380 Multiple Regression 1. Multiple Regression

Chapter 3 - Linear Regression

Regression Diagnostics for Survey Data

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Statistics for Engineers Lecture 9 Linear Regression

1 Multiple Regression

> Y ~ X1 + X2. The tilde character separates the response variable from the explanatory variables. So in essence we fit the model

Confidence Intervals, Testing and ANOVA Summary

Multiple Regression and Regression Model Adequacy

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

STAT 420: Methods of Applied Statistics

Study Sheet. December 10, The course PDF has been updated (6/11). Read the new one.

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Applied Regression Analysis

Chapter 12: Linear regression II

Linear Modelling: Simple Regression

Need for Several Predictor Variables

1 The Classic Bivariate Least Squares Model

1 Introduction 1. 2 The Multiple Regression Model 1

General Linear Statistical Models

Generalized Linear Models

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

6. Multiple Linear Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Inference for Regression

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

6.1 Introduction. Regression Model:

Matrix Approach to Simple Linear Regression: An Overview

Density Temp vs Ratio. temp

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Lec 5: Factorial Experiment

STAT 4385 Topic 06: Model Diagnostics

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Unit 6 - Simple linear regression

y i s 2 X 1 n i 1 1. Show that the least squares estimators can be written as n xx i x i 1 ns 2 X i 1 n ` px xqx i x i 1 pδ ij 1 n px i xq x j x

Simple Linear Regression

Diagnostics and Transformations Part 2

Math 423/533: The Main Theoretical Topics

Variance Decomposition and Goodness of Fit

General Linear Statistical Models - Part III

Detecting and Assessing Data Outliers and Leverage Points

Weighted Least Squares

CHAPTER 5. Outlier Detection in Multivariate Data

Linear Regression. Furthermore, it is simple.

St 412/512, D. Schafer, Spring 2001

Lecture 4 Multiple linear regression

Chapter 6 Multiple Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Linear Algebra Review

Chapter 14. Linear least squares

MODELS WITHOUT AN INTERCEPT

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Chapter 5 Matrix Approach to Simple Linear Regression

Transcription:

Leverage Some cases have high leverage, the potential to greatly affect the fit. These cases are outliers in the space of predictors. Often the residuals for these cases are not large because the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Summary 1 The leverage exerted by the i th case is h ii, the i th diagonal element of the hat matrix. 2 Properties: 0 h ii 1 if there is an intercept term in a regression model h ii 1 n if there are r observations with the same x, the leverage for those observations is 1/r. (groups of potentially influential cases are masked) a general guideline is to flag cases where h ii > 2p/n, where p is the number of columns of X, equal to k + 1 in a multiple regression with k predictors and an intercept. 3 To get leverage values in R, use the command hat(model.matrix(output)), where output is the output from a call to lm.

The fitted value at case i is ŷ i = (Hy) i = n h ij y j = h ii y i + j=1 n h ij y j j i a linear combination of all the responses. Ideally all cases contribute, with those at and closest to x i dominating. In influential cases h ii approaches 1, and h ij approaches 0, for j i. One can inspect the h ii = x T i (X T X) 1 x i, called leverage values, to identify those which are large. In R you use the command hat(model.matrix(output)) to get the leverage values, where output is the output from lm.

Often it is difficult to find a case with high leverage by examining each predictor separately or in pairs using bivariate plots the case may not be extreme in any particular predictor, but still be far from the centroid of the predictors. Recall that Var(ŷ i ) = h ii σ 2 and Var(e i ) = (1 h ii )σ 2, so cases with high leverage have large estimation variance and small residual variance. In simple linear regression h ii = 1 n + (x i x) 2 n j=1 (x j x) 2 so the minimum is 1/n at x and the maximum occurs when x is furthest from x. More generally, h ii measures the distance of the predictors from their centroid. The sum of the h ii is tr(h) = k + 1 = p, so their average is h = (k + 1)/n = p/n.

0 h ii 1 Because H = HH and H = H T h ii = n h ij h ji = hii 2 + j=1 n hij 2 (1) j i so and h ii (1 h ii ) 0 0 h ii 1. Some statistical packages flag cases where h ii > 2(k + 1)/n = 2p/n.

When an intercept β 0 is included in a multiple regression model h ii 1 n In the notes about adding variables to a regression we partitioned the X matrix into X 1 and X 2, and saw that H = H 1 + H 2.1. Let X 1 be the vector of 1 s, so that H 1 = J/n and H 2.1 = X( X T X) 1 X T where X = (I J/n)X contains the deviations of the predictors from their means. The ith diagonal entry of H is The second term is positive so h ii = 1 n + [ X( X T X) 1 XT ]ii. h ii 1/n.

Maximum leverage with replicate x The second term is of the form (x ij x j )(x ik x k ) C jk j k where C jk is the jkth entry of ( X T X) 1, and so measures the distance of the predictors in the ith case from the centroid x = ( x 1,..., x k ) T in the k dimensional space of predictors. When two cases i and k have the same predictors, (x i = x k ), (or equivalently, when there are two y values at the same x) h ik = x T i (X T X) 1 x k = h ii

From equation (1) so and n h ii = 2hii 2 + j i,k h 2 ij h ii (1 2h ii ) 0 0 h ii 1/2. So, the maximum leverage value is halved when there are two cases with the same values for the predictors. More generally, if r cases have the same predictors, (or equivalently, when there are r replicate values of y at x), the maximum possible leverage value for these cases is 1/r. Groups of potentially influential cases are masked, and cannot be detected by examining the h ii.

How to identify hidden extrapolation Our book points out the danger of hidden extrapolation when predicting (Section 3.8). They note that any x 0 with x T 0 (XT X) 1 x 0 > h max, where h max is the largest value in the dataset, will imply extrapolation beyond the cases in the dataset.

Example: strength of wood beams Example: Data on the strength of wood beams was given by Hoaglin and Welsch (The American Statistician, 1978, vol 32, pp 17-22). The response is Strength and the predictors are Specific Gravity and Moisture Content. The data are Beam Specific Moisture Strength Number Gravity Content 1.499 11.1 11.14 2.558 8.9 12.74 3.604 8.8 13.13 4.441 8.9 11.51 5.550 8.8 12.38 6.528 9.9 12.60 7.418 10.7 11.13 8.480 10.5 11.70 9.406 10.5 11.02 10.467 10.7 11.41

The correlation matrix of the data is SG MC STRENGTH SG 1.0000000-0.6077351 0.9131352 MC -0.6077351 1.0000001-0.7592328 STRENGTH 0.9131352-0.7592328 1.0000000

SG 9.0 9.5 10.5 0.40 0.45 0.50 0.55 0.60 9.0 9.5 10.0 10.5 11.0 MC 0.40 0.50 0.60 11.0 12.0 13.0 11.0 11.5 12.0 12.5 13.0 STRENGTH

There is a positive association between Strength and SG and a negative association with MC. There is also a negative association between Strength and MC, with one value (in the lower left corner) quite different from the others. The linear model gives output as follows. Strength = β 0 + β 1 SG + β 2 MC + ɛ

> summary(woodlm.out) Call: lm(formula = wood.str ~ wood.sg + wood.mc) Residuals: Min 1Q Median 3Q Max -0.4442-0.1278 0.05365 0.1052 0.4499 Coefficients: Value Std. Error t value Pr(> t ) (Intercept) 10.3015 1.8965 5.4319 0.0010 wood.sg 8.4947 1.7850 4.7589 0.0021 wood.mc -0.2663 0.1237-2.1522 0.0684 Residual standard error: 0.2754 on 7 degrees of freedom Multiple R-Squared: 0.9 F-statistic: 31.5 on 2 and 7 degrees of freedom, the p-valu

The leverage values for the linear model in the two predictors are. > hat(model.matrix(woodlm.out)) [1] 0.4178935 0.2418666 0.4172806 0.6043904 0.2521824 0.14 [7] 0.2616385 0.1540321 0.3155106 0.1873364 Case 4 has the largest leverage, this is the case with low SG and MC. Note that the leverage values sum to 3, and that 2 h = 2(3)/10 =.6 so that case 4 would be flagged as high leverage by some packages.

The leverage values are plotted in the SG, MC space below. One can see how the values increase as you move toward the extremes of the data. Leverage values in predictor space 0.42 MC 9.0 9.5 10.0 10.5 11.0 0.32 0.26 0.19 0.15 0.15 0.6 0.25 0.24 0.42 0.40 0.45 0.50 0.55 0.60 SG

The residuals from this model are shown below. residuals 0.4 0.2 0.0 0.2 0.4 0.40 0.45 0.50 0.55 0.60 SG residuals 0.4 0.2 0.0 0.2 0.4 9.0 9.5 10.0 10.5 11.0 MC

This is only a small data set, but one possible extension to the model is to add a quadratic term in MC. >wood.mc2 = wood.mc^2 >woodlm2.out=lm(wood.str~wood.sg + wood.mc + wood.mc2) >summary(woodlm2.out) Call: lm(formula = wood.str ~ wood.sg + wood.mc + wood.mc2)

Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -42.59511 8.49356-5.015 0.002416 ** wood.sg 9.68175 0.72851 13.290 1.12e-05 *** wood.mc 10.42822 1.71125 6.094 0.000889 *** wood.mc2-0.54221 0.08672-6.252 0.000776 *** --- Residual standard error: 0.1085 on 6 degrees of freedom Multiple R-Squared: 0.9867, Adjusted R-squared: 0.98 F-statistic: 148.3 on 3 and 6 DF, p-value: 5.13e-06

The fit has been improved (s has been reduced from.2754 to.1085, R 2 has increased from.9 to.99) and the MC 2 term is highly significant. The leverage values change with the model - X has one more column. > hat(model.matrix(woodlm2.out)) [1] 0.7657191 0.2418690 0.4241376 0.6469168 0.2836093 0.6163116 0.2662545 [8] 0.2304277 0.3371019 0.1876526 The first case now has the largest leverage value. With an extra predictor, however, 2 h =.8, so none of these values meet the threshold.

0.40 0.45 0.50 0.55 0.60 Leverage values in predictor space 0.77 MC 9.0 9.5 10.0 10.5 11.0 0.34 0.27 0.19 0.23 0.62 0.65 0.24 0.28 0.42 SG