STAT5044: Regression and Anova

Similar documents
Formal Statement of Simple Linear Regression Model

Diagnostics and Remedial Measures

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Lecture 1: Linear Models and Applications

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Lectures on Simple Linear Regression Stat 431, Summer 2012

Ch 3: Multiple Linear Regression

STAT 4385 Topic 06: Model Diagnostics

Remedial Measures, Brown-Forsythe test, F test

Statistics for Managers using Microsoft Excel 6 th Edition

Diagnostics and Remedial Measures: An Overview

Unit 10: Simple Linear Regression and Correlation

Ch 2: Simple Linear Regression

The Model Building Process Part I: Checking Model Assumptions Best Practice

Homework 2: Simple Linear Regression

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

holding all other predictors constant

One-way ANOVA Model Assumptions

Chapter 12 - Lecture 2 Inferences about regression coefficient

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Stat 427/527: Advanced Data Analysis I

MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

STAT5044: Regression and Anova. Inyoung Kim

Heteroskedasticity and Autocorrelation

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

4.1. Introduction: Comparing Means

Simple Linear Regression for the Advertising Data

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

Heteroskedasticity. Part VII. Heteroskedasticity

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Math 3330: Solution to midterm Exam

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Inferences for Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Diagnostics of Linear Regression

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

ANOVA: Analysis of Variation

Econometrics of Panel Data

Simple Linear Regression

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Regression diagnostics

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

CHAPTER 2 SIMPLE LINEAR REGRESSION

Multiple Regression Analysis: Heteroskedasticity

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Multiple Linear Regression

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Sociology 6Z03 Review II

STAT5044: Regression and Anova. Inyoung Kim

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Applied Statistical Methods II. Larry Winner University of Florida Department of Statistics

Lecture 5: Hypothesis tests for more than one sample

STA 6167 Exam 1 Spring 2016 PRINT Name

Diagnostics for Linear Models With Functional Responses

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Multiple Linear Regression

Inference for Regression

14 Multiple Linear Regression

Correlation Analysis

Simple Linear Regression

STATISTICS 479 Exam II (100 points)

STAT 4385 Topic 03: Simple Linear Regression

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Mathematics for Economics MA course

Chapter 16. Simple Linear Regression and dcorrelation

Weighted Least Squares

Linear models and their mathematical foundations: Simple linear regression

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Handout 4: Simple Linear Regression

Solutions to Final STAT 421, Fall 2008

Econometrics of Panel Data

STA2601. Tutorial letter 203/2/2017. Applied Statistics II. Semester 2. Department of Statistics STA2601/203/2/2017. Solutions to Assignment 03

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Lecture 10 Multiple Linear Regression

Correlation and Simple Linear Regression

Answer Keys to Homework#10

5. Linear Regression

Basic Business Statistics 6 th Edition

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Assignment 9 Answer Keys

R 2 and F -Tests and ANOVA

Lecture 9 SLR in Matrix Form

Applied Regression Analysis

The ε ij (i.e. the errors or residuals) are normally distributed. This assumption has the least influence on the F test.

Module 6: Model Diagnostics

STA 4210 Practise set 2a

Simple Linear Regression

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

MATH 644: Regression Analysis Methods

STAT 571A Advanced Statistical Regression Analysis. Chapter 3 NOTES Diagnostics and Remedial Measures

Weighted Least Squares

Transcription:

STAT5044: Regression and Anova Inyoung Kim 1 / 49

Outline 1 How to check assumptions 2 / 49

Assumption Linearity: scatter plot, residual plot Randomness: Run test, Durbin-Watson test when the data can be arranged in time order Constant variance: scatter plot, residual plot (ABS-residual plot); Brown-Forsythe test, Breusch-Pagan Test Normality of error: Box-plot, histogram, normal probability plot; Shapiro-Wilks test, Kolmogorov-Smirnov, Anderson-Darling Remark: Normality probability plot provides no information if the assumption of linearity and/or constant variance are violated 3 / 49

Influential point Combination of large absolute residual and high leverage (h ii ) Leverage: diagonal value of Hat matrix (H) h 11 h 12 h 1n h 21 h 22 h 1n H = h n1 h n2 h nn High leverage large h ii 4 / 49

Residual Three types: Ordinary r: r i = y ŷ, where E(r i ) = 0 and var(r i ) = (1 h ii )σ 2 Standardized: r i ˆσ 1 h ii Studendized (or Jackknife): where, ˆσ 2 (i) = ( j r 2 r i ˆσ (i) 1 hii t n 2 j(i) )/(n p 1) and (p+1) is the number of parameter h ii is the leverage which is the diagonal value of Hat matrix r j(i) = y j ŷ j(i) = y j ( ˆβ 0(i) + ˆβ 1(i) x j ) 5 / 49

Properties of residuals Sum to zero: r i = 0 Are not independent 6 / 49

Residual Jackknife σ 2 r i(i) = y i ŷ i(i) N(0, ) 1 h ii where the subindex (i) indicate that estimate without point i residual for y i computed using regression without y i then scaling Studendized residual: r i(i) var(r ˆ i(i) ) r i(i) = y i ŷ i(i) = y i [ ˆβ 0(i) + ˆβ 1(i) x i ] 7 / 49

Studendized residual r i(i) = var(r ˆ i(i) ) r i ˆσ (i) 1 hii by Fact 1 and 2 Fact 1: r i(i) = r i 1 h ii Fact 2: ri(i) 2 = (n p) ˆσ 2 r i 2 1 h ii ˆσ (i) = (n p) ˆσ 2 r 2 i 1 h ii n p 1 8 / 49

Residual Using Fact1 r i(i) = r i 1 h ii, we have Var(r i(i) ) = Var(r i) (1 h ii ) 2 = σ 2 1 h ii But σ 2 is unknown We use ˆσ 2 (i) r i(i) = Y i Ŷ i(i) r i(i) = r i 1 h ii 9 / 49

Residual Studendized residual r i 1 h ii = ˆσ (i) 2 1 h ii r i ˆσ 2 (i) (1 h ii ) where ˆσ 2 (i) = j r 2 j(i) n p 1 rj(i) 2 = (n p) ˆσ 2 r i 2 j 1 h ii NOTE: large residual if r j(i) > 3 An expression for the distribution of the standardized residuals was obtained (Weisberg, 1985) 10 / 49

Studendized residual r i(i) = var(r ˆ i(i) ) r i ˆσ (i) 1 hii t n p 1 You don t need to know how to prove this in our class! (beyond our class scope) 11 / 49

Comparison with standardized residual Standardized residual: r i 0 var(ri ) = r i 0 σ 2 (1 h ii ) r i (1 hii ) ˆσ 2 If one has outliers with large absolute residual, then ˆσ 2 may not be a good measurement Residuals are not independent and have different variances The distribution of the standardized residual is not a t distribution People usually ignore these problems 12 / 49

Residual plots in R > lmfit<-lm(y x) > plot(fitted(lmfit),residuals(lmfit),xlab= Fitted,ylab= Residuals ) > abline(h=0) > plot(fitted(lmfit),abs(residuals(lmfit)),xlab="fitted",ylab=" Residuals 13 / 49

Residual plots Residual plots Residuals -2-1 0 1 Residuals 05 10 15 20 10 12 14 16 18 20 22 Fitted 10 12 14 16 18 20 22 Fitted 14 / 49

Leverage H = X(X t X) 1 X t Let x t i = ( ( ) x t 1 ) 1 1 x i, xi =, X =, A = (X t X) 1 x i H n n = XAX t = The (i,j)th element of H is x t i Ax j NOTE: A = (X t X) 1 = x t 1 x t n x t n A ( x 1 x 2 x n ) ( 1 + x 2 n x S xx S xx x 1 S xx S xx ) 15 / 49

Leverage The (i,j)th element of H is (1 x i )(X t X) 1 ( 1 x j ), h ii is level of matrix h ii = 1 n + (x i x) 2 S xx (Check ) high level point: h ii is large, that is (x i x) 2 is large 1 n h ii 1 Idea: If this is regular and n is large (n ) h ii = 1 n + (x i x) 2 (x i x) 2 O( 1 n ) 0 16 / 49

Why is leverage in this range? j h 2 ji = h ii 0 j i h 2 ji = h ii h 2 ii Hence, 0 h ii (1 h ii ) Since h ii > 0 and 1 h ii 0, h ii 1 We also know that h ii > 1 n because of h ii = 1 n + (x i x) 2 S xx 1 n 17 / 49

Cook s distance Measure influential points using ŷ i ŷ i(j), j is fixed point ŷ 1(i) ŷ 2(i) ŷ (i) = ŷ n(i) where the subindex (i) indicates the fitted values are obtained using all observations except ith observation The ith cook s distance: D i = {ŷ ŷ (i)} t {ŷ ŷ (i) } p ˆσ 2 where ŷ = X ˆβ, ŷ (i) = X ˆβ (i) 18 / 49

Cook s distance D i = { ˆβ ˆβ (i) } t X t X{ ˆβ ˆβ (i) }/p ˆσ 2 F p,n p Identify the points which have relatively large cook distance by Fact3: ˆβ ˆβ (i) = D i = ( r i 1 h ii ) 2 x t i (X t X) 1 (X t X)(X t X) 1 x i p ˆσ 2 r i 1 h ii (X t X) 1 x i 19 / 49

Cook s distance D i depends on two factors: D i = ( r i 1 h ii ) 2 x t i (X t X) 1 (X t X)(X t X) 1 x i p ˆσ 2 The size of the residual r i The leverage value h ii The larger either r i or h ii is, the larger D i The ith case can be influential: (1) by having a larger residuals and only a moderate leverage value h ii or (2) by having a larger leverage value h ii with only a moderately sized residuals or (3) by having both a larger residual and a large leverage value 20 / 49

Cooks distance in R libray(stats) #<---for cooksdistance libray(faraway) #<--halfnorm lmfit<-lm(y x) cook<-cooksdistance(lmfit) par(mfcol=c(1,2)) halfnorm(cook,3,ylab="cooks dist") boxplot(cook) 21 / 49

Cooks distance in R Cooks distance Cooks dist 000 005 010 015 020 025 030 035 2 4 7 005 010 015 020 025 030 035 00 05 10 15 Half normal quantiles 22 / 49

Randomness: runs test and Durbin Wason test runs test Order the residuals Count the number of runs (r), the numbers of positive and negative residuals, let s say n 1 and n 2 If n 1 20, n 2 20, reject the hypothesis of randomness if r < r L or if r > r U, where r L and r U are the upper and lower critical values given Table A30 (handout) For large sample size, reject hypothesis of randomness if z > z α/2, where z = r µ 05 σ where µ = 1 + 2n 1n 2 n 1 +n 2, σ 2 = 2n 1n 2 (2n 1 n 2 n 1 n 2 ) (n 1 +n 2 ) 2 (n 1 +n 2 1) 23 / 49

Example of Randomness: runs test > x<-c(0:9) > y<-c(98, 135, 162,178, 221,232,283,300,374,395) > lmfit<-lm(y x) > residuals(lmfit) 1 2 3 4 5 6 7 64363636 109393939 54424242-110545455 -05515152-220484848 -35454545 8 9 10-190424242 224606061 109636364 24 / 49

Example of Randomness: runs test ------------------------ How to do run test? ----------------------- Run test: (+ + +) (- - - - -) (+ +) the num of run=3 the num of positive=5 the num of negative=5 Using Table A30 rl=2 and ru=10 If r<rl or r>ru, reject the hypothesis of randomness 25 / 49

run rest in R library(lawstat) lmfit<-lm(y x) runstest(residuals(lmfit)) > runstest(residuals(lmfit)) Runs Test - Two sided data: residuals(lmfit) Standardized Runs Statistic = -06708, p-value = 05023 26 / 49

Randomness: Durbin Wason test Durbin Wason test: to test error terms ε i are independent (H 0 : ρ = 0) Test statistic D is D = n t=2(r t r t 1 ) 2 n t=1 r 2 t where r t = Y t Ŷ t If D > d U, conclude H 0 If D < d L, conclude H a If d L < D < d U, test is inconclusive d L and d U are selected based on level of testing, the number of X variables (p 1), sample size (n) 27 / 49

DW test in R > lmfit<-lm(y x) > dwtest(lmfit) Durbin-Watson test data: lmfit DW = 1875, p-value = 04968 alternative hypothesis: true autocorrelation is greater than 0 28 / 49

Constant variance: Brown-Forsythe and Breusch-Pagan test Brown-Forsythe (Levene test) r i1, r i2 : the ith residual for group1 and group2 n 1, n 2 : the sample size of each group r 1, r 2 : the median of each group d i1 = r i1 r 1, d i2 = r i2 r 2 Two-sample t test statistic is where s 2 = (d i1 d1 ) 2 + (d i2 d2 ) 2 n 2 Breusch-Pagan to test H 0 : γ 1 = 0 d 1 d2 t BF = s 1/n 1 + 1/n 2 log e σi 2 = γ 0 + γ 1 X i Test statistic is X 2 BP = SSR /2 (SSE/n) 2 where SSR : regression sum of squares when regressing r 2 on X and SSE is the error sum of squares when regression Y on X 29 / 49

BF tests in R # best way to split two group is that one has low values and the other has large values of X g1<-c(64363636, 109393939, 54424242, -110545455, -05515152) g2<-c(-220484848, -35454545,-190424242, 224606061, 109636364 ) d1<-abs(g1-median(g1)) d2<-abs(g2-median(g2)) ttest(d1,d2) Welch Two Sample t-test data: d1 and d2 t = -17688, df = 711, p-value = 01196 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2124279 302946 sample estimates: mean of x mean of y 5796364 14903030 30 / 49

BF tests in R library(lawstat) lmfit<-lm(y x) levenetest(residuals(lmfit),group) > levenetest(residuals(lmfit),group=c(rep(1,5),rep(0,5))) Classical Levene s test based on the absolute deviations from the mean data: residuals(lmfit) Test Statistic = 00708, p-value = 07969 31 / 49

BP tests in R library(lmtest) lmfit<-lm(y x) bptest(lmfit) > bptest(lmfit) studentized Breusch-Pagan test data: lmfit BP = 30628, df = 1, p-value = 00801 32 / 49

Test of normality Shapiro Wilk test: H 0 : a sample y 1,,y n cames from a normally distributed population Test statistic is W = ( a iy (i) ) 2 n i=1(y (i) ȳ) 2 where y (i) is the ith order statistics and the constant a i are given by m t V 1 (a 1,,a n ) = (m t V 1 V 1 m) 1/2 and m = (m 1,,m n ) t where m i is the expected values of the order statistics of iid random variables from standard normal dist and V is the covariance matrix of those order statistics If W is too small, reject the null hypothesis 33 / 49

Shapiro Wilks in R library(stats) lmfit<-lm(y x) Shapirotest(residuals(lmfit)) > shapirotest(residuals(lmfit)) Shapiro-Wilk normality test data: residuals(lmfit) W = 09073, p-value = 02632 34 / 49

Test of normality Kolmogorove-Smirnov: The empirical distribution function F n for n iid observations Y i is defined as F n (y) = 1 n n i=1 I(Y i < y) where I(Y y): indicator function The Kolmogorove-Smirnov statistic is If D n is big, reject the null D n = sup y F n (y) F(y) Correlation test: idea-compute the correlation between the expected quantile of normal and the observed order statistic Anderson-Darling test: A distance or empirical distribution test and use with small sample size n 25 35 / 49

Anderson-Darling test in R library(nortest) adtest(residuals(lmfit)) > adtest(residuals(lmfit)) Anderson-Darling normality test data: residuals(lmfit) A = 04495, p-value = 02168 36 / 49

PP plot and QQplot Plots for comparing two probability distributions There are two basic types, the probability-probability plot and the quantile-quantile plot A plot of points whose coordinates are the cumulative probabilities {p x (q),p y (q)} for different values of q is a probaility-probability plot, A plot of the points whose coordinates are the quantiles {q x (p),q y (p)} for different values of p is a quantile-quantile plot The latter is the more frequently used of the two types and its use to investigate the assumption that a set of data is from a normal distribution For example, plotting the ordered sample values y 1,,y n against the quantiles of a standard normal distribution, Φ 1 [p (i) ] where p i = i 1 2 n Φ(x) = x 1 e 1 2 µ2 2π dµ This is usually known as a normal probability plot and 37 / 49

Normal QQ plot in R library(faraway) qqnorm(residuals(lmfit), ylab= Residuals ) qqline(residuals(lmfit)) 38 / 49

Normal QQplot in R Normal QQplot Normal Q-Q Plot Histogram of residuals(lmfit) Residuals -2-1 0 1 Residuals 00 05 10 15 20 25 30-15 -05 05 15 Theoretical Quantiles -3-2 -1 0 1 2 residuals(lmfit) 39 / 49

Lack of fit test Idea: if you have multiple tests of y for x values, you can use these to test for lack of fit Basis: if the fit is good, the fitted line should go through the mean of y s at each x If the fit is bad, the fitted value should differ from the mean 40 / 49

Linear Lack of fit test This test assumes variance homogeneity Goal: check the linearity of the conditional mean of Y given X Requirements: one has to have replicates in X Data x 1 x 2 x k y 11 y 21 y k1 y 12 y 22 y k2 y 1n1 y 2n2 y knk Some of the n 1, n 2,,n k have to be > 1 41 / 49

Linear Lack of fit test Model y ij = β 0 + β 1 x i + ε ij, i = 1,,k, j = 1,2,,n k where ε ij [0,σ 2 ] Model y ij = β 0 + β 1 x i + σε ij, i = 1,,k, j = 1,2,,n k where ε ij [0,1] These are the same model 42 / 49

Linear Lack of fit test Model y ij = β 0 + β 1 x i + σε ij, i = 1,,k, j = 1,2,,n k where ε ij [0,1] How many total replicate? n 1 + n 2 + + n k = n Remark1: independent, normally distributed error with a constant variance 43 / 49

Linear Lack of fit test y = y 11 y 1n1 y 21 y 2n2 y k1 y knk = 1 n1 x 1 1 n1 1 n2 x 2 1 n2 1 nk x k 1 nk ( β0 β 1 ) + ε 44 / 49

ANOVA table for Lack of fit test ANOVA model (Ŷ ij Ȳ ) 2 residual (Y ij Ŷ ij ) 2 Total (Y ij Ŷ ij ) 2 + (Ŷ ij Ȳ ) 2 SSE=SSPE+SSLOF Y ij Ŷ ij = (Y ij Ȳ i ) + (Ȳ i Ŷ ij ) SSPE: sum of squared pure errors= (Y ij Ȳ i ) 2 SSLOF=sum of square lack of fit = (Ŷ ij Ȳ i ) 2 H 0 : Linear model fit the data well H 1 : Linear model does not fit the data If SSLOF is large there is a lack of fit F = (Ŷ ij Ȳ i ) 2 /df 1 (Y ij Ȳ i ) 2 /df 2 F df 1,df 2(= SSLof SSPE ) reject H 0 if F > F df 1,df 2 for a 1 α level test 45 / 49

Degree of freedom in ANOVA Find df1 and df2 Think about an example of two populations (We used pooled sample variance) S 2 p = (Y 1j ȳ 1 ) 2 + (Y 2j ȳ 2 ) 2 n 1 1 + n 2 1 Now we have k groups S 2 p = = (Y 1j ȳ 1 ) 2 + (Y 2j ȳ 2 ) 2 n 1 + n 2 2 (y ij ȳ i ) 2 n 1 1 + n 2 1 + + n k 1 = SSPE n k df 2 = n k, df 1 = df(res) df 2 = n 2 (n k) = k 2 46 / 49

ANOVA ANOVA SS df Regression (Ŷ ij Ȳ ) 2 1 Residual (Y ij Ŷ ij ) 2 n-2 LoF (ŷ ij ȳ i ) 2 k-2 PE (Y ij Ȳ i ) 2 n-k F LOF = SSLof/(k 2) SSPE/(n k) F k 2,n k 47 / 49

SSLOF and SSPE SSLOF = y t A 1 y = y t ( H + J )y SSPE = y t A 2 Y = y t (I J )y where 1 J n n1 0 0 0 1 J 1 = 0 J n n2 0 0 2 0 0 0 0 0 0 0 1 n k J nk 48 / 49

Remedial actions Change model if it appears there is nonlinearity but homogeneity of variance Transform if there is heterogeneity of variance and nonlinearity Consider weighted least squares if there is just heterogeneity of variance Delete outliers Fit a robust model (loess, etc) 49 / 49