STAT5044: Regression and Anova Inyoung Kim 1 / 49
Outline 1 How to check assumptions 2 / 49
Assumption Linearity: scatter plot, residual plot Randomness: Run test, Durbin-Watson test when the data can be arranged in time order Constant variance: scatter plot, residual plot (ABS-residual plot); Brown-Forsythe test, Breusch-Pagan Test Normality of error: Box-plot, histogram, normal probability plot; Shapiro-Wilks test, Kolmogorov-Smirnov, Anderson-Darling Remark: Normality probability plot provides no information if the assumption of linearity and/or constant variance are violated 3 / 49
Influential point Combination of large absolute residual and high leverage (h ii ) Leverage: diagonal value of Hat matrix (H) h 11 h 12 h 1n h 21 h 22 h 1n H = h n1 h n2 h nn High leverage large h ii 4 / 49
Residual Three types: Ordinary r: r i = y ŷ, where E(r i ) = 0 and var(r i ) = (1 h ii )σ 2 Standardized: r i ˆσ 1 h ii Studendized (or Jackknife): where, ˆσ 2 (i) = ( j r 2 r i ˆσ (i) 1 hii t n 2 j(i) )/(n p 1) and (p+1) is the number of parameter h ii is the leverage which is the diagonal value of Hat matrix r j(i) = y j ŷ j(i) = y j ( ˆβ 0(i) + ˆβ 1(i) x j ) 5 / 49
Properties of residuals Sum to zero: r i = 0 Are not independent 6 / 49
Residual Jackknife σ 2 r i(i) = y i ŷ i(i) N(0, ) 1 h ii where the subindex (i) indicate that estimate without point i residual for y i computed using regression without y i then scaling Studendized residual: r i(i) var(r ˆ i(i) ) r i(i) = y i ŷ i(i) = y i [ ˆβ 0(i) + ˆβ 1(i) x i ] 7 / 49
Studendized residual r i(i) = var(r ˆ i(i) ) r i ˆσ (i) 1 hii by Fact 1 and 2 Fact 1: r i(i) = r i 1 h ii Fact 2: ri(i) 2 = (n p) ˆσ 2 r i 2 1 h ii ˆσ (i) = (n p) ˆσ 2 r 2 i 1 h ii n p 1 8 / 49
Residual Using Fact1 r i(i) = r i 1 h ii, we have Var(r i(i) ) = Var(r i) (1 h ii ) 2 = σ 2 1 h ii But σ 2 is unknown We use ˆσ 2 (i) r i(i) = Y i Ŷ i(i) r i(i) = r i 1 h ii 9 / 49
Residual Studendized residual r i 1 h ii = ˆσ (i) 2 1 h ii r i ˆσ 2 (i) (1 h ii ) where ˆσ 2 (i) = j r 2 j(i) n p 1 rj(i) 2 = (n p) ˆσ 2 r i 2 j 1 h ii NOTE: large residual if r j(i) > 3 An expression for the distribution of the standardized residuals was obtained (Weisberg, 1985) 10 / 49
Studendized residual r i(i) = var(r ˆ i(i) ) r i ˆσ (i) 1 hii t n p 1 You don t need to know how to prove this in our class! (beyond our class scope) 11 / 49
Comparison with standardized residual Standardized residual: r i 0 var(ri ) = r i 0 σ 2 (1 h ii ) r i (1 hii ) ˆσ 2 If one has outliers with large absolute residual, then ˆσ 2 may not be a good measurement Residuals are not independent and have different variances The distribution of the standardized residual is not a t distribution People usually ignore these problems 12 / 49
Residual plots in R > lmfit<-lm(y x) > plot(fitted(lmfit),residuals(lmfit),xlab= Fitted,ylab= Residuals ) > abline(h=0) > plot(fitted(lmfit),abs(residuals(lmfit)),xlab="fitted",ylab=" Residuals 13 / 49
Residual plots Residual plots Residuals -2-1 0 1 Residuals 05 10 15 20 10 12 14 16 18 20 22 Fitted 10 12 14 16 18 20 22 Fitted 14 / 49
Leverage H = X(X t X) 1 X t Let x t i = ( ( ) x t 1 ) 1 1 x i, xi =, X =, A = (X t X) 1 x i H n n = XAX t = The (i,j)th element of H is x t i Ax j NOTE: A = (X t X) 1 = x t 1 x t n x t n A ( x 1 x 2 x n ) ( 1 + x 2 n x S xx S xx x 1 S xx S xx ) 15 / 49
Leverage The (i,j)th element of H is (1 x i )(X t X) 1 ( 1 x j ), h ii is level of matrix h ii = 1 n + (x i x) 2 S xx (Check ) high level point: h ii is large, that is (x i x) 2 is large 1 n h ii 1 Idea: If this is regular and n is large (n ) h ii = 1 n + (x i x) 2 (x i x) 2 O( 1 n ) 0 16 / 49
Why is leverage in this range? j h 2 ji = h ii 0 j i h 2 ji = h ii h 2 ii Hence, 0 h ii (1 h ii ) Since h ii > 0 and 1 h ii 0, h ii 1 We also know that h ii > 1 n because of h ii = 1 n + (x i x) 2 S xx 1 n 17 / 49
Cook s distance Measure influential points using ŷ i ŷ i(j), j is fixed point ŷ 1(i) ŷ 2(i) ŷ (i) = ŷ n(i) where the subindex (i) indicates the fitted values are obtained using all observations except ith observation The ith cook s distance: D i = {ŷ ŷ (i)} t {ŷ ŷ (i) } p ˆσ 2 where ŷ = X ˆβ, ŷ (i) = X ˆβ (i) 18 / 49
Cook s distance D i = { ˆβ ˆβ (i) } t X t X{ ˆβ ˆβ (i) }/p ˆσ 2 F p,n p Identify the points which have relatively large cook distance by Fact3: ˆβ ˆβ (i) = D i = ( r i 1 h ii ) 2 x t i (X t X) 1 (X t X)(X t X) 1 x i p ˆσ 2 r i 1 h ii (X t X) 1 x i 19 / 49
Cook s distance D i depends on two factors: D i = ( r i 1 h ii ) 2 x t i (X t X) 1 (X t X)(X t X) 1 x i p ˆσ 2 The size of the residual r i The leverage value h ii The larger either r i or h ii is, the larger D i The ith case can be influential: (1) by having a larger residuals and only a moderate leverage value h ii or (2) by having a larger leverage value h ii with only a moderately sized residuals or (3) by having both a larger residual and a large leverage value 20 / 49
Cooks distance in R libray(stats) #<---for cooksdistance libray(faraway) #<--halfnorm lmfit<-lm(y x) cook<-cooksdistance(lmfit) par(mfcol=c(1,2)) halfnorm(cook,3,ylab="cooks dist") boxplot(cook) 21 / 49
Cooks distance in R Cooks distance Cooks dist 000 005 010 015 020 025 030 035 2 4 7 005 010 015 020 025 030 035 00 05 10 15 Half normal quantiles 22 / 49
Randomness: runs test and Durbin Wason test runs test Order the residuals Count the number of runs (r), the numbers of positive and negative residuals, let s say n 1 and n 2 If n 1 20, n 2 20, reject the hypothesis of randomness if r < r L or if r > r U, where r L and r U are the upper and lower critical values given Table A30 (handout) For large sample size, reject hypothesis of randomness if z > z α/2, where z = r µ 05 σ where µ = 1 + 2n 1n 2 n 1 +n 2, σ 2 = 2n 1n 2 (2n 1 n 2 n 1 n 2 ) (n 1 +n 2 ) 2 (n 1 +n 2 1) 23 / 49
Example of Randomness: runs test > x<-c(0:9) > y<-c(98, 135, 162,178, 221,232,283,300,374,395) > lmfit<-lm(y x) > residuals(lmfit) 1 2 3 4 5 6 7 64363636 109393939 54424242-110545455 -05515152-220484848 -35454545 8 9 10-190424242 224606061 109636364 24 / 49
Example of Randomness: runs test ------------------------ How to do run test? ----------------------- Run test: (+ + +) (- - - - -) (+ +) the num of run=3 the num of positive=5 the num of negative=5 Using Table A30 rl=2 and ru=10 If r<rl or r>ru, reject the hypothesis of randomness 25 / 49
run rest in R library(lawstat) lmfit<-lm(y x) runstest(residuals(lmfit)) > runstest(residuals(lmfit)) Runs Test - Two sided data: residuals(lmfit) Standardized Runs Statistic = -06708, p-value = 05023 26 / 49
Randomness: Durbin Wason test Durbin Wason test: to test error terms ε i are independent (H 0 : ρ = 0) Test statistic D is D = n t=2(r t r t 1 ) 2 n t=1 r 2 t where r t = Y t Ŷ t If D > d U, conclude H 0 If D < d L, conclude H a If d L < D < d U, test is inconclusive d L and d U are selected based on level of testing, the number of X variables (p 1), sample size (n) 27 / 49
DW test in R > lmfit<-lm(y x) > dwtest(lmfit) Durbin-Watson test data: lmfit DW = 1875, p-value = 04968 alternative hypothesis: true autocorrelation is greater than 0 28 / 49
Constant variance: Brown-Forsythe and Breusch-Pagan test Brown-Forsythe (Levene test) r i1, r i2 : the ith residual for group1 and group2 n 1, n 2 : the sample size of each group r 1, r 2 : the median of each group d i1 = r i1 r 1, d i2 = r i2 r 2 Two-sample t test statistic is where s 2 = (d i1 d1 ) 2 + (d i2 d2 ) 2 n 2 Breusch-Pagan to test H 0 : γ 1 = 0 d 1 d2 t BF = s 1/n 1 + 1/n 2 log e σi 2 = γ 0 + γ 1 X i Test statistic is X 2 BP = SSR /2 (SSE/n) 2 where SSR : regression sum of squares when regressing r 2 on X and SSE is the error sum of squares when regression Y on X 29 / 49
BF tests in R # best way to split two group is that one has low values and the other has large values of X g1<-c(64363636, 109393939, 54424242, -110545455, -05515152) g2<-c(-220484848, -35454545,-190424242, 224606061, 109636364 ) d1<-abs(g1-median(g1)) d2<-abs(g2-median(g2)) ttest(d1,d2) Welch Two Sample t-test data: d1 and d2 t = -17688, df = 711, p-value = 01196 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2124279 302946 sample estimates: mean of x mean of y 5796364 14903030 30 / 49
BF tests in R library(lawstat) lmfit<-lm(y x) levenetest(residuals(lmfit),group) > levenetest(residuals(lmfit),group=c(rep(1,5),rep(0,5))) Classical Levene s test based on the absolute deviations from the mean data: residuals(lmfit) Test Statistic = 00708, p-value = 07969 31 / 49
BP tests in R library(lmtest) lmfit<-lm(y x) bptest(lmfit) > bptest(lmfit) studentized Breusch-Pagan test data: lmfit BP = 30628, df = 1, p-value = 00801 32 / 49
Test of normality Shapiro Wilk test: H 0 : a sample y 1,,y n cames from a normally distributed population Test statistic is W = ( a iy (i) ) 2 n i=1(y (i) ȳ) 2 where y (i) is the ith order statistics and the constant a i are given by m t V 1 (a 1,,a n ) = (m t V 1 V 1 m) 1/2 and m = (m 1,,m n ) t where m i is the expected values of the order statistics of iid random variables from standard normal dist and V is the covariance matrix of those order statistics If W is too small, reject the null hypothesis 33 / 49
Shapiro Wilks in R library(stats) lmfit<-lm(y x) Shapirotest(residuals(lmfit)) > shapirotest(residuals(lmfit)) Shapiro-Wilk normality test data: residuals(lmfit) W = 09073, p-value = 02632 34 / 49
Test of normality Kolmogorove-Smirnov: The empirical distribution function F n for n iid observations Y i is defined as F n (y) = 1 n n i=1 I(Y i < y) where I(Y y): indicator function The Kolmogorove-Smirnov statistic is If D n is big, reject the null D n = sup y F n (y) F(y) Correlation test: idea-compute the correlation between the expected quantile of normal and the observed order statistic Anderson-Darling test: A distance or empirical distribution test and use with small sample size n 25 35 / 49
Anderson-Darling test in R library(nortest) adtest(residuals(lmfit)) > adtest(residuals(lmfit)) Anderson-Darling normality test data: residuals(lmfit) A = 04495, p-value = 02168 36 / 49
PP plot and QQplot Plots for comparing two probability distributions There are two basic types, the probability-probability plot and the quantile-quantile plot A plot of points whose coordinates are the cumulative probabilities {p x (q),p y (q)} for different values of q is a probaility-probability plot, A plot of the points whose coordinates are the quantiles {q x (p),q y (p)} for different values of p is a quantile-quantile plot The latter is the more frequently used of the two types and its use to investigate the assumption that a set of data is from a normal distribution For example, plotting the ordered sample values y 1,,y n against the quantiles of a standard normal distribution, Φ 1 [p (i) ] where p i = i 1 2 n Φ(x) = x 1 e 1 2 µ2 2π dµ This is usually known as a normal probability plot and 37 / 49
Normal QQ plot in R library(faraway) qqnorm(residuals(lmfit), ylab= Residuals ) qqline(residuals(lmfit)) 38 / 49
Normal QQplot in R Normal QQplot Normal Q-Q Plot Histogram of residuals(lmfit) Residuals -2-1 0 1 Residuals 00 05 10 15 20 25 30-15 -05 05 15 Theoretical Quantiles -3-2 -1 0 1 2 residuals(lmfit) 39 / 49
Lack of fit test Idea: if you have multiple tests of y for x values, you can use these to test for lack of fit Basis: if the fit is good, the fitted line should go through the mean of y s at each x If the fit is bad, the fitted value should differ from the mean 40 / 49
Linear Lack of fit test This test assumes variance homogeneity Goal: check the linearity of the conditional mean of Y given X Requirements: one has to have replicates in X Data x 1 x 2 x k y 11 y 21 y k1 y 12 y 22 y k2 y 1n1 y 2n2 y knk Some of the n 1, n 2,,n k have to be > 1 41 / 49
Linear Lack of fit test Model y ij = β 0 + β 1 x i + ε ij, i = 1,,k, j = 1,2,,n k where ε ij [0,σ 2 ] Model y ij = β 0 + β 1 x i + σε ij, i = 1,,k, j = 1,2,,n k where ε ij [0,1] These are the same model 42 / 49
Linear Lack of fit test Model y ij = β 0 + β 1 x i + σε ij, i = 1,,k, j = 1,2,,n k where ε ij [0,1] How many total replicate? n 1 + n 2 + + n k = n Remark1: independent, normally distributed error with a constant variance 43 / 49
Linear Lack of fit test y = y 11 y 1n1 y 21 y 2n2 y k1 y knk = 1 n1 x 1 1 n1 1 n2 x 2 1 n2 1 nk x k 1 nk ( β0 β 1 ) + ε 44 / 49
ANOVA table for Lack of fit test ANOVA model (Ŷ ij Ȳ ) 2 residual (Y ij Ŷ ij ) 2 Total (Y ij Ŷ ij ) 2 + (Ŷ ij Ȳ ) 2 SSE=SSPE+SSLOF Y ij Ŷ ij = (Y ij Ȳ i ) + (Ȳ i Ŷ ij ) SSPE: sum of squared pure errors= (Y ij Ȳ i ) 2 SSLOF=sum of square lack of fit = (Ŷ ij Ȳ i ) 2 H 0 : Linear model fit the data well H 1 : Linear model does not fit the data If SSLOF is large there is a lack of fit F = (Ŷ ij Ȳ i ) 2 /df 1 (Y ij Ȳ i ) 2 /df 2 F df 1,df 2(= SSLof SSPE ) reject H 0 if F > F df 1,df 2 for a 1 α level test 45 / 49
Degree of freedom in ANOVA Find df1 and df2 Think about an example of two populations (We used pooled sample variance) S 2 p = (Y 1j ȳ 1 ) 2 + (Y 2j ȳ 2 ) 2 n 1 1 + n 2 1 Now we have k groups S 2 p = = (Y 1j ȳ 1 ) 2 + (Y 2j ȳ 2 ) 2 n 1 + n 2 2 (y ij ȳ i ) 2 n 1 1 + n 2 1 + + n k 1 = SSPE n k df 2 = n k, df 1 = df(res) df 2 = n 2 (n k) = k 2 46 / 49
ANOVA ANOVA SS df Regression (Ŷ ij Ȳ ) 2 1 Residual (Y ij Ŷ ij ) 2 n-2 LoF (ŷ ij ȳ i ) 2 k-2 PE (Y ij Ȳ i ) 2 n-k F LOF = SSLof/(k 2) SSPE/(n k) F k 2,n k 47 / 49
SSLOF and SSPE SSLOF = y t A 1 y = y t ( H + J )y SSPE = y t A 2 Y = y t (I J )y where 1 J n n1 0 0 0 1 J 1 = 0 J n n2 0 0 2 0 0 0 0 0 0 0 1 n k J nk 48 / 49
Remedial actions Change model if it appears there is nonlinearity but homogeneity of variance Transform if there is heterogeneity of variance and nonlinearity Consider weighted least squares if there is just heterogeneity of variance Delete outliers Fit a robust model (loess, etc) 49 / 49