STAT5044: Regression and Anova

Size: px

Start display at page:

Download "STAT5044: Regression and Anova"

Dinah Ross
5 years ago
Views:

1 STAT5044: Regression and Anova Inyoung Kim 1 / 49

2 Outline 1 How to check assumptions 2 / 49

3 Assumption Linearity: scatter plot, residual plot Randomness: Run test, Durbin-Watson test when the data can be arranged in time order Constant variance: scatter plot, residual plot (ABS-residual plot); Brown-Forsythe test, Breusch-Pagan Test Normality of error: Box-plot, histogram, normal probability plot; Shapiro-Wilks test, Kolmogorov-Smirnov, Anderson-Darling Remark: Normality probability plot provides no information if the assumption of linearity and/or constant variance are violated 3 / 49

4 Influential point Combination of large absolute residual and high leverage (h ii ) Leverage: diagonal value of Hat matrix (H) h 11 h 12 h 1n h 21 h 22 h 1n H = h n1 h n2 h nn High leverage large h ii 4 / 49

5 Residual Three types: Ordinary r: r i = y ŷ, where E(r i ) = 0 and var(r i ) = (1 h ii )σ 2 Standardized: r i ˆσ 1 h ii Studendized (or Jackknife): where, ˆσ 2 (i) = ( j r 2 r i ˆσ (i) 1 hii t n 2 j(i) )/(n p 1) and (p+1) is the number of parameter h ii is the leverage which is the diagonal value of Hat matrix r j(i) = y j ŷ j(i) = y j ( ˆβ 0(i) + ˆβ 1(i) x j ) 5 / 49

6 Properties of residuals Sum to zero: r i = 0 Are not independent 6 / 49

7 Residual Jackknife σ 2 r i(i) = y i ŷ i(i) N(0, ) 1 h ii where the subindex (i) indicate that estimate without point i residual for y i computed using regression without y i then scaling Studendized residual: r i(i) var(r ˆ i(i) ) r i(i) = y i ŷ i(i) = y i [ ˆβ 0(i) + ˆβ 1(i) x i ] 7 / 49

8 Studendized residual r i(i) = var(r ˆ i(i) ) r i ˆσ (i) 1 hii by Fact 1 and 2 Fact 1: r i(i) = r i 1 h ii Fact 2: ri(i) 2 = (n p) ˆσ 2 r i 2 1 h ii ˆσ (i) = (n p) ˆσ 2 r 2 i 1 h ii n p 1 8 / 49

9 Residual Using Fact1 r i(i) = r i 1 h ii, we have Var(r i(i) ) = Var(r i) (1 h ii ) 2 = σ 2 1 h ii But σ 2 is unknown We use ˆσ 2 (i) r i(i) = Y i Ŷ i(i) r i(i) = r i 1 h ii 9 / 49

10 Residual Studendized residual r i 1 h ii = ˆσ (i) 2 1 h ii r i ˆσ 2 (i) (1 h ii ) where ˆσ 2 (i) = j r 2 j(i) n p 1 rj(i) 2 = (n p) ˆσ 2 r i 2 j 1 h ii NOTE: large residual if r j(i) > 3 An expression for the distribution of the standardized residuals was obtained (Weisberg, 1985) 10 / 49

11 Studendized residual r i(i) = var(r ˆ i(i) ) r i ˆσ (i) 1 hii t n p 1 You don t need to know how to prove this in our class! (beyond our class scope) 11 / 49

12 Comparison with standardized residual Standardized residual: r i 0 var(ri ) = r i 0 σ 2 (1 h ii ) r i (1 hii ) ˆσ 2 If one has outliers with large absolute residual, then ˆσ 2 may not be a good measurement Residuals are not independent and have different variances The distribution of the standardized residual is not a t distribution People usually ignore these problems 12 / 49

13 Residual plots in R > lmfit<-lm(y x) > plot(fitted(lmfit),residuals(lmfit),xlab= Fitted,ylab= Residuals ) > abline(h=0) > plot(fitted(lmfit),abs(residuals(lmfit)),xlab="fitted",ylab=" Residuals 13 / 49

14 Residual plots Residual plots Residuals Residuals Fitted Fitted 14 / 49

15 Leverage H = X(X t X) 1 X t Let x t i = ( ( ) x t 1 ) 1 1 x i, xi =, X =, A = (X t X) 1 x i H n n = XAX t = The (i,j)th element of H is x t i Ax j NOTE: A = (X t X) 1 = x t 1 x t n x t n A ( x 1 x 2 x n ) ( 1 + x 2 n x S xx S xx x 1 S xx S xx ) 15 / 49

16 Leverage The (i,j)th element of H is (1 x i )(X t X) 1 ( 1 x j ), h ii is level of matrix h ii = 1 n + (x i x) 2 S xx (Check ) high level point: h ii is large, that is (x i x) 2 is large 1 n h ii 1 Idea: If this is regular and n is large (n ) h ii = 1 n + (x i x) 2 (x i x) 2 O( 1 n ) 0 16 / 49

17 Why is leverage in this range? j h 2 ji = h ii 0 j i h 2 ji = h ii h 2 ii Hence, 0 h ii (1 h ii ) Since h ii > 0 and 1 h ii 0, h ii 1 We also know that h ii > 1 n because of h ii = 1 n + (x i x) 2 S xx 1 n 17 / 49

18 Cook s distance Measure influential points using ŷ i ŷ i(j), j is fixed point ŷ 1(i) ŷ 2(i) ŷ (i) = ŷ n(i) where the subindex (i) indicates the fitted values are obtained using all observations except ith observation The ith cook s distance: D i = {ŷ ŷ (i)} t {ŷ ŷ (i) } p ˆσ 2 where ŷ = X ˆβ, ŷ (i) = X ˆβ (i) 18 / 49

19 Cook s distance D i = { ˆβ ˆβ (i) } t X t X{ ˆβ ˆβ (i) }/p ˆσ 2 F p,n p Identify the points which have relatively large cook distance by Fact3: ˆβ ˆβ (i) = D i = ( r i 1 h ii ) 2 x t i (X t X) 1 (X t X)(X t X) 1 x i p ˆσ 2 r i 1 h ii (X t X) 1 x i 19 / 49

20 Cook s distance D i depends on two factors: D i = ( r i 1 h ii ) 2 x t i (X t X) 1 (X t X)(X t X) 1 x i p ˆσ 2 The size of the residual r i The leverage value h ii The larger either r i or h ii is, the larger D i The ith case can be influential: (1) by having a larger residuals and only a moderate leverage value h ii or (2) by having a larger leverage value h ii with only a moderately sized residuals or (3) by having both a larger residual and a large leverage value 20 / 49

21 Cooks distance in R libray(stats) #<---for cooksdistance libray(faraway) #<--halfnorm lmfit<-lm(y x) cook<-cooksdistance(lmfit) par(mfcol=c(1,2)) halfnorm(cook,3,ylab="cooks dist") boxplot(cook) 21 / 49

22 Cooks distance in R Cooks distance Cooks dist Half normal quantiles 22 / 49

23 Randomness: runs test and Durbin Wason test runs test Order the residuals Count the number of runs (r), the numbers of positive and negative residuals, let s say n 1 and n 2 If n 1 20, n 2 20, reject the hypothesis of randomness if r < r L or if r > r U, where r L and r U are the upper and lower critical values given Table A30 (handout) For large sample size, reject hypothesis of randomness if z > z α/2, where z = r µ 05 σ where µ = 1 + 2n 1n 2 n 1 +n 2, σ 2 = 2n 1n 2 (2n 1 n 2 n 1 n 2 ) (n 1 +n 2 ) 2 (n 1 +n 2 1) 23 / 49

24 Example of Randomness: runs test > x<-c(0:9) > y<-c(98, 135, 162,178, 221,232,283,300,374,395) > lmfit<-lm(y x) > residuals(lmfit) / 49

25 Example of Randomness: runs test How to do run test? Run test: (+ + +) ( ) (+ +) the num of run=3 the num of positive=5 the num of negative=5 Using Table A30 rl=2 and ru=10 If r<rl or r>ru, reject the hypothesis of randomness 25 / 49

26 run rest in R library(lawstat) lmfit<-lm(y x) runstest(residuals(lmfit)) > runstest(residuals(lmfit)) Runs Test - Two sided data: residuals(lmfit) Standardized Runs Statistic = , p-value = / 49

27 Randomness: Durbin Wason test Durbin Wason test: to test error terms ε i are independent (H 0 : ρ = 0) Test statistic D is D = n t=2(r t r t 1 ) 2 n t=1 r 2 t where r t = Y t Ŷ t If D > d U, conclude H 0 If D < d L, conclude H a If d L < D < d U, test is inconclusive d L and d U are selected based on level of testing, the number of X variables (p 1), sample size (n) 27 / 49

28 DW test in R > lmfit<-lm(y x) > dwtest(lmfit) Durbin-Watson test data: lmfit DW = 1875, p-value = alternative hypothesis: true autocorrelation is greater than 0 28 / 49

29 Constant variance: Brown-Forsythe and Breusch-Pagan test Brown-Forsythe (Levene test) r i1, r i2 : the ith residual for group1 and group2 n 1, n 2 : the sample size of each group r 1, r 2 : the median of each group d i1 = r i1 r 1, d i2 = r i2 r 2 Two-sample t test statistic is where s 2 = (d i1 d1 ) 2 + (d i2 d2 ) 2 n 2 Breusch-Pagan to test H 0 : γ 1 = 0 d 1 d2 t BF = s 1/n 1 + 1/n 2 log e σi 2 = γ 0 + γ 1 X i Test statistic is X 2 BP = SSR /2 (SSE/n) 2 where SSR : regression sum of squares when regressing r 2 on X and SSE is the error sum of squares when regression Y on X 29 / 49

30 BF tests in R # best way to split two group is that one has low values and the other has large values of X g1<-c( , , , , ) g2<-c( , , , , ) d1<-abs(g1-median(g1)) d2<-abs(g2-median(g2)) ttest(d1,d2) Welch Two Sample t-test data: d1 and d2 t = , df = 711, p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y / 49

31 BF tests in R library(lawstat) lmfit<-lm(y x) levenetest(residuals(lmfit),group) > levenetest(residuals(lmfit),group=c(rep(1,5),rep(0,5))) Classical Levene s test based on the absolute deviations from the mean data: residuals(lmfit) Test Statistic = 00708, p-value = / 49

32 BP tests in R library(lmtest) lmfit<-lm(y x) bptest(lmfit) > bptest(lmfit) studentized Breusch-Pagan test data: lmfit BP = 30628, df = 1, p-value = / 49

33 Test of normality Shapiro Wilk test: H 0 : a sample y 1,,y n cames from a normally distributed population Test statistic is W = ( a iy (i) ) 2 n i=1(y (i) ȳ) 2 where y (i) is the ith order statistics and the constant a i are given by m t V 1 (a 1,,a n ) = (m t V 1 V 1 m) 1/2 and m = (m 1,,m n ) t where m i is the expected values of the order statistics of iid random variables from standard normal dist and V is the covariance matrix of those order statistics If W is too small, reject the null hypothesis 33 / 49

34 Shapiro Wilks in R library(stats) lmfit<-lm(y x) Shapirotest(residuals(lmfit)) > shapirotest(residuals(lmfit)) Shapiro-Wilk normality test data: residuals(lmfit) W = 09073, p-value = / 49

35 Test of normality Kolmogorove-Smirnov: The empirical distribution function F n for n iid observations Y i is defined as F n (y) = 1 n n i=1 I(Y i < y) where I(Y y): indicator function The Kolmogorove-Smirnov statistic is If D n is big, reject the null D n = sup y F n (y) F(y) Correlation test: idea-compute the correlation between the expected quantile of normal and the observed order statistic Anderson-Darling test: A distance or empirical distribution test and use with small sample size n / 49

36 Anderson-Darling test in R library(nortest) adtest(residuals(lmfit)) > adtest(residuals(lmfit)) Anderson-Darling normality test data: residuals(lmfit) A = 04495, p-value = / 49

37 PP plot and QQplot Plots for comparing two probability distributions There are two basic types, the probability-probability plot and the quantile-quantile plot A plot of points whose coordinates are the cumulative probabilities {p x (q),p y (q)} for different values of q is a probaility-probability plot, A plot of the points whose coordinates are the quantiles {q x (p),q y (p)} for different values of p is a quantile-quantile plot The latter is the more frequently used of the two types and its use to investigate the assumption that a set of data is from a normal distribution For example, plotting the ordered sample values y 1,,y n against the quantiles of a standard normal distribution, Φ 1 [p (i) ] where p i = i 1 2 n Φ(x) = x 1 e 1 2 µ2 2π dµ This is usually known as a normal probability plot and 37 / 49

38 Normal QQ plot in R library(faraway) qqnorm(residuals(lmfit), ylab= Residuals ) qqline(residuals(lmfit)) 38 / 49

39 Normal QQplot in R Normal QQplot Normal Q-Q Plot Histogram of residuals(lmfit) Residuals Residuals Theoretical Quantiles residuals(lmfit) 39 / 49

40 Lack of fit test Idea: if you have multiple tests of y for x values, you can use these to test for lack of fit Basis: if the fit is good, the fitted line should go through the mean of y s at each x If the fit is bad, the fitted value should differ from the mean 40 / 49

41 Linear Lack of fit test This test assumes variance homogeneity Goal: check the linearity of the conditional mean of Y given X Requirements: one has to have replicates in X Data x 1 x 2 x k y 11 y 21 y k1 y 12 y 22 y k2 y 1n1 y 2n2 y knk Some of the n 1, n 2,,n k have to be > 1 41 / 49

42 Linear Lack of fit test Model y ij = β 0 + β 1 x i + ε ij, i = 1,,k, j = 1,2,,n k where ε ij [0,σ 2 ] Model y ij = β 0 + β 1 x i + σε ij, i = 1,,k, j = 1,2,,n k where ε ij [0,1] These are the same model 42 / 49

43 Linear Lack of fit test Model y ij = β 0 + β 1 x i + σε ij, i = 1,,k, j = 1,2,,n k where ε ij [0,1] How many total replicate? n 1 + n n k = n Remark1: independent, normally distributed error with a constant variance 43 / 49

44 Linear Lack of fit test y = y 11 y 1n1 y 21 y 2n2 y k1 y knk = 1 n1 x 1 1 n1 1 n2 x 2 1 n2 1 nk x k 1 nk ( β0 β 1 ) + ε 44 / 49

45 ANOVA table for Lack of fit test ANOVA model (Ŷ ij Ȳ ) 2 residual (Y ij Ŷ ij ) 2 Total (Y ij Ŷ ij ) 2 + (Ŷ ij Ȳ ) 2 SSE=SSPE+SSLOF Y ij Ŷ ij = (Y ij Ȳ i ) + (Ȳ i Ŷ ij ) SSPE: sum of squared pure errors= (Y ij Ȳ i ) 2 SSLOF=sum of square lack of fit = (Ŷ ij Ȳ i ) 2 H 0 : Linear model fit the data well H 1 : Linear model does not fit the data If SSLOF is large there is a lack of fit F = (Ŷ ij Ȳ i ) 2 /df 1 (Y ij Ȳ i ) 2 /df 2 F df 1,df 2(= SSLof SSPE ) reject H 0 if F > F df 1,df 2 for a 1 α level test 45 / 49

46 Degree of freedom in ANOVA Find df1 and df2 Think about an example of two populations (We used pooled sample variance) S 2 p = (Y 1j ȳ 1 ) 2 + (Y 2j ȳ 2 ) 2 n n 2 1 Now we have k groups S 2 p = = (Y 1j ȳ 1 ) 2 + (Y 2j ȳ 2 ) 2 n 1 + n 2 2 (y ij ȳ i ) 2 n n n k 1 = SSPE n k df 2 = n k, df 1 = df(res) df 2 = n 2 (n k) = k 2 46 / 49

47 ANOVA ANOVA SS df Regression (Ŷ ij Ȳ ) 2 1 Residual (Y ij Ŷ ij ) 2 n-2 LoF (ŷ ij ȳ i ) 2 k-2 PE (Y ij Ȳ i ) 2 n-k F LOF = SSLof/(k 2) SSPE/(n k) F k 2,n k 47 / 49

48 SSLOF and SSPE SSLOF = y t A 1 y = y t ( H + J )y SSPE = y t A 2 Y = y t (I J )y where 1 J n n J 1 = 0 J n n n k J nk 48 / 49

49 Remedial actions Change model if it appears there is nonlinearity but homogeneity of variance Transform if there is heterogeneity of variance and nonlinearity Consider weighted least squares if there is just heterogeneity of variance Delete outliers Fit a robust model (loess, etc) 49 / 49

Formal Statement of Simple Linear Regression Model

Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor