22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each of m populations. n i is the sample size in the ith population for i =1,...,m. y ij is the jth observation in the ith population. 1 There are a couple commonly used models for a one-way ANOVA with m groups. The cell means model: iid Y ij = µ i + ij with ij N(0, σ 2 ) i =1, 2,...,m j =1, 2,...,n i So, E[Y 1j ]=µ 1,andallobservationsfrom group 1 have the same mean, µ 1. The mean of group i is µ i. The mean parameters to be estimated are: µ 1,µ 2,...,µ m There is 1 noise parameter to estimate σ 2 2 Estimators: ˆµ i = Ȳi = ni j=1 Y ij n i The estimated ˆµ i for a group is just the sample group mean. σ 2 is estimated using a pooled estimate because constant variance is assumed. ˆσ 2 = s 2 P = (n 1 1)s 2 1 +(n 2 1)s 2 2 + (n m 1)s 2 m N m where s 2 i is the sample variance in the ith group Pooled estimate of σ: s P = s 2 P 3 Now, a different way to parameterize the same situation... The effects model: Y ij = µ + α i + ij i =1, 2,...,m j =1, 2,...,n i with ij iid N(0, σ 2 ) So, E[Y 1j ]=µ + α 1,andallobservations from group 1 have the same mean, µ+α 1. In this model, there are m groups (m estimated means), and we re using m +1 parameters to define the mean structure. This is an over-parameterization. Different sets of parameter values (µ, α 1,...α m ) can give the same fitted values (i.e. can give the same estimated group means). 4
For example, suppose m =3,and Ȳ 1 =10, Ȳ 2 =20, and Ȳ3 =30. In the over-parameterized effects model, Ŷ ij = ˆµ + ˆα i for i =1, 2, 3 many different combinations of (µ, α 1, α 2, α 3 ) estimates will give me these same estimated group means of (10, 20, 30), for example... ˆµ ˆα 1 ˆα 2 ˆα 3 Ŷ 1j Ŷ 2j Ŷ 3j 0 10 20 30 10 20 30-10 20 30 40 10 20 30 20-10 0 10 10 20 30 This means we have to use a constraint or restriction to make the parameters in the model identifiable (uniquely determined). The effects model: Y ij = µ + α i + ij The α m = constraint: Set the last group parameter to zero. (Essentially, delete the parameter for the last category). Under this constraint, group m is seen as the baseline group... α m =0,soE[Y mj ]=µ + α m = µ µ represents the mean of the m th group under this constraint. α i is the distance of group i from group m. (The α i s give distance from baseline group.) This may or may not be a useful interpretation for your situation. 5 6 Dummy Regressor Coding for the α m =0 constraint with m =3: Category D 1 D 2 group 1 1 0 group 2 0 1 group 3 0 0 This is the coding we ve been using so far with our dummy regressors (we ll call this Baseline Coding or Indicator Coding). Regression Model: Y i = µ + α 1 D 1i + α 2 D 2i + i Model by group... Group 1: Y i = µ + α 1 + i Group 2: Y i = µ + α 2 + i Group 3: Y i = µ + i The effects model: Y ij = µ + α i + ij There is another often used constraint that produces easily interpretable parameters... The sum-to-zero constraint: mi=1 α i =0 α m = (α 1 + α 2 + + α m 1 ) m 1dummyvariablesneeded µ is seen as the grand mean, or the average of the pop n means (nice interpretation). If you have balanced data: ˆµ = Ȳ, the overall mean of the sample If you have unbalanced data: mi=1 Ȳ ˆµ = i m,themeanofthesample means 7 8
α i represents the distance of group i from the grand mean. Thus, α i is the effect of being in group i (tells us if the mean of group i is up or down from the grand mean). Model by group... Group 1: Y i = µ + α 1 + i Group 2: Y i = µ + α 2 + i Group 3: Y i = µ (α 1 + α 2 )+ i Dummy Regressor Coding for sum-to-zero constraint with m =3: Category D 1 D 2 group 1 1 0 group 2 0 1 group 3-1 -1 Regression Model (looks the same as indicator coding): No baseline group in this interpretation. You still only need 2 dummy variables, as α 3 = (α 1 + α 2 )... that s the restriction we ve imposed. These (1,0,-1) dummy regressors are called deviation regressors, becausetheinterpretation gives values as distances (or deviations) from the grand mean. Y i = µ + α 1 D 1i + α 2 D 2i + i 9 10 Example: Deviationregressors- Back to the Pet and Stress data We ll now use a different dummy regressor coding of the same situation, and we ll use the deviation regressors for the dummy variables. Category D 1 D 2 Conrol 1 0 Friend 0 1 Pet -1-1 > pets=read.csv("pets.csv") > attach(pets) > names(pets) [1] "group" "rate" > levels(group) [1] "C" "F" "P" Create the deviation regressors... > n=nrow(pets) > dummy.1=rep(0,n) > dummy.1[group=="c"]= 1 > dummy.1[group=="p"]= -1 > dummy.2=rep(0,n) > dummy.2[group=="f"]= 1 > dummy.2[group=="p"]= -1 > data.frame(group,dummy.1,dummy.2) group dummy.1 dummy.2 1 P -1-1 2 F 0 1 3 P -1-1 4 C 1 0 5 C 1 0... Regression Model: Y i = µ + α 1 D 1i + α 2 D 2i + i > lm.out=lm(rate ~ dummy.1 + dummy.2) > lm.out$coefficients (Intercept) dummy.1 dummy.2 82.44408889 0.07997778 8.88104444 ˆµ ˆα 1 ˆα 2 11 12
Since this is balanced data, overall mean ˆµ: > mean(rate) [1] 82.44409 All three group means: Ȳ 1, Ȳ2, Ȳ3 > tapply(rate,group,mean) C F P 82.52407 91.32513 73.48307 Control treatment group: µ + α 1 > lm.out$coefficients[1]+lm.out$coefficients[2] [1] 82.52407 Friend treatment group: µ + α 2 > lm.out$coefficients[1]+lm.out$coefficients[3] [1] 91.32513 Pet treatment group: µ (α 1 + α 2 ) > lm.out$coefficients[1]-(lm.out$coefficients[2]+ lm.out$coefficients[3]) [1] 73.48307 13 You can use the summary statement to get an overall F-test: > lm.out=lm(rate ~ dummy.1 + dummy.2) > summary(lm.out) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 82.44409 1.37269 60.060 < 2e-16 *** dummy.1 0.07998 1.94128 0.041 0.967 dummy.2 8.88104 1.94128 4.575 4.18e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 9.208 on 42 degrees of freedom Multiple R-Squared: 0.4014,Adjusted R-squared: 0.3729 F-statistic: 14.08 on 2 and 42 DF, p-value: 2.092e-05 F-statistic is 14.08 and p-value is 0.00002. We reject the null and conclude that there is statistically significant evidence that at least one of the group levels is different from the others. This F-statistic and p-value are EXACTLY the same as when we fit the model using the α 3 =0 constraint in the part 1 notes (on p.16). 14 Hypothesis testing in one-way ANOVA Cell means model H 0 : µ 1 = µ 2 = = µ m H A :atleast1µ i different Effects model H 0 : α 1 = α 2 = = α m =0 H A :atleast1α i =0 Both hypotheses are testing the same thing... whether or not all the group means are equal. ANOVA table and overall F-test When we represent group by dummy regressors, R sees each dummy variable as a separate covariate (notice how it performs a test for each dummy regressor in the summary). In the pet example, we can test the significance of group by lumping the two covariates together and doing a partial F-test (or an overall F-test in this case because they were the only predictors in the model). What about the ANOVA table and the sums of squares? RegSS ni=1 (Ŷi Ȳ )2 RSS TSS ni=1 (Y i Ŷi) 2 ni=1 (Y i Ȳ )2 15 16
> RegSS=sum((lm.out$fitted.values-mean(rate))^2) > RegSS [1] 2387.689 > RSS=sum((rate-lm.out$fitted.values)^2) > RSS [1] 3561.299 Source Sum of Squares df Mean Square F Regression 2387.689 2 1193.844 14.07954 Residuals 3561.299 42 84.79285 Total 5948.988 44 When R sees group as a factor (categorical variable), and it s the ONLY predictor, we can get the RegSS from the anova statement. > lm.out=lm(rate~group) > anova(lm.out) Analysis of Variance Table Response: rate Df Sum Sq Mean Sq F value Pr(>F) group 2 2387.7 1193.8 14.079 2.092e-05 *** Residuals 42 3561.3 84.8 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 17 Classical ANOVA sums of squares Notation for sums of squares in a 1-way ANOVA: m RegSS = SS group = n i (Ȳi Ȳ )2 i=1 where Ȳi is the group i mean and Ȳ is the overall sample mean (Note that Ŷij = Ȳi ) Residual sum of squares m n i RSS = (Y ij Ȳi ) 2 i=1 j=1 Source Sum of Squares df Mean Square F SS group m i=1 n i(ȳi Ȳ )2 m-1 RegSS m 1 = RegMS RegMS MSE Residuals m ni i=1 j=1 (Y ij Ȳi ) 2 RSS n-m n m = MSE Total m ni i=1 j=1 (Y ij Ȳ )2 n-1 18 Assessing the assumptions of one-way ANOVA Normal distribution of response variable in each population (or group) histograms, boxplots for sample data from each population (done separately) normal qq plot for sample data from each population (done separately) normal qq plot of all residuals from the fitted model if n i s are equal (balanced design), the ANOVA is less sensitive to the violation of equal variance If one or both assumptions are violated, try atransformation. If only normality is violated, try non-parametric procedure such as Kruskal-Wallis test. Same standard deviation (or variance) in all populations can use Levene s test for homogeneity of variance (but it assumes normality of observations) rule of thumb: if largest sample standard deviation isn t more than twice as large as smallest sample standard deviation, assumption is probably met close enough for ANOVA to be OK 19 20
Earlier, we mentioned that 1-way ANOVA... Focuses on testing for differences among group means. Can you get at the differences between means using either effects model coding method (i.e. either constraint)? Case 1: Baseline coding (α m =0) µ represents baseline group. α 1 is distance group 1 from baseline group. α 2 is distance group 2 from baseline group. α 2 α 1 is distance between group 1 & 2. The answer is yes. The interpretation of the parameters depends on the constraint used, but the important results are still the same (p-values, Ŷij values, etc.). Because hypothesis tests are built on parameter interpretation, the hypothesis test used to answer a given question does depend on the constraint used. Case 2: sum-to-zero coding ( α i =0) µ represents overall or grand mean. α 1 is distance group 1 from overall mean. α 2 is distance group 2 from overall mean. α 2 α 1 is distance between group 1 & 2. 21 22