STAT 705 Chapter 16: One-way ANOVA Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Data Analysis II 1 / 21
What is ANOVA? Analysis of variance (ANOVA) models are regression models with qualitative predictors, called factors or treatments Factors have different levels For example, the factor education may have the levels high school, undergraduate, graduate The factor gender has two levels female, male We may have several factors as predictors, eg race and gender may be used to predict annual salary in $ There are two types of factors: Classification (investigator cannot control) Experimental (investigator can control) 2 / 21
ANOVA A control treatment (or control factor level) is sometimes used to measure effects of (new or experimental) treatments under investigation, relative to the status quo eg ibuprofin, aspirin, and placebo We have 3 factor levels Without placebo, we do not know how iboprofin or aspirin does relative to no pain killer, only relative to each other Uses of ANOVA models: find best/worst treatment, measure effectiveness of new treatment, compare treatments Often interested in determining whether there is a difference in treatments Read Sections 161 168 in the text 3 / 21
163 Cell means model Have r different treatments or factor levels At each level i, have n i observations from group i Total number of observations is n T = n 1 + n 2 + + n r { i = 1,, r factor level Response is Y ij where j = 1,, n i obs within factor level } Example: Two factors: MS, PhD Y ij is age in years Spring of 2014 we observe Y 11 = 28, Y 12 = 24, Y 13 = 24, Y 14 = 22, Y 15 = 26, Y 16 = 23, Y 21 = 29, Y 22 = 23, Y 23 = 26, Y 24 = 25, Y 25 = 22, Y 26 = 23, Y 27 = 38, Y 28 = 33, Y 29 = 30, Y 2,10 = 27 4 / 21
One-way ANOVA model Y ij = µ i + ɛ ij, ɛ ij iid N(0, σ 2 ) Can rewrite as Y ij ind N(µ i, σ 2 ) Data are normal, data are independent, variance constant across groups µ i is allowed to be different for each group µ 1,, µ r are the r population means of the response A picture helps Questions: what is E{Y ij }? What is σ 2 {Y ij }? 5 / 21
Matrix formulation (pp 683 684, 710 712) For r = 3 we have Y 11 Y 12 Y 1n1 Y 21 Y 22 Y 2n2 Y 31 Y 32 Y 3n3 = 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 µ 1 µ 2 µ 3 + ɛ 11 ɛ 12 ɛ 1n1 ɛ 21 ɛ 22 ɛ 2n2 ɛ 31 ɛ 32 ɛ 3n3 or Y = Xβ + ɛ 6 / 21
164 Fitting the model For r = 3, let Q(µ 1, µ 2, µ 3 ) = 3 ni i=1 j=1 (Y ij µ i ) 2 Need to minumize this over all possible (µ 1, µ 2, µ 3 ) to find least-squares (LS) solution Can easily show that Q(µ 1, µ 2, µ 3 ) has minimum at ˆβ = ˆµ 1 ˆµ 2 ˆµ 3 = Ȳ 1 Ȳ 2 Ȳ 3 where Ȳ i = 1 n i ni j=1 Y ij is the sample mean from the ith group (pp 687 688) These ˆβ are also maximum likelihood estimates 7 / 21
Matrix formula of least-squares estimators (r = 3) X X = 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 (X X) 1 = n 1 1 0 0 0 n 1 2 0 0 0 n 1 3 1 0 0 1 0 0 0 1 0 = 0 1 0 0 0 1 0 0 1, X Y = Y 1 Y 2, Y 3 ˆβ = (X X) 1 X Y = Ȳ 1 Ȳ 2 Ȳ 3 n 1 0 0 0 n 2 0 0 0 n 3, 8 / 21
Residuals As in regression (STAT 704), e ij = Y ij Ŷ ij = Y ij ˆµ i = Y ij Ȳ i As usual, Ŷ ij is the estimated mean response under the model Note that n i j=1 e ij = 0 [check this!] In matrix terms e = Y Xˆβ = Y Ŷ 9 / 21
Kenton Food Company Example r = 4 box designs for a new breakfast cereal 20 stores w/ roughly equal sales volumes picked to participate; n i = 5 is planned for each A fire occurred at one store that had design 3, so ended up with n T = 19 instead of 20, and n 1 = n 2 = n 4 = 5 and n 3 = 4 10 / 21
Kenton foods example data kenton; input sales design @@; datalines; 11 1 17 1 16 1 14 1 15 1 12 2 10 2 15 2 19 2 11 2 23 3 20 3 18 3 17 3 27 4 33 4 22 4 26 4 28 4 ; proc sgscatter; plot sales*design; run; proc glm plots=all; * zero/one dummy variables, but recover cell means via lsmeans; class design; model sales=design; lsmeans design; run; 11 / 21
165 ANOVA table (pp 690 698) Define the following n i Y i = Y ij = i group sum, j=1 Ȳ i = 1 n i n i Y ij = ith group mean j=1 Y = r n i Y ij = i=1 j=1 r Y i = sum all obs i=1 Ȳ = 1 n T r n i i=1 j=1 Y ij = 1 n T r Y i = mean all obs i=1 12 / 21
Sums of squares for treatments, error, and total SSTO = SSTR = = r n i (Y ij Ȳ ) 2 = variability in Y ij s i=1 j=1 n i r (Ŷ ij Ȳ ) 2 = i=1 j=1 n i r (Ȳ i Ȳ ) 2 = r n i (ˆµ ij Ȳ ) 2 i=1 j=1 i=1 j=1 i=1 r n i (Ȳ i Ȳ ) 2 = variability explained by ANOVA model r n i r n i SSE = (Y ij Ŷ ij ) 2 = i=1 j=1 i=1 j=1 = variability NOT explained by ANOVA model e 2 i 13 / 21
Comments As before in regression, SSTO }{{} = } SSTR {{} + }{{} SSE total treatment effects leftover randomness SSE=0 Y ij = Y ik for all j k SSTR=0 Ȳi = Ȳ for i = 1,, r 14 / 21
ANOVA table (p 694) Source SS df MS SSTR r ni i=1 j=1 (Ȳ i Ȳ ) 2 r 1 SSTR/(r 1) SSE r ni i=1 j=1 (Y ij Ȳi ) 2 n T r SSE/(n T r) SSTO r ni i=1 j=1 (Y ij Ȳ ) 2 n T 1 15 / 21
Degrees of freedom SSTO has n T 1 df because there are n T Y ij Ȳ terms in the sum, but they add up to zero (1 constraint) SSE has n T r df because there are n T Y ij Ȳi terms in the sum, but there are r constraints of the form ni j=1 (Y ij Ȳ i ) = 0 SSTR has r 1 df because there are r terms n i (Ȳ i Ȳ ) in the sum, but they sum to zero (1 constraint) 16 / 21
Estimated mean squares E{MSE} = σ 2, MSE is unbiased estimate of σ 2 r E{MSTR} = σ 2 i=1 + n i(µ i µ ) 2, r 1 where µ = r n i µ i i=1 n T is weighted average of µ 1,, µ r (pp 696 698) If µ i = µ j for all i, j {1,, r} then E{MSTR} = σ 2, otherwise E{MSTR} > σ 2 Hence, if any group means are different then E{MSTR} E{MSE} > 1 17 / 21
166 F test of H 0 : µ 1 = = µ r Fact: If µ 1 = = µ r then F = MSTR MSE F (r 1, n T r) To perform α-level test of H 0 : µ 1 = = µ r vs H a : some µ i µ j for i j, Accept if F F (1 α, r 1, n T r) or p-value α Reject if F > F (1 α, r 1, n T r) or p-value < α p-value = P{F (r 1, n T 1) F } Example: Kenton Foods 18 / 21
Comments If r = 2 then F = (t ) 2 where t is t-statistic from 2-sample pooled-variance t-test The F-test may be obtained from the general nested linear hypotheses approach (big model / little model) Here the full model is Y ij = µ i + ɛ ij and the reduced is Y ij = µ + ɛ ij F = [ ] SSE(R) SSE(F ) dfe R dfe F SSE(F ) dfe F = MSTR MSE 19 / 21
167 Alternative formulations SAS will fit the cell means model (discussed so far) with a noint option in model statement; however, the F-test will not be correct Your textbook discusses an alternative parameterization that is not easy to get out of the SAS procedures we will use By default, SAS fits the model where α r = 0 Y ij = µ + α i + ɛ ij, E{Y rj } = µ; µ is the cell-mean for the rth level For i < r, E{Y ij } = µ + α i ; α i is i s offset to group r s mean µ Note that SAS s default corresponds to a regression model where categorical predictors are modeled using the usual zero-one dummy variables In class, let s find the design X for SAS s model for r = 3 and n 1 = n 2 = n 3 = 2 20 / 21
SAS s baseline & offset model Even though SAS parameterizes the model differently, with the r th level as baseline, the ANOVA table and F-test is the same as the cell means model Also ˆµ = Ȳr and ˆα i = Ȳi Ȳr are the OLS and MLE estimators These are reported in SAS Use, eg model sales=design / solution; The cell means ˆµ i are obtained in SAS by adding lsmeans to glm or glimmix 21 / 21