Comparing Several Means: ANOVA. Group Means and Grand Mean

STAT 511 ANOVA and Regressin 1 Cmparing Several Means: ANOVA Slide 1 Blue Lake snap beans were grwn in 12 pen-tp chambers which are subject t 4 treatments 3 each with O 3 and SO 2 present/absent. The ttal yield was measured fr each chamber. Sulfur Dixide Ozne Absent resent Absent 1.52 1.49 1.85 1.55 1.39 1.21 resent 1.15 0.65 1.30 0.76 1.57 0.69 T cmpare the means f several say I grups (ppulatins) ne ften uses an analysis f variance mdel r ANOVA. Fr the I ppulatins we use µ 1 µ 2... µ I and σ 1 σ 2... σ I t dente their respective means and standard deviatins. Similarly the sample mean sample standard deviatin and sample size f the ith ppulatin are dented by x i s i and J i. Of mst interest are the cmparisns between the µ i s. Grup Means and Grand Mean Slide 2 Fr the bean grwth data trt J i j x ij x i 1 3 4.76 1.5867 2 3 4.25 1.4167 3 3 4.02 1.3400 4 3 2.10 0.7000 The grand ttal f n = 12 bservatins is i j x ij = 15.13 s the grand mean is x = 15.13 12 = 1.2608. The J i s here are all equal s x is the mean f x i s. This wuld nt be the case fr J i s unequal. Fr J i s large by CLT X i N(µ i σ2 i J i ) and s 2 i are reliable estimates f σ 2 i. Fr J i s small ne assumes nrmality and σ 2 1 = = σ 2 I = σ 2. The individual sample means are x i = 1 J i Ji j=1 x ij where x ij is the jth bservatin in the ith grup. The grand mean is x = 1 n I i=1 Ji j=1 x ij where n = I i=1 J i is the ttal number f bservatins in the I grups.

STAT 511 ANOVA and Regressin 2 Variatin Within Grups Slide 3 Fr the bean grwth data trt j (x ij x i ) 2 s 2 i 1.112467.056233 2.065867.032933 3.090600.045300 4.006200.003100 SSE is i j (x ij x i ) 2 =.275134 and MSE is s 2 p =.275133 12 4 =.034392. Fr J i s all equal s 2 p = i s2 i/i. In general s 2 p is a weighted mean f s 2 i with weights (J i 1). Under the assumptin σ 2 1 = = σ 2 I = σ 2 ne wuld like t estimate the cmmn variance σ 2 using all available infrmatin. Such infrmatin is cntained in the sum f squared errrs SSE = I Ji i=1 j=1 (x ij x i ) 2 = I i=1 (J i 1)s 2 i. The pled variance estimate is given by s 2 p = MSE = SSE n I where n I = I i=1 (J i 1). Variatin Between Grups Slide 4 Fr the bean grwth data SSTr is given by i 3( x i x ) 2 = 1.353758 and SST is given by i j (x ij x ) 2 = 1.628892. It is easy t verify that SST = SSTr + SSE If ne ignres the gruping then the sample variance f the n bservatins is s 2 = 1 n 1 SST. T measure the variability between grups ne calculates the sum f squares fr treatments SSTr = I Ji i=1 j=1 ( x i x ) 2 = I i=1 J i( x i x ) 2. It can be shwn that i j (x ij x ) 2 = i j (x ij x i ) 2 + i j ( x i x ) 2 where SST = i j (x ij x ) 2. Fr I = 2 it can be shwn that SSTr = ( x 1 x 2 ) 2 1 J 1 + 1. J 2

STAT 511 ANOVA and Regressin 3 ANOVA Table F-Test Slide 5 Assciated with SSE and SST are degrees f freedm n I and n 1. Similarly SSTr has df I 1. Nte that n 1 = (n I) + (I 1). Dividing SS by the crrespnding df ne gets a mean square (MS). An ANOVA table summarizes all the infrmatin. Src SS df MS Trt SSTr I 1 Errr SSE n I Ttal SST n 1 SSTr I 1 SSE n I MSE is an unbiased estimate f σ 2. Fr µ i s all equal MSTr is als an unbiased estimate f σ 2. When µ i s are nt all equal MSTr tends t be larger. T test the hyptheses H 0 : µ 1 = = µ I Calculate vs. H a :.w. f = MSTr MSE and reject H 0 when f > F αν1 ν 2 where ν 1 = I 1 and ν 2 = n I. F-Distributin Slide 6 Let Y i N(0 1) i = 1... m and Z j N(0 1) j = 1... n independent. The distributin f m i=1 Y i 2 /m n j=1 Z2 j /n is called a F-distributin with degrees f freedm ν n = m and ν d = n. F(38) and F(83) 0.0 0.2 0.4 0.6 0 1 2 3 4 5 6 Fr the bean grwth data the ANOVA table is given by Src SS df MS Trt 1.3538 3.4513 Errr 0.2751 8.0344 Ttal 1.6289 11 It is easy t calculate f =.4513.0344 = 13.12 which is larger than F.0538 = 4.07 s we reject H 0 at the 5% significance level. T btain F.0538 qf(.9538). in R use

STAT 511 ANOVA and Regressin 4 F- and t-tests Cmputing Frmulas Slide 7 Fr I = 2 ne has f = MSTr MSE = ( x 1 x 2 ) 2 s 2 p( 1 J 1 + 1 J 2 ). Reject H 0 when f > F α1n 2. Cmpare this with the t-test fr H 0 : µ 1 = µ 2 versus H a : µ 1 µ 2 t = x 1 x 2 q 1 s p J 1 + 1 J 2 with a rejectin regin t > t α/2n 2. We ntice that f = t 2. Actually ne als has F α1ν = t 2 α/2ν s the F-test is equivalent t the t-test we learned earlier. Since SST = SSTr+SSE ne nly needs t calculate tw f the three terms. SST = i j (x ij x ) 2 j x2 ij ( i j x ij) 2 n = i SSTr = i j ( x i x ) 2 ( j x ij) 2 J i ( = i SSE = i j (x ij x i ) 2 = i j x2 ij i i j x ij) 2 n ( j x ij) 2 J i. Cmputing ANOVA: Example Slide 8 Cnsider the fllwing data Sample 1 2 3 12 8 6 10 5 2 3 4 J i 2 4 3 j x ij 22 20 12 j x2 ij 244 114 56 x i 11 5 4 n = 9 i j x ij = 54 x = 6. 4 Using the cmputing frmulas SSE = 244 + 114 + 56 ( 222 2 + 202 4 + 122 3 ) = 24 SSTr = ( 222 2 + 202 4 + 122 3 ) 542 9 = 66. Since f = 66/2 24/6 = 8.25 and F.0526 = 5.14 we reject H 0 : µ 1 = µ 2 = µ 3 at the 5% significance level.

STAT 511 ANOVA and Regressin 5 arameter Estimatin and Testing Slide 9 Fr the bean grwth data x 1 = 1.5867 x 2 = 1.4167 s 2 p =.0344 =.1855 2 J 1 = J 2 = 3 ν = 8. A 95% CI fr µ 1 is q 1.5867 ± 2.306.0344 3 r (1.340 1.834) where t.0258 = 2.306. The inferences cncerning means are derived frm the fact that X i N(µ i σ2 J i ). A (1 α)100% CI fr µ i is s s x i ± t 2 p α/2ν J i where ν = n I. A (1 α)100% CI fr µ 1 µ 2 is A 95% CI fr µ 1 µ 2 is q.17 ± 2.306(.1855) 2 3 r (.179.519). One wuld accept H 0 : µ 1 = µ 2 at the 5% level. ( x 1 x 2 ) ± t α/2ν rs 2 p( 1 J 1 + 1 J 2 ) Tests fr hyptheses cncerning these parameters can be similarly cnstructed. Estimating and Testing Cntrasts Slide 10 Fr the bean grwth data a cntrast f interest is θ = (µ 1 µ 2 ) (µ 3 µ 4 ). θ = 0 implies n interactin between O 3 and SO 2. The estimate is given by ˆθ = x 1 x 2 x 3 + x 4 =.47 with a standard errr ˆσˆθ =.1855 p 4/3 =.2142. A 95% CI fr θ is.47 ± 2.306(.2142) r (.964.024). One wuld cnclude θ = 0 at the 5% level. A linear cmbinatin f means θ = c 1 µ 1 + + c I µ I is t be estimated by ˆθ = c 1 x 1 + + c k x I with a standard errr ˆσˆθ = s p s c 2 1 J 1 + + c2 I J I. When c 1... c I add t zer i c i = 0 such a θ is called a cntrast. Fr example µ 1 µ 2 is a cntrast. In applicatins cntrasts are ften f the mst interest.

STAT 511 ANOVA and Regressin 6 Slide 11 Relatins Between Variables Functinal relatins: y = f(x) deterministic such as (i) A = πr 2 fr the area A and radius r f a circle; r (ii) y = 5 9 (x 32) fr thermmeter readings x F and y C. Statistical relatins: Variables tend t vary tgether but there is n deterministic cupling. Amng examples are (i) ages f married cuples; and (ii) lengths and weights f snakes. radius area 0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0 length (cm) weight (gm) 55 60 65 100 120 140 160 180 200 Slide 12 Simple Linear Regressin When studying the heights f father-sn pairs Galtn fund in late 19th century that fr fathers taller than average the average height f their sns is between their height and the average. Ditt fr fathers shrter than average. A simple linear regressin is f the frm Y = β 0 + β 1 x + ǫ Y respnse r dependent var. x predictr r indep. var. ǫ nise r randm errr Y varies randmly given x. The distributin f Y varies systematically with x thrugh the regressin functin µ Y x = β 0 + β 1 x. The mdel has a systematic part β 0 + β 1 x and a randm part ǫ. A causal structure is usually implied.

STAT 511 ANOVA and Regressin 7 Mdel Assumptins in SLR Data cme in as pairs (x i y i ) and the mdel is written as Y i = β 0 + β 1 x i + ǫ i It is usually assumed that ǫ i N(0σ 2 ). Slide 13 Cnsider Y = 12 + 8x + ǫ where ǫ N(0 9). Since Y x=1 N(20 9) ne has In practice ne bserves pairs (x i y i ) and estimates mdel parameters β 0 β 1 and σ 2. µ Y x = β 0 + β 1 x is a strng assumptin. (Y <17 x= 1) = (Z < 17 20 ) =.1587 3 The nrmality assumptin can smetimes be weakened t µ ǫi = 0 and σ 2 ǫ i = σ 2. Example: Length and Weight f Snakes Slide 14 Length Weight 60 136 69 198 66 194 64 140 54 93 67 172 59 116 65 174 63 145 Nine adult females f the snake Vipera berus were caught and measured. The lengths and weights are listed n the left and pltted belw. weight (gm) 100 120 140 160 180 200 55 60 65 length (cm)

STAT 511 ANOVA and Regressin 8 Least Squares Estimates f β 0 β 1 Slide 15 The lengths and weights f female snakes. The LS estimate f regressin functin is Y = 301 + 7.19X. weight (gm) 80 100 140 180 Y=-301+7.19X Y=-227+6X 55 60 65 length (cm) Q=1093.7 Q=1347 Minimizing w.r.t. β 0 β 1 Q = nx (y i (β 0 + β 1 x i )) 2 i=1 ne btains the least squares (LS) estimates f (β 0 β 1 ) where b 1 = ˆβ 1 = S xy b 0 = ˆβ 0 = ȳ b 1 x. S xy = i (x i x)(y i ȳ) = i (x i x) 2. Fitted Values and Residuals Slide 16 The lengths and weights f female snakes. x y ŷ e 60 136 130.4 5.6 69 198 195.2 2.8 66 194 173.6 20.4 64 140 159.2-19.2 54 93 87.3 5.7 67 172 180.8-8.8 59 116 123.2-7.2 65 174 166.4 7.6 63 145 152.0-7.0 The mean respnse µ Y x at x is (unbiasedly) estimated by the fitted regressin functin ˆµ Y x = Ŷ = b 0 + b 1 x. At the data pints ne has the fitted values (y-hat) and the residuals ŷ i = b 0 + b 1 x i e i = y i ŷ i = y i (b 0 + b 1 x i ). The fitted values and residuals satisfy n i=1 ŷi = n i=1 y i n i=1 e i = n i=1 x ie i = 0.

STAT 511 ANOVA and Regressin 9 Estimatin f σ 2 Slide 17 Cnsider a mdel Y i = µ + ǫ i where µ ǫi = 0 and σ 2 ǫ i = σ 2. The estimate ŷ i = ˆµ = ȳ actually minimizes Q = n i=1 (y i µ) 2. An unbiased estimate f σ 2 is n s 2 i=1 = (y i ŷ i ) 2 n 1 n i=1 = e2 i n 1 where ŷ i cntains ne parameter. T estimate σ 2 calculate the residual sum f squares nx nx SSE = (y i ŷ i ) 2 = e 2 i and use i=1 s 2 = SSE n 2 = i=1 i (y i ŷ i ) 2. n 2 Unbiasedness: µ s 2 = σ 2. T calculate s 2 use where SSE = S yy S2 xy S yy = i (y i ȳ) 2. Details f Calculatin We use the lengths and weights f snakes t illustrate. Nte that S xy = X xi yi x i y i = X x 2 i ( x i ) 2. n n Slide 18 First summarize the data. xi = 567 x 2 i = 35893 yi = 1368 y 2 i = 217926 xi y i = 87421 Then calculate x = 567 9 = 63 ȳ = 1368 9 = 152 = 35893 5672 9 = 172 S yy = 217926 13682 9 = 9990 S xy = 87421 567(1368) 9 = 1237. Nw we have b 1 = 1237 172 = 7.19 b 0 = 152 7.19(63) = 301 SSE is given by 9990 12372 172 = 1093.7 s σ 2 is estimated by s 2 = 1093.7 9 2 = 156.24.

STAT 511 ANOVA and Regressin 10 Inferences Cncerning β 1 Lengths and weights f snakes. Assume ǫ i N(0 σ 2 ). Slide 19 We have b 1 = 7.19 and q 156.24 s b1 = =.953. 172 A 95% CI fr β 1 is given by 7.19 ± 2.365(.953) where t.0257 = 2.365. T test the hyptheses b 1 N(β 1 σ 2 b 1 ) where σ 2 b 1 = σ 2 / is t be estimated by s 2 b 1 = s2. The inferences are based n H 0 : β 1 = 0 vs. H a : β 1 0 we calculate t = 7.19 0.953 = 7.545 and reject H 0 even at the 1%- level as t > 3.499 = t.0057. b 1 β 1 s b1 t n 2. Fr example a (1 α)100% CI fr β 1 is given by b 1 ± t α/2n 2 s b1. Analysis f Variance Slide 20 The lengths and weights f female snakes. Surce SS df MS F Mdel 8896.3 1 8896.3 56.94 Resid 1093.7 7 156.24 Ttal 9990.0 8 Decmpse the deviatin f y i frm ȳ y i ȳ = (ŷ i ȳ) + (y i ŷ i ) where (ŷ i ȳ) is systematic and (y i ŷ i ) is randm. It can be shwn that i (y i ȳ) 2 = i (ŷ i ȳ) 2 + i (y i ŷ i ) 2 SST : (n 1) = SSR : 1 + SSE : (n 2) The ANOVA table summarizes related infrmatin. Surce SS df MS f Mdel SSR 1 SSR 1 Resid SSE n 2 s 2 = SSE n 2 Ttal SST n 1 MSR MSE

STAT 511 ANOVA and Regressin 11 F-Test fr β 1 = 0 Slide 21 The lengths and weights f female snakes. Since f = 8896.3 156.24 = 56.94 F.0117 = 12.246 we reject H 0 : β 1 = 0 at the 1% level. This is equivalent t the t-test n Slide 19. Nte that It can be shwn that µ MSR = σ 2 + β 2 1 µ MSE = σ 2. When β 1 = 0 ne has f = MSR MSE F 1n 2. These lead t the F-test fr H 0 : β 1 = 0 vs. H a : β 1 0 which rejects H 0 when F s > F α1n 2. The F- and t-tests are equivalent: f = 56.94 = 7.55 2 = t 2 F.0117 = 12.25 = 3.5 2 = t 2.0057. MSR MSE = f = t2 = ( b 1 s b1 ) 2 F α1n 2 = t 2 α/2n 2. Inferences Cncerning β 0 Slide 22 Fr the lengths and weights f snakes β 0 has n meaning. Cnsider Y = 15+5X +ǫ where ǫ N(0 4). Given x i = 8(.1)10 simulate Y i and estimate the regressin functin. Assume ǫ i N(0 σ 2 ). where b 0 N(β 0 σ 2 b 0 ) σ 2 b 0 = σ 2 { 1 n + x2 } is t be estimated by 0 20 40 60 0 2 4 6 8 10 s 2 b 0 = s 2 { 1 n + x2 } The inferences are based n b 0 β 0 s b0 t n 2. Fr x large β 0 is hard t estimate r t interpret.

STAT 511 ANOVA and Regressin 12 Inferences Cncerning µ Y x = β 0 + β 1 x Slide 23 The lengths and weights f female snakes. We are t estimate the average weight f snakes f length 60 cm. Ŷ = 301 + 7.19(60) = 130.4 s 2 Ŷ = 156.24{1 9 + (60 63)2 172 } = 25.535 = 5.053 2 Assume ǫ i N(0 σ 2 ). Ŷ N(β 0 + β 1 x σ 2 Ŷ ) where Ŷ = b 0 + b 1 X and σ 2 Ŷ = σ2 { 1 n + (x x)2 } is t be estimated by s 2 Ŷ = s2 { 1 n + (x x)2 }. The inferences are based n s a 95% CI fr β 0 + β 1 60 is 130.4 ± 2.365(5.053) r (118.45 142.35). Ŷ (β 0 + β 1 x) sŷ t n 2. Fr x x large β 0 +β 1 x is hard t estimate. redictin f New Observatin Slide 24 The lengths and weights f female snakes. We are t predict the weight f a snake f length 60 cm. Ŷ = 130.4 s 2 = 156.24 s 2 Ŷ = 25.535 T predict a new respnse at x Y = β 0 + β 1 x + ǫ ne has t allw fr the variability f ǫ. With β 0 β 1 and σ 2 knwn the predictin interval (β 0 + β 1 x) ± z α/2 σ cvers Y with prbability 1 α. s a 95% I fr Y at X = 60 is 130.4 ± 2.365 156.24 + 25.535 r (98.51 162.29). This is wider than the CI fr β 0 + β 1 60. With β 0 + β 1 x estimated by Ŷ = b 0 + b 1 x we use Ŷ ± t α/2n 2 qs 2 + s 2 Ŷ where the variances f Ŷ and ǫ are estimated by s 2 and Ŷ s2.

STAT 511 ANOVA and Regressin 13 Slide 25 R 2 Crrelatin Lengths and weights f snakes. R 2 = 8896.3 9990 =.891 r = 1237 172(9990) =.944 The cefficient f determinatin r R 2 R 2 = SSR SST = 1 SSE SST measures the amunt f variatin explained by the mdel. The cefficient f crrelatin r = S xy p Sxx S yy measures the linear assciatin between X and Y. 0 R 2 1. 1 r 1. R 2 = r 2.