Stat 4 ANOVA Analyss of Varance /6/04 Comparng Two varances: F dstrbuton Typcal Data Sets One way analyss of varance : example Notaton for one way ANOVA Comparng Two varances: F dstrbuton We saw that the two sample tests had dfferent statstcs dependng on whether we could say that the varances of both groups were equal (but unknown) σ 2 = σ 2 2 = σ 2 or dfferent σ 2 σ 2 2. We suppose that both populatons are Normally dstrbuted wth respectve varances σ 2 and σ 2 2. In order to test the null hypothess H 0 : σ 2 = σ 2 2 we need a new test statstc for comparson of varances, these are on the square scale compared to the means that are the same unts as the orgnal varables. F-test for comparng two varances from Normal dstrbuted random varables to test H 0 : σ 2 = σ 2 2, calculate F statstc = s2 s 2 ( F for Fsher ) 2 Under the null hypothess F s dstrbuton s called the F dstrbuton wth (num,den) as the paramters where num s the number of degrees of freedom for the numerator, and den s the dof of the denomnator. As the usual computatons nvolve two sded tests, we always use H A : σ 2 σ2, 2 we take the largest of the two varances and put t on the numerator and test whether our observed F obs s larger than F α/2,num,den the α/2 th quantle of the F dstrbuton wth (num,den) degrees of freedom. If H 0 s true, F has the F num,den dstrbuton, f H 0 s false, F tends to be larger, so reect H 0 f F s suffcently large. P-value of F test = P(F num,den > F obs ) Example: Take the energy expendture example. Rato of varances: F obs =.275 P(F 8,2 >.275) = 0.34 > 0.025, we do not reect the equalty of varances. F 0.975 = qf(0.975, 8, 2) = 3.5 > qf(0.025,2,8) [] 0.284756 > pf(.2748,8,2) [] 0.660336 > -pf(.2748,8,2) [] 0.3398664 > /.2748 [] 0.7844368 > pf(0.78444,2,8) [] 0.3398687 df(seq(0, 5, 0.05), 2, 8) 0.0 0.2 0.4 0.6 0 2 3 4 5 seq(0, 5, 0.05) - two parameters F α[ν,ν 2], ν = numerator d.o.f., ν 2 = denomnator d.o.f. for exact P-values, use software ; n R: df(x, df, df2, log = FALSE) pf(q, df, df2, ncp=0, lower.tal = TRUE, log.p = FALSE) qf(p, df, df2, lower.tal = TRUE, log.p = FALSE) rf(n, df, df2) > var.test(lean,obese) F test to compare two varances data: lean and obese F = 0.7844, num df = 2, denom df = 8, p-value = 0.6797 alternatve hypothess: true rato of varances s not equal to
95 percent confdence nterval: 0.867876 2.754799 sample estmates: rato of varances 0.784446 Typcal Data Sets Coagulaton- Det : Det A 62 60 63 59 Det B 63 67 7 64 65 66 Det C 68 66 7 67 68 68 Det D 56 62 60 6 63 64 63 59 boxplot(coag~det,data=coag.df) summary(coag.df) coag det Mn. :56.00 A:4 st Qu.:6.75 B:6 Medan :63.50 C:6 Mean :64.00 D:8 3rd Qu.:67.00 Max. :7.00 60 65 70 The data consst of blood coagulaton tmes for 24 anmals fed one of 4 dfferent dets. In the followng I wrte the data n a table and decompose the table nto a sum of several tables. The 4 rows of the table correspond to Dets A, B, C and D. A B C D Comparng several(more than 2) dfferent samples Remnder: to compare two samples from populatons wth the same varance:. Compute the means for both samples: x and x 2 2. The wthn sample sum of squares (x x) 2 s found for both samples. 3. The pooled estmate of varance s 2 p s obtaned by addng the sums of squares of devatons and dvdng by the total degrees of freedom. 4. The standard error of the mean dfference x x 2 s computed as s p 5. Test the null hypothess µ = µ 2 by computng the test statstc null hypothess. n + n 2 x x 2 s p n + n 2 whch should follow a t n+n 2 2 dstrbuton under the A specal decomposton accordng to the dfferent factor levels Later n the course we wll use a vector notaton and then want to thnk of stackng up thes 24 values nto a sngle column vector but the tables save space. 62 60 63 59 63 67 7 64 65 66 68 66 7 67 68 68 56 62 60 6 63 64 63 59 = 64 + = 6 6 6 6 66 66 66 66 66 66 68 68 68 68 68 68 6 6 6 6 6 6 6 6 3 3 3 3 +2 +2 +2 +2 +2 +2 +4 +4 +4 +4 +4 +4 3 3 3 3 3 3 3 3 2
A, T and R are perpendcular 24-vector Pythagoras: Y = (y ) A = (ȳ) T = (ȳ ȳ) R = (y ȳ ) data average treatment resdual Y = A + T + R d.o.f.n = + (a ) + (n a) y 2 = ȳ 2 + (ȳ ȳ) 2 + SS = SS ave + SS among + SS wthn (y ȳ) 2 y 2 = ȳ 2 + (ȳ ȳ) 2 + (y ȳ) 2 () SS = SS ave + SS among + SS wthn SS total = SS SS ave = (y ȳ) 2 = SS among + SS wthn Model Checkng Model I asumptons y = µ + ɛ =,..., a =,..., n Look at resduals e = y ȳ, usually va plots e.g.. check normalty va normal quartle plots. ɛ ndep N(0, σ 2 ) 2. check (vsually) constancy of varance (σ 2 σ2 ) plot resduals versus ftted values e (y-axs) vs ȳ (x-axs) -look for evdence that spread of e depends on ȳ 3. tme sequence - check ndependnce assumpton of observaton y taken at tme t plot e (y-axs) vs t. Remark Alternate form of Model I y = µ + α + ɛ and dentfablty constrant a n α = 0 (2) Planned comparsons - sngle pars of means, or constrants specfed n advance Dfference of Means e.g. µ µ : lkea two-sample test -but, we are ANOVA model and hence the pooled varance estmate s 2 for the common varance σ 2. 00( k)% CI ȳ ȳ ± t α[ν]seȳ ȳ ν = n a SEȳ ȳ = n + n s ANOVA - Part I-One way //03 { =,..., a y = µ + ɛ =,..., n treatments groups observatons wthn groups µ fxed group means, ɛ N(0, σ 2 ), n = a = n = n + n 2 + n a (Unknown parameters of model: µ,..., µ a, σ 2 ) Null hypothess of nterest here: H 0 : µ = µ 2 =... = µ a vs H : not all equal. Notes: 3
. common error varance n all groups s assumed. 2. a = 2 reduces to two sample problem (σ 2 = σ 2 2) Varaton wthn treatments th group y y 2. y n ave ȳ var SS s 2 (n )s 2 = (y ȳ ) 2 dof n s 2 = n n = (y ȳ ) 2 s an unbased estmate of σ 2 for the th group. But we have =,..., a ndependent estmates of the common error varance σ 2. Pooled estmate of σ 2 = weghted average (by d.o.f.) of estmates. s 2 = (n )s 2 (n = SS wthn = MS wthn ( mean squares wthn groups ) ) n a SS wthn = (y ȳ ) 2 = sum of squares wthn treatments = = n a = (n ) d.o.f. wthn treatments Varaton among treatments -compare group sample means ȳ to overall sample mean ȳ = n ȳ Motvaton: Suppose that n fact that µ and n are same: µ µ ; n n n = y n Then ȳ (µ, σ2 n ), we look at ths new sample of a observed ȳ s and compute ther estmated varance. Then we would have another estmate of σ 2, separate from the pooled estmate descrbed above: Ths suggests defnng SSamong = MS among = a a = σ 2 n = a a (ȳ ȳ) 2 = a n (ȳ ȳ) 2 = sum of squares among treatments = n (y ȳ) 2 = SS among a Thus f H 0 : µ =... = µ a s true, have two estmates of varablty: MSamong (a- dof), MS wthn (n-a, dof). If H 0 s false, due to the varaton among µ, we expect F = MS among MS to be larger than. wthn Total varaton and an dentty SS total = (y ȳ) 2 = (y ȳ) 2 + (y ȳ) 2 = SS wthn + SS among = mean square among treatments a- = d.o.f. among treatments -a decomposton of varaton about the grand mean ȳ nto components of varaton about the ndvdual means and then the component between sample means. 4
- leads to an analyss of varance table, llustrated on the blood data. Source of Varaton SS d.o.f. MS F Among(Between) SS among = 228 a =3 MS among = 228/3 = 76 treatments Wthn Treatments SS wthn = 2 n a = 20 MS wthn = 2/20 = 5.6 SS total = 340 n = 23 ANOVA F-test: to test H 0 : µ = µ 2 =... = µ a, MS among MS wthn = 76 5.6 = 3.6 Calculate F statstc = MS among MS wthn ( F for Fsher ) If H 0 s true, F has the F a,n a dstrbuton. f H 0 s false, F tends to be larger, so reect H 0 f F s suffcently large. P-value of F test = P(F a,n a > F obs ) - two parameters F α[ν,ν 2], ν = numerator d.o.f., ν 2 = denomnator d.o.f. for exact P-values, use software ; n R: > pf(3.6,3,20) [] 0.999954 > pf(3.6,3,20,lower.tal=f) [] 4.594599e-05 > qf(0.999,3,20) [] 8.09838 Geometrc Pcture of Varance Decomposton. coag.aov_lm(coag~det,data=coag.df) anova(coag.aov) Analyss of Varance Table Response: coag Df Sum Sq Mean Sq F value Pr(>F) det 3 228.0 76.0 3.57 4.658e-05 *** Resduals 20 2.0 5.6 --- Sgnf. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. The data consst of blood coagulaton tmes for 24 anmals fed one of 4 dfferent dets. In the followng I wrte the data n a table and decompose the table nto a sum of several tables. The 4 rows of the table correspond to Dets A, B, C and D. We could use a vector notaton and then want to thnk of stackng up thes 24 values nto a sngle column vector. 62 60 63 59 63 67 7 64 65 66 68 66 7 67 68 68 56 62 60 6 63 64 63 59 = 64 + = 6 6 6 6 66 66 66 66 66 66 68 68 68 68 68 68 6 6 6 6 6 6 6 6 3 3 3 3 +2 +2 +2 +2 +2 +2 +4 +4 +4 +4 +4 +4 3 3 3 3 3 3 3 3 On the left hand sde s the uncorrected total sum of squares. The frst term on the rght hand sde gves the total mean. Ths term s sometmes put n ANOVA tables as the Sum of Squares due to the Grand Mean but t s usually subtracted from the total to produce the Total Sum of Squares we usually put at the bottom of the table and often called the Corrected (or Adusted) Total Sum of Squares. In ths case the corrected sum of squares s the squared length of the table whch s 340. > sum(coag^2)-24*64^2 [] 340 The second term on the rght hand sde of the equaton has squared length 228 (whch s the Treatment Sum of Squares produced). > sum((pred-64)^2) [] 228 The squared length of the vector of ndvdual sample means mnus the grand mean. The last vector of the decomposton s called the resdual vector and has squared length 2. > sum(res^2) [] 2 5
(y ȳ) 2 = (y ȳ)(y ȳ) = (y 2 2y ȳ + ȳ 2 ) = y 2 2ȳ y + nȳ 2 = y 2 2nȳ 2 + nȳ 2 = y 2 nȳ 2 Correspondng to the decomposton of the total squared length of the data vector s a decomposton of ts dmenson, 24, nto the dmensons of subspaces. For nstance the grand mean s always a multple of the sngle vector all of whose entres are ; ths descrbes a one dmensonal space. The second vector, of devatons from a grand mean les n the three dmensonal subspace of tables whch are constant n each row and have a total equal to 0. Smlarly the vector of resduals les n a 20 dmensonal subspace the set of all tables whose rows sum to 0. Ths decomposton of dmensons s the decomposton fo degrees of freedom. So 24 = +3+20 and the degrees of freedom for treatment and error are 3 and 20 respectvely. The vector whose squared length s the Corrected Total Sum of Squares les n the 23 dmensonal subspace of vectors whose entres sum to ; ths produces the 23 total degrees of freedom n the usual ANOVA table. A Y=A+T+R A+T A, T and R are perpendcular 24-vector Pythagoras: Y = (y ) A = (ȳ) T = (ȳ ȳ) R = (y ȳ ) data average treatment resdual Y = A + T + R d.o.f.n = + (a ) + (n a) y 2 = ȳ 2 + (ȳ ȳ) 2 + (y ȳ) 2 (3) SS = SS ave + SS among + SS wthn SS total = SS SS ave = (y ȳ) 2 = SS among + SS wthn 6