22s:152 Applied Linear Regression. Take random samples from each of m populations.

Size: px

Start display at page:

Download "22s:152 Applied Linear Regression. Take random samples from each of m populations."

Edward Nicholson
5 years ago
Views:

1 22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each of m populations. n i is the sample size in the ith population for i = 1,..., m. y ij is the jth observation in the ith population. 1

2 There are a couple commonly used models for a one-way ANOVA with m groups. The cell means model: iid Y ij = µ i + ɛ ij with ɛ ij N(0, σ 2 ) i = 1, 2,..., m j = 1, 2,..., n i So, E[Y 1j ] = µ 1, and all observations from group 1 have the same mean, µ 1. The mean of group i is µ i. The mean parameters to be estimated are: µ 1, µ 2,..., µ m There is 1 noise parameter to estimate σ 2 2

3 Estimators: ˆµ i = Ȳi = ni j=1 Y ij n i The estimated ˆµ i for a group is just the sample group mean. σ 2 is estimated using a pooled estimate because constant variance is assumed. ˆσ 2 = s 2 P = (n 1 1)s (n 2 1)s (n m 1)s 2 m N m where s 2 i group is the sample variance in the ith Pooled estimate of σ: s P = s 2 P 3

4 Now, a different way to parameterize the same situation... The effects model: Y ij = µ + α i + ɛ ij with ɛ ij iid N(0, σ 2 ) i = 1, 2,..., m j = 1, 2,..., n i So, E[Y 1j ] = µ + α 1, and all observations from group 1 have the same mean, µ+α 1. In this model, there are m groups (m estimated means), and we re using m + 1 parameters to define the mean structure. This is an over-parameterization. Different sets of parameter values (µ, α 1,... α m ) can give the same fitted values (i.e. can give the same estimated group means). 4

5 For example, suppose m = 3, and Ȳ 1 = 10, Ȳ 2 = 20, and Ȳ3 = 30. In the over-parameterized effects model, Ŷ ij = ˆµ + ˆα i for i = 1, 2, 3 many different combinations of (µ, α 1, α 2, α 3 ) estimates will give me these same estimated group means of (10, 20, 30), for example... ˆµ ˆα 1 ˆα 2 ˆα 3 Ŷ 1j Ŷ 2j Ŷ 3j This means we have to use a constraint or restriction to make the parameters in the model identifiable (uniquely determined). 5

6 The effects model: Y ij = µ + α i + ɛ ij The α m = 0 constraint: Set the last group parameter to zero. (Essentially, delete the parameter for the last category). Under this constraint, group m is seen as the baseline group... α m = 0, so E[Y mj ] = µ + α m = µ µ represents the mean of the m th group under this constraint. α i is the distance of group i from group m. (The α i s give distance from baseline group.) This may or may not be a useful interpretation for your situation. 6

7 Dummy Regressor Coding for the α m = 0 constraint with m = 3 : Category D 1 D 2 group group group This is the coding we ve been using so far with our dummy regressors (we ll call this Baseline Coding or Indicator Coding). Regression Model: Y i = µ + α 1 D 1i + α 2 D 2i + ɛ i Model by group... Group 1: Y i = µ + α 1 + ɛ i Group 2: Y i = µ + α 2 + ɛ i Group 3: Y i = µ + ɛ i 7

8 The effects model: Y ij = µ + α i + ɛ ij There is another often used constraint that produces easily interpretable parameters... The sum-to-zero constraint: mi=1 α i = 0 α m = (α 1 + α α m 1 ) }{{} m 1 dummy variables needed µ is seen as the grand mean, or the average of the pop n means (nice interpretation). If you have balanced data: ˆµ = Ȳ, the overall mean of the sample If you have unbalanced data: ˆµ = mi=1 Ȳ i m, the mean of the sample means 8

9 α i represents the distance of group i from the grand mean. Thus, α i is the effect of being in group i (tells us if the mean of group i is up or down from the grand mean). Dummy Regressor Coding for sum-to-zero constraint with m = 3: Category D 1 D 2 group group group Regression Model (looks the same as indicator coding): Y i = µ + α 1 D 1i + α 2 D 2i + ɛ i 9

10 Model by group... Group 1: Y i = µ + α 1 + ɛ i Group 2: Y i = µ + α 2 + ɛ i Group 3: Y i = µ (α 1 + α 2 ) + ɛ i No baseline group in this interpretation. You still only need 2 dummy variables, as α 3 = (α 1 + α 2 )... that s the restriction we ve imposed. These (1,0,-1) dummy regressors are called deviation regressors, because the interpretation gives values as distances (or deviations) from the grand mean. 10

11 Example: Deviation regressors - Back to the Pet and Stress data We ll now use a different dummy regressor coding of the same situation, and we ll use the deviation regressors for the dummy variables. Category D 1 D 2 Conrol 1 0 Friend 0 1 Pet -1-1 > pets=read.csv("pets.csv") > attach(pets) > names(pets) [1] "group" "rate" > levels(group) [1] "C" "F" "P" Create the deviation regressors... 11

12 > n=nrow(pets) > dummy.1=rep(0,n) > dummy.1[group=="c"]= 1 > dummy.1[group=="p"]= -1 > dummy.2=rep(0,n) > dummy.2[group=="f"]= 1 > dummy.2[group=="p"]= -1 > data.frame(group,dummy.1,dummy.2) group dummy.1 dummy.2 1 P F P C C Regression Model: Y i = µ + α 1 D 1i + α 2 D 2i + ɛ i > lm.out=lm(rate ~ dummy.1 + dummy.2) > lm.out$coefficients (Intercept) dummy.1 dummy ˆµ ˆα 1 ˆα 2 12

13 Since this is balanced data, overall mean ˆµ: > mean(rate) [1] All three group means: Ȳ 1, Ȳ2, Ȳ3 > tapply(rate,group,mean) C F P Control treatment group: µ + α 1 > lm.out$coefficients[1]+lm.out$coefficients[2] [1] Friend treatment group: µ + α 2 > lm.out$coefficients[1]+lm.out$coefficients[3] [1] Pet treatment group: µ (α 1 + α 2 ) > lm.out$coefficients[1]-(lm.out$coefficients[2]+ lm.out$coefficients[3]) [1]

14 You can use the summary statement to get an overall F-test: > lm.out=lm(rate ~ dummy.1 + dummy.2) > summary(lm.out) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** dummy dummy e-05 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 42 degrees of freedom Multiple R-Squared: ,Adjusted R-squared: F-statistic: on 2 and 42 DF, p-value: 2.092e-05 F-statistic is and p-value is We reject the null and conclude that there is statistically significant evidence that at least one of the group levels is different from the others. This F-statistic and p-value are EXACTLY the same as when we fit the model using the α 3 = 0 constraint in the part 1 notes (on p.16). 14

15 Hypothesis testing in one-way ANOVA Cell means model H 0 : µ 1 = µ 2 = = µ m H A : at least 1 µ i different Effects model H 0 : α 1 = α 2 = = α m = 0 H A : at least 1 α i 0 Both hypotheses are testing the same thing... whether or not all the group means are equal. 15

16 ANOVA table and overall F-test When we represent group by dummy regressors, R sees each dummy variable as a separate covariate (notice how it performs a test for each dummy regressor in the summary). In the pet example, we can test the significance of group by lumping the two covariates together and doing a partial F-test (or an overall F-test in this case because they were the only predictors in the model). What about the ANOVA table and the sums of squares? RegSS RSS TSS ni=1 (Ŷi Ȳ )2 ni=1 (Y i Ŷi) 2 ni=1 (Y i Ȳ )2 16

17 > RegSS=sum((lm.out$fitted.values-mean(rate))^2) > RegSS [1] > RSS=sum((rate-lm.out$fitted.values)^2) > RSS [1] Source Sum of Squares df Mean Square F Regression Residuals Total When R sees group as a factor (categorical variable), and it s the ONLY predictor, we can get the RegSS from the anova statement. > lm.out=lm(rate~group) > anova(lm.out) Analysis of Variance Table Response: rate Df Sum Sq Mean Sq F value Pr(>F) group e-05 *** Residuals Signif. codes: 0 *** ** 0.01 *

18 Classical ANOVA sums of squares Notation for sums of squares in a 1-way ANOVA: m RegSS = SS group = n i (Ȳi Ȳ )2 i=1 where Ȳi is the group i mean and Ȳ is the overall sample mean (Note that Ŷij = Ȳi ) Residual sum of squares m n i RSS = (Y ij Ȳi ) 2 i=1 j=1 Source Sum of Squares df Mean Square F SS group m i=1 n i(ȳi Ȳ )2 m-1 RegSS m 1 = RegMS RegMS MSE Residuals m ni i=1 j=1 (Y ij Ȳi ) 2 RSS n-m n m = MSE Total m i=1 ni j=1 (Y ij Ȳ )2 n-1 18

19 Assessing the assumptions of one-way ANOVA Normal distribution of response variable in each population (or group) histograms, boxplots for sample data from each population (done separately) normal qq plot for sample data from each population (done separately) normal qq plot of all residuals from the fitted model Same standard deviation (or variance) in all populations can use Levene s test for homogeneity of variance (but it assumes normality of observations) rule of thumb: if largest sample standard deviation isn t more than twice as large as smallest sample standard deviation, assumption is probably met close enough for ANOVA to be OK 19

20 if n i s are equal (balanced design), the ANOVA is less sensitive to the violation of equal variance If one or both assumptions are violated, try a transformation. If only normality is violated, try non-parametric procedure such as Kruskal-Wallis test. 20

21 Earlier, we mentioned that 1-way ANOVA... Focuses on testing for differences among group means. Can you get at the differences between means using either effects model coding method (i.e. either constraint)? Case 1: Baseline coding (α m = 0) µ represents baseline group. α 1 is distance group 1 from baseline group. α 2 is distance group 2 from baseline group. α 2 α 1 is distance between group 1 & 2. Case 2: sum-to-zero coding ( α i = 0) µ represents overall or grand mean. α 1 is distance group 1 from overall mean. α 2 is distance group 2 from overall mean. α 2 α 1 is distance between group 1 & 2. 21

22 The answer is yes. The interpretation of the parameters depends on the constraint used, but the important results are still the same (p-values, Ŷij values, etc.). Because hypothesis tests are built on parameter interpretation, the hypothesis test used to answer a given question does depend on the constraint used. 22

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each