ACOVA and Interactions

Size: px

Start display at page:

Download "ACOVA and Interactions"

Laureen Morgan
6 years ago
Views:

1 Chapter 15 ACOVA and Interactions Analysis of covariance (ACOVA) incorporates one or more regression variables into an analysis of variance. As such, we can think of it as analogous to the two-way ANOVA of Chapter 14 except that instead of having two different factor variables as predictors, we have one factor variable and one continuous variable. The regression variables are referred to as covariates (relative to the dependent variable), hence the name analysis of covariance. Covariates are also known as supplementary or concomitant observations. Cox (1958, chapter 4) gives a particularly nice discussion of the experimental design ideas behind analysis of covariance and illustrates various useful plotting techniques, see also Figure In 1957 and 1982, Biometrics devoted entire issues to the analysis of covariance. We begin our discussion with an example that involves a two-group one-way analysis of variance and one covariate. Section 2 looks at an example where the covariate can also be viewed as a factor variable. Section 3 uses ACOVA to look at lack-of-fit testing One covariate example Fisher (1947) gives data on the body weights (in kilograms) and heart weights (in grams) for domestic cats of both sexes that were given digitalis. A subset of the data is presented in Table Our primary interest is to determine whether females heart weights differ from males heart weights when both have received digitalis. As a first step, we might fit a one-way ANOVA model with sex groups, y i j = µ i + ε i j (15.1.1) = µ + α i + ε i j, where the y i j s are the heart weights, i = 1,2, and j = 1,...,24. This model yields the analysis of Table 15.1: Body weights (kg) and heart weights (g) of domestic cats. Females Males Body Heart Body Heart Body Heart Body Heart

2 ACOVA AND INTERACTIONS Table 15.2: One-way analysis of variance on heart weights: Model (15.1.1). Source df SS MS F P Sex Error Total Table 15.3: Analysis of variance for heart weights: Model (15.1.2). Source df Adj. SS MS F P Body weights Sex Error Total variance given in Table Note the overwhelming effect due to sexes. We now develop a model for both sex and weight that is analogous to the additive model (14.1.3) Additive regression effects Fisher provided both heart weights and body weights, so we can ask a more complex question, Is there a sex difference in the heart weights over and above the fact that male cats are naturally larger than female cats? To examine this we add a regression term to model (15.1.1) and fit the traditional analysis of covariance model, y i j = µ i + γz i j + ε i j (15.1.2) = µ + α i + γz i j + ε i j. Here the zs are the body weights and γ is a slope parameter associated with body weights. For this example the mean model is { µ1 + γz, if sex = female m(sex,z) = µ 2 + γz if sex = male. Model (15.1.2) is a special case of the general additive effects model (9.9.2). It is an extension of the simple linear regression between the ys and the zs in which we allow a different intercept µ i for each sex but the same slope. In many ways, it is analogous to the two-way additive effects model (14.1.3). In model (15.1.2) the effect of sex on heart weight is always the same for any fixed body weight, i.e., (µ 1 + γz) (µ 2 + γz) = µ 1 µ 2. Thus we can talk about µ 1 µ 2 being the sex effect regardless of body weight. The means for females and males are parallel lines with common slope γ and µ 1 µ 2 the distance between the lines. An analysis of variance table for model (15.1.2) is given as Table The interpretation of this table is different from the ANOVA tables examined earlier. For example, the sums of squares for body weights, sex, and error do not add up to the sum of squares total. The sums of squares in Table 15.3 are referred to as adjusted sums of squares (Adj. SS) because the body weight sum of squares is adjusted for sexes and the sex sum of squares is adjusted for body weights. The error line in Table 15.3 is simply the error from fitting model (15.1.2). The body weights line comes from comparing model (15.1.2) with the reduced model (15.1.1). Note that the only difference between models (15.1.1) and (15.1.2) is that (15.1.1) does not involve the regression on body weights, so by testing the two models we are testing whether there is a significant effect due

3 15.1 ONE COVARIATE EXAMPLE 359 to the regression on body weights. The standard way of comparing a full and a reduced model is by comparing their error terms. Model (15.1.2) has one more parameter, γ, than model (15.1.1), so there is one more degree of freedom for error in model (15.1.1) than in model (15.1.2), hence one degree of freedom for body weights. The adjusted sum of squares for body weights is the difference between the sum of squares error in model (15.1.1) and the sum of squares error in model (15.1.2). Given the sum of squares and the mean square, the F statistic for body weights is constructed in the usual way and is reported in Table We see a major effect due to the regression on body weights. The Sex line in Table 15.3 provides a test of whether there are differences in sexes after adjusting for the regression on body weights. This comes from comparing model (15.1.2) to a similar model in which sex differences have been eliminated. In model (15.1.2), the sex differences are incorporated as µ 1 and µ 2 in the first version and as α 1 and α 2 in the second version. To eliminate sex differences in model (15.1.2), we simply eliminate the distinctions between the µs (the αs). Such a model can be written as y i j = µ + γz i j + ε i j. (15.1.3) The analysis of covariance model without treatment effects is just a simple linear regression of heart weight on body weight. We have reduced the two sex parameters to one overall parameter, so the difference in degrees of freedom between model (15.1.3) and model (15.1.2) is 1. The difference in the sums of squares error between model (15.1.3) and model (15.1.2) is the adjusted sum of squares for sex reported in Table We see that the evidence for a sex effect over and above the effect due to the regression on body weights is not great. While ANOVA table Error terms are always the same for equivalent models, the table of coefficients depends on the particular parameterization of a model. I prefer the ACOVA model parameterization y i j = µ i + γz i j + ε i j. Some computer programs insist on using the equivalent model y i j = µ + α i + γz i j + ε i j (15.1.4) which is overparameterized. To get estimates of the parameters in model (15.1.4), one must impose side conditions on them. My choice would be to make µ = 0 and get a model equivalent to the first one. Other common choices of side conditions are: (a) α 1 = 0, (b) α 2 = 0, and (c) α 1 +α 2 = 0. Some programs are flexible enough to let you specify the side conditions yourself. Minitab, for example, uses the side conditions (c) and reports Covariate ˆγ SE( ˆγ) t P Constant Sex Body Wt Relative to model (15.1.4) the parameter estimates are ˆµ = 2.755, ˆα 1 = , ˆα 1 = , ˆγ = , so the estimated regression line for females is and for males E(y) = [ ( )] z = z E(y) = [2.755 ( )] z = z, e.g., the predicted values for females are and for males are ŷ 1 j = [ ( )] z 1 j = z 1 j ŷ 2 j = [2.755 ( )] z 2 j = z 2 j.

4 ACOVA AND INTERACTIONS Note that the t statistic for sex 1 is the square root of the F statistic for sex in Table The P values are identical. Similarly, the tests for body weights are equivalent. Again, we find clear evidence for the effect of body weights after fitting sexes. A 95% confidence interval for γ has end points ± 2.014(0.5759) which yields the interval (1.6,4.0). We are 95% confident that, for data comparable to the data in this study, an increase in body weight of one kilogram corresponds to a mean increase in heart weight of between 1.6g and 4.0g. (An increase in body weight corresponds to an increase in heart weight. Philosophically, we have no reason to believe that increasing body weights by one kg will cause an increase in heart weight.) In model (15.1.2), comparing treatments by comparing the treatment means ȳ i is inappropriate because of the complicating effect of the covariate. Adjusted means are often used to compare treatments. The formula and the actual values for the adjusted means are given below along with the raw means for body weights and heart rates. Note that the difference in adjusted means is Adjusted means ȳ i ˆγ( z i z ) Sex N Body Heart Adj. Heart Female Male Combined = ˆα 1 ˆα 2 = 2( ). We have seen previously that there is little evidence of a differential effect on heart weights due to sexes after adjusting for body weights. Nonetheless, from the adjusted means what evidence exists suggests that, even after adjusting for body weights, a typical heart weight for males, , is larger than a typical heart weight for females, Figures 15.1 through 15.3 contain residual plots for model (15.1.2). The plot of residuals versus predicted values looks exceptionally good. The plot of residuals versus sexes shows slightly less variability for females than for males. The difference is probably not enough to worry about. The normal plot of the residuals is alright with W above the appropriate percentile. The models that we have fitted form a hierarchy similar to that discussed in Chapter 14. The ACOVA model is larger than both the one-way and simple linear regression models, which are not comparable, but both are larger than the intercept only model. ACOVA One-Way ANOVA Simple Linear Regression Intercept Only In terms of numbered models the hierarchy is (15.1.2) (15.1.1) (15.1.3) (14.1.6) Such a hierarchy leads to two sequential ANOVA tables that are displayed in Table All of the results in Table 15.3 appear in Table Interaction models With these data, there is little reason to assume that when regressing heart weight on body weight the linear relationships are the same for females and males. Model (15.1.2) allows different intercepts

5 15.1 ONE COVARIATE EXAMPLE 361 Residual Fitted plot Standardized residuals Fitted Figure 15.1: Residuals versus predicted values, cat data. Residual Socio plot Standardized residuals x Figure 15.2: Residuals versus sex, cat data. for these regressions but uses the same slope γ. We should test the assumption of a common slope by fitting the more general model that allows different slopes for females and males, i.e., y i j = µ i + γ i z i j + ε i j (15.1.5) = µ + α i + γ i z i j + ε i j. In model (15.1.5) the γs depend on i and thus the slopes are allowed to differ between the sexes. While model (15.1.5) may look complicated, it consists of nothing more than fitting a simple linear regression to each group: one to the female data and a separate simple linear regression to the male

6 ACOVA AND INTERACTIONS Normal Q Q Plot Standardized residuals Theoretical Quantiles Figure 15.3: Normal plot for cat data, W = Table 15.4: Analyses of variance for rat weight gains. Source df Seq SS MS F P Body weights Sex Error Total Source df Seq SS MS F P Sex Body Weights Error Total data. The means model is { µ1 + γ m(sex,z) = 1 z, if sex = female µ 2 + γ 2 z if sex = male. Figure 15.4 contains some examples of how model (15.1.2) and model (15.1.5) might look when plotted. In model (15.1.2) the lines are always parallel. In model (15.1.5) they can have several appearences. The sum of squares error for model (15.1.5) can be found directly but it also comes from adding the error sums of squares for the separate female and male simple linear regressions. It is easily seen that for females the simple linear regression has an error sum of squares of on 22 degrees of freedom and the males have an error sum of squares of also on 22 degrees of freedom. Thus model (15.1.5) has an error sum of squares of = on = 44 degrees of freedom. The mean squared error for model (15.1.5) is MSE(5) = = 1.638

7 15.1 ONE COVARIATE EXAMPLE 363 No Interaction Interaction Mean Mean x 1 x 1 Interaction Interaction Mean Mean x 1 x 1 Figure 15.4 Patterns of interaction (effect modification) between a continuous predictor x 1 and a binary predictor x 2. and, using results from Table 15.3, the test of model (15.1.5) against the reduced model (15.1.2) has F = [ ]/[45 44] = =.126. The F statistic is very small; there is no evidence that we need to fit different slopes for the two sexes. Fitting model (15.1.5) gives us no reason to question our analysis of model (15.1.2). The interaction model is easily incorporated into our previous hierarchy of models. One-Way ANOVA or, in terms of numbered models, Interaction ACOVA Intercept Only Simple Linear Regression (15.1.5) (15.1.2) (15.1.1) (15.1.3) (14.1.6) The hierarchy leads to the two ANOVA tables given in Table We could also report C p statistics for all five models relative to the interaction model (15.1.5).

8 ACOVA AND INTERACTIONS Table 15.5: Analyses of variance for rat weight gains. Source df Seq SS MS F P Body weights Sex Sex*Body Wt Error Total Source df Seq SS MS F P Sex Body Weights Sex*Body Wt Error Total The table of coefficients depends on the particular parameterization of a model. I prefer the interaction model parameterization y i j = µ i + γ i z i j + ε i j, in which all of the parameters are uniquely defined. Some computer programs insist on using the equivalent model y i j = µ + α i + βz i j + γ i z i j + ε i j (15.1.6) which is overparameterized. To get estimates of the parameters, one must impose side conditions on them. My choice would be to make µ = 0 = β and get a model equivalent to the first one. Other common choices of side conditions are: (a) α 1 = 0 = γ 1, (b) α 2 = 0 = γ 2, and (c) α 1 + α 2 = 0 = γ 1 + γ 2. Some programs are flexible enough to let you specify the model yourself. Minitab, for example, uses the side conditions (c) and reports Covariate ˆγ SE( ˆγ) t P Constant Sex Body Wt Body Wt*Sex Relative to model (15.1.6) the parameter estimates are ˆµ = 2.789, ˆα 1 = 0.142, ˆα 2 = 0.142, ˆβ = ˆγ 1 = , ˆγ 2 = , so the estimated regression line for females is and for males E(y) = ( ) + [ ( )]z = z E(y) = ( ) + [ ( )]z = z, i.e., the fitted values for females are and for males ŷ 1 j = ( ) + [ ( )]z 1 j = z 1 j ŷ 2 j = ( ) + [ ( )]z 2 j = z 2 j,

9 15.1 ONE COVARIATE EXAMPLE Multiple covariates In our cat example, we had one covariate, but it would be very easy to extend model (15.1.2) to include more covariates. For example, with three covariates, x 1, x 2, x 3, the ACOVA model becomes y i j = µ i + γ 1 x i j1 + γ 2 x i j2 + γ 3 x i j3 + ε i j. We could even apply this idea to the cat example by considering a polynomial model. Incorporating into model (15.1.2) a cubic polynomial for one predictor z gives y i j = µ i + γ 1 z i j + γ 2 z 2 i j + γ 3 z 3 i j + ε i j The key point is that ACOVA models are additive effects models because none of the γ parameters depend on sex (i). If we have three covariates x 1, x 2, x 3, an ACOVA model has y i j = µ i + h(x i j1,x i j2,x i j3 ) + ε i j, for some function h( ). In this case µ 1 µ 2 is the differential effect for the two groups regardless of the covariate values. One possible interaction model allows completely different regressions functions for each group, y i j = µ i + γ i1 x i j1 + γ i2 x i j2 + γ i3 x i j3 + ε i j. Here we allow the slope parameters to depend on i. For the cat example we might consider separate cubic polynomials for each sex, i.e., y i j = µ i + γ i1 z i j + γ i2 z 2 i j + γ i3 z 3 i j + ε i j. Minitab commands The following Minitab commands were used to generate the analysis of these data. The means given by the ancova subcommand means are the adjusted treatment means. MTB > names c1 body c2 heart c3 sex MTB > note Fit model (\thesection.1). MTB > oneway c2 c3 MTB > note Fit model (\thesection.2). MTB > ancova c2 = c3; SUBC> covar c1; SUBC> resid c10; SUBC> fits c11; SUBC> means c3. MTB > plot c10 c11 MTB > plot c10 c3 MTB > note Split the data into females and males and MTB > note perform two regressions to fit model (\thesection.5). MTB > copy c1 c2 to c11 c12; SUBC> use c3=1. MTB > regress c12 on 1 c11 MTB > copy c1 c2 to c21 c22; SUBC> use c3=2. MTB > regress c22 on 1 c21

10 ACOVA AND INTERACTIONS 15.2 Regression modeling Consider again the ACOVA model (15.1.2) based on the factor variable sex (i) and the measurement variable body weight (z). To make life more interesting, let s consider a third sex category, say, herm (for hermaphrodite). If we create indicator variables for each of our three categories, say, x 1, x 2, x 3, we can rewrite both the one-way ANOVA model (15.1.1) and model (15.1.2) as linear models. (The SLR model (15.1.3) is already in linear model form.) The first form for the means of model (15.1.1) becomes a no intercept multiple regression model m(x 1,x 2,x 3 ) = µ 1 x 1 + µ 2 x 2 + µ 3 x 3 (15.2.1) { µ1, female = µ 2, male µ 3, herm and the second form for the means is the overparameterized model m(x 1,x 2,x 3 ) = µ + α 1 x 1 + α 2 x 2 + α 3 x 3 (15.2.2) (µ + α 1 ), female = (µ + α 2 ), male (µ + α 3 ), herm. The first form for the means of model (15.1.2) is the parallel lines regression model m(x 1,x 2,x 3,z) = µ 1 x 1 + µ 2 x 2 + µ 3 x 3 + γz (15.2.3) { µ1 + γz, female = µ 2 + γz, male µ 3 + γz, herm and the second form is the overparameterized parallel lines model m(x 1,x 2,x 3,z) = µ + α 1 x 1 + α 2 x 2 + α 3 x 3 + γz (15.2.4) (µ + α 1 ) + γz, female = (µ + α 2 ) + γz, male (µ + α 3 ) + γz, herm. Similarly, we could have parallel polynomials. For quadratics that would be m(x 1,x 2,x 3,z) = µ 1 x 1 + µ 2 x 2 + µ 3 x 3 + γ 1 z + γ 2 z 2 µ 1 + γ 1 z + γ 2 z 2, female = µ 2 + γ 1 z + γ 2 z 2, male µ 3 + γ 1 z + γ 2 z 2, herm wherein only the intercepts are different. The interaction model (15.1.5) gives separate lines for each group and can be written as m(x 1,x 2,x 3,z) = µ 1 x 1 + µ 2 x 2 + µ 3 x 3 + γ 1 zx 1 + γ 2 zx 2 + γ 3 zx 3 { µ1 + γ 1 z, female = µ 2 + γ 2 z, male µ 3 + γ 3 z, herm and the second form is the overparameterized model m(x 1,x 2,x 3,z) = µ + α 1 x 1 + α 2 x 2 + α 3 x 3 + βz + γ 1 zx 1 + γ 2 zx 2 + γ 3 zx 3 (µ + α 1 ) + (β + γ 1 )z, female = (µ + α 2 ) + (β + γ 2 )z, male (µ + α 3 ) + (β + γ 3 )z, herm.

11 15.2 REGRESSION MODELING 367 Every sex category has a completely separate line with different slopes and intercepts. Interaction parabolas would be completely separate parabolas for each group m(x 1,x 2,x 3,z) = µ 1 x 1 + µ 2 x 2 + µ 3 x 3 + γ 11 zx 1 + γ 21 z 2 x 1 = +γ 12 zx 2 + γ 22 z 2 x 2 + γ 13 zx 3 + γ 23 z 2 x 3 µ 1 + γ 11 z + γ 21 z 2, female µ 2 + γ 12 z + γ 22 z 2, male µ 3 + γ 13 z + γ 23 z 2, herm Using overparameterized models As discussed in Chapter 12, model (15.2.2) can be made into a regression model by dropping any one of the predictor variables, say x 1, m(x 1,x 2,x 3 ) = µ + α 2 x 2 + α 3 x 3 (15.2.5) µ, female = (µ + α 2 ), male (µ + α 3 ), herm. Using an intercept and indicators x 2 and x 3 for male and herm makes female the baseline category. Similarly, if we fit the ACOVA model (15.2.4) but drop out x 1 we get parallel lines m(x 1,x 2,x 3,z) = µ + α 2 x 2 + α 3 x 3 + γz (15.2.6) µ + γz, female = (µ + α 2 ) + γz, male (µ + α 3 ) + γz, herm. If, in the one-way ANOVA, we thought that males and females had the same mean, we could drop both x 1 and x 2 from model (15.2.2) to get { µ, female or male m(x 1,x 2,x 3 ) = µ + α 3 x 3 = µ + α 3, herm. If we thought that males and herms had the same mean, since neither male nor herm is the baseline, we could replace x 2 and x 3 with a new variable x = x 2 +x 3 that indicates membership in either group to get { µ, female m(x 1,x 2,x 3 ) = µ + α x = µ + α, male or herm. We could equally well fit the model m(x 1,x 2,x 3 ) = µ 1 x 1 + µ 3 x = { µ1, female µ 3, male or herm. In these cases, the analysis of covariance (15.2.4) behaves similarly. For example, without both x 1 and x 2 model (15.2.4) becomes m(x 1,x 2,x 3,z) = µ + α 3 x 3 + γz (15.2.7) { µ + γz, female or male = (µ + α 3 ) + γz, herm. and involves only two parallel lines, one that applies to both females and males, and another one for herms. Dropping both x 1 and x 2 from model (15.2.2) gives very different results than dropping the

12 ACOVA AND INTERACTIONS Table 15.6: Multiple weighings of a hopper car. Day First Second Third Day First Second Third intercept and x 2 from model (15.2.2). That statement may seem obvious, but if you think about the fact that dropping x 1 alone does not actually affect how the model fits the data, it might be tempting to think that further dropping x 2 could have the same effect after dropping x 1 as dropping x 2 has in model (15.2.1). We have already examined dropping both x 1 and x 2 from model (15.2.2), now consider dropping both the intercept and x 2 from model (15.2.2), i.e., dropping x 2 from model (15.2.1). The model becomes m(x) = µ 1 x 1 + µ 3 x 3 = { µ1, female 0, male µ 3, herm. This occurs because all of the predictor variables in the model take the value 0 for male. If we incorporate the covariate age into this model we get m(x) = µ 1 x 1 + µ 3 x 3 + γz = which are three parallel lines but male has an intercept of 0. { µ1 + γz, female 0 + γz, male µ 3 + γz, herm ACOVA and two-way ANOVA The material in section 1 is sufficiently complex to warrant another example. This time we use a covariate that also defines a grouping variable and explore the relationships between fitting an ACOVA and fitting a two-way ANOVA. EXAMPLE Hopper Data. The data in Table 15.6 were provided by Schneider and Pruett (1994). They were interested in whether the measurement system for the weight of railroad hopper cars was under control. A standard hopper car weighing about 266,000 pounds was used to obtain the first 3 weighings of the day on 20 days. The process was to move car onto the scales, weigh the car, move the car off, move the car on, weigh the car, move it off, move it on, and weigh it a third time. The tabled values are the weight of the car minus 260,000. As we did with the cat data, the first thing we might do is treat the three repeat observations as replications and do a one-way ANOVA on the days, y i j = µ i + ε i j, i = 1,...,12, j = 1,2,3. Summary statistics are given in Table 15.7 and the ANOVA table follows.

13 15.3 ACOVA AND TWO-WAY ANOVA 369 Table 15.7: Summary statistics for hopper data. DAY N MEAN STDEV DAY N MEAN STDEV Analysis of Variance Source df SS MS F P Day Error Total Obviously, there are differences in days Additive effects The three repeat observations on the hopper could be subject to trends. Treat the three observations as measurements of time with values 1, 2, 3. This now serves as a covariate z. With three distinct covariate values, we could fit a parabola. y i j = µ i + γ 1 z i j + γ 2 z 2 i j + ε i j, i = 1,...,12, j = 1,2,3. The software I used actually fits y i j = µ + α i + γ 1 z i j + γ 2 z 2 i j + ε i j, i = 1,...,12, j = 1,2,3 with the additional constraint that α α 20 = 0, so that ˆα 20 = ( ˆα ˆα 19 ). The output then only presents ˆα 1,..., ˆα 19

14 ACOVA AND INTERACTIONS Table of Coefficients Predictor ˆµ k SE( ˆµ k ) t P Constant z z Day The table of coefficients is ugly, especially because there are so many days, but the main point is that the z 2 term in not significant (P = 0.145). The corresponding ANOVA table is a little strange. The only really important thing is that it gives the Error line. There is also some interest in the fact that the F statistic reported for z 2 is the square of the t statistic, having identical P values. Analysis of Variance Source df SS MS F P z Day z Error Total Similar to Section 12.5, instead of fitting a maximal polynomial (we only have three times so can fit at most a quadratic in time), we could alternatively treat z as a factor variable and do a two-way ANOVA as in Chapter 14, i.e., fit y i j = µ + α i + η j + ε i j, i = 1,...,12, j = 1,2,3. The quadratic ACOVA model is equivalent to this two-way ANOVA model, so the two-way ANOVA model should have an equivalent ANOVA table. Analysis of Variance Source df SS MS F P Day Time Error Total

15 15.3 ACOVA AND TWO-WAY ANOVA 371 This has the same Error line as the quadratic ACOVA model. With a nonsignificant z 2 term in the quadratic model, it makes sense to check whether we need the linear term in z. The model is y i j = µ i + γ 1 z i j + ε i j, i = 1,...,12, j = 1,2,3 or y i j = µ + α i + γ 1 z i j + ε i j, i = 1,...,12, j = 1,2,3 subject to the constraint that α α 20 = 0. The table of coefficients is Table of Coefficients Predictor ˆµ k SE( ˆµ k ) t P Constant Time Day and we find no evidence that we need the linear term (P = 0.655). For completeness, an ANOVA table is Analysis of Variance Source df SS MS F P z Day Error Total It might be tempting to worry about interaction in this model. Resist the temptation! First, there are not enough observations for us to fit a full interaction model and still estimate σ 2. If we fit separate quadratics for each day, we would have 60 mean parameters and 60 observations, so zero degrees of freedom for error. Exactly the same thing would happen if we fit a standard interaction model from Chapter 14. But more importantly, it just makes sense to think of interaction as error for these data. What does it mean for there to be a time trend in these data? Surely we have no interest in time trends that go up one day and down another day without any rhyme or reason. For a time trend to be meaningful, it needs to be something that we can spot on a consistent basis. It has to be something that is strong enough that we can see it over and above the natural day to day variation of

16 ACOVA AND INTERACTIONS Table 15.8: Hooker data. Near Near Case Temperature Pressure Rep. Case Temperature Pressure Rep the weighing process. Well, the natural day to day variation of the weighing process is precisely the Day by Time interaction, so the interaction is precisely what we want to be using as our error term. In the model y i j = µ + α i + γ 1 z i j + γ 2 z 2 i j + ε i j, changes that are inconsistent across days and times, terms that depend on both i and j, are what we want to use as error. (An exception to this claim is if, say, we noticed that time trends go up one day, down the next, then up again, etc. That is a form of interaction that we could be interested in, but it s existence requires additional structure for the Days because it involves modeling effects for alternate days.) 15.4 Near replicate lack-of-fit tests In Section 8.5 and subsection we discussed Fisher s lack-of-fit test. Fisher s test is based on there being duplicate cases among the predictor variables. Often, there are few or none of these. Near replicate lack of fit tests were designed to ameliorate that problem by clustering together cases that are nearly replicates of one another. With the Hooker data, Fisher s lack-of-fit test suffers from few degrees of freedom for pure error. Table 15.8 contains a list of near replicates. These were obtained by grouping together cases that were within.5 degrees F. We then construct an F test by fitting 3 models. First, reindex the observations y i, i = 1,...,31 into y jk with j = 1,...,19 identifying the near replicate groups and k = 1,...,N i identifying observations within the near replicate group. Thus the simple linear regression model y i = β 0 + β 1 x i + ε i can be rewritten as y jk = β 0 + β 1 x jk + ε jk. The first of the three models in question is the simple linear regression performed on the near replicate cluster means x j y jk = β 0 + β 1 x j + ε jk. (15.4.1) This is sometimes called the artificial means model because it is a regression on the near replicate cluster means x j but the clusters are artificially constructed. The second model is a one-way analysis of variance model with groups defined by the near replicate clusters. y jk = µ j + ε jk. (15.4.2)

17 15.5 EXERCISES 373 As a regression model, define the predictor variables δ h j for h = 1,...,19 which are equal to 1 if h = j and 0 otherwise. Then the model can be rewritten as a multiple regression model through the origin y jk = µ 1 δ 1 j + µ 2 δ 2 j + + µ 19 δ 19, j + ε jk. The last model is called an analysis of covariance model because it incorporates the original predictor (covariate) x jk into the analysis of variance model (15.4.2). The model is which can alternatively be written as a regression Fitting these three models gives y jk = µ j + β 1 x jk + ε jk (15.4.3) y jk = µ 1 δ 1 j + µ 2 δ 2 j + + µ 19 δ 19, j + β 1 x jk + ε jk. Analysis of variance: artificial means model (15.4.1) Source df SS MS F P Regression Error Total Analysis of variance on near replicate groups (15.4.2) Source df SS MS F P Near Reps Error Total Analysis of covariance (15.4.3) Source df SS MS F P x Near Reps Error Total The lack of fit test uses the difference in the sums of squares error for the first two models in the numerator of the test and the mean squared error for the analysis of covariance model in the denominator of the test. The lack of fit test statistic is F = ( )/(29 12).027 = 7.4. This can be compared to an F(17,11) distribution which yields a P value of.001. This procedure is known as Shillington s test, cf. Christensen (2011) Exercises EXERCISE Table 15.8 contains data from Sulzberger (1953) and Williams (1959) on y, the maximum compressive strength parallel to the grain of wood from ten hoop pine trees. The data also include the temperature of the evaluation and a covariate z, the moisture content of the wood. Analyze the data. Examine (tabled) polynomial contrasts in the temperatures. EXERCISE Smith, Gnanadesikan, and Hughes (1962) gave data on urine characteristics of young men. The men were divided into four categories based on obesity. The data contain a

18 ACOVA AND INTERACTIONS Table 15.9: Compressive strength of hoop pine trees (y) with moisture contents (z). Temperature 20 C 0 C 20 C 40 C 60 C Tree z y z y z y z y z y Table 15.10: Excretory characteristics. Group I Group II z y 1 y 2 z y 1 y Group III Group IV z y 1 y 2 z y 1 y covariate z that measures specific gravity. The dependent variable is y 1 ; it measures pigment creatinine. These variables are included in Table Perform an analysis of covariance on y 1. How do the conclusions about obesity effects change between the ACOVA and the results of the ANOVA that ignores the covariate? EXERCISE Smith, Gnanadesikan, and Hughes (1962) also give data on the variable y 2 that measures chloride in the urine of young men. These data are also reported in Table As in the previous problem, the men were divided into four categories based on obesity. Perform an

19 15.5 EXERCISES 375 analysis of covariance on y 2 again using the specific gravity as the covariate z. Compare the results of the ACOVA to the results of the ANOVA that ignores the covariate. EXERCISE Test the need for a power transformation in each of the following problems from the previous chapter. Use all three constructed variables on each data set and compare results. (a) Exercise (b) Exercise (c) Exercise (d) Exercise (e) Exercise (f) Exercise (g) Exercise EXERCISE Consider the analysis of covariance for a completely randomized design with one covariate. Find the form for a 99% prediction interval for an observation, say, from the first treatment group with a given covariate value z. EXERCISE Assuming that in model (15.3.1) Cov(ȳ i, ˆγ) = 0, show that ) [ ] Var λ i (ȳ i z i ˆγ) = σ 2 a i=1 λi 2 + ( a i=1 λ i z i ) 2. b SSE zz ( a i=1

Categorical Predictor Variables

Categorical Predictor Variables We often wish to use categorical (or qualitative) variables as covariates in a regression model. For binary variables (taking on only 2 values, e.g. sex), it is relatively