Linear Model Specification in R

Linear Model Specification in R How to deal with overparameterisation? Paul Janssen 1 Luc Duchateau 2 1 Center for Statistics Hasselt University, Belgium 2 Faculty of Veterinary Medicine Ghent University, Belgium P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 1 / 30

1. The data set 1.1. The data set of exam preparation DoseP DoseK Yield low low 21 low low 23 medium low 19 medium low 24 high low 24 high low 21 low high 29 low high 31 medium high 35 medium high 36 high high 41 high high 40 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 2 / 30

1.2. The extended data set DosePnum DoseP DoseK Yield 1 low low 21 1 low low 23 2 medium low 19 2 medium low 24 3 high low 24 3 high low 21 1 low high 29 1 low high 31 2 medium high 35 2 medium high 36 3 high high 41 3 high high 40 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 3 / 30

2. The linear regression model 2.1. Model specification The model is given by Y i = β 0 + β 1 dosepnum i + ε i, ε i iid N (0, σ 2 ) There is no overparameterisation in this model, both the intercept β 0 and the slope β 1 have a clear meaning P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 4 / 30

2.2. Model matrix Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 Y 11 Y 12 = 1 1 1 1 1 3 1 3 1 1 1 1 1 3 1 3 [ β0 β 1 ] + ε 1 ε 2 ε 3 ε 4 ε 5 ε 6 ε 7 ε 8 ε 9 ε 10 ε 11 ε 12 Y = Xβ + ε X is full rank ε MV N(0, σ 2 I 15 ) Think of a data structure for which the linear regression model would become overparameterised P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 5 / 30

2.3. Overparameterised linear regression model Assume we have used only one dose, for instance the medium dose Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 Y 11 Y 12 = [ β0 β 1 ] + ε 1 ε 2 ε 3 ε 4 ε 5 ε 6 ε 7 ε 8 ε 9 ε 10 ε 11 ε 12 X is no longer full rank P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 6 / 30

2.4. Model specification in R setwd("c:/users/lduchate/docs/oc/onderwijs/adekus/part3basicprinciples") tomatopk<-read.table('tomatopk.txt',header=t) linres.tomatopk<-lm(yield~dosepnum,data=tomatopk);summary(linres.tomatopk) Call: lm(formula = Yield ~ DosePnum, data = tomatopk) Residuals: Min 1Q Median 3Q Max -10.4167-5.5417 0.0833 6.5833 9.5833 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 23.167 5.946 3.896 0.00298 ** DosePnum 2.750 2.753 0.999 0.34134 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 7.786 on 10 degrees of freedom Multiple R-squared: 0.09075, Adjusted R-squared: -0.000175 F-statistic: 0.9981 on 1 and 10 DF, p-value: 0.3413 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 7 / 30

3. One-Way Analysis of Variance We consider 3 different models The cell means model (not overparametrised) The factor effects model (not overparametrised) The factor effects model with treatment restriction The factor effects model with sum restriction P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 8 / 30

3.1. The cell means model Consider the effect of P dose as categorical variable The model is given by Y ij = µ i + ε i, ε i iid N (0, σ 2 ) with µ 1, µ 2 and µ 3 the population mean yield of low, medium and high P dosen, resp. There is no overparameterisation in this model, the population means have a clear meaning P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 9 / 30

3.1.1. Model matrix Y 11 Y 12 Y 21 Y 22 Y 31 Y 32 Y 13 Y 14 Y 23 Y 24 Y 33 Y 34 = 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 µ 1 µ 2 + µ 3 ε 11 ε 12 ε 21 ε 22 ε 31 ε 32 ε 13 ε 14 ε 23 ε 24 ε 33 ε 34 Y = Xβ + ε Xis full rank ε MV N(0, σ 2 I 15 ) P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 10 / 30

3.1.2. Model specification in R onewaycellm.tomatopk<-lm(yield~dosep-1,data=tomatopk);summary(onewaycellm.tomatopk) Call: lm(formula = Yield ~ DoseP - 1, data = tomatopk) Residuals: Min 1Q Median 3Q Max -10.500-5.625 0.000 6.750 9.500 Coefficients: Estimate Std. Error t value Pr(> t ) DosePhigh 31.500 4.103 7.678 3.07e-05 *** DosePlow 26.000 4.103 6.337 0.000135 *** DosePmedium 28.500 4.103 6.946 6.71e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.206 on 9 degrees of freedom Multiple R-squared: 0.9424, Adjusted R-squared: 0.9233 F-statistic: 49.12 on 3 and 9 DF, p-value: 6.651e-06 The parameters just correspond to the sample means tapply(tomatopk$yield, tomatopk$dosep, mean) high low medium 31.5 26.0 28.5 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 11 / 30

3.2. The factor effects model Consider the effect of P dose as categorical variable The model is given by Y ij = µ + δ i + ε ij, ε ij iid N (0, σ 2 ) with µ a parameter common to all P dose levels, and δ 1, δ 2 and δ 3 the effect of low, medium and high P dosen, resp. on the population mean yield There is overparameterisation in this model, the population means do not have a clear meaning P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 12 / 30

3.2.1. Model matrix Y 11 Y 12 Y 21 Y 22 Y 31 Y 32 Y 13 Y 14 Y 23 Y 24 Y 33 Y 34 = 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 µ δ 1 δ 2 δ 3 + ε 11 ε 12 ε 21 ε 22 ε 31 ε 32 ε 13 ε 14 ε 23 ε 24 ε 33 ε 34 Y = Xβ + ε X is not full rank ε MV N(0, σ 2 I 15 ) P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 13 / 30

3.2.2 Model specification in R: treatment restriction options("contrasts") $contrasts [1] "contr.treatment" "contr.treatment" onewaytrt.tomatopk<-lm(yield~dosep,data=tomatopk);summary(onewaytrt.tomatopk) Call: lm(formula = Yield ~ DoseP, data = tomatopk) Residuals: Min 1Q Median 3Q Max -10.500-5.625 0.000 6.750 9.500 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 31.500 4.103 7.678 3.07e-05 *** DosePlow -5.500 5.802-0.948 0.368 DosePmedium -3.000 5.802-0.517 0.618 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.206 on 9 degrees of freedom Multiple R-squared: 0.091, Adjusted R-squared: -0.111 F-statistic: 0.4505 on 2 and 9 DF, p-value: 0.6509 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 14 / 30

The estimated parameters of the model are given by (Intercept) DosePlow DosePmedium 31.5-5.5-3.0 In this factor effects model with treatment restriction we have µ H = µ = intercept = 31.5 µ L = intercept + DosePlow = 31.5-5.5 =26 µ M = intercept + DosePmedium = 31.5-3 =28.5 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 15 / 30

3.2.3 Model specification in R: sum restriction options(contrasts=c("contr.sum","contr.sum")) onewaysum.tomatopk<-lm(yield~dosep,data=tomatopk);summary(onewaysum.tomatopk) Call: lm(formula = Yield ~ DoseP, data = tomatopk) Residuals: Min 1Q Median 3Q Max -10.500-5.625 0.000 6.750 9.500 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 28.667 2.369 12.102 7.16e-07 *** DoseP.833 3.350 0.846 0.420 DoseP2-2.667 3.350-0.796 0.447 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.206 on 9 degrees of freedom Multiple R-squared: 0.091, Adjusted R-squared: -0.111 F-statistic: 0.4505 on 2 and 9 DF, p-value: 0.6509 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 16 / 30

The estimated parameters of the model are given by (Intercept) DoseP1 DoseP2 28.666667 2.833333-2.666667 In this overparametrized model we use the -restriction, i.e., and δ 1 + δ 2 + δ 3 = 0 µ = 28.6666667 δ 1 = 2.8333333 δ 2 = 2.6666667 } µ H = µ + δ 1 µ L = µ + δ 2 µ M = µ + δ 3 δ 3 = (2.8333333 2.6666667) = 0.1667 µ H = µ + δ 1 = 28.6666667 + 2.8333333 = 31.5 µ L = µ + δ 2 = 28.6666667 2.6666667 = 26 µ M = µ + δ 3 = 28.6666667 (2.8333333 2.6666667) = 28.5 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 17 / 30

4. Two-Way Analysis of Variance 4.1.1. The cell means model specification Consider the effect of P and K dose as categorical variables The model is given by Y ijk = µ ij + ε ij, ε ij iid N (0, σ 2 ) with µ ij the population mean yield of the i th P dose and the j th K dose. There is no overparameterisation in this model, the population means have a clear meaning We can however not disentangle the effect of the K and the P dose P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 18 / 30

4.1.2. Matrix Notation Cell means model Y 111 Y 112 Y 211 Y 212 Y 311 Y 312 Y 211 Y 212 Y 221 Y 222 Y 231 Y 232 = 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 Y = Xβ + ε µ 11 µ 21 µ 31 µ 12 µ 22 µ 32 + ε 111 ε 112.. ε 321 ε 322 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 19 / 30

We first have a look at the sample means of the treatment combinations yieldpk.average<-aggregate(tomatopk$yield, list(dosep = tomatopk$dosep,dosek=tomatopk$dosek), mean) yieldpk.average DoseP DoseK x 1 high high 40.5 2 low high 30.0 3 medium high 35.5 4 high low 22.5 5 low low 22.0 6 medium low 21.5 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 20 / 30

Next we can look at the interaction plot interaction.plot(tomatopk$dosep,tomatopk$dosek,tomatopk$yield, trace.label ="DoseK",xlab="DoseP",ylab="Mean Yield") Mean Yield 25 35 DoseK high low high low medium DoseP Picture says that yield is higher with higher K dose, and that there is only a substantial effect of the P dose at the high K dose. The two factors therefore interact! P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 21 / 30

4.2.1. The factor effects model specification We decompose the cell means µ ij leading to y ijk = µ + π i + κ j + (πκ) ij + ε ijk µ is the overall mean π i = µ i. µ is the effect of P dose i, i = 1, 2, 3 κ j = µ.j µ is the effect of K dose j, j = 1, 2 (πκ) ij = µ ij (µ + π i + κ j ) is the interaction effect for P dose i and K dose j ε ijk is the random error term P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 22 / 30

4.2.2. Matrix Notation Factor effects model Y 111 Y 112 Y 211 Y 212 Y 311 Y 312 Y 211 Y 212 Y 221 Y 222 Y 231 Y 232 = 1 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 1 Y = Xβ + ε µ π 1 π 2 π 3 κ 1 κ 2 πκ 11 πκ 21 πκ 31 πκ 12 πκ 22 πκ 32 + ε 111 ε 112.. ε 321 ε 322 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 23 / 30

We revisit the example: Main effects DoseK DoseP Low Medium High Low 22 21.5 22.5 µ.l = 22 (κ 1 = 6.667) High 30 35.5 40.5 µ.h = 35.333 (κ 2 = 6.667 µ L. = 26 µ M. = 28.5 µ H. = 31.5 µ.. µ = 28.667 (π 1 = 2.667) (π 2 = 0.167) (π 3 = 2.833) P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 24 / 30

Interaction effects (πκ ij = µ ij µ π i κ j ) DoseK DoseP Low Medium High Low 2.667-0.333-2.333 High -2.667 0.333 2.333 Example µ LL = 22 = µ + π 1 + κ 1 + (πκ) 11 = 28.667 2.667 6.667 + 2.667 µ HH = 40.5 = µ + π 3 + κ 2 + (πκ) 32 = 28.667 + 2.833 + 6.667 + 2.333 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 25 / 30

4.2.3 Model specification in R: sum restriction options(contrasts=c("contr.sum","contr.sum")) twowaysum.tomatopk<-lm(yield~dosep*dosek,data=tomatopk); summary(twowaysum.tomatopk) Call: lm(formula = Yield ~ DoseP * DoseK, data = tomatopk) Residuals: Min 1Q Median 3Q Max -2.5-1.0 0.0 1.0 2.5 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 28.6667 0.5528 51.860 3.45e-09 *** DoseP.8333 0.7817 3.624 0.0110 * DoseP2-2.6667 0.7817-3.411 0.0143 * DoseK1 6.6667 0.5528 12.060 1.97e-05 *** DoseP1:DoseK.3333 0.7817 2.985 0.0245 * DoseP2:DoseK1-2.6667 0.7817-3.411 0.0143 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.915 on 6 degrees of freedom Multiple R-squared: 0.967, Adjusted R-squared: 0.9395 F-statistic: 35.16 on 5 and 6 DF, p-value: 0.0002271 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 26 / 30

Interpretation of parameter estimates DoseP: DoseP1 (High) and DoseP2 (Low) DoseK: DoseK1 (High) From the output we see DoseP1=2.833, DoseP2=-2.667,DoseK1=6.667, DoseP1:DoseK1=2.333, DoseP2:DoseK1=-2.667. Due to the sum restriction we have DoseP3=-0.167, DoseK2=-6.667, DoseP3:DoseK1=-(2.333-2.667)= 0.333, DoseP1:DoseK2=-2.333, DoseP2:DoseK2=2.667, DoseP3:DoseK2=-0.333. µ LL = intercept + DoseP2 + DoseK2 + DoseP2:DoseK2 µ HH = intercept + DoseP1 + DoseK1 + DoseP1:DoseK1 µ LL = 22, µ HH = 40.5, P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 27 / 30

# Type I SUM OF SQUARES anova(twowaysum.tomatopk) Analysis of Variance Table Response: Yield Df Sum Sq Mean Sq F value Pr(>F) DoseP 2 60.67 30.33 8.2727 0.01885 * DoseK 1 533.33 533.33 145.4545 1.972e-05 *** DoseP:DoseK 2 50.67 25.33 6.9091 0.02775 * Residuals 6 22.00 3.67 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 28 / 30

# Type I SUM OF SQUARES are not influenced by sequence of variables twowaysum.tomatopkalt<-lm(yield~dosek*dosep,data=tomatopk); anova(twowaysum.tomatopkalt) Analysis of Variance Table Response: Yield Df Sum Sq Mean Sq F value Pr(>F) DoseK 1 533.33 533.33 145.4545 1.972e-05 *** DoseP 2 60.67 30.33 8.2727 0.01885 * DoseK:DoseP 2 50.67 25.33 6.9091 0.02775 * Residuals 6 22.00 3.67 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 29 / 30

4.2.4. Model specification in R: treatment restriction options(contrasts=c("contr.treatment","contr.treatment")) twowaytrt.tomatopk<-lm(yield~dosep*dosek,data=tomatopk); summary(twowaytrt.tomatopk) Call: lm(formula = Yield ~ DoseP * DoseK, data = tomatopk) Residuals: Min 1Q Median 3Q Max -2.5-1.0 0.0 1.0 2.5 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 40.500 1.354 29.911 9.26e-08 *** DosePlow -10.500 1.915-5.483 0.00154 ** DosePmedium -5.000 1.915-2.611 0.04006 * DoseKlow -18.000 1.915-9.400 8.23e-05 *** DosePlow:DoseKlow 10.000 2.708 3.693 0.01018 * DosePmedium:DoseKlow 4.000 2.708 1.477 0.19012 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.915 on 6 degrees of freedom Multiple R-squared: 0.967, Adjusted R-squared: 0.9395 F-statistic: 35.16 on 5 and 6 DF, p-value: 0.0002271 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 30 / 30