Linear Model Specification in R

Size: px

Start display at page:

Download "Linear Model Specification in R"

Edith Dickerson
5 years ago
Views:

1 Linear Model Specification in R How to deal with overparameterisation? Paul Janssen 1 Luc Duchateau 2 1 Center for Statistics Hasselt University, Belgium 2 Faculty of Veterinary Medicine Ghent University, Belgium P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 1 / 30

2 1. The data set 1.1. The data set of exam preparation DoseP DoseK Yield low low 21 low low 23 medium low 19 medium low 24 high low 24 high low 21 low high 29 low high 31 medium high 35 medium high 36 high high 41 high high 40 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 2 / 30

3 1.2. The extended data set DosePnum DoseP DoseK Yield 1 low low 21 1 low low 23 2 medium low 19 2 medium low 24 3 high low 24 3 high low 21 1 low high 29 1 low high 31 2 medium high 35 2 medium high 36 3 high high 41 3 high high 40 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 3 / 30

4 2. The linear regression model 2.1. Model specification The model is given by Y i = β 0 + β 1 dosepnum i + ε i, ε i iid N (0, σ 2 ) There is no overparameterisation in this model, both the intercept β 0 and the slope β 1 have a clear meaning P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 4 / 30

5 2.2. Model matrix Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 Y 11 Y 12 = [ β0 β 1 ] + ε 1 ε 2 ε 3 ε 4 ε 5 ε 6 ε 7 ε 8 ε 9 ε 10 ε 11 ε 12 Y = Xβ + ε X is full rank ε MV N(0, σ 2 I 15 ) Think of a data structure for which the linear regression model would become overparameterised P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 5 / 30

6 2.3. Overparameterised linear regression model Assume we have used only one dose, for instance the medium dose Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 Y 11 Y 12 = [ β0 β 1 ] + ε 1 ε 2 ε 3 ε 4 ε 5 ε 6 ε 7 ε 8 ε 9 ε 10 ε 11 ε 12 X is no longer full rank P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 6 / 30

7 2.4. Model specification in R setwd("c:/users/lduchate/docs/oc/onderwijs/adekus/part3basicprinciples") tomatopk<-read.table('tomatopk.txt',header=t) linres.tomatopk<-lm(yield~dosepnum,data=tomatopk);summary(linres.tomatopk) Call: lm(formula = Yield ~ DosePnum, data = tomatopk) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** DosePnum Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 10 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 10 DF, p-value: P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 7 / 30

8 3. One-Way Analysis of Variance We consider 3 different models The cell means model (not overparametrised) The factor effects model (not overparametrised) The factor effects model with treatment restriction The factor effects model with sum restriction P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 8 / 30

9 3.1. The cell means model Consider the effect of P dose as categorical variable The model is given by Y ij = µ i + ε i, ε i iid N (0, σ 2 ) with µ 1, µ 2 and µ 3 the population mean yield of low, medium and high P dosen, resp. There is no overparameterisation in this model, the population means have a clear meaning P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 9 / 30

10 Model matrix Y 11 Y 12 Y 21 Y 22 Y 31 Y 32 Y 13 Y 14 Y 23 Y 24 Y 33 Y 34 = µ 1 µ 2 + µ 3 ε 11 ε 12 ε 21 ε 22 ε 31 ε 32 ε 13 ε 14 ε 23 ε 24 ε 33 ε 34 Y = Xβ + ε Xis full rank ε MV N(0, σ 2 I 15 ) P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 10 / 30

11 Model specification in R onewaycellm.tomatopk<-lm(yield~dosep-1,data=tomatopk);summary(onewaycellm.tomatopk) Call: lm(formula = Yield ~ DoseP - 1, data = tomatopk) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) DosePhigh e-05 *** DosePlow *** DosePmedium e-05 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 9 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 9 DF, p-value: 6.651e-06 The parameters just correspond to the sample means tapply(tomatopk$yield, tomatopk$dosep, mean) high low medium P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 11 / 30

12 3.2. The factor effects model Consider the effect of P dose as categorical variable The model is given by Y ij = µ + δ i + ε ij, ε ij iid N (0, σ 2 ) with µ a parameter common to all P dose levels, and δ 1, δ 2 and δ 3 the effect of low, medium and high P dosen, resp. on the population mean yield There is overparameterisation in this model, the population means do not have a clear meaning P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 12 / 30

13 Model matrix Y 11 Y 12 Y 21 Y 22 Y 31 Y 32 Y 13 Y 14 Y 23 Y 24 Y 33 Y 34 = µ δ 1 δ 2 δ 3 + ε 11 ε 12 ε 21 ε 22 ε 31 ε 32 ε 13 ε 14 ε 23 ε 24 ε 33 ε 34 Y = Xβ + ε X is not full rank ε MV N(0, σ 2 I 15 ) P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 13 / 30

14 3.2.2 Model specification in R: treatment restriction options("contrasts") $contrasts [1] "contr.treatment" "contr.treatment" onewaytrt.tomatopk<-lm(yield~dosep,data=tomatopk);summary(onewaytrt.tomatopk) Call: lm(formula = Yield ~ DoseP, data = tomatopk) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-05 *** DosePlow DosePmedium Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 9 degrees of freedom Multiple R-squared: 0.091, Adjusted R-squared: F-statistic: on 2 and 9 DF, p-value: P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 14 / 30

15 The estimated parameters of the model are given by (Intercept) DosePlow DosePmedium In this factor effects model with treatment restriction we have µ H = µ = intercept = 31.5 µ L = intercept + DosePlow = =26 µ M = intercept + DosePmedium = =28.5 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 15 / 30

16 3.2.3 Model specification in R: sum restriction options(contrasts=c("contr.sum","contr.sum")) onewaysum.tomatopk<-lm(yield~dosep,data=tomatopk);summary(onewaysum.tomatopk) Call: lm(formula = Yield ~ DoseP, data = tomatopk) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-07 *** DoseP DoseP Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 9 degrees of freedom Multiple R-squared: 0.091, Adjusted R-squared: F-statistic: on 2 and 9 DF, p-value: P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 16 / 30

17 The estimated parameters of the model are given by (Intercept) DoseP1 DoseP In this overparametrized model we use the -restriction, i.e., and δ 1 + δ 2 + δ 3 = 0 µ = δ 1 = δ 2 = } µ H = µ + δ 1 µ L = µ + δ 2 µ M = µ + δ 3 δ 3 = ( ) = µ H = µ + δ 1 = = 31.5 µ L = µ + δ 2 = = 26 µ M = µ + δ 3 = ( ) = 28.5 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 17 / 30

18 4. Two-Way Analysis of Variance The cell means model specification Consider the effect of P and K dose as categorical variables The model is given by Y ijk = µ ij + ε ij, ε ij iid N (0, σ 2 ) with µ ij the population mean yield of the i th P dose and the j th K dose. There is no overparameterisation in this model, the population means have a clear meaning We can however not disentangle the effect of the K and the P dose P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 18 / 30

19 Matrix Notation Cell means model Y 111 Y 112 Y 211 Y 212 Y 311 Y 312 Y 211 Y 212 Y 221 Y 222 Y 231 Y 232 = Y = Xβ + ε µ 11 µ 21 µ 31 µ 12 µ 22 µ 32 + ε 111 ε ε 321 ε 322 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 19 / 30

20 We first have a look at the sample means of the treatment combinations yieldpk.average<-aggregate(tomatopk$yield, list(dosep = tomatopk$dosep,dosek=tomatopk$dosek), mean) yieldpk.average DoseP DoseK x 1 high high low high medium high high low low low medium low 21.5 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 20 / 30

21 Next we can look at the interaction plot interaction.plot(tomatopk$dosep,tomatopk$dosek,tomatopk$yield, trace.label ="DoseK",xlab="DoseP",ylab="Mean Yield") Mean Yield DoseK high low high low medium DoseP Picture says that yield is higher with higher K dose, and that there is only a substantial effect of the P dose at the high K dose. The two factors therefore interact! P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 21 / 30

22 The factor effects model specification We decompose the cell means µ ij leading to y ijk = µ + π i + κ j + (πκ) ij + ε ijk µ is the overall mean π i = µ i. µ is the effect of P dose i, i = 1, 2, 3 κ j = µ.j µ is the effect of K dose j, j = 1, 2 (πκ) ij = µ ij (µ + π i + κ j ) is the interaction effect for P dose i and K dose j ε ijk is the random error term P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 22 / 30

23 Matrix Notation Factor effects model Y 111 Y 112 Y 211 Y 212 Y 311 Y 312 Y 211 Y 212 Y 221 Y 222 Y 231 Y 232 = Y = Xβ + ε µ π 1 π 2 π 3 κ 1 κ 2 πκ 11 πκ 21 πκ 31 πκ 12 πκ 22 πκ 32 + ε 111 ε ε 321 ε 322 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 23 / 30

24 We revisit the example: Main effects DoseK DoseP Low Medium High Low µ.l = 22 (κ 1 = 6.667) High µ.h = (κ 2 = µ L. = 26 µ M. = 28.5 µ H. = 31.5 µ.. µ = (π 1 = 2.667) (π 2 = 0.167) (π 3 = 2.833) P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 24 / 30

25 Interaction effects (πκ ij = µ ij µ π i κ j ) DoseK DoseP Low Medium High Low High Example µ LL = 22 = µ + π 1 + κ 1 + (πκ) 11 = µ HH = 40.5 = µ + π 3 + κ 2 + (πκ) 32 = P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 25 / 30

26 4.2.3 Model specification in R: sum restriction options(contrasts=c("contr.sum","contr.sum")) twowaysum.tomatopk<-lm(yield~dosep*dosek,data=tomatopk); summary(twowaysum.tomatopk) Call: lm(formula = Yield ~ DoseP * DoseK, data = tomatopk) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** DoseP * DoseP * DoseK e-05 *** DoseP1:DoseK * DoseP2:DoseK * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 6 degrees of freedom Multiple R-squared: 0.967, Adjusted R-squared: F-statistic: on 5 and 6 DF, p-value: P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 26 / 30

27 Interpretation of parameter estimates DoseP: DoseP1 (High) and DoseP2 (Low) DoseK: DoseK1 (High) From the output we see DoseP1=2.833, DoseP2=-2.667,DoseK1=6.667, DoseP1:DoseK1=2.333, DoseP2:DoseK1= Due to the sum restriction we have DoseP3=-0.167, DoseK2=-6.667, DoseP3:DoseK1=-( )= 0.333, DoseP1:DoseK2=-2.333, DoseP2:DoseK2=2.667, DoseP3:DoseK2= µ LL = intercept + DoseP2 + DoseK2 + DoseP2:DoseK2 µ HH = intercept + DoseP1 + DoseK1 + DoseP1:DoseK1 µ LL = 22, µ HH = 40.5, P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 27 / 30

28 # Type I SUM OF SQUARES anova(twowaysum.tomatopk) Analysis of Variance Table Response: Yield Df Sum Sq Mean Sq F value Pr(>F) DoseP * DoseK e-05 *** DoseP:DoseK * Residuals Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 28 / 30

29 # Type I SUM OF SQUARES are not influenced by sequence of variables twowaysum.tomatopkalt<-lm(yield~dosek*dosep,data=tomatopk); anova(twowaysum.tomatopkalt) Analysis of Variance Table Response: Yield Df Sum Sq Mean Sq F value Pr(>F) DoseK e-05 *** DoseP * DoseK:DoseP * Residuals Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 29 / 30

30 Model specification in R: treatment restriction options(contrasts=c("contr.treatment","contr.treatment")) twowaytrt.tomatopk<-lm(yield~dosep*dosek,data=tomatopk); summary(twowaytrt.tomatopk) Call: lm(formula = Yield ~ DoseP * DoseK, data = tomatopk) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-08 *** DosePlow ** DosePmedium * DoseKlow e-05 *** DosePlow:DoseKlow * DosePmedium:DoseKlow Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 6 degrees of freedom Multiple R-squared: 0.967, Adjusted R-squared: F-statistic: on 5 and 6 DF, p-value: P.Janssen & L. Duchateau (UH & UG) Linear Model Specification in R 30 / 30

Workshop 7.4a: Single factor ANOVA

-1- Workshop 7.4a: Single factor ANOVA Murray Logan November 23, 2016 Table of contents 1 Revision 1 2 Anova Parameterization 2 3 Partitioning of variance (ANOVA) 10 4 Worked Examples 13 1. Revision 1.1.