Binomial Logis5c Regression with glm()

Friday 10/10/2014

> plot(x,y) > abline(reg=lm(y~x))

Binomial Logis5c Regression numsessions relapse -1.74 No relapse 1.15 No relapse 1.87 No relapse.62 No relapse -.47 Relapse.88 No relapse -.99 Relapse.81 No relapse.44 No relapse -1.35 Relapse.52 No relapse.14 Relapse -.49 Relapse.60 No relapse -.03 No relapse -.43 Relapse -.94 Relapse -.06 Relapse -.84 Relapse -1.81 Relapse

The Logis5c Func5on 1 e x 1+ e x 0 x

The Exponen5al Func5on e x e = 2.718282 log(e x ) = x e log(x) = x e x * e Y = e x+y e 0 = 1 > curve(exp(x),-5,+5) > rbind(-3:3,round(exp(-3:3),2)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] -3.00-2.00-1.00 0 1.00 2.00 3.00 [2,] 0.05 0.14 0.37 1 2.72 7.39 20.09

y p p e p p p e p e p e p e pe p e e p e e p y y y y y y y y y y ˆ 1 log 1 ) (1 * ) (1 1 ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ = = = = = + = + + = x x e e 1+ logit = log(p/(1- p)) The Logis5c Func5on 0 1 x

> glm(relapse~numsessions,family="binomial") Call: glm(formula = relapse ~ numsessions, family = "binomial") Coefficients: (Intercept) numsessions -0.1866-2.0925 Degrees of Freedom: 19 Total (i.e. Null); 18 Residual Null Deviance: 27.73 Residual Deviance: 17.64 AIC: 21.64 p( relapse) log 1 ( ) p relapse = 2.092 * Numsessions.186 At numsess = 0 (i.e. at the mean) p( relapse) log 1 p( relapse) =.186 p relapse) 1 p( relapse) (. 186 = e =.83 At numsess = 1 p relapse) 1 p( relapse) ( 2.092. 186 = e =.10 At numsess = - 1 p relapse) 1 p( relapse) ( + 2.092. 186 = e = 6.73

e 1+ e yˆ e p = 1+ e 2866 2.0925* Numsessions p ( relapse) = 2866 2.0925* Numsessions yˆ At mean numssessions (0) p(relapse) = 45% At high numssessions (+1) p(relapse) = 9% At low numssessions (- 1) p(relapse) = 87% a<--.1866; b<--2.0925 numsessions<-seq(-3,3,by=.01) p_relapse<-exp(a+b*numsessions)/(1+exp(a+b*numsessions)) plot(numsessions,p_relapse,cex=.5,col="blue")

b = - 20 b = - 5 b = - 1 b = +5

Mul5ple Regression

Mul5ple Regression y ˆ = a + bx ˆ = b b x b x b x 1 1 2 2 3 3 y + + + + 0... yˆ = bi xi + b 0

ID GPA SAT GPA Reco (College) (%) (High) id coll_gpa sat recs hs_gpa 1 3.14 75.40 3 3.46 2 3.49 74.84 4 3.03 3 3.39 96.52 3 3.21 4 3.36 77.88 4 2.80 5 2.76 66.04 3 1.68 6 2.79 80.67 2 2.77 7 3.37 78.81 4 2.28 8 4.15 96.50 4 3.15 9 3.43 89.16 3 3.68 10 3.37 71.15 4 3.46 11 3.36 80.70 2 3.01 12 3.04 83.44 1 3.38 13 4.07 83.85 5 4.15 14 3.15 72.82 5 3.07 15 3.32 71.91 4 3.47 16 2.79 63.72 3 2.00 17 2.89 57.98 2 2.09 18 3.91 105.20 3 3.54 19 3.79 104.60 5 3.88 20 3.73 77.38 3 3.54 > round(cor(data0),2) ID coll_gpa sat recs hs_gpa ID 1.00 0.22 0.07 0.05 0.21 coll_gpa 0.22 1.00 0.69 0.52 0.69 sat 0.07 0.69 1.00 0.17 0.61 recs 0.05 0.52 0.17 1.00 0.31 hs_gpa 0.21 0.69 0.61 0.31 1.00

> summary(lm(coll_gpa~sat,data=data0)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.57321 0.44521 3.534 0.002373 ** sat 0.02228 0.00547 4.072 0.000715 *** --- Residual standard error: 0.3043 on 18 degrees of freedom Multiple R-squared: 0.4795, Adjusted R-squared: 0.4506 F-statistic: 16.58 on 1 and 18 DF, p-value: 0.0007148 > summary(lm(coll_gpa~sat+recs,data=data0)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.230924 0.396858 3.102 0.006481 ** sat 0.020041 0.004711 4.254 0.000535 *** recs 0.155883 0.055176 2.825 0.011669 * --- Residual standard error: 0.2583 on 17 degrees of freedom Multiple R-squared: 0.6458, Adjusted R-squared: 0.6042 F-statistic: 15.5 on 2 and 17 DF, p-value: 0.0001473 > summary(lm(coll_gpa~sat+hs_gpa,data=data0)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.415248 0.409492 3.456 0.00302 ** sat 0.013800 0.006254 2.207 0.04138 * hs_gpa 0.272462 0.122609 2.222 0.04013 * --- Residual standard error: 0.2756 on 17 degrees of freedom Multiple R-squared: 0.5967, Adjusted R-squared: 0.5492 F-statistic: 12.58 on 2 and 17 DF, p-value: 0.0004445

sat 60 70 80 90 100 2.0 2.5 3.0 3.5 4.0 hs_gpa model2$residuals -15-5 0 5 10 20 2.0 2.5 3.0 3.5 4.0 hs_gpa coll_gpa 2.8 3.2 3.6 4.0-15 -10-5 0 5 10 15 20 model2$residuals (SAT controlling for hs_gpa)

> round(cor(hs_gpa,model2$residuals),3) [1] 0 > summary(lm(coll_gpa~model2$resid)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.3650 0.0887 37.938 <2e-16 *** model2$resid 0.0138 0.0090 1.533 0.143 --- Residual standard error: 0.3967 on 18 degrees of freedom Multiple R-squared: 0.1155,Adjusted R-squared: 0.06638 F-statistic: 2.351 on 1 and 18 DF, p-value: 0.1426 Standardized Coefficients b s β i i i = s 0 Residual Variance MS Re sidual ( Y = MSError = N Yˆ) p 1 2 See lm.beta() in package QuantPsyc Note that this is the square of the residual standard error above or the standard error of the esmmate (s Y.X )

> summary(lm(coll_gpa~hs_gpa+sat+recs,data=data0)) Call: lm(formula = coll_gpa ~ hs_gpa + sat + recs, data = data0) Residuals: Min 1Q Median 3Q Max -0.33996-0.14539-0.04915 0.15624 0.45895 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.171032 0.375079 3.122 0.00657 ** hs_gpa 0.200249 0.112204 1.785 0.09328. sat 0.014177 0.005519 2.569 0.02060 * recs 0.130288 0.053883 2.418 0.02790 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.2431 on 16 degrees of freedom Multiple R-squared: 0.7046, Adjusted R-squared: 0.6492 F-statistic: 12.72 on 3 and 16 DF, p-value: 0.0001661 > confint(lm(coll_gpa~sat+hs_gpa,data=data0)) 2.5 % 97.5 % (Intercept) 0.375899813 1.96616432 hs_gpa -0.037613795 0.43811176 sat 0.002477472 0.02587656 recs 0.016061439 0.24451436

R = r YY ˆ This is the mulmple correlamon coefficient. > cor(coll_gpa,lm(coll_gpa~sat+hs_gpa)$fitted) [1] 0.7724583 > cor(coll_gpa,lm(coll_gpa~sat+hs_gpa)$fitted)^2 [1] 0.5966918 > summary(lm(coll_gpa~sat+hs_gpa,data=data0)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.415248 0.409492 3.456 0.00302 ** sat 0.013800 0.006254 2.207 0.04138 * hs_gpa 0.272462 0.122609 2.222 0.04013 * --- Residual standard error: 0.2756 on 17 degrees of freedom Multiple R-squared: 0.5967, Adjusted R-squared: 0.5492 F-statistic: 12.58 on 2 and 17 DF, p-value: 0.0004445

r YY R ˆ = 1 1) )( (1 1 2 2 = p N N R adjr ) (1 1) ( 2 2 R p R p N F =, with (p, N-p-1) degrees of freedom. ) )(1 ( ) 1)( ( 1) /( ) ) /( ( 2 2 2 1), ( f r f f r f f N r f R r f R R f N f N SSE r f SSR SSR F = =

F ( f r, N f 1) = ( SSR f SSE f SSR r ) /( f /( N f 1) r) = ( N ( f f 1)( R 2 f r)(1 R 2 f R ) 2 r ) > length(coef(model2))-1->f > length(coef(model3))-1->r > length(data0$coll_gpa)->n > summary(model2)[8][[1]]->r2f > summary(model3)[8][[1]]->r2r > > (N-f-1)*(R2f-R2r)/((f-r)*(1-R2f)) [1] 3.185082 > anova(model2,model3) Analysis of Variance Table Model 1: coll_gpa ~ hs_gpa + sat + recs Model 2: coll_gpa ~ sat + recs Res.Df RSS Df Sum of Sq F Pr(>F) 1 16 0.94582 2 17 1.13410-1 -0.18828 3.1851 0.09328. --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1

Par5al Correla5on > round(cor(data1),2) r y1 2 = r y1 (1 r r 2 12 12 r y2 )(1 r 2 y2 ) icecream drownings heat icecream 1.00 0.46 0.71 drownings 0.46 1.00 0.58 heat 0.71 0.58 1.00 > bm.partial<-function(x,y,z) {round((cor(x,y)-cor(x,z)*cor(y,z))/sqrt((1- cor(x,z)^2)*(1-cor(y,z)^2)),2)} > ls() [1] "bm.partial" "data1" > bm.partial(data1$icecream,data1$drownings,data1$heat) [1] 0.08 # Now I am repeating it with the formula from the psych package > library(psych) > partial.r(data1,1:2,3) icecream drownings icecream 1.00 0.08 drownings 0.08 1.00 # Note that we obtain the same result by correlating residuals: > cor(lm(icecream~heat,data=data1)$residuals,lm(drownings~heat,data=data1)$residuals) [1] 0.0813568

Semi- Par5al (Part) Correla5on r 0(1 2) = ( r 01 r 02 (1 r r 2 12 12 ) ) > round(cor(data2),2) racetime practicetime practicetrack racetime 1.00 0.11 0.06 practicetime 0.11 1.00-0.91 practicetrack 0.06-0.91 1.00 > bm.semipartial<-function(x,y,z) {round((cor(x,y)-cor(x,z)*cor(y,z))/sqrt((1- cor(y,z)^2)),2)} > bm.semipartial(racetime,practicetime,practicetrack) [1] 0.39 # Note that you get a very similar result by correlating a residual with racetime # But in contrast to the partial correl, only one of the two terms is a residual here. > cor(data2$racetime,lm(practicetime~practicetrack,data=data2)$residuals) [1] 0.3969366

Breaking Down the SS X 0 a e b c f d g X 1 X 2

R 2 can be high while none of the predictors are significant! > summary(lm(y2~x3+x4,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.4557 0.7796-1.867 0.0741. X3 1.1314 1.6791 0.674 0.5069 X4 0.0366 1.6864 0.022 0.9829 --- Residual standard error: 2.31 on 24 degrees of freedom Multiple R-squared: 0.7945, Adjusted R-squared: 0.7773 F-statistic: 46.39 on 2 and 24 DF, p-value: 5.68e-09 > cor(data3[,7:9]) X3 X4 Y2 X3 1.0000000 0.9973899 0.8913316 X4 0.9973899 1.0000000 0.8891501 Y2 0.8913316 0.8891501 1.0000000 Y 2 a b c X 3 d X 4

> summary(lm(y~x1,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -2.058 23.520-0.087 0.9310 X1 14.530 6.237 2.330 0.0282 * --- Residual standard error: 61.48 on 25 degrees of freedom Multiple R-squared: 0.1784, Adjusted R-squared: 0.1455 F-statistic: 5.428 on 1 and 25 DF, p-value: 0.02819 > summary(lm(y~x2,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -12.4663 19.9323-0.625 0.53736 X2 1.1639 0.3382 3.442 0.00204 ** --- Residual standard error: 55.87 on 25 degrees of freedom Multiple R-squared: 0.3215, Adjusted R-squared: 0.2944 F-statistic: 11.85 on 1 and 25 DF, p-value: 0.002042 > summary(lm(y~x1+x2,data=data3)) Coefficients: NoMce how the coefficient for X2 goes up even as it gets less significant. Estimate Std. Error t value Pr(> t ) (Intercept) -12.1999 22.2745-0.548 0.5890 X1-0.2571 8.7545-0.029 0.9768 X2 1.1755 0.5224 2.250 0.0339 * --- Residual standard error: 57.02 on 24 degrees of freedom Multiple R-squared: 0.3215, Adjusted R-squared: 0.265 F-statistic: 5.687 on 2 and 24 DF, p-value: 0.009513

X2 20 40 60 80 100 X2.1 (Residuals) -40-20 0 20 40 1 2 3 4 5 6 X1 1 2 3 4 5 6 X1

> lm(x1~x2,data=data3)$residuals->data3$x1.2 X1 X2 Y X2.1 X1.2 X1 1.000 0.751 0.422 0.000 0.661 X2 0.751 1.000 0.567 0.661 0.000 Y 0.422 0.567 1.000 0.378-0.005 X2.1 0.000 0.661 0.378 1.000-0.751 X1.2 0.661 0.000-0.005-0.751 1.000

> summary(lm(y~x1.2,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 45.2999 13.0536 3.470 0.0019 ** X1.2-0.2571 10.4135-0.025 0.9805 --- Residual standard error: 67.83 on 25 degrees of freedom Multiple R-squared: 2.438e-05, Adjusted R-squared: -0.03997 F-statistic: 0.0006094 on 1 and 25 DF, p-value: 0.9805 > summary(lm(y~x2.1,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 45.2999 12.0834 3.749 0.000941 *** X2.1 1.1755 0.5752 2.044 0.051662. --- Residual standard error: 62.79 on 25 degrees of freedom Multiple R-squared: 0.1431, Adjusted R-squared: 0.1089 F-statistic: 4.176 on 1 and 25 DF, p-value: 0.05166 > summary(lm(y~x1+x2.1,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -2.0579 21.8137-0.094 0.9256 X1 14.5303 5.7842 2.512 0.0191 * X2.1 1.1755 0.5224 2.250 0.0339 * --- Residual standard error: 57.02 on 24 degrees of freedom Multiple R-squared: 0.3215, Adjusted R-squared: 0.265 F-statistic: 5.687 on 2 and 24 DF, p-value: 0.009513

> summary(lm(y~x1.2+x2.1,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 45.2999 10.9740 4.128 0.000381 *** X1.2 33.2847 13.2500 2.512 0.019134 * X2.1 2.6663 0.7906 3.372 0.002523 ** --- Residual standard error: 57.02 on 24 degrees of freedom Multiple R-squared: 0.3215, Adjusted R-squared: 0.265 F-statistic: 5.687 on 2 and 24 DF, p-value: 0.009513 Y 68% X 2 14% 18% X 1 Predictors: X1 X1.2 X2 X2.1 Intercept R 2 X1 14.5* - 2.1.18 X2 1.2** - 12.5.32 X1 and X2 -.3 1.2* - 12.2.32 X1.2 -.3 45.3**.00 X2.1 1.2 45.3**.14 X1 and X2.1 14.5* 1.2* - 2.1.32 X2 and X1.2 -.3 1.2** - 12.5.32 X1.2 and X2.1 33.3* 2.7** 45.3**.32

Monday 10/13/2014

> names(model1) [1] "coefficients" "residuals" "effects" "rank" "fitted.values" [6] "assign" "qr" "df.residual" "xlevels" "call" [11] "terms" "model" > model1$fitted 1 2 3 4 5 6 7 8 9 3.530163 3.342030 3.420783 3.241401 2.751383 3.228276 3.013893 3.394532 3.626416 10 11 12 13 14 15 16 17 18 3.530163 3.333280 3.495161 3.832049 3.359531 3.534538 2.891388 2.930764 3.565164 19 20 3.713920 3.565164 > model1$residuals 1 2 3 4 5 6-0.390162625 0.147969637-0.030783403 0.118598521 0.008617436-0.438275972 7 8 9 10 11 12 0.356107303 0.755467610-0.196416341-0.160162625 0.026719974-0.455161274 13 14 15 16 17 18 0.237950722-0.209531039-0.214537794-0.101387968-0.040764488 0.344836024 19 20 0.076080282 0.164836024 > model1$df [1] 18

WARNING: anova(lm()) parmmons the sum of square sequenmally so order of predictor maaers! > summary(lm(y~x1)) (Intercept) 2.4203 0.6679 3.624 0.000462 *** x1 0.3651 0.1628 2.243 0.027156 * > summary(lm(y~x2)) (Intercept) 3.1670 0.4361 7.262 9.18e-11 *** x2 0.4992 0.2605 1.916 0.0582. > summary(lm(y~x1+x2)) (Intercept) 2.3509 0.6700 3.509 0.000684 *** x1 0.2842 0.1781 1.596 0.113830 x2 0.3145 0.2832 1.111 0.269497 > summary(lm(y~x2+x1)) (Intercept) 2.3509 0.6700 3.509 0.000684 *** x2 0.3145 0.2832 1.111 0.269497 x1 0.2842 0.1781 1.596 0.113830 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3.142 on 97 degrees of freedom Multiple R-squared: 0.06077, Adjusted R-squared: 0.0414 F-statistic: 3.138 on 2 and 97 DF, p-value: 0.0478 In lm() the order does not maaer

> summary(lm(y~x2+x1)) (Intercept) 2.3509 0.6700 3.509 0.000684 *** x2 0.3145 0.2832 1.111 0.269497 x1 0.2842 0.1781 1.596 0.113830 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3.142 on 97 degrees of freedom Multiple R-squared: 0.06077, Adjusted R-squared: 0.0414 F-statistic: 3.138 on 2 and 97 DF, p-value: 0.0478 > anova(lm(y~x1+x2)) Df Sum Sq Mean Sq F value Pr(>F) x1 1 49.78 49.775 5.0426 0.0270 * x2 1 12.17 12.175 1.2334 0.2695 Residuals 97 957.48 9.871 Now the order maaers! > anova(lm(y~x2+x1)) Df Sum Sq Mean Sq F value Pr(>F) x2 1 36.82 36.819 3.730 0.05636. x1 1 25.13 25.131 2.546 0.11383 Residuals 97 957.48 9.871

10 predict() 8 6 4 > model1$call lm(formula = coll_gpa ~ hs_gpa) > predict(model1,list(hs_gpa=3.4)) 1 3.503912 Y 2 0 0 X 2 4 6 8 10 > model2$call lm(formula = coll_gpa ~ hs_gpa + sat + recs, data = data0) > predict(model2,list(hs_gpa=c(3.4,2.9),sat=c(60,90),recs=c(4,5))) 1 2 3.223651 3.679125 > predict(model2,list(hs_gpa=c(3.4,2.9),sat=c(60,90),recs=c(4,5)), interval="confidence") fit lwr upr 1 3.223651 2.908809 3.538494 2 3.679125 3.406056 3.952194 > predict(model2,list(hs_gpa=c(3.4,2.9),sat=c(60,90),recs=c(4,5)) interval="prediction") fit lwr upr 1 3.223651 2.619679 3.827623 2 3.679125 3.095839 4.262411 For a discussion of predicmon vs. confidence intervals see: hap://en.wikipedia.org/wiki/predicmon_interval

> with(data0,plot(coll_gpa,hs_gpa)) > abline(model1) abline()

hap://cran.r- project.org/web/packages/scaaerplot3d/index.html (The easiest thing to do is to install it within R, or download the.zip here.) scaaerplot3d() with(data0,scatterplot3d(sat,recs,coll_gpa))

scaaerplot3d() with(data0,scatterplot3d(sat,recs,coll_gpa,pch=16,color="red"))

scaaerplot3d() with(data0,scatterplot3d(sat,recs,coll_gpa,pch=16,color="red",type="h"))

scaaerplot3d()$plane3d() > model3$call lm(formula = coll_gpa ~ sat + recs, data = data0) > with(data0,scatterplot3d(sat,recs,coll_gpa,pch=16,color="red",type="h"))->my3d > names(my3d) [1] "xyz.convert" "points3d" "plane3d" "box3d" > my3d$plane3d(model3)

Dummy Coding

Imagine a study with 50 par5cipants split unevenly into 3 groups (X) and measured on a dv Y. > str(d) 'data.frame': 50 obs. of 2 variables: $ x: num 1 1 1 1 1 1 1 1 1 1... $ y: num 5.94 5.43 1.09 4.69 6.58 5.98 6.97 8.18 5.12 5.62... > summary(d) x y Min. :1.00 Min. : 1.020 1st Qu.:1.00 1st Qu.: 5.325 Median :2.00 Median : 6.510 Mean :2.16 Mean : 6.409 3rd Qu.:3.00 3rd Qu.: 7.965 Max. :3.00 Max. :10.540 > table(d$x) 1 2 3 16 10 24 > round(tapply(d$y,d$x,mean),2) 1 2 3 5.29 7.12 6.85

In this first pass we treat X as if it was a con5nuous variable. > summary(lm(y~x,data=d)) Call: lm(formula = y ~ x, data = d) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.8190 0.7551 6.382 6.52e-08 *** x 0.7362 0.3237 2.274 0.0275 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 2.014 on 48 degrees of freedom Multiple R-squared: 0.09726, Adjusted R-squared: 0.07845 F-statistic: 5.172 on 1 and 48 DF, p-value: 0.02747

> as.factor(d$x)->d$x > summary(lm(y~x,data=d)) Call: lm(formula = y ~ x, data = d) Residuals: Min 1Q Median 3Q Max -5.8342-0.9192 0.2855 1.1108 3.4160 y ˆ = b + b x2 + b x3 0 1 2 Where x2 and x3 each take the values 0 or 1. Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.2950 0.4974 10.645 4.1e-14 *** x2 1.8290 0.8020 2.280 0.0272 * x3 1.5592 0.6421 2.428 0.0191 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > tapply(d$y,d$x,mean)->means > means[2]-means[1] 2 1.829 > means[3]-means[1] 3 1.559167 Now when X is a factor, lm() gives two dummy codes corresponding to the difference in means from Group 1.

> factor(sample(c("control","before","after"),50,replace=t))->d$z > str(d) 'data.frame': 50 obs. of 3 variables: $ x: Factor w/ 3 levels 1", 2", 3": 1 1 1 1 1 1 1 1 1 1... $ y: num 5.94 5.43 1.09 4.69 6.58 5.98 6.97 8.18 5.12 5.62... $ z: Factor w/ 3 levels "After","Before",..: 3 3 1 3 2 2 1 3 2 1... > summary(lm(y~z,data=d)) Call: lm(formula = y ~ z, data = d) Residuals: Min 1Q Median 3Q Max -5.5071-1.1922 0.2187 1.5417 3.9429 As is illustrated here, the reference group is the one earliest in the alphabet*, which can be arbitrary [* on values, not labels; here Level 1 is Ader ] Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.5971 0.4654 14.176 <2e-16 *** zbefore -0.1784 0.7077-0.252 0.802 zcontrol -0.5033 0.7526-0.669 0.507 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 2.133 on 47 degrees of freedom Multiple R-squared: 0.009436, Adjusted R-squared: -0.03272 F-statistic: 0.2239 on 2 and 47 DF, p-value: 0.8003

> contrasts(d$x) 2 3 1 0 0 2 1 0 3 0 1 > d[d$x==1,4]<-0 > d[d$x==2,4]<-1 > d[d$x==3,4]<-0 > d[d$x==1,5]<-0 > d[d$x==2,5]<-0 > d[d$x==3,5]<-1 y ˆ = b + b1 x2 + b2 0 x Where x2 and x3 each take the values 0 or 1. > str(d) 'data.frame': 50 obs. of 5 variables: $ x : Factor w/ 3 levels "1","2","3": 1 1... $ y : num 5.94 5.43 1.09 4.69 6.58... $ z : Factor w/ 3 levels "After","Before",.. $ myx2: num 0 0 0 0 0 0 0 0 0 0... $ myx3: num 0 0 0 0 0 0 0 0 0 0... I want to show you that the contrasts used by R is the same thing as you entered your own dummy coding. 3 > d x y z myx2 myx3 1 1 5.94 Control 0 0 2 1 5.43 Control 0 0 3 1 1.09 After 0 0 4 1 4.69 Control 0 0 5 1 6.58 Before 0 0 6 1 5.98 Before 0 0 7 1 6.97 After 0 0 8 1 8.18 Control 0 0 9 1 5.12 Before 0 0 10 1 5.62 After 0 0 11 1 4.78 After 0 0 12 1 8.60 After 0 0 13 1 3.63 Before 0 0 14 1 6.17 After 0 0 15 1 3.32 Before 0 0 16 1 2.62 After 0 0 17 2 7.37 Control 1 0 18 2 5.29 After 1 0 19 2 8.02 After 1 0 20 2 6.61 After 1 0 21 2 8.00 Before 1 0 22 2 10.54 After 1 0 23 2 7.10 After 1 0 24 2 6.30 Before 1 0 25 2 4.15 After 1 0 26 2 7.86 Control 1 0 27 3 7.77 Before 0 1 28 3 8.82 After 0 1 ( )

> summary(lm(y~myx2+myx3,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.2950 0.4974 10.645 4.1e-14 *** myx2 1.8290 0.8020 2.280 0.0272 * myx3 1.5592 0.6421 2.428 0.0191 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > summary(lm(y~x,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.2950 0.4974 10.645 4.1e-14 *** x2 1.8290 0.8020 2.280 0.0272 * x3 1.5592 0.6421 2.428 0.0191 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 These two numerical dummy codes give the same result as X as a factor. Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 y ˆ = b0 + b1 x2 + b2 x Where x2 and x3 each take the values 0 or 1. 3

> summary(lm(y~x-1,data=d)) Call: lm(formula = y ~ x - 1, data = d) Coefficients: Estimate Std. Error t value Pr(> t ) x1 5.2950 0.4974 10.64 4.1e-14 *** x2 7.1240 0.6292 11.32 5.0e-15 *** x3 6.8542 0.4061 16.88 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.918, Adjusted R-squared: 0.9128 F-statistic: 175.4 on 3 and 47 DF, p-value: < 2.2e-16 > means 1 2 3 5.295000 7.124000 6.854167 > 1.99/sqrt(tapply(d$y,d$x,length)) 1 2 3 0.4975000 0.6292933 0.4062070 R lets you remove the intercept all 3 means are now tested against zero, using the residual s.e..

> summary(lm(y~1,data=d)) On the other hand, the model with only 1 has only an intercept in other words the grand mean. Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.4092 0.2968 21.6 <2e-16 *** Residual standard error: 2.098 on 49 degrees of freedom > mean(d$y) [1] 6.4092 > sd(d$y) [1] 2.09849 > sd(d$y)/sqrt(49) [1] 0.2997843 > sd(d$y)/sqrt(50) [1] 0.2967713

R uses the contrasts() command to specify how categorical variables should be handled. TradiMonally this transformamon of categorical variables with k values (k>2) into k- 1 numerical variables is called dummy coding, of which there are 3 major types: 1 Dummy coding 2 Effect coding 3 Contrast coding

> c('a1','a2','a3')->a > c('b1','b2','b3')->b Let me first remind you quickly how we can make a matrix from vectors using rbind() or cbind(). > rbind(a,b) #Bind as rows [,1] [,2] [,3] A "A1" "A2" "A3" B "B1" "B2" "B3 > cbind(a,b) #Bind as columns A B [1,] "A1" "B1" [2,] "A2" "B2" [3,] "A3" "B3"

> #Dummy coding (default) > contrasts(d$x) 2 3 1 0 0 2 1 0 3 0 1 Dummy coding can be simply adjusted by inpu5ng a new matrix of codes into contrast(). Here this is the default matrix. > summary(lm(y~x,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.2950 0.4974 10.645 4.1e-14 *** x2 1.8290 0.8020 2.280 0.0272 * x3 1.5592 0.6421 2.428 0.0191 * Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > means[2]-means[1] 1.829 > means[3]-means[1] 1.559167

> #Dummy coding (default) > contrasts(d$x)<-cbind(c(1,0,0),c(0,0,1)) > contrasts(d$x) [,1] [,2] Even if you s5ck to simple dummy 1 1 0 coding, you can change which 2 0 0 group is the reference group. 3 0 1 > summary(lm(y~x,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 7.1240 0.6292 11.32 5e-15 *** x1-1.8290 0.8020-2.28 0.0272 * x2-0.2698 0.7489-0.36 0.7202 --- Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > means[1]-means[2] -1.829 > means[3]-means[2] -0.2698333

> cbind(c(1,0,0),c(0,1,0),c(0,0,1))->c > contrasts(d$x)<-c > contrasts(d$x) [,1] [,2] 1 1 0 2 0 1 3 0 0 No5ce what happens when you try to put more than (k- 1) dummy codes.

> #Dummy coding (default) > contrasts(d$x) 2 3 1 0 0 2 1 0 3 0 1 > #Effect coding > contrasts(d$x)<-cbind(c(-1,1,0),c(-1,0,1)) > contrasts(d$x) [,1] [,2] 1-1 -1 2 1 0 3 0 1

> #Effect coding > contrasts(d$x)<-cbind(c(-1,1,0),c(-1,0,1)) > summary(lm(y~x,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.4244 0.2997 21.438 <2e-16 *** x1 0.6996 0.4709 1.486 0.144 x2 0.4298 0.3805 1.129 0.264 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > mean(d$y) [1] 6.4092 > mean(means) [1] 6.424389 > means[2]-mean(means) 0.6996111 > means[3]-mean(means) 0.4297778 Effect coding tests departures from the unweighted grand mean.

Contrast coding is best to capture Planned contrasts, a priori predic5ons you have made about the pamern of your means. - 2 +1 +1 0-1 1

Rules for contrast weights Contrast Contrast... 1 2 = a = a 1.1 2.1 x 1 x 1 + + a a 1.2 2.2 x 2 x 2 +... = +... = a 1. i a x 2. i i x i 1 Weights sum to zero 2 Orthogonal contrasts k i= 1 k i= 1 a j. i = 0 a1. ia2. i= 0 3 With k groups there are (k 1) orthogonal contrasts

> #Dummy coding (default) > contrasts(d$x) 2 3 1 0 0 2 1 0 3 0 1 > #Effect coding > contrasts(d$x)<-cbind(c(-1,1,0),c(-1,0,1)) > contrasts(d$x) [,1] [,2] 1-1 -1 2 1 0 3 0 1 > #Contrast coding > contrasts(d$x)<-cbind(c(-2,1,1),c(0,-1,1)) > contrasts(d$x) [,1] [,2] 1-2 0 2 1-1 3 1 1

> #Contrast coding > contrasts(d$x)<-cbind(c(-2,1,1),c(0,-1,1)) > summary(lm(y~x,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.4244 0.2997 21.438 <2e-16 *** x1 0.5647 0.2075 2.721 0.0091 ** x2-0.1349 0.3744-0.360 0.7202 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > mean(means) [1] 6.424389 > (-2)^2+(+1)^2+(+1)^2 [1] 6 > (-2*means[1]+means[2]+means[3])/6 0.5646944 > (-means[2]+means[3])/2-0.1349167 Contrast coding tests more surgical a priori predic5ons.

Contrast coding tests more surgical a priori predic5ons, and can be more complicated: Control Threat 1 Threat 2 Self- Affirma5on - 3 +1 +1 +1 0-1 - 1 +2 0-1 +1 0