Binomial Logis5c Regression with glm()

Similar documents
R Output for Linear Models using functions lm(), gls() & glm()

1 Use of indicator random variables. (Chapter 8)

Chapter 11: Linear Regression and Correla4on. Correla4on

Homework 9 Sample Solution

Biostatistics 380 Multiple Regression 1. Multiple Regression

Module 4: Regression Methods: Concepts and Applications

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

1 Multiple Regression

Garvan Ins)tute Biosta)s)cal Workshop 16/6/2015. Tuan V. Nguyen. Garvan Ins)tute of Medical Research Sydney, Australia

Chapter 8: Correlation & Regression

Workshop 7.4a: Single factor ANOVA

Stat 5102 Final Exam May 14, 2015

Inference for Regression

Generalised linear models. Response variable can take a number of different formats

STAT 3022 Spring 2007

ST430 Exam 2 Solutions

Section 4.6 Simple Linear Regression

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Regression on Faithful with Section 9.3 content

Stat 5303 (Oehlert): Analysis of CR Designs; January

Variance Decomposition and Goodness of Fit

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

MODELS WITHOUT AN INTERCEPT

Extensions of One-Way ANOVA.

Example: 1982 State SAT Scores (First year state by state data available)

Psychology 405: Psychometric Theory

Tests of Linear Restrictions

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

General Linear Statistical Models - Part III

Using R in 200D Luke Sonnet

Chapter 8 Conclusion

BMI 541/699 Lecture 22

Analysis of variance. Gilles Guillot. September 30, Gilles Guillot September 30, / 29

Confidence Interval for the mean response

> nrow(hmwk1) # check that the number of observations is correct [1] 36 > attach(hmwk1) # I like to attach the data to avoid the '$' addressing

Extensions of One-Way ANOVA.

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Regression and the 2-Sample t

1 Introduction 1. 2 The Multiple Regression Model 1

STAT 215 Confidence and Prediction Intervals in Regression

13 Simple Linear Regression

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Week 7 Multiple factors. Ch , Some miscellaneous parts

Introduction to Regression in R Part II: Multivariate Linear Regression

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

We d like to know the equation of the line shown (the so called best fit or regression line).

Chapter 8: Correlation & Regression

Multiple Regression: Example

FACTORIAL DESIGNS and NESTED DESIGNS

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

Introduction and Background to Multilevel Analysis

Interactions in Logistic Regression

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

SCHOOL OF MATHEMATICS AND STATISTICS

Generalized Linear Models in R

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

22s:152 Applied Linear Regression. Take random samples from each of m populations.

STK4900/ Lecture 3. Program

Booklet of Code and Output for STAC32 Final Exam

Logistic Regression - problem 6.14

Chapter 8: Correlation & Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Statistics Lab #6 Factorial ANOVA

Ch 2: Simple Linear Regression

1 Forecasting House Starts

Linear Regression. Furthermore, it is simple.

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

Estimated Simple Regression Equation

MS&E 226: Small Data

Chapter 16: Understanding Relationships Numerical Data

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Multiple Regression Part I STAT315, 19-20/3/2014

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Correlation Analysis

Model Building Chap 5 p251

Simple Linear Regression

Lecture 3: Inference in SLR

Linear Regression Models P8111

Multivariate Statistics in Ecology and Quantitative Genetics Summary

5. Linear Regression

Regression and Models with Multiple Factors. Ch. 17, 18

Log-linear Models for Contingency Tables

Stat 401B Final Exam Fall 2016

Using R formulae to test for main effects in the presence of higher-order interactions

Stat 401B Final Exam Fall 2015

Inference for the Regression Coefficient

The Statistical Sleuth in R: Chapter 13

Multiple Linear Regression (solutions to exercises)

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Transcript of Mick Crawley s R course 2010 Imperial College London, Silwood Park

Inferences on Linear Combinations of Coefficients

Multiple Regression and Regression Model Adequacy

Checking the Poisson assumption in the Poisson generalized linear model

Stat 401B Exam 2 Fall 2015

Transcription:

Friday 10/10/2014

Binomial Logis5c Regression with glm()

> plot(x,y) > abline(reg=lm(y~x))

Binomial Logis5c Regression numsessions relapse -1.74 No relapse 1.15 No relapse 1.87 No relapse.62 No relapse -.47 Relapse.88 No relapse -.99 Relapse.81 No relapse.44 No relapse -1.35 Relapse.52 No relapse.14 Relapse -.49 Relapse.60 No relapse -.03 No relapse -.43 Relapse -.94 Relapse -.06 Relapse -.84 Relapse -1.81 Relapse

The Logis5c Func5on 1 e x 1+ e x 0 x

The Exponen5al Func5on e x e = 2.718282 log(e x ) = x e log(x) = x e x * e Y = e x+y e 0 = 1 > curve(exp(x),-5,+5) > rbind(-3:3,round(exp(-3:3),2)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] -3.00-2.00-1.00 0 1.00 2.00 3.00 [2,] 0.05 0.14 0.37 1 2.72 7.39 20.09

y p p e p p p e p e p e p e pe p e e p e e p y y y y y y y y y y ˆ 1 log 1 ) (1 * ) (1 1 ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ = = = = = + = + + = x x e e 1+ logit = log(p/(1- p)) The Logis5c Func5on 0 1 x

> glm(relapse~numsessions,family="binomial") Call: glm(formula = relapse ~ numsessions, family = "binomial") Coefficients: (Intercept) numsessions -0.1866-2.0925 Degrees of Freedom: 19 Total (i.e. Null); 18 Residual Null Deviance: 27.73 Residual Deviance: 17.64 AIC: 21.64 p( relapse) log 1 ( ) p relapse = 2.092 * Numsessions.186 At numsess = 0 (i.e. at the mean) p( relapse) log 1 p( relapse) =.186 p relapse) 1 p( relapse) (. 186 = e =.83 At numsess = 1 p relapse) 1 p( relapse) ( 2.092. 186 = e =.10 At numsess = - 1 p relapse) 1 p( relapse) ( + 2.092. 186 = e = 6.73

e 1+ e yˆ e p = 1+ e 2866 2.0925* Numsessions p ( relapse) = 2866 2.0925* Numsessions yˆ At mean numssessions (0) p(relapse) = 45% At high numssessions (+1) p(relapse) = 9% At low numssessions (- 1) p(relapse) = 87% a<--.1866; b<--2.0925 numsessions<-seq(-3,3,by=.01) p_relapse<-exp(a+b*numsessions)/(1+exp(a+b*numsessions)) plot(numsessions,p_relapse,cex=.5,col="blue")

b = - 20 b = - 5 b = - 1 b = +5

Mul5ple Regression

Mul5ple Regression y ˆ = a + bx ˆ = b b x b x b x 1 1 2 2 3 3 y + + + + 0... yˆ = bi xi + b 0

ID GPA SAT GPA Reco (College) (%) (High) id coll_gpa sat recs hs_gpa 1 3.14 75.40 3 3.46 2 3.49 74.84 4 3.03 3 3.39 96.52 3 3.21 4 3.36 77.88 4 2.80 5 2.76 66.04 3 1.68 6 2.79 80.67 2 2.77 7 3.37 78.81 4 2.28 8 4.15 96.50 4 3.15 9 3.43 89.16 3 3.68 10 3.37 71.15 4 3.46 11 3.36 80.70 2 3.01 12 3.04 83.44 1 3.38 13 4.07 83.85 5 4.15 14 3.15 72.82 5 3.07 15 3.32 71.91 4 3.47 16 2.79 63.72 3 2.00 17 2.89 57.98 2 2.09 18 3.91 105.20 3 3.54 19 3.79 104.60 5 3.88 20 3.73 77.38 3 3.54 > round(cor(data0),2) ID coll_gpa sat recs hs_gpa ID 1.00 0.22 0.07 0.05 0.21 coll_gpa 0.22 1.00 0.69 0.52 0.69 sat 0.07 0.69 1.00 0.17 0.61 recs 0.05 0.52 0.17 1.00 0.31 hs_gpa 0.21 0.69 0.61 0.31 1.00

> summary(lm(coll_gpa~sat,data=data0)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.57321 0.44521 3.534 0.002373 ** sat 0.02228 0.00547 4.072 0.000715 *** --- Residual standard error: 0.3043 on 18 degrees of freedom Multiple R-squared: 0.4795, Adjusted R-squared: 0.4506 F-statistic: 16.58 on 1 and 18 DF, p-value: 0.0007148 > summary(lm(coll_gpa~sat+recs,data=data0)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.230924 0.396858 3.102 0.006481 ** sat 0.020041 0.004711 4.254 0.000535 *** recs 0.155883 0.055176 2.825 0.011669 * --- Residual standard error: 0.2583 on 17 degrees of freedom Multiple R-squared: 0.6458, Adjusted R-squared: 0.6042 F-statistic: 15.5 on 2 and 17 DF, p-value: 0.0001473 > summary(lm(coll_gpa~sat+hs_gpa,data=data0)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.415248 0.409492 3.456 0.00302 ** sat 0.013800 0.006254 2.207 0.04138 * hs_gpa 0.272462 0.122609 2.222 0.04013 * --- Residual standard error: 0.2756 on 17 degrees of freedom Multiple R-squared: 0.5967, Adjusted R-squared: 0.5492 F-statistic: 12.58 on 2 and 17 DF, p-value: 0.0004445

sat 60 70 80 90 100 2.0 2.5 3.0 3.5 4.0 hs_gpa model2$residuals -15-5 0 5 10 20 2.0 2.5 3.0 3.5 4.0 hs_gpa coll_gpa 2.8 3.2 3.6 4.0-15 -10-5 0 5 10 15 20 model2$residuals (SAT controlling for hs_gpa)

> round(cor(hs_gpa,model2$residuals),3) [1] 0 > summary(lm(coll_gpa~model2$resid)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.3650 0.0887 37.938 <2e-16 *** model2$resid 0.0138 0.0090 1.533 0.143 --- Residual standard error: 0.3967 on 18 degrees of freedom Multiple R-squared: 0.1155,Adjusted R-squared: 0.06638 F-statistic: 2.351 on 1 and 18 DF, p-value: 0.1426 Standardized Coefficients b s β i i i = s 0 Residual Variance MS Re sidual ( Y = MSError = N Yˆ) p 1 2 See lm.beta() in package QuantPsyc Note that this is the square of the residual standard error above or the standard error of the esmmate (s Y.X )

> summary(lm(coll_gpa~hs_gpa+sat+recs,data=data0)) Call: lm(formula = coll_gpa ~ hs_gpa + sat + recs, data = data0) Residuals: Min 1Q Median 3Q Max -0.33996-0.14539-0.04915 0.15624 0.45895 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.171032 0.375079 3.122 0.00657 ** hs_gpa 0.200249 0.112204 1.785 0.09328. sat 0.014177 0.005519 2.569 0.02060 * recs 0.130288 0.053883 2.418 0.02790 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.2431 on 16 degrees of freedom Multiple R-squared: 0.7046, Adjusted R-squared: 0.6492 F-statistic: 12.72 on 3 and 16 DF, p-value: 0.0001661 > confint(lm(coll_gpa~sat+hs_gpa,data=data0)) 2.5 % 97.5 % (Intercept) 0.375899813 1.96616432 hs_gpa -0.037613795 0.43811176 sat 0.002477472 0.02587656 recs 0.016061439 0.24451436

R = r YY ˆ This is the mulmple correlamon coefficient. > cor(coll_gpa,lm(coll_gpa~sat+hs_gpa)$fitted) [1] 0.7724583 > cor(coll_gpa,lm(coll_gpa~sat+hs_gpa)$fitted)^2 [1] 0.5966918 > summary(lm(coll_gpa~sat+hs_gpa,data=data0)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.415248 0.409492 3.456 0.00302 ** sat 0.013800 0.006254 2.207 0.04138 * hs_gpa 0.272462 0.122609 2.222 0.04013 * --- Residual standard error: 0.2756 on 17 degrees of freedom Multiple R-squared: 0.5967, Adjusted R-squared: 0.5492 F-statistic: 12.58 on 2 and 17 DF, p-value: 0.0004445

r YY R ˆ = 1 1) )( (1 1 2 2 = p N N R adjr ) (1 1) ( 2 2 R p R p N F =, with (p, N-p-1) degrees of freedom. ) )(1 ( ) 1)( ( 1) /( ) ) /( ( 2 2 2 1), ( f r f f r f f N r f R r f R R f N f N SSE r f SSR SSR F = =

F ( f r, N f 1) = ( SSR f SSE f SSR r ) /( f /( N f 1) r) = ( N ( f f 1)( R 2 f r)(1 R 2 f R ) 2 r ) > length(coef(model2))-1->f > length(coef(model3))-1->r > length(data0$coll_gpa)->n > summary(model2)[8][[1]]->r2f > summary(model3)[8][[1]]->r2r > > (N-f-1)*(R2f-R2r)/((f-r)*(1-R2f)) [1] 3.185082 > anova(model2,model3) Analysis of Variance Table Model 1: coll_gpa ~ hs_gpa + sat + recs Model 2: coll_gpa ~ sat + recs Res.Df RSS Df Sum of Sq F Pr(>F) 1 16 0.94582 2 17 1.13410-1 -0.18828 3.1851 0.09328. --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1

Par5al Correla5on > round(cor(data1),2) r y1 2 = r y1 (1 r r 2 12 12 r y2 )(1 r 2 y2 ) icecream drownings heat icecream 1.00 0.46 0.71 drownings 0.46 1.00 0.58 heat 0.71 0.58 1.00 > bm.partial<-function(x,y,z) {round((cor(x,y)-cor(x,z)*cor(y,z))/sqrt((1- cor(x,z)^2)*(1-cor(y,z)^2)),2)} > ls() [1] "bm.partial" "data1" > bm.partial(data1$icecream,data1$drownings,data1$heat) [1] 0.08 # Now I am repeating it with the formula from the psych package > library(psych) > partial.r(data1,1:2,3) icecream drownings icecream 1.00 0.08 drownings 0.08 1.00 # Note that we obtain the same result by correlating residuals: > cor(lm(icecream~heat,data=data1)$residuals,lm(drownings~heat,data=data1)$residuals) [1] 0.0813568

Semi- Par5al (Part) Correla5on r 0(1 2) = ( r 01 r 02 (1 r r 2 12 12 ) ) > round(cor(data2),2) racetime practicetime practicetrack racetime 1.00 0.11 0.06 practicetime 0.11 1.00-0.91 practicetrack 0.06-0.91 1.00 > bm.semipartial<-function(x,y,z) {round((cor(x,y)-cor(x,z)*cor(y,z))/sqrt((1- cor(y,z)^2)),2)} > bm.semipartial(racetime,practicetime,practicetrack) [1] 0.39 # Note that you get a very similar result by correlating a residual with racetime # But in contrast to the partial correl, only one of the two terms is a residual here. > cor(data2$racetime,lm(practicetime~practicetrack,data=data2)$residuals) [1] 0.3969366

Breaking Down the SS X 0 a e b c f d g X 1 X 2

R 2 can be high while none of the predictors are significant! > summary(lm(y2~x3+x4,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.4557 0.7796-1.867 0.0741. X3 1.1314 1.6791 0.674 0.5069 X4 0.0366 1.6864 0.022 0.9829 --- Residual standard error: 2.31 on 24 degrees of freedom Multiple R-squared: 0.7945, Adjusted R-squared: 0.7773 F-statistic: 46.39 on 2 and 24 DF, p-value: 5.68e-09 > cor(data3[,7:9]) X3 X4 Y2 X3 1.0000000 0.9973899 0.8913316 X4 0.9973899 1.0000000 0.8891501 Y2 0.8913316 0.8891501 1.0000000 Y 2 a b c X 3 d X 4

> summary(lm(y~x1,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -2.058 23.520-0.087 0.9310 X1 14.530 6.237 2.330 0.0282 * --- Residual standard error: 61.48 on 25 degrees of freedom Multiple R-squared: 0.1784, Adjusted R-squared: 0.1455 F-statistic: 5.428 on 1 and 25 DF, p-value: 0.02819 > summary(lm(y~x2,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -12.4663 19.9323-0.625 0.53736 X2 1.1639 0.3382 3.442 0.00204 ** --- Residual standard error: 55.87 on 25 degrees of freedom Multiple R-squared: 0.3215, Adjusted R-squared: 0.2944 F-statistic: 11.85 on 1 and 25 DF, p-value: 0.002042 > summary(lm(y~x1+x2,data=data3)) Coefficients: NoMce how the coefficient for X2 goes up even as it gets less significant. Estimate Std. Error t value Pr(> t ) (Intercept) -12.1999 22.2745-0.548 0.5890 X1-0.2571 8.7545-0.029 0.9768 X2 1.1755 0.5224 2.250 0.0339 * --- Residual standard error: 57.02 on 24 degrees of freedom Multiple R-squared: 0.3215, Adjusted R-squared: 0.265 F-statistic: 5.687 on 2 and 24 DF, p-value: 0.009513

X2 20 40 60 80 100 X2.1 (Residuals) -40-20 0 20 40 1 2 3 4 5 6 X1 1 2 3 4 5 6 X1

> lm(x1~x2,data=data3)$residuals->data3$x1.2 X1 X2 Y X2.1 X1.2 X1 1.000 0.751 0.422 0.000 0.661 X2 0.751 1.000 0.567 0.661 0.000 Y 0.422 0.567 1.000 0.378-0.005 X2.1 0.000 0.661 0.378 1.000-0.751 X1.2 0.661 0.000-0.005-0.751 1.000

> summary(lm(y~x1.2,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 45.2999 13.0536 3.470 0.0019 ** X1.2-0.2571 10.4135-0.025 0.9805 --- Residual standard error: 67.83 on 25 degrees of freedom Multiple R-squared: 2.438e-05, Adjusted R-squared: -0.03997 F-statistic: 0.0006094 on 1 and 25 DF, p-value: 0.9805 > summary(lm(y~x2.1,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 45.2999 12.0834 3.749 0.000941 *** X2.1 1.1755 0.5752 2.044 0.051662. --- Residual standard error: 62.79 on 25 degrees of freedom Multiple R-squared: 0.1431, Adjusted R-squared: 0.1089 F-statistic: 4.176 on 1 and 25 DF, p-value: 0.05166 > summary(lm(y~x1+x2.1,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -2.0579 21.8137-0.094 0.9256 X1 14.5303 5.7842 2.512 0.0191 * X2.1 1.1755 0.5224 2.250 0.0339 * --- Residual standard error: 57.02 on 24 degrees of freedom Multiple R-squared: 0.3215, Adjusted R-squared: 0.265 F-statistic: 5.687 on 2 and 24 DF, p-value: 0.009513

> summary(lm(y~x1.2+x2.1,data=data3)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 45.2999 10.9740 4.128 0.000381 *** X1.2 33.2847 13.2500 2.512 0.019134 * X2.1 2.6663 0.7906 3.372 0.002523 ** --- Residual standard error: 57.02 on 24 degrees of freedom Multiple R-squared: 0.3215, Adjusted R-squared: 0.265 F-statistic: 5.687 on 2 and 24 DF, p-value: 0.009513 Y 68% X 2 14% 18% X 1 Predictors: X1 X1.2 X2 X2.1 Intercept R 2 X1 14.5* - 2.1.18 X2 1.2** - 12.5.32 X1 and X2 -.3 1.2* - 12.2.32 X1.2 -.3 45.3**.00 X2.1 1.2 45.3**.14 X1 and X2.1 14.5* 1.2* - 2.1.32 X2 and X1.2 -.3 1.2** - 12.5.32 X1.2 and X2.1 33.3* 2.7** 45.3**.32

Monday 10/13/2014

> names(model1) [1] "coefficients" "residuals" "effects" "rank" "fitted.values" [6] "assign" "qr" "df.residual" "xlevels" "call" [11] "terms" "model" > model1$fitted 1 2 3 4 5 6 7 8 9 3.530163 3.342030 3.420783 3.241401 2.751383 3.228276 3.013893 3.394532 3.626416 10 11 12 13 14 15 16 17 18 3.530163 3.333280 3.495161 3.832049 3.359531 3.534538 2.891388 2.930764 3.565164 19 20 3.713920 3.565164 > model1$residuals 1 2 3 4 5 6-0.390162625 0.147969637-0.030783403 0.118598521 0.008617436-0.438275972 7 8 9 10 11 12 0.356107303 0.755467610-0.196416341-0.160162625 0.026719974-0.455161274 13 14 15 16 17 18 0.237950722-0.209531039-0.214537794-0.101387968-0.040764488 0.344836024 19 20 0.076080282 0.164836024 > model1$df [1] 18

WARNING: anova(lm()) parmmons the sum of square sequenmally so order of predictor maaers! > summary(lm(y~x1)) (Intercept) 2.4203 0.6679 3.624 0.000462 *** x1 0.3651 0.1628 2.243 0.027156 * > summary(lm(y~x2)) (Intercept) 3.1670 0.4361 7.262 9.18e-11 *** x2 0.4992 0.2605 1.916 0.0582. > summary(lm(y~x1+x2)) (Intercept) 2.3509 0.6700 3.509 0.000684 *** x1 0.2842 0.1781 1.596 0.113830 x2 0.3145 0.2832 1.111 0.269497 > summary(lm(y~x2+x1)) (Intercept) 2.3509 0.6700 3.509 0.000684 *** x2 0.3145 0.2832 1.111 0.269497 x1 0.2842 0.1781 1.596 0.113830 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3.142 on 97 degrees of freedom Multiple R-squared: 0.06077, Adjusted R-squared: 0.0414 F-statistic: 3.138 on 2 and 97 DF, p-value: 0.0478 In lm() the order does not maaer

> summary(lm(y~x2+x1)) (Intercept) 2.3509 0.6700 3.509 0.000684 *** x2 0.3145 0.2832 1.111 0.269497 x1 0.2842 0.1781 1.596 0.113830 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3.142 on 97 degrees of freedom Multiple R-squared: 0.06077, Adjusted R-squared: 0.0414 F-statistic: 3.138 on 2 and 97 DF, p-value: 0.0478 > anova(lm(y~x1+x2)) Df Sum Sq Mean Sq F value Pr(>F) x1 1 49.78 49.775 5.0426 0.0270 * x2 1 12.17 12.175 1.2334 0.2695 Residuals 97 957.48 9.871 Now the order maaers! > anova(lm(y~x2+x1)) Df Sum Sq Mean Sq F value Pr(>F) x2 1 36.82 36.819 3.730 0.05636. x1 1 25.13 25.131 2.546 0.11383 Residuals 97 957.48 9.871

10 predict() 8 6 4 > model1$call lm(formula = coll_gpa ~ hs_gpa) > predict(model1,list(hs_gpa=3.4)) 1 3.503912 Y 2 0 0 X 2 4 6 8 10 > model2$call lm(formula = coll_gpa ~ hs_gpa + sat + recs, data = data0) > predict(model2,list(hs_gpa=c(3.4,2.9),sat=c(60,90),recs=c(4,5))) 1 2 3.223651 3.679125 > predict(model2,list(hs_gpa=c(3.4,2.9),sat=c(60,90),recs=c(4,5)), interval="confidence") fit lwr upr 1 3.223651 2.908809 3.538494 2 3.679125 3.406056 3.952194 > predict(model2,list(hs_gpa=c(3.4,2.9),sat=c(60,90),recs=c(4,5)) interval="prediction") fit lwr upr 1 3.223651 2.619679 3.827623 2 3.679125 3.095839 4.262411 For a discussion of predicmon vs. confidence intervals see: hap://en.wikipedia.org/wiki/predicmon_interval

> with(data0,plot(coll_gpa,hs_gpa)) > abline(model1) abline()

hap://cran.r- project.org/web/packages/scaaerplot3d/index.html (The easiest thing to do is to install it within R, or download the.zip here.) scaaerplot3d() with(data0,scatterplot3d(sat,recs,coll_gpa))

scaaerplot3d() with(data0,scatterplot3d(sat,recs,coll_gpa,pch=16,color="red"))

scaaerplot3d() with(data0,scatterplot3d(sat,recs,coll_gpa,pch=16,color="red",type="h"))

scaaerplot3d()$plane3d() > model3$call lm(formula = coll_gpa ~ sat + recs, data = data0) > with(data0,scatterplot3d(sat,recs,coll_gpa,pch=16,color="red",type="h"))->my3d > names(my3d) [1] "xyz.convert" "points3d" "plane3d" "box3d" > my3d$plane3d(model3)

Dummy Coding

Imagine a study with 50 par5cipants split unevenly into 3 groups (X) and measured on a dv Y. > str(d) 'data.frame': 50 obs. of 2 variables: $ x: num 1 1 1 1 1 1 1 1 1 1... $ y: num 5.94 5.43 1.09 4.69 6.58 5.98 6.97 8.18 5.12 5.62... > summary(d) x y Min. :1.00 Min. : 1.020 1st Qu.:1.00 1st Qu.: 5.325 Median :2.00 Median : 6.510 Mean :2.16 Mean : 6.409 3rd Qu.:3.00 3rd Qu.: 7.965 Max. :3.00 Max. :10.540 > table(d$x) 1 2 3 16 10 24 > round(tapply(d$y,d$x,mean),2) 1 2 3 5.29 7.12 6.85

In this first pass we treat X as if it was a con5nuous variable. > summary(lm(y~x,data=d)) Call: lm(formula = y ~ x, data = d) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.8190 0.7551 6.382 6.52e-08 *** x 0.7362 0.3237 2.274 0.0275 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 2.014 on 48 degrees of freedom Multiple R-squared: 0.09726, Adjusted R-squared: 0.07845 F-statistic: 5.172 on 1 and 48 DF, p-value: 0.02747

> as.factor(d$x)->d$x > summary(lm(y~x,data=d)) Call: lm(formula = y ~ x, data = d) Residuals: Min 1Q Median 3Q Max -5.8342-0.9192 0.2855 1.1108 3.4160 y ˆ = b + b x2 + b x3 0 1 2 Where x2 and x3 each take the values 0 or 1. Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.2950 0.4974 10.645 4.1e-14 *** x2 1.8290 0.8020 2.280 0.0272 * x3 1.5592 0.6421 2.428 0.0191 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > tapply(d$y,d$x,mean)->means > means[2]-means[1] 2 1.829 > means[3]-means[1] 3 1.559167 Now when X is a factor, lm() gives two dummy codes corresponding to the difference in means from Group 1.

> factor(sample(c("control","before","after"),50,replace=t))->d$z > str(d) 'data.frame': 50 obs. of 3 variables: $ x: Factor w/ 3 levels 1", 2", 3": 1 1 1 1 1 1 1 1 1 1... $ y: num 5.94 5.43 1.09 4.69 6.58 5.98 6.97 8.18 5.12 5.62... $ z: Factor w/ 3 levels "After","Before",..: 3 3 1 3 2 2 1 3 2 1... > summary(lm(y~z,data=d)) Call: lm(formula = y ~ z, data = d) Residuals: Min 1Q Median 3Q Max -5.5071-1.1922 0.2187 1.5417 3.9429 As is illustrated here, the reference group is the one earliest in the alphabet*, which can be arbitrary [* on values, not labels; here Level 1 is Ader ] Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.5971 0.4654 14.176 <2e-16 *** zbefore -0.1784 0.7077-0.252 0.802 zcontrol -0.5033 0.7526-0.669 0.507 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 2.133 on 47 degrees of freedom Multiple R-squared: 0.009436, Adjusted R-squared: -0.03272 F-statistic: 0.2239 on 2 and 47 DF, p-value: 0.8003

> contrasts(d$x) 2 3 1 0 0 2 1 0 3 0 1 > d[d$x==1,4]<-0 > d[d$x==2,4]<-1 > d[d$x==3,4]<-0 > d[d$x==1,5]<-0 > d[d$x==2,5]<-0 > d[d$x==3,5]<-1 y ˆ = b + b1 x2 + b2 0 x Where x2 and x3 each take the values 0 or 1. > str(d) 'data.frame': 50 obs. of 5 variables: $ x : Factor w/ 3 levels "1","2","3": 1 1... $ y : num 5.94 5.43 1.09 4.69 6.58... $ z : Factor w/ 3 levels "After","Before",.. $ myx2: num 0 0 0 0 0 0 0 0 0 0... $ myx3: num 0 0 0 0 0 0 0 0 0 0... I want to show you that the contrasts used by R is the same thing as you entered your own dummy coding. 3 > d x y z myx2 myx3 1 1 5.94 Control 0 0 2 1 5.43 Control 0 0 3 1 1.09 After 0 0 4 1 4.69 Control 0 0 5 1 6.58 Before 0 0 6 1 5.98 Before 0 0 7 1 6.97 After 0 0 8 1 8.18 Control 0 0 9 1 5.12 Before 0 0 10 1 5.62 After 0 0 11 1 4.78 After 0 0 12 1 8.60 After 0 0 13 1 3.63 Before 0 0 14 1 6.17 After 0 0 15 1 3.32 Before 0 0 16 1 2.62 After 0 0 17 2 7.37 Control 1 0 18 2 5.29 After 1 0 19 2 8.02 After 1 0 20 2 6.61 After 1 0 21 2 8.00 Before 1 0 22 2 10.54 After 1 0 23 2 7.10 After 1 0 24 2 6.30 Before 1 0 25 2 4.15 After 1 0 26 2 7.86 Control 1 0 27 3 7.77 Before 0 1 28 3 8.82 After 0 1 ( )

> summary(lm(y~myx2+myx3,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.2950 0.4974 10.645 4.1e-14 *** myx2 1.8290 0.8020 2.280 0.0272 * myx3 1.5592 0.6421 2.428 0.0191 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > summary(lm(y~x,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.2950 0.4974 10.645 4.1e-14 *** x2 1.8290 0.8020 2.280 0.0272 * x3 1.5592 0.6421 2.428 0.0191 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 These two numerical dummy codes give the same result as X as a factor. Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 y ˆ = b0 + b1 x2 + b2 x Where x2 and x3 each take the values 0 or 1. 3

> summary(lm(y~x-1,data=d)) Call: lm(formula = y ~ x - 1, data = d) Coefficients: Estimate Std. Error t value Pr(> t ) x1 5.2950 0.4974 10.64 4.1e-14 *** x2 7.1240 0.6292 11.32 5.0e-15 *** x3 6.8542 0.4061 16.88 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.918, Adjusted R-squared: 0.9128 F-statistic: 175.4 on 3 and 47 DF, p-value: < 2.2e-16 > means 1 2 3 5.295000 7.124000 6.854167 > 1.99/sqrt(tapply(d$y,d$x,length)) 1 2 3 0.4975000 0.6292933 0.4062070 R lets you remove the intercept all 3 means are now tested against zero, using the residual s.e..

> summary(lm(y~1,data=d)) On the other hand, the model with only 1 has only an intercept in other words the grand mean. Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.4092 0.2968 21.6 <2e-16 *** Residual standard error: 2.098 on 49 degrees of freedom > mean(d$y) [1] 6.4092 > sd(d$y) [1] 2.09849 > sd(d$y)/sqrt(49) [1] 0.2997843 > sd(d$y)/sqrt(50) [1] 0.2967713

R uses the contrasts() command to specify how categorical variables should be handled. TradiMonally this transformamon of categorical variables with k values (k>2) into k- 1 numerical variables is called dummy coding, of which there are 3 major types: 1 Dummy coding 2 Effect coding 3 Contrast coding

> c('a1','a2','a3')->a > c('b1','b2','b3')->b Let me first remind you quickly how we can make a matrix from vectors using rbind() or cbind(). > rbind(a,b) #Bind as rows [,1] [,2] [,3] A "A1" "A2" "A3" B "B1" "B2" "B3 > cbind(a,b) #Bind as columns A B [1,] "A1" "B1" [2,] "A2" "B2" [3,] "A3" "B3"

> #Dummy coding (default) > contrasts(d$x) 2 3 1 0 0 2 1 0 3 0 1 Dummy coding can be simply adjusted by inpu5ng a new matrix of codes into contrast(). Here this is the default matrix. > summary(lm(y~x,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.2950 0.4974 10.645 4.1e-14 *** x2 1.8290 0.8020 2.280 0.0272 * x3 1.5592 0.6421 2.428 0.0191 * Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > means[2]-means[1] 1.829 > means[3]-means[1] 1.559167

> #Dummy coding (default) > contrasts(d$x)<-cbind(c(1,0,0),c(0,0,1)) > contrasts(d$x) [,1] [,2] Even if you s5ck to simple dummy 1 1 0 coding, you can change which 2 0 0 group is the reference group. 3 0 1 > summary(lm(y~x,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 7.1240 0.6292 11.32 5e-15 *** x1-1.8290 0.8020-2.28 0.0272 * x2-0.2698 0.7489-0.36 0.7202 --- Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > means[1]-means[2] -1.829 > means[3]-means[2] -0.2698333

> cbind(c(1,0,0),c(0,1,0),c(0,0,1))->c > contrasts(d$x)<-c > contrasts(d$x) [,1] [,2] 1 1 0 2 0 1 3 0 0 No5ce what happens when you try to put more than (k- 1) dummy codes.

> #Dummy coding (default) > contrasts(d$x) 2 3 1 0 0 2 1 0 3 0 1 > #Effect coding > contrasts(d$x)<-cbind(c(-1,1,0),c(-1,0,1)) > contrasts(d$x) [,1] [,2] 1-1 -1 2 1 0 3 0 1

> #Effect coding > contrasts(d$x)<-cbind(c(-1,1,0),c(-1,0,1)) > summary(lm(y~x,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.4244 0.2997 21.438 <2e-16 *** x1 0.6996 0.4709 1.486 0.144 x2 0.4298 0.3805 1.129 0.264 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > mean(d$y) [1] 6.4092 > mean(means) [1] 6.424389 > means[2]-mean(means) 0.6996111 > means[3]-mean(means) 0.4297778 Effect coding tests departures from the unweighted grand mean.

Contrast coding is best to capture Planned contrasts, a priori predic5ons you have made about the pamern of your means. - 2 +1 +1 0-1 1

Rules for contrast weights Contrast Contrast... 1 2 = a = a 1.1 2.1 x 1 x 1 + + a a 1.2 2.2 x 2 x 2 +... = +... = a 1. i a x 2. i i x i 1 Weights sum to zero 2 Orthogonal contrasts k i= 1 k i= 1 a j. i = 0 a1. ia2. i= 0 3 With k groups there are (k 1) orthogonal contrasts

> #Dummy coding (default) > contrasts(d$x) 2 3 1 0 0 2 1 0 3 0 1 > #Effect coding > contrasts(d$x)<-cbind(c(-1,1,0),c(-1,0,1)) > contrasts(d$x) [,1] [,2] 1-1 -1 2 1 0 3 0 1 > #Contrast coding > contrasts(d$x)<-cbind(c(-2,1,1),c(0,-1,1)) > contrasts(d$x) [,1] [,2] 1-2 0 2 1-1 3 1 1

> #Contrast coding > contrasts(d$x)<-cbind(c(-2,1,1),c(0,-1,1)) > summary(lm(y~x,data=d)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.4244 0.2997 21.438 <2e-16 *** x1 0.5647 0.2075 2.721 0.0091 ** x2-0.1349 0.3744-0.360 0.7202 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.99 on 47 degrees of freedom Multiple R-squared: 0.1378, Adjusted R-squared: 0.1011 F-statistic: 3.754 on 2 and 47 DF, p-value: 0.03071 > mean(means) [1] 6.424389 > (-2)^2+(+1)^2+(+1)^2 [1] 6 > (-2*means[1]+means[2]+means[3])/6 0.5646944 > (-means[2]+means[3])/2-0.1349167 Contrast coding tests more surgical a priori predic5ons.

Contrast coding tests more surgical a priori predic5ons, and can be more complicated: Control Threat 1 Threat 2 Self- Affirma5on - 3 +1 +1 +1 0-1 - 1 +2 0-1 +1 0