Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Similar documents
Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Inferences on Linear Combinations of Coefficients

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Analysis of variance. Gilles Guillot. September 30, Gilles Guillot September 30, / 29

General Linear Statistical Models - Part III

> nrow(hmwk1) # check that the number of observations is correct [1] 36 > attach(hmwk1) # I like to attach the data to avoid the '$' addressing

Tests of Linear Restrictions

1 Multiple Regression

Workshop 7.4a: Single factor ANOVA

FACTORIAL DESIGNS and NESTED DESIGNS

Multiple Predictor Variables: ANOVA

Simple, Marginal, and Interaction Effects in General Linear Models

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Multiple Predictor Variables: ANOVA

Stats fest Analysis of variance. Single factor ANOVA. Aims. Single factor ANOVA. Data

Chapter 16: Understanding Relationships Numerical Data

Regression and the 2-Sample t

Extensions of One-Way ANOVA.

STAT 3022 Spring 2007

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Lecture 15 Topic 11: Unbalanced Designs (missing data)

Stat 5303 (Oehlert): Analysis of CR Designs; January

Example: 1982 State SAT Scores (First year state by state data available)

Psychology 405: Psychometric Theory

Stat 5102 Final Exam May 14, 2015

Booklet of Code and Output for STAC32 Final Exam

A Handbook of Statistical Analyses Using R 3rd Edition. Torsten Hothorn and Brian S. Everitt

Variance Decomposition and Goodness of Fit

Notes on Maxwell & Delaney

Orthogonal contrasts and multiple comparisons

Multiple Regression Part I STAT315, 19-20/3/2014

Linear Probability Model

Foundations of Correlation and Regression

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Lecture 2. The Simple Linear Regression Model: Matrix Approach

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

SCHOOL OF MATHEMATICS AND STATISTICS

Analysis of Variance. Contents. 1 Analysis of Variance. 1.1 Review. Anthony Tanbakuchi Department of Mathematics Pima Community College

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Module 4: Regression Methods: Concepts and Applications

Inference for Regression

MS&E 226: Small Data

Regression and Models with Multiple Factors. Ch. 17, 18

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

SPH 247 Statistical Analysis of Laboratory Data

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Exercise 2 SISG Association Mapping

ST430 Exam 2 Solutions

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Analysis of Variance

Extensions of One-Way ANOVA.

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

1 Use of indicator random variables. (Chapter 8)

2-way analysis of variance

Inference with Heteroskedasticity

Regression. Marc H. Mehlman University of New Haven

Six Sigma Black Belt Study Guides

Categorical Predictor Variables

R Demonstration ANCOVA

Diagnostics and Transformations Part 2

STK 2100 Oblig 1. Zhou Siyu. February 15, 2017

MODELS WITHOUT AN INTERCEPT

22s:152 Applied Linear Regression. 1-way ANOVA visual:

Statistics GIDP Ph.D. Qualifying Exam Methodology

Lecture 6 Multiple Linear Regression, cont.

Exam Applied Statistical Regression. Good Luck!

BIOL 458 BIOMETRY Lab 8 - Nested and Repeated Measures ANOVA

1 Introduction to Minitab

Quantitative Understanding in Biology 2.3 Quantitative Comparison of Models and ANOVA

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

COMPARING SEVERAL MEANS: ANOVA

1: a b c d e 2: a b c d e 3: a b c d e 4: a b c d e 5: a b c d e. 6: a b c d e 7: a b c d e 8: a b c d e 9: a b c d e 10: a b c d e

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Homework 9 Sample Solution

Exercise I.1 I.2 I.3 I.4 II.1 II.2 III.1 III.2 III.3 IV.1 Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Answer

Regression on Faithful with Section 9.3 content

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Chapter 8 Conclusion

STATISTICS 110/201 PRACTICE FINAL EXAM

ANOVA (Analysis of Variance) output RLS 11/20/2016

Week 7 Multiple factors. Ch , Some miscellaneous parts

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Statistics Lab One-way Within-Subject ANOVA

Ordinary Least Squares Regression Explained: Vartanian

STAT 571A Advanced Statistical Regression Analysis. Chapter 8 NOTES Quantitative and Qualitative Predictors for MLR

Motor Trend Car Road Analysis

2. Outliers and inference for regression

ST430 Exam 1 with Answers

Comparing Nested Models

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

NC Births, ANOVA & F-tests

BIOL 933!! Lab 10!! Fall Topic 13: Covariance Analysis

Lectures 5 & 6: Hypothesis Testing

Explanatory Variables Must be Linear Independent...

Stat 401B Exam 2 Fall 2015

Transcription:

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R Gilles Lamothe February 21, 2017 Contents 1 Anova with one factor 2 1.1 The data.......................................... 2 1.2 A visual description.................................... 2 1.3 The aov function...................................... 3 1.4 Pairwise comparisons with Tukey's method....................... 4 1.5 Comparison to a control { Dunnett's test........................ 5 1.6 Recode a vector with car.................................. 5 1.7 Test for trend........................................ 6 1.8 Fitting a polynomial response............................... 7 1.9 Optimizing a function................................... 8 2 Simple Linear Regression 9 2.1 Test for lack-of-t..................................... 9 3 Applying our new skills 11 4 General Linear Model { ANCOVA 12 4.1 Regression by groups.................................... 13 4.2 A general linear model................................... 14 4.3 Test for Lack-of-t..................................... 15 4.4 Comparing the tness status groups........................... 16 4.5 Comparing the eect of age according to tness status................. 19 5 Applying our new skills 21 1

Intro to R Workshop - Psychology Statistics Club 2 Intro to R Workshop - Psychology Statistics Club 1 Anova with one factor 1.1 The data Consider the data in the le Fish.txt. (Source: Design and Analysis of Experiements, by Dean and Voss). Description: The response variable in this experiment is hemoglobin in the blood of brown trout (in grams per 100 ml). The trout were randomly separated in four trough. The food given to the trough contained 0, 5, 10 and 15 grams of sulfamerazine per 100 lbs of sh (coded 1, 2, 3, 4). The measures were achieved on 10 randomly chosen sh from each trough after 35 days. We import the data and display the names of the columns. > fish<-read.table("fish.txt",header=true,sep="\t") > names(fish) [1] "hemoglobin" "group" > sapply(fish,is.factor) hemoglobin group FALSE FALSE We will start with an ANOVA. The explanatory variable must be a factor (i.e. a categorical variable). It identied the groups. > fish$group<-factor(fish$group) > levels(fish$group) [1] "1" "2" "3" "4" 1.2 A visual description We can use comparative boxplots to describe the amount of hemoglobin in the blood according the amount of sulfamerazine in the food. with(fish,boxplot(hemoglobin~group, ylab="hemoglobin",xlab="group"))

Intro to R Workshop - Psychology Statistics Club 3 Does the amount of sulfamerazine in the food appear to have eects on the amount of hemoglobin in the blood? 1.3 The aov function We will use the aov function to t an ANOVA model with one factor. Then, with the summary function, we obtain a summary of the t. > mod<-aov(hemoglobin~group,fish) > summary(mod) Df Sum Sq Mean Sq F value Pr(>F) group 3 26.50 8.833 5.732 0.00259 ** Residuals 36 55.48 1.541 Remarks: The group means are signicantly dierent. An Anova model is a Linear model, so we could have used the lm function to t the model. Note that R 2 = 32:32%. > model<-lm(hemoglobin~group,fish) > summary(model) Call: lm(formula = hemoglobin ~ group, data = fish) Residuals: Min 1Q Median 3Q Max -2.910-0.915 0.000 0.730 2.390 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 7.2000 0.3926 18.341 < 2e-16 *** group2 2.1100 0.5552 3.801 0.000537 *** group3 1.8300 0.5552 3.296 0.002209 ** group4 1.4900 0.5552 2.684 0.010928 *

Intro to R Workshop - Psychology Statistics Club 4 Residual standard error: 1.241 on 36 degrees of freedom Multiple R-squared: 0.3232,Adjusted R-squared: 0.2668 F-statistic: 5.732 on 3 and 36 DF, p-value: 0.002593 We can get the ANOVA table for the tted linear model. This is the test for the equality of means. > anova(model) Analysis of Variance Table Response: hemoglobin Df Sum Sq Mean Sq F value Pr(>F) group 3 26.499 8.8329 5.7316 0.002593 ** Residuals 36 55.479 1.5411 1.4 Pairwise comparisons with Tukey's method We can use an aov object in the function TukeyHSD. > TukeyHSD(mod) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = hemoglobin ~ group, data = fish) $group diff lwr upr p adj 2-1 2.11 0.61479395 3.6052061 0.0028918 3-1 1.83 0.33479395 3.3252061 0.0113389 4-1 1.49-0.00520605 2.9852061 0.0510850 3-2 -0.28-1.77520605 1.2152061 0.9575246 4-2 -0.62-2.11520605 0.8752061 0.6817134 4-3 -0.34-1.83520605 1.1552061 0.9274651 We can use the glht function from the multcomp package to test these multiple hypotheses. > library(multcomp) > summary(glht(model, linfct = mcp(group = "Tukey"))) Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts

Intro to R Workshop - Psychology Statistics Club 5 Fit: lm(formula = hemoglobin ~ group, data = fish) Linear Hypotheses: Estimate Std. Error t value Pr(> t ) 2-1 == 0 2.1100 0.5552 3.801 0.00275 ** 3-1 == 0 1.8300 0.5552 3.296 0.01120 * 4-1 == 0 1.4900 0.5552 2.684 0.05091. 3-2 == 0-0.2800 0.5552-0.504 0.95753 4-2 == 0-0.6200 0.5552-1.117 0.68171 4-3 == 0-0.3400 0.5552-0.612 0.92747 (Adjusted p values reported -- single-step method) 1.5 Comparison to a control { Dunnett's test At times, we are not necessarily interested in all pairwise comparison. We might only be interested in comparing the control group to the other groups. To do so, we use Dunnett's test. > summary(glht(model, linfct = mcp(group = "Dunnett"))) Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Dunnett Contrasts Fit: lm(formula = hemoglobin ~ group, data = fish) Linear Hypotheses: Estimate Std. Error t value Pr(> t ) 2-1 == 0 2.1100 0.5552 3.801 0.00159 ** 3-1 == 0 1.8300 0.5552 3.296 0.00613 ** 4-1 == 0 1.4900 0.5552 2.684 0.02895 * (Adjusted p values reported -- single-step method) Remark: R assumes that the rst level for the factor corresponds to the control group. Keep this in mind, when using Dunnett's test. 1.6 Recode a vector with car When the explanatory variable x is quantitative, we might want to describe the mean response as a function of x. Let us, import the data in the le fish.txt again.

Intro to R Workshop - Psychology Statistics Club 6 > fish<-read.table("fish.txt",header=true,sep="\t") > names(fish) [1] "hemoglobin" "group" > unique(fish$group) [1] 1 2 3 4 The groups 1, 2, 3, 4 are associated to the quantity x in grams of sulfamerazine per 100 lbs of sh, which are 0, 5, 10 and 15, respectively. We can recode a vector with the function recode from the car package. # Install the car package install.packages("car") # Load the car package library(car) We recode the values and assign them to the vector sulfamerazine. Then, we add the new vector to the dataframe fish. > sulfamerazine<-recode(fish$group,"1=0;2=5;3=10;4=15") > fish<-data.frame(fish,sulfamerazine) > unique(fish$sulfamerazine) [1] 0 5 10 15 1.7 Test for trend We can do a test for trend with orthogonal polynomials in x of degree 0, 1, 2,..., r-1, where r is the number of levels of x. Our explanatory variable has r = 4 levels. So we will use polynomials with at most a degree of three. > model<-aov(hemoglobin~poly(sulfamerazine,3),fish) > summary.lm(model) Call: aov(formula = hemoglobin ~ poly(sulfamerazine, 3), data = fish) Residuals: Min 1Q Median 3Q Max -2.910-0.915 0.000 0.730 2.390 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 8.5575 0.1963 43.598 < 2e-16 *** poly(sulfamerazine, 3)1 2.9628 1.2414 2.387 0.02238 * poly(sulfamerazine, 3)2-3.8738 1.2414-3.120 0.00355 **

Intro to R Workshop - Psychology Statistics Club 7 poly(sulfamerazine, 3)3 1.6476 1.2414 1.327 0.19281 Residual standard error: 1.241 on 36 degrees of freedom Multiple R-squared: 0.3232,Adjusted R-squared: 0.2668 F-statistic: 5.732 on 3 and 36 DF, p-value: 0.002593 Remark: The cubic trend is not signicant. However, the quadratic trend is signicant. We need a polynomial of degree 2 to describe the mean hemoglobin as a function of the amount of sulfamerazine. 1.8 Fitting a polynomial response We want to nd an estimated regression model of the form: ^y = b 0 + b 1 x + b 2 x 2 ; where ^y is the estimated amount of hemoglobin and x is the amount sulfamerazine in the food. We want to add x and x 2 to the model. We use the syntax I(x 2 ) to add x 2 to the model. > lm(hemoglobin~sulfamerazine+i(sulfamerazine^2),fish) Call: lm(formula = hemoglobin ~ sulfamerazine + I(sulfamerazine^2), data = fish) Coefficients: (Intercept) sulfamerazine I(sulfamerazine^2) 7.3165 0.4513-0.0245 Our estimated model is ^y = 7:3165 + 0:4513 x 0:0245 x 2 : We will produce a scatter plot with an overlay of the estimated regression model. ### produce a scatter plot with(fish,plot(sulfamerazine,hemoglobin, xlab="sulfamerazine (in grams per 100 lbs of fish)", ylab="hemoglobin in the blood",cex.lab=1.25,cex.axis=1.25)) The above command produces the following scatter plot.

Intro to R Workshop - Psychology Statistics Club 8 To overlay the estimated model on the plot, we will need to evaluate the model at many points. Since x goes from 0 to 15, then we will will produce a sequence of many values from 0 to 15. Then, we create a dataframe that contains all of these values and name this vector sulfamerazine. It should have the same name as the explanatory factor in the model. x=seq(0,15,by=0.1) newdata<-data.frame(sulfamerazine=x) We now use the function predict, to evaluate the estimated model at all these values. P<-predict(model,newData) lines(x,p) text(10,7,expression(hat(y)==7.3165+0.4513*x-0.0245*x^2),cex=1.2) Remarks: The function lines adds points to the plot that are joined by line segments. The function text adds text to the plot. The function expression also us to display a mathematical formula. Here is the resulting graph. Note that it appears that the optimal amount of sulfamerazine is somewhere between 5 and 10. 1.9 Optimizing a function R has a function called optimize for optimizing univariate functions. To maximize the mean amount of hemoglobin in the blood, we the amount of sulfamerazine should be at x = 9:21. This quantity corresponds to an estimated amount of hemoglobin in the blood of ^y = 9:39.

Intro to R Workshop - Psychology Statistics Club 9 > quadf<-function(x) 7.3165+0.4513*x-0.0245*x^2 > optimize(quadf, lower = 0, upper = 15,maximum = TRUE) $maximum [1] 9.210204 $objective [1] 9.394783 2 Simple Linear Regression Let us reconsider, the sh data, where we describe the hemoglobin in the blood as a function of the amount of sulfamerazine in the food. Suppose that someone decides to use a Regression approach instead of an ANOVA approach. They might try to t a simple linear regression and produce a scatter plot with an overlay of the estimated line. mod<-lm(hemoglobin~sulfamerazine,fish) with(fish,plot(sulfamerazine,hemoglobin)) abline(mod) Comments: The t appears not bad. However, we might be concerned with the lack-of-t of our model. It does not appear to capture the curvature in the trend for small values of x. When there a replication for at least one of the values of x, then we can do a test for lack-of-t. 2.1 Test for lack-of-t To test for lack-of-t, we start with allowing the mean response as a function of the levels of x, but not with any particular pattern. That is, if there are r levels, then there could be r dierent means. This is an ANOVA model. Our null hypothesis will be the mean of the response can be expressed as a linear function of x. This is our simple linear model with x as a predictor. The ANOVA model and the simple linear model are nested. So we can use the anova function to compare the models.

Intro to R Workshop - Psychology Statistics Club 10 > # fit the full model > mod<-lm(hemoglobin~factor(sulfamerazine),fish) > # fit the reduced model > mod0<-lm(hemoglobin~sulfamerazine,fish) > # compare the models > anova(mod0,mod) Analysis of Variance Table Model 1: hemoglobin ~ sulfamerazine Model 2: hemoglobin ~ factor(sulfamerazine) Res.Df RSS Df Sum of Sq F Pr(>F) 1 38 73.200 2 36 55.479 2 17.721 5.7494 0.00681 ** We have signicant evidence for the lack-of-t of the simple linear regression model. We could try to improve the t by approximating the response function with a quadratic function in x. Let us test for the lack-of-t of the quadratic model. The evidence for the lack-of-t of the quadratic model is not signicant. > # fit the full model > mod<-lm(hemoglobin~factor(sulfamerazine),fish) > # fit the reduced model > mod0<-lm(hemoglobin~sulfamerazine+i(sulfamerazine^2),fish) > # compare the models > anova(mod0,mod) Analysis of Variance Table Model 1: hemoglobin ~ sulfamerazine + I(sulfamerazine^2) Model 2: hemoglobin ~ factor(sulfamerazine) Res.Df RSS Df Sum of Sq F Pr(>F) 1 37 58.193 2 36 55.479 1 2.7144 1.7614 0.1928

Intro to R Workshop - Psychology Statistics Club 11 3 Applying our new skills Exercises to try in class. 1. Consider a company that produces items made from glass. The data is in the le glassworks.txt. (Source: Applied Linear Models, by Kutner et al.). A large company is studying the eects of the length of special training for new employees on the quality of work. Employees are randomly assigned to have either 6, 8, 10, or 12 hours of training. After the special training, each employee is given the same amount of material. The response variable is the number of acceptable pieces. (a) Are there the eects of the duration of the special training on the quality of the work signicant? (b) Do a trend analysis to describe the quality of the work as a function of the time in training? 2. Consider the data in the le Productivity.txt. (Source: Applied Linear Models, by Kutner et al.). Description: An economist compiled data on productivity improvements for a sample of rms producing computing equipment. The productivity improvement is measured on a scale from 0 to 100. The rms were classied according to the level of their average expenditures for R&D in the last three years (low, medium, high). (a) Test whether or not the mean productivity improvements diers according to the level of R&D research.

Intro to R Workshop - Psychology Statistics Club 12 4 General Linear Model { ANCOVA Rehabilitation: Consider the data in the le Rehabilitation.txt. Describe the association between physical tness prior to surgery of persons undergoing corrective knee surgery and time required in physical therapy until successful rehabilitation. The response: The number of days required for successful completion of physical therapy. One explanatory factor: Status of tness prior to surgery. It is categorical with 3 levels: Below average (=1), average (=2), above average (=3). Observational Design: Cross sectional. A random selection of 24 male subjects ranging in age from 18 to 30 years that undergone similar corrective knee surgery in the last year. This is an ANOVA. Remarks: We can extend an ANOVA by including one ore more numerical explanatory factors. This is called an analysis of covariance (ANCOVA). The purposes of ANCOVA: { Reduce the error variance, so that the ANOVA becomes more powerful. { Control for confounding. { An ANCOVA model is a special case of a General Linear Model (i.e. a linear model with categorical and/or numerical predictors). { We will use the lm function to build the model. Learning Objectives: Learn to create overlayed scatter plots. Learn to use the lm function with more than on predictor. Learn to test for the lack-of-t of linearity. Multiple testing concerning regression coecients (i.e. compare many slopes or compare many intercepts).

Intro to R Workshop - Psychology Statistics Club 13 4.1 Regression by groups We display the names of the variables and display the levels of the categorical variable. > datar<-read.table("rehabilitation.txt",header=true,sep="\t") > names(datar) [1] "time" "status" "age" > sapply(datar,is.factor) time status age FALSE FALSE FALSE > datar$status<-factor(datar$status) > levels(datar$status) [1] "1" "2" "3" We will create subsets of our dataframe according to the levels of the categorical variable. ## We use a logical argument to determine the rows to keep. ## Nothing after the comma means we keep all columns. data1=datar[datar$status=="1",] data2=datar[datar$status=="2",] data3=datar[datar$status=="3",] For each group, we will nd the least square line for the time for rehabilitation as a function of age. To properly overlay the three scatter plots, we will need the range for all ages and for all values of the response. ylim=range(datar$time) xlim=range(datar$age) We are now ready to produce the three plots by using the plot to create the rst plot and use points to overlay the other plots. ylab="time to rehabilitation (in days)" xlab="age (in years)" plot(data1$age,data1$time,ylab=ylab,xlab=xlab,ylim=ylim,xlim=xlim) abline(lm(data1$time~data1$age),lty=1) points(data2$age,data2$time,ylim=ylim,xlim=xlim,pch=2) abline(lm(data2$time~data2$age),lty=2) points(data3$age,data3$time,ylim=ylim,xlim=xlim,pch=3) abline(lm(data3$time~data3$age),lty=3) legend("bottomright", c("below Average","Average","Above Average"), lty=c(1,2,3), pch=c(1,2,3))

Intro to R Workshop - Psychology Statistics Club 14 Remarks: The slopes appear to be similar. Here the coecients for the three models. > lm(time~age,data=data1)$coefficients (Intercept) age 6.703220 1.195104 > lm(time~age,data=data2)$coefficients (Intercept) age 6.090041 1.144939 > lm(time~age,data=data3)$coefficients (Intercept) age -0.6326823 1.1368930 If we can assume that the slopes are the same, then to describe the eects of the tness status it suces to compare the intercepts. To formally compare the slopes, we can build a linear model to allows for dierent slopes according to tness status and compare it to a reduced model where the slopes are assumed to be equal. (This is a general linear test.) 4.2 A general linear model We t a linear model: age and status are the predictors. The symbol * is indicate that we want a model with interactions. This means that we are allowing a dierent coecient for age according to the levels of status. > modelf=lm(time~age*status,data=datar) > anova(modelf) Analysis of Variance Table Response: time Df Sum Sq Mean Sq F value Pr(>F) age 1 835.75 835.75 2530.9102 < 2.2e-16 *** status 2 246.08 123.04 372.6086 2.257e-15 *** age:status 2 0.22 0.11 0.3359 0.7191 Residuals 18 5.94 0.33

Intro to R Workshop - Psychology Statistics Club 15 Remarks: There are not signicant interactions between age and status. These results suggest that the slope of the regression between the response and age is similar for the three groups. We will t the reduced model and compare it the above full model. The symbol + indicates that we want an additive model. This means only one coecient for age (i.e. it does not depend on the tness status). > modelr=lm(time~age+status,data=datar) > anova(modelr,modelf) Analysis of Variance Table Model 1: time ~ age + status Model 2: time ~ age * status Res.Df RSS Df Sum of Sq F Pr(>F) 1 20 6.1657 2 18 5.9439 2 0.22184 0.3359 0.7191 Remark: This is a general linear test for the equality of the slopes. The p-value is large. So the evidence against the equality of the slopes is not signicant. Note that the p-value is exactly the same at the test for interaction. Actually, the tests are testing exactly the same thing. 4.3 Test for Lack-of-t Is it reasonable to assume that the associations between time for rehabilitation and age are linear? If not, then the ANCOVA will be biased. If we have repeated values for the variable age, we can test for linearity. We simply make the age a factor. This means that we make no assumption about the form of the association between the response and age. > modelfull=lm(time~factor(age)*status,data=datar) > anova(modelf,modelfull) Analysis of Variance Table Model 1: time ~ age * status Model 2: time ~ factor(age) * status Res.Df RSS Df Sum of Sq F Pr(>F) 1 18 5.9439 2 0 0.0000 18 5.9439 Remarks:

Intro to R Workshop - Psychology Statistics Club 16 We did not get a p-value. This means that there are not repeated values for age. So we cannot measure the error variance. If we round the ages to the nearest integer, we might get repeated values. There is not signicant dierence between the two models. Hence it is reasonable to assume linearity. > modelfull=lm(time~factor(round(age,0))*status,data=datar) > anova(modelf,modelfull) Analysis of Variance Table Model 1: time ~ age * status Model 2: time ~ factor(round(age, 0)) * status Res.Df RSS Df Sum of Sq F Pr(>F) 1 18 5.9439 2 6 1.1667 12 4.7772 2.0474 0.1949 If rounding did not work, try to approximate deviations from linearity with a quadratic model. We will t the larger model by using a quadratic function of age. > anova(modelf,modelfull) Analysis of Variance Table Model 1: time ~ age * status Model 2: time ~ (age + I(age^2)) * status Res.Df RSS Df Sum of Sq F Pr(>F) 1 18 5.9439 2 15 4.8200 3 1.124 1.1659 0.3555 There are no signicant dierences between the two nested models. So it is reasonable to use the simpler model, i.e. to assume that the associations between the response and age are linear. 4.4 Comparing the tness status groups We established that the slopes of the three groups are the same. This means that it suces to compare the intercepts for a comparison of the three tness status groups. We will use the additive model. The p-value for the signicance of the status is very small. The are signicant dierences between the three groups. > modelr=lm(time~age+status,data=datar) > anova(modelr) Analysis of Variance Table Response: time Df Sum Sq Mean Sq F value Pr(>F)

Intro to R Workshop - Psychology Statistics Club 17 age 1 835.75 835.75 2710.95 < 2.2e-16 *** status 2 246.08 123.04 399.11 < 2.2e-16 *** Residuals 20 6.17 0.31 Warning: We will change the order of the predictors. Look at the F -value for status. Do you notice something strange? It has changed a lot. R uses sequential sums of squares (i.e. of type I). This means that when testing for the signicance of a predictor, we are only controlling for the eects of the previous covariable. If we put age after status, then the test for status does not take age into account. To not worry about order, we could use type II sum of squares. We use the Anova function in the car package. > anova(lm(time~status+age,data=datar)) Analysis of Variance Table Response: time Df Sum Sq Mean Sq F value Pr(>F) status 2 672.00 336.00 1089.9 < 2.2e-16 *** age 1 409.83 409.83 1329.4 < 2.2e-16 *** Residuals 20 6.17 0.31 > library(car) > Anova(lm(time~status+age,data=dataR)) Anova Table (Type II tests) Response: time Sum Sq Df F value Pr(>F) status 246.08 2 399.11 < 2.2e-16 *** age 409.83 1 1329.39 < 2.2e-16 *** Residuals 6.17 20

Intro to R Workshop - Psychology Statistics Club 18 We established that there are signicant dierences between the three tness status groups. Where are does dierences? We will need the multcomp package. Once it is installed, we load it: > library(multcomp). We need to interpret the regression coecients: 0 ; 1 ; 2 ; 3. > modelr$coefficients (Intercept) age status2 status3 7.431688 1.167286-1.847379-8.722893 1 is the slope for age. Notice that status 1 is missing. This means that the intercept (i.e. 0 is for group 1). 2 is the eect for group 2 in comparison to group 1. 3 is the eect for group 3 in comparison to group 1. We want to test: 2 = 0 i.e. groups 1 and 2 are equal 3 = 0 2 = 3 i.e. groups 1 and 3 are equal i.e. groups 2 and 3 are equal We can use the glht function from the multcomp package to test these multiple hypotheses. > summary(glht(modelr, linfct = mcp(status = "Tukey"))) Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: lm(formula = time ~ age + status, data = datar) Linear Hypotheses: Estimate Std. Error t value Pr(> t ) 2-1 == 0-1.8474 0.2869-6.438 <1e-05 *** 3-1 == 0-8.7229 0.3330-26.198 <1e-05 *** 3-2 == 0-6.8755 0.2884-23.842 <1e-05 *** (Adjusted p values reported -- single-step method)

Intro to R Workshop - Psychology Statistics Club 19 4.5 Comparing the eect of age according to tness status We concluded above that the eect of age on the time for rehabilitation were equal equal between the tness groups. What would we do if they were not equal? We will do some pairwise comparisons of the slopes. Let us consider the full model with interactions. > modelf=lm(time~age*status,data=datar) > modelf$coefficients (Intercept) age status2 status3 age:status2 age:status3 6.70321989 1.19510377-0.61317875-7.33590221-0.05016525-0.05821075 We would like to compare the coecients for age: age, age:status2 age:status3. We will denote them 1, 4, 5. 1 is the slope for age within group 1. 4 is the comparison (i.e. deviation) of the slope for age within group 2 compared to group 1. 5 is the comparison (i.e. deviation) of the slope for age within group 3 compared to group 1. We want to test: 4 = 0 slopes for groups 1 and 2 are equal 5 = 0 slopes for groups 1 and 3 are equal 4 5 = 0 slopes for groups 2 and 3 are equal We can use the glht function from the multcomp package to test these multiple hypotheses. library(multcomp) K <- rbind( c(0, 0, 0, 0, 1, 0), c(0, 0, 0, 0, 0, 1), c(0, 0, 0, 0, 1, -1)) rownames(k) <- c("2-1", "3-1","2-3") summary(glht(modelf, linfct = K)) Remarks: The model has six coecients: 0 ; 1 ; : : : ; 5. We are building a matrix for the test: the columns are the coecients and the rows are the comparisons.

Intro to R Workshop - Psychology Statistics Club 20 The row (0, 0, 0, 1, -1, 0) means 4 5. We obtain the following output. Simultaneous Tests for General Linear Hypotheses Fit: lm(formula = time ~ age * status, data = datar) Linear Hypotheses: Estimate Std. Error t value Pr(> t ) 2-1 == 0-0.050165 0.079799-0.629 0.806 3-1 == 0-0.058211 0.081856-0.711 0.759 2-3 == 0 0.008045 0.092427 0.087 0.996 (Adjusted p values reported -- single-step method) We observe that there are no signicant dierences in the pairwise comparisons. This should not be surprising, since we had concluded that the slopes were equal.

Intro to R Workshop - Psychology Statistics Club 21 5 Applying our new skills Exercises to try in class. 1. Consider the data in the le Productivity.txt. (Source: Applied Linear Models, by Kutner et al.). Description: An economist compiled data on productivity improvements for a sample of rms producing computing equipment. The productivity improvement is measured on a scale from 0 to 100. The rms were classied according to the level of their average expenditures for R&D in the last three years (low, medium, high). The economist also has information on annual productivity improvement in the prior year and wishes to use this information as a covariable in the model. (a) Test whether the three regression lines that describe the productivity improvements as a linear function of the productivity improvements in the prior year have the same slope? (b) Test a formal test here as to whether the three regression functions are linear? (c) Make pairwise comparisons to compare the rm according to their average expenditures for R&D in the last three years with an ANCOVA.