Introductory Statistics with R: Linear models for continuous response (Chapters 6, 7, and 11)

Similar documents
Introductory Statistics with R: Simple Inferences for continuous data

Regression and correlation

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

22s:152 Applied Linear Regression. 1-way ANOVA visual:

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

ST430 Exam 2 Solutions

Estimated Simple Regression Equation

Introduction and Single Predictor Regression. Correlation

STATISTICS 110/201 PRACTICE FINAL EXAM

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

Outline. 1 Preliminaries. 2 Introduction. 3 Multivariate Linear Regression. 4 Online Resources for R. 5 References. 6 Upcoming Mini-Courses

Regression in R I. Part I : Simple Linear Regression

STATISTICS 479 Exam II (100 points)

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Stat 5303 (Oehlert): Analysis of CR Designs; January

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Chapter 16: Understanding Relationships Numerical Data

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

1 Introduction to Minitab

Module 4: Regression Methods: Concepts and Applications

Introduction to linear model

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Multiple Predictor Variables: ANOVA

STAT 3022 Spring 2007

Chapter 8: Correlation & Regression

Statistics GIDP Ph.D. Qualifying Exam Methodology

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

13: Additional ANOVA Topics

Regression and Models with Multiple Factors. Ch. 17, 18

Simple linear regression: estimation, diagnostics, prediction

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Unbalanced Data in Factorials Types I, II, III SS Part 1

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Handling Categorical Predictors: ANOVA

Business Statistics. Lecture 10: Course Review

Introduction to Regression in R Part II: Multivariate Linear Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Math 141. Lecture 16: More than one group. Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

Inferences for Regression

Exam Applied Statistical Regression. Good Luck!

6. Multiple regression - PROC GLM

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Confidence Intervals, Testing and ANOVA Summary

More about Single Factor Experiments

Exercise 2 SISG Association Mapping

Stat 209 Lab: Linear Mixed Models in R This lab covers the Linear Mixed Models tutorial by John Fox. Lab prepared by Karen Kapur. ɛ i Normal(0, σ 2 )

Diagnostics and Transformations Part 2

Cuckoo Birds. Analysis of Variance. Display of Cuckoo Bird Egg Lengths

Multiple Sample Numerical Data

1 Introduction 1. 2 The Multiple Regression Model 1

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

(Foundation of Medical Statistics)

Chapter 1 Statistical Inference

Math 423/533: The Main Theoretical Topics

SPSS Guide For MMI 409

Linear Regression. Example: Lung FuncAon in CF pts. Example: Lung FuncAon in CF pts. Lecture 9: Linear Regression BMI 713 / GEN 212

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

holding all other predictors constant

Bivariate data analysis

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

a. The least squares estimators of intercept and slope are (from JMP output): b 0 = 6.25 b 1 =

What is a Hypothesis?

Lecture 9: Linear Regression

Stats fest Analysis of variance. Single factor ANOVA. Aims. Single factor ANOVA. Data

Contents. Acknowledgments. xix

Categorical Predictor Variables

Analysis of variance and regression. November 22, 2007

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

Multiple Regression and Regression Model Adequacy

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Lecture 6 Multiple Linear Regression, cont.

MATH ASSIGNMENT 2: SOLUTIONS

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Introduction to Linear regression analysis. Part 2. Model comparisons

Inference for Regression

Outline. Analysis of Variance. Acknowledgements. Comparison of 2 or more groups. Comparison of serveral groups

STAT 501 EXAM I NAME Spring 1999

Comparing Nested Models

Introduction to the Analysis of Hierarchical and Longitudinal Data

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Multiple Linear Regression. Chapter 12

Unit 6 - Introduction to linear regression

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

CHAPTER 10. Regression and Correlation

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Week 14 Comparing k(> 2) Populations

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Jian WANG, PhD. Room A115 College of Fishery and Life Science Shanghai Ocean University

ST430 Exam 1 with Answers

Booklet of Code and Output for STAC32 Final Exam

General Linear Model (Chapter 4)

Topic 18: Model Selection and Diagnostics

ANOVA (Analysis of Variance) output RLS 11/20/2016

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Transcription:

Introductory Statistics with R: Linear models for continuous response (Chapters 6, 7, and 11) Statistical Packages STAT 1301 / 2300, Fall 2014 Sungkyu Jung Department of Statistics University of Pittsburgh E-mail: sungkyu@pitt.edu http://www.stat.pitt.edu/sungkyu/stat1301/ 1 / 26

1 Simple linear regression 2 Multiple Regression 3 Analysis of Variance 2 / 26

Section 1 Simple linear regression 3 / 26

Fitting simple linear regression lm (Linear Model) function A model for thuesen data: y i = β 0 + β 1x i + ɛ i, ɛ i N(0, σ 2 ), short.velocity i = intercept + slope blood.glucose i + error i. library ( ISwR ) plot ( thuesen $ blood. glucose, thuesen $ short. velocity ) lm( short. velocity ~ blood. glucose, data = thuesen ) > attach ( thuesen ) > lm( short. velocity ~ blood. glucose ) Call : lm( formula = short. velocity ~ blood. glucose ) Coefficients : ( Intercept ) blood. glucose 1.09781 0.02196 4 / 26

Saving and extracting the result of analysis Model object The result of lm is a model object. Whereas other statistical systems focus on generating printed output that can be controlled by setting options, you get instead the result of a model fit encapsulated in an object. The model object is a sort of list. From the model object the desired quantities can be obtained using extractor functions. fit <-lm( short. velocity ~ blood. glucose ) summary ( fit ) str ( fit ) fit $ coefficients fit $ fitted. values plot ( short. velocity ~ blood. glucose ) abline ( fit ) 5 / 26

Model Diagnostics All further analysis is based on the model object (fit in this example). Fitted values ŷ i fitted ( fit ) fit $ fitted. values Residuals r i = y i ŷ i resid ( fit ) fit $ residuals Exercise: Produce a scatterplot of short.velocity versus blood.glucose with fitted line and residual line segments. 6 / 26

Model Diagnostics Leverage, Cook s distance and studentized residuals leverages <- hatvalues ( fit ) cooks. distance ( fit ) rstudent ( fit ) Residual vs fitted values plot plot ( fitted ( fit ), rstudent ( fit )) Checking Normality qqnorm ( resid ( fit )) qqline ( resid ( fit )) Just want everything? plot ( fit ) 7 / 26

Prediction and confidence intervals predict function Obtain the fitted values with upper and lower bounds on confidence intervals (interval="confidence") or prediction interval (interval="prediction"). The arguments can be abbreviated: predict (fit, int = "c") predict (fit, int = "p") 8 / 26

Prediction and confidence intervals Prediction for new data points New data points (here newpts) arranged as a data frame containing suitable x values (here blood.glucose). newpts <- data. frame ( blood. glucose = 1:30) pp <- predict (fit, int ="p", newdata = newpts ) pc <- predict (fit, int ="c", newdata = newpts ) plot ( blood. glucose, short. velocity, ylim = range ( short. velocity, pp, na.rm=t), xlim = range ( blood. glucose, newdata,na.rm=t)) matlines ( newpts $ blood. glucose, pp, lty =c(1,3,3), col =" black ") matlines ( newpts $ blood. glucose, pc, lty =c(1,3,3), col =" blue ") range function returns a vector of length 2 containing the minimum and maximum values of all of its arguments. matlines function plots the columns of one matrix against the columns of another. 9 / 26

Section 2 Multiple Regression 10 / 26

Multiple Regression Scatterplot matrix data ( cystfibr ) par ( mex = 0.5) pairs ( cystfibr, gap = 0, cex. labels = 0.9) Fitting multiple regression model by lm function lm( pemax ~ age + sex + height + weight + bmp + fev1 +rv+frc +tlc, data = cystfibr ) a <-lm( pemax ~., data = cystfibr ) summary (a) Here, the model formula pemax~. stands for pemax explained by all variables in the data frame. 11 / 26

Model formulas Suppose we have either continuous (numeric) or categorical (factor) variables as follows. Y : response variable X, Z, W : predictor variables. Basic formulas Y ~ X: Y i = β 0 + β 1X i + ɛ i. Y ~ X+W: Y i = β 0 + β 1X i + β 2W i + ɛ i. Examples: lm( pemax ~ age, data = cystfibr ) lm( pemax ~ age + sex, data = cystfibr ) Importantly, the use of the + symbol in this context is different than its usual meaning. 12 / 26

Model formulas In addition, the dot symbol (. ) can only be used when the data frame is specified, and means all variables except the response. Examples Y ~ X + Z + W + X: Z + X: W + Z: W Y ~ X * Z * W - X: Z: W Y ~ (X + Z + W )^2 Y ~ X + Y - 1 Y ~. - X, data = adataframe # used within lm () 13 / 26

Model formulas Exercises Fit a quadratic regression to thuesen data Fit a Quartic regression to thuesen data without the intercept. Merge weight and height to create BMI = wt/ht 2 and fit a linear regression model pemax ~BMI to cystfibr data 14 / 26

Model Comparison AIC and BIC for each model fit1 <- lm( pemax ~rv+bmp, data = cystfibr ) AIC ( fit1 ) extractaic ( fit1 ) BIC ( fit1 ) Comparing two models H0: Model 1: pemax ~rv + bmp is true. (or the weight coefficient = 0)) H1: Model 2: pemax ~rv + bmp + weight is true. (coefficient 0) To request an F-test for significance of the more complex model, use anova function. fit1 <- lm( pemax ~rv+bmp, data = cystfibr ) fit2 <- lm( pemax ~rv+ bmp + weight, data = cystfibr ) anova (fit1, fit2 ) 15 / 26

Model Selection add1 function: try adding one more predictor to the current model >fit1 <- lm( pemax ~age, data = cystfibr ) >add1 (fit1, pemax ~ age + sex + height + weight + bmp + fev1 +rv+frc +tlc, data = cystfibr, test = F ) Unfortunately, using pemax~. in the second argument of add1 function does not provide all variables in the data frame, but provides all variables in the current model. drop1: try removing one predictor from the current model > fit1 <- lm( pemax ~., data = cystfibr ) > drop1 ( fit1, test = F ) Exercise: do forward and backward variable selections by i) F-tests and ii) AIC values. 16 / 26

Model Selection Automated variable selection by step function step(model.object, scope, direction = "backward",test = "F",...)) model.object the model object representing the initial model in the stepwise search. scope a model formula for full model (as used in add1) Options for direction include "both", "backward", "forward". Optional argument test = "F" is passed to add1 and drop1. Example: fit. null <- lm( pemax ~1, data = cystfibr ) step ( fit.null, pemax ~ age + sex + height + weight + bmp + fev1 +rv+frc +tlc, data = cystfibr, direction =" both ") fit. full <- lm( pemax ~., data = cystfibr ) step ( fit.full, direction =" backward ",test = "F") 17 / 26

Section 3 Analysis of Variance 18 / 26

red cell folate data > summary ( red. cell. folate ) folate ventilation Min. :206.0 N2O +O2,24 h:8 1st Qu.:249.5 N2O +O2, op :9 Median :274.0 O2,24 h :5 Mean :283.2 3rd Qu.:305.5 Max. :392.0 One-way and two-way ANOVA Models juul data > summary ( juul [,3:5]) sex igf1 tanner Min. :1.000 Min. : 25.0 Min. :1.00 1st Qu.:1.000 1st Qu.:202.2 1st Qu.:1.00 Median :2.000 Median :313.5 Median :2.00 Mean :1.534 Mean :340.2 Mean :2.64 3rd Qu.:2.000 3rd Qu.:462.8 3rd Qu.:5.00 Max. :2.000 Max. :915.0 Max. :5.00 NA s :5 NA s :321 NA s :240 Exercise: Treat variables sex and tanner as categorical variables. 19 / 26

aov for balanced data > table ( red. cell. folate $ ventilation ) ANOVA model fit and ANOVA table N2O +O2,24 h N2O +O2,op O2,24 h 8 9 5 > aov ( folate ~ ventilation, data = red. cell. folate ) Call : aov ( formula = folate ~ ventilation, data = red. cell. folate ) Terms : ventilation Residuals Sum of Squares 15515.77 39716.10 Deg. of Freedom 2 19 Residual standard error : 45.72003 Estimated effects may be unbalanced > summary ( aov ( folate ~ ventilation, data = red. cell. folate )) Df Sum Sq Mean Sq F value Pr( >F) ventilation 2 15516 7758 3.711 0.0436 * Residuals 19 39716 2090 --- 20 / 26

ANOVA model fit and ANOVA table An ANOVA model is a linear regression model with multiple (binary) predictors. lm to fit, anova for table > lm( folate ~ ventilation, data = red. cell. folate ) Call : lm( formula = folate ~ ventilation, data = red. cell. folate ) Coefficients : ( Intercept ) ventilationn2o +O2, op ventilationo2,24 h 316.62-60.18-38.62 > anova (lm( folate ~ ventilation, data = red. cell. folate )) Analysis of Variance Table Response : folate Df Sum Sq Mean Sq F value Pr( >F) ventilation 2 15516 7757.9 3.7113 0.04359 * Residuals 19 39716 2090.3 21 / 26

Pairwise comparisons The p-value in the ANOVA table is used for testing H 0 : α 1 = = α I = 0. In case where H 0 is rejected (i.e., we have enough evidence for the alternative: there is at least one pair (i, j) such that α i α j ), one often performs the pairwise comparison for all pairs of levels. pairwise.t.test function > attach ( red. cell. folate ) > pairwise.t. test ( folate, ventilation ) Pairwise comparisons using t tests with pooled SD data : folate and ventilation N2O +O2,24 h N2O +O2,op N2O +O2,op 0.042 - O2,24 h 0.310 0.408 P value adjustment method : holm Other methods for p-value adjustment can be used: pairwise.t. test ( folate, ventilation, p. adj = " bonferroni " ) # try none, holm, fdr for option p. adj TukeyHSD ( aov ( folate ~ ventilation, data = red. cell. folate )) 22 / 26

Miscellaneous functions for one-way ANOVA bartlett.test for homogeneity of variances The traditional one-way ANOVA requires an assumption of equal variances for all groups. To check whether this assumption is true: bartlett. test ( folate ~ ventilation ) oneway.test for unequal variances There is, however, an alternative procedure that does not require that assumption. It is due to Welch and similar to the unequal-variances t test. oneway. test ( folate ~ ventilation ) Kruskal-Wallis test A nonparametric counterpart of a one-way analysis of variance: kruskal. test ( folate ~ ventilation ) 23 / 26

Graphical investigation Diagnostics plot ( model. object ) Graphical representation for comparing levels xbar <- tapply ( folate, ventilation, mean ) s <- tapply ( folate, ventilation, sd) n <- tapply ( folate, ventilation, length ) sem <- s/ sqrt (n) stripchart ( folate ~ ventilation, method =" jitter ", jitter =0.05, pch =16, vert =T) arrows (1:3, xbar +sem,1:3, xbar -sem, angle =90, code =3, length =.1) lines (1:3, xbar, pch =4, type ="b",cex =2) boxplot ( folate ~ ventilation ) 24 / 26

Two-way ANOVA models juul data: Preparation detach ( juul ) juul $ sex <- factor ( juul $sex, labels =c("m","f")) juul $ menarche <- factor ( juul $ menarche, labels =c("no"," Yes ")) juul $ tanner <- factor ( juul $ tanner, labels =c("i","ii"," III ","IV","V")) attach ( juul ) Main effect model TF <- complete. cases ( juul [,2:5]) juul2 <- subset (juul,tf) a <- lm( igf1 ~ menarche + tanner, data = juul2 ) anova (a) Full model with interaction a <- lm( igf1 ~ menarche + tanner + menarche : tanner, data = juul2 ) a2 <-lm( igf1 ~ menarche * tanner, data = juul2 ) anova (a) table ( menarche, tanner ) 25 / 26

Two-way ANOVA models ToothGrowth data The response is the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid). TG2 <- transform ( ToothGrowth, dose = factor ( dose )) attach ( TG2 ) table (supp, dose ) summary ( aov ( len ~ supp * dose )) Two-way Interaction plot # interaction. plot (x. factor, trace. factor, response ) interaction. plot (supp,dose, len ) interaction. plot (dose,supp, len ) 26 / 26