Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Similar documents
MODELS WITHOUT AN INTERCEPT

1 The Classic Bivariate Least Squares Model

Diagnostics and Transformations Part 2

1 Introduction 1. 2 The Multiple Regression Model 1

Regression and the 2-Sample t

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Introduction to Linear Regression

Simple linear regression

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

NC Births, ANOVA & F-tests

Multiple Regression: Example

Tests of Linear Restrictions

1 Multiple Regression

1 Use of indicator random variables. (Chapter 8)

Linear regression and correlation

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Psychology 405: Psychometric Theory

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Inference for Regression

ST430 Exam 1 with Answers

Dealing with Heteroskedasticity

Module 4: Regression Methods: Concepts and Applications

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Density Temp vs Ratio. temp

General Linear Statistical Models - Part III

STAT 3022 Spring 2007

Variance Decomposition and Goodness of Fit

Chapter 12: Linear regression II

36-707: Regression Analysis Homework Solutions. Homework 3

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

STAT Fall HW8-5 Solution Instructor: Shiwen Shen Collection Day: November 9

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Statistical View of Least Squares

R 2 and F -Tests and ANOVA

Chapter 8 Conclusion

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

ST430 Exam 2 Solutions

22s:152 Applied Linear Regression. 1-way ANOVA visual:

Outline. 1 Preliminaries. 2 Introduction. 3 Multivariate Linear Regression. 4 Online Resources for R. 5 References. 6 Upcoming Mini-Courses

STAT 350: Summer Semester Midterm 1: Solutions

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Introduction and Single Predictor Regression. Correlation

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Lecture 10. Factorial experiments (2-way ANOVA etc)

L21: Chapter 12: Linear regression

Math 2311 Written Homework 6 (Sections )

Chapter 3 - Linear Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

ANOVA (Analysis of Variance) output RLS 11/20/2016

Regression. Marc H. Mehlman University of New Haven

Comparing Nested Models

Introduction to the Analysis of Hierarchical and Longitudinal Data

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

movies Name:

MATH 644: Regression Analysis Methods

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Multiple Regression and Regression Model Adequacy

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Lecture 18: Simple Linear Regression

Multiple Predictor Variables: ANOVA

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Multiple Predictor Variables: ANOVA

Lectures on Simple Linear Regression Stat 431, Summer 2012

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Stat 5303 (Oehlert): Models for Interaction 1

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

Key Algebraic Results in Linear Regression

Model Modifications. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. February 6, 2007

Example: 1982 State SAT Scores (First year state by state data available)

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Multiple Linear Regression

Stat 5102 Final Exam May 14, 2015

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Lecture 6 Multiple Linear Regression, cont.

SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester

Exercise 2 SISG Association Mapping

Linear Modelling: Simple Regression

MS&E 226: Small Data

Inferences on Linear Combinations of Coefficients

CAS MA575 Linear Models

Booklet of Code and Output for STAC32 Final Exam

Nonlinear Regression

Inference with Simple Regression

Lecture 10: F -Tests, ANOVA and R 2

Week 7 Multiple factors. Ch , Some miscellaneous parts

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

22s:152 Applied Linear Regression

Transcription:

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology 310 Instructions.Work through the lab, saving the output as you go. You will be submitting your assignment as an R Markdown document. Preamble. Today s assignment involves looking at multiple linear regression as an analysis technique. 1 The Multiple Linear Regression Model Multiple linear regression is a generalization of simple bivariate linear regression. Instead of having just one predictor, we have two or more. For example, suppose we have 3 predictors, X 1, X 2, and X 3. The prediction equation becomes Ŷ = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 (1) The standard multiple regression model tested by commercial statistical programs does not actually assume anything about the distribution of the X variables. Actually, they are treated as fixed observations, or as already given. Strictly speaking, this model is not correct for many studies in which the X variables and Y scores are gathered simultaneously, and all are subject to random variation. Fortunately, it often doesn t matter too much in practice. The fixed regressor model does assume that observations on Y and X follow the rule y i = ŷ i + ɛ i (2) where the ɛ values have a normal distribution that has a mean of zero and a variance that is constant across values of the predictor variables. Under these assumptions, we estimate the regression coefficients using a multivariate equivalent of the least squares regression equations we learned in our module on bivariate linear regression. Fitting a multiple linear regression model in R is really simple. 2 An Example Predicting Birth Weight As an illustration, we shall examine some data on birth weight of babies. Download the babies.data.txt file, read it into R, and attach it with the following commands. > babies.data <- read.table("http://www.statpower.net/data/babies.data.txt", + header = TRUE) > attach(babies.data) 1

It is reasonable to expect many if not all of these variables to have an impact on birth weight. Length of the gestation period is obviously a key factor, but physical size is inherited, so we would expect the mom s height and weight to also be predictors. Smoking has been identified as a potential cause of reduced birth weight as well. After loading the variables, let s take a quick look at all the correlations. We can look at all the intercorrelations by giving the command cor(babies.data), but a more efficient command will display only the correlations between the criterion, birth.weight, and the predictors. > cor(babies.data, birth.weight) [,1] birth.weight 1.00000 gestation 0.40754 not.first.born -0.04391 mom.age 0.02698 mom.height 0.20370 mom.weight 0.15592 mom.smokes -0.24680 As you can see from the correlations, the gestation variable is the strongest predictor. We ll want to spot any predictable nonlinear relationships as well, so let s do a scatterplot matrix. The scatterplotmatrix function in the car library is especially good, because it includes a linear fit and a nonparametric regression line called a loess fit in each little scatterplot. (You will need to have the car library installed and running to execute this command.) Here are the commands. > library(car) > scatterplotmatrix(~birth.weight + mom.height + mom.weight + mom.age + gestation + + not.first.born + mom.smokes) 2

65 15 25 35 45 0.0 0.4 0.8 180 55 60 120 birth.weight 55 65 mom.height 15 25 35 45 100 200 mom.weight 350 mom.age 150 250 gestation 0.0 0.4 mom.smokes 0.8 0.0 0.4 0.8 not.first.born 60 120 180 100 200 150 250 350 0.0 0.4 0.8 We begin by fitting two simple models, one with only the intercept term, and one with only one predictor, gestation. > fit0 <- lm(birth.weight ~ 1) > fit1 <- lm(birth.weight ~ gestation) Note that as long as you have one predictor, you do not need the intercept term, which is represented by a 1. The model with no predictors and only an intercept is frequently referred to as the null model in discussions of regression. Let s examine the fit of our first non-null model. > summary(fit1) Call: lm(formula = birth.weight ~ gestation) Residuals: Min 1Q Median 3Q Max 3

-49.35-11.07 0.22 10.10 57.70 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -10.7541 8.5369-1.26 0.21 gestation 0.4666 0.0305 15.28 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 16.7 on 1172 degrees of freedom Multiple R-squared: 0.166,Adjusted R-squared: 0.165 F-statistic: 233 on 1 and 1172 DF, p-value: <2e-16 The squared multiple correlation of 0.166 is highly significant. However, the scatterplot of birth.weight versus gestation gives us pause, because two observations appear to be outliers. You can see this clearly by loading the alr3 library and using their residual.plots function, or simply by plotting the two variables > plot(gestation, birth.weight) > abline(fit1, lty = 2, col = "red") 4

birth.weight 60 80 100 120 140 160 180 150 200 250 300 350 gestation Next, use the identify function to identify the outliers. > identify(gestation, birth.weight) Click on the points and you ll quickly identify them as points 239 and 820. Frequently outliers are also high influence observations. An observation is high influence if it exerts a strong effect on the values of the estimated β coefficients. This is measured by taking a standardized summary measure comparing ˆβ, the vector of estimated beta weights using all the data, with ˆβ i, the vector of estimates based on all the data but the ith observation. You can verify that point 239 is definitely an outlier and a high influence observation by loading the alr3 library, then displaying the Cook s Distance measure against the case number, as follows: > case.numbers <- 1:length(gestation) > influence <- cooks.distance(fit1) > plot(case.numbers, influence) 5

influence 0.0 0.1 0.2 0.3 0.4 0 200 400 600 800 1000 1200 case.numbers Let s remove these two cases and continue. We use a command that tells R to exclude the 239th and 820th rows of the data. Then we detach the current file and attach the trimmed data. > trimmed.data <- babies.data[c(-239, -820), ] > detach(babies.data) > attach(trimmed.data) > fit0 <- lm(birth.weight ~ 1) > fit1 <- lm(birth.weight ~ gestation) > summary(fit1) Call: lm(formula = birth.weight ~ gestation) Residuals: Min 1Q Median 3Q Max -49.3-10.9 0.2 10.1 53.7 Coefficients: Estimate Std. Error t value Pr(> t ) 6

(Intercept) -22.2026 8.8885-2.5 0.013 * gestation 0.5073 0.0318 16.0 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 16.6 on 1170 degrees of freedom Multiple R-squared: 0.179,Adjusted R-squared: 0.178 F-statistic: 255 on 1 and 1170 DF, p-value: <2e-16 Notice how the squared multiple R has improved a bit, and the slope of the regression line has increased. Our next step is to add a predictor to the regression equation. Let s add mom s smoking as a predictor, and store the fit. > fit2 <- lm(birth.weight ~ gestation + mom.smokes) > summary(fit2) Call: lm(formula = birth.weight ~ gestation + mom.smokes) Residuals: Min 1Q Median 3Q Max -50.55-10.86-0.18 10.01 50.49 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -13.6485 8.6930-1.57 0.12 gestation 0.4881 0.0309 15.77 <2e-16 *** mom.smokes -8.1717 0.9692-8.43 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 16.2 on 1169 degrees of freedom Multiple R-squared: 0.226,Adjusted R-squared: 0.225 F-statistic: 171 on 2 and 1169 DF, p-value: <2e-16 We can look for outliers in this multiple regression situation by plotting predicted scores versus residual scores. Adding a horizontal line at zero helps interpretation of the plot. > plot(predict(fit2), residuals(fit2)) > abline(0, 0, lty = 2, col = "red") 7

residuals(fit2) 40 20 0 20 40 100 120 140 160 predict(fit2) This regression suggests that kids grow about.45 ounces a day within the range of the data, and that smoking reduces birth weight by about a half a pound. Note that the R 2 value increased, indicating that smoking adds about 4% more variance to what is predicted by gestation period. We can compare these linear fits by significance with the anova command. > anova(fit0, fit1, fit2) Analysis of Variance Table Model 1: birth.weight ~ 1 Model 2: birth.weight ~ gestation Model 3: birth.weight ~ gestation + mom.smokes Res.Df RSS Df Sum of Sq F Pr(>F) 1 1171 393956 2 1170 323499 1 70457 270.1 <2e-16 *** 3 1169 304953 1 18546 71.1 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 8

The two F tests for these models are both highly significant, indicating that each model is significantly better than its predecessor. 9