Chapter 3 - Linear Regression

Similar documents
Analytics 512: Homework # 2 Tim Ahn February 9, 2016

STAT 3022 Spring 2007

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Introduction and Single Predictor Regression. Correlation

lm statistics Chris Parrish

MATH 644: Regression Analysis Methods

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Diagnostics and Transformations Part 2

Chapter 5 Exercises 1

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Regression Analysis Chapter 2 Simple Linear Regression

36-463/663: Multilevel & Hierarchical Models

Linear Modelling: Simple Regression

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression is a very popular method in science and engineering. It lets you establish relationships between two or more numerical variables.

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Beam Example: Identifying Influential Observations using the Hat Matrix

Applied Regression Analysis

Chapter 6 Exercises 1

Statistical Prediction

Motor Trend Car Road Analysis

MODELS WITHOUT AN INTERCEPT

Linear Regression Model. Badr Missaoui

Regression Diagnostics

Density Temp vs Ratio. temp

13 Simple Linear Regression

You can use numeric categorical predictors. A categorical predictor is one that takes values from a fixed set of possibilities.

Inference for Regression

Final Exam. Name: Solution:

Pumpkin Example: Flaws in Diagnostics: Correcting Models

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

Exam Applied Statistical Regression. Good Luck!

STATISTICS 110/201 PRACTICE FINAL EXAM

Variance Decomposition and Goodness of Fit

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Example: 1982 State SAT Scores (First year state by state data available)

Introduction to Linear Regression

Statistical View of Least Squares

x 21 x 22 x 23 f X 1 X 2 X 3 ε

Math 2311 Written Homework 6 (Sections )

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Bivariate data analysis

STAT 215 Confidence and Prediction Intervals in Regression

Modeling kid s test scores (revisited) Lecture 20 - Model Selection. Model output. Backward-elimination

Homework 2. For the homework, be sure to give full explanations where required and to turn in any relevant plots.

Chapter 9. Polynomial Models and Interaction (Moderator) Analysis

Lecture 6: Linear Regression

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Multiple Regression and Regression Model Adequacy

Topics on Statistics 2

R 2 and F -Tests and ANOVA

Tests of Linear Restrictions

Lab #5 - Predictive Regression I Econ 224 September 11th, 2018

Linear Models II. Chapter Key ideas

14 Multiple Linear Regression

The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

Regression in R I. Part I : Simple Linear Regression

Regression. Marc H. Mehlman University of New Haven

General Linear Statistical Models - Part III

Collinearity: Impact and Possible Remedies

Regression and the 2-Sample t

Lecture 19: Inference for SLR & Transformations

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Model Modifications. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. February 6, 2007

Multiple Regression Analysis in Teaching Statistics by R Software

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

Tutorial 6: Linear Regression

ST430 Exam 1 with Answers

Nonstationary time series models

Chaper 5: Matrix Approach to Simple Linear Regression. Matrix: A m by n matrix B is a grid of numbers with m rows and n columns. B = b 11 b m1 ...

1 The Classic Bivariate Least Squares Model

Operators and the Formula Argument in lm

Lecture 6: Linear Regression (continued)

STAT 350: Summer Semester Midterm 1: Solutions

Regression_Model_Project Md Ahmed June 13th, 2017

Lecture 10: Introduction to linear models

GMM - Generalized method of moments

STAT 571A Advanced Statistical Regression Analysis. Chapter 8 NOTES Quantitative and Qualitative Predictors for MLR

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

LECTURE 04: LINEAR REGRESSION PT. 2. September 20, 2017 SDS 293: Machine Learning

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Regression and Models with Multiple Factors. Ch. 17, 18

STATISTICS 479 Exam II (100 points)

General Linear Statistical Models

Handout 4: Simple Linear Regression

Stat 5102 Final Exam May 14, 2015

Using R in 200D Luke Sonnet

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Lecture 4: Regression Analysis

Examples of fitting various piecewise-continuous functions to data, using basis functions in doing the regressions.

Lecture 2. Simple linear regression

Multiple Predictor Variables: ANOVA

Transcription:

Chapter 3 - Linear Regression Lab Solution 1 Problem 9 First we will read the Auto" data. Note that most datasets referred to in the text are in the R package the authors developed. So we just need to use the data() function to extract it into R s memory. library(islr) data("auto") Each of the 392 rows of this dataset corresponds to a make/model of car, and columns represent various characteristics of that make/model. names(auto) [1] "mpg" "cylinders" "displacement" "horsepower" "weight" [6] "acceleration" "year" "origin" "name" dim(auto) [1] 392 9 head(auto) mpg cylinders displacement horsepower weight acceleration year origin 1 18 8 307 130 3504 12.0 70 1 2 15 8 350 165 3693 11.5 70 1 3 18 8 318 150 3436 11.0 70 1 4 16 8 304 150 3433 12.0 70 1 5 17 8 302 140 3449 10.5 70 1 6 15 8 429 198 4341 10.0 70 1 name 1 chevrolet chevelle malibu 2 buick skylark 320 3 plymouth satellite 4 amc rebel sst 5 ford torino 6 ford galaxie 500 Eventually, we will try to predict the fuel efficiency of a car given its various characteristics. (a). Produce a scatterplot matrix which includes all of the variables in the data set. A scatterplot matrix can be constructed with the pairs() function. pairs(auto) 1

5 7 50 150 10 20 1.0 2.0 3.0 40 3 3 5 7 10 mpg 400 cylinders 200 100 displacement 5000 50 horsepower 10 20 1500 weight 78 acceleration 2.5 70 year 0 name 200 1.0 origin 10 30 100 300 1500 4000 70 76 82 0 150 300 Note some of the trends in the data. For example, as the number of cylinders increases is appears that fuel efficiency increases and decreases. There are numerous curvilinear relationships with mpg as well. (b). Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative. As the problem mentions, we need to remove the name column because it is not numeric. Recall the output of names(auto) above, and note that name" is the 9th entry in that output, so cor(auto[,-9]) would do the job for us. However it would be inefficient to lookup this number each time, so instead we can use the which() function to find the column for us to remove in the same way. Additionally, the subset() function can be used to remove the names of columns. So that cor(auto[,-9]) cor(auto[,-which(names(auto) %in% c("name"))]) cor(subset(auto, select=-name)) will all produce mpg cylinders displacement horsepower weight acceleration year origin mpg 1.0000000-0.7776175-0.8051269-0.7784268-0.8322442 0.4233285 0.5805410 0.5652088 year cylinders displacement horsepower weight acceleration -0.7776175-0.8051269-0.7784268-0.8322442 0.4233285 1.0000000 0.9508233 0.8429834 0.8975273-0.5046834 0.9508233 1.0000000 0.8972570 0.9329944-0.5438005 0.8429834 0.8972570 1.0000000 0.8645377-0.6891955 0.8975273 0.9329944 0.8645377 1.0000000-0.4168392-0.5046834-0.5438005-0.6891955-0.4168392 1.0000000-0.3456474-0.3698552-0.4163615-0.3091199 0.2903161-0.5689316-0.6145351-0.4551715-0.5850054 0.2127458 origin 2

mpg 0.5805410 0.5652088 cylinders -0.3456474-0.5689316 displacement -0.3698552-0.6145351 horsepower -0.4163615-0.4551715 weight -0.3091199-0.5850054 acceleration 0.2903161 0.2127458 year 1.0000000 0.1815277 origin 0.1815277 1.0000000 Note that the signs and magnitudes more or less match the directions and strengths of the relatrionships we observe in the scatterplot matrix, though because of the non-linear nature of some relationsships these results may be dubious! (c). Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. We can tell the lm function to use all other variables in a dataset by using a." after the ĩn formula specification. To remove a variable we can subtract it after we place the.". For example in this case: #name is missing from the first line my_lm = lm(mpg ~ cylinders+displacement+horsepower+weight+acceleration+year+origin,data=auto) my_lm = lm(mpg ~. - name,data=auto) Both give us the same result summary(my_lm) Call: lm(formula = mpg ~. - name, data = Auto) Residuals: Min 1Q Median 3Q Max -9.5903-2.1565-0.1169 1.8690 13.0604 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -17.218435 4.644294-3.707 0.00024 *** cylinders -0.493376 0.323282-1.526 0.12780 displacement 0.019896 0.007515 2.647 0.00844 ** horsepower -0.016951 0.013787-1.230 0.21963 weight -0.006474 0.000652-9.929 < 2e-16 *** acceleration 0.080576 0.098845 0.815 0.41548 year 0.750773 0.050973 14.729 < 2e-16 *** origin 1.426141 0.278136 5.127 4.67e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.328 on 384 degrees of freedom Multiple R-squared: 0.8215,Adjusted R-squared: 0.8182 F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16 Comment on the output. For instance: i. Is there a relationship between the predictors and the response? The global F test has a very small p-value, indicating there is some relationship between this ser of predictors and the response. ii. Which predictors appear to have a statistically significant relationship to the response? Examination of the p-values in the coefficient table suggests that displacement, weight, year, and origin have a statistically significant relationship with mpg, while cylinders, horsepower, and acceleration do not. 3

iii. What does the coefficient for the year variable suggest? The regression coefficient for year is 0.750773, which suggests that for every one year increase of a make/model car, mpg increases by 0.750773. i.e. newer cars are more fuel efficient. (d). Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage? In R, plot() is what is called a "generic" function. This means that you can pass it many different types of objects and it will do different things depending on what is passed. In this problem, my_lm is an lm object. So when we call plot(my_lm), R will immeditly call the subroutine plot.lm(), which is the plotting function for lm objects. par(mfrow=c(2,2)) #creates a 2 x 2 panel display plot(my_lm) Residuals vs Fitted Normal Q Q Residuals 10 5 0 5 10 15 323 327 326 Standardized residuals 2 0 2 4 323 327 326 10 15 20 25 30 35 Fitted values 3 2 1 0 1 2 3 Theoretical Quantiles Scale Location Residuals vs Leverage Standardized residuals 0.0 0.5 1.0 1.5 2.0 323 327 326 Standardized residuals 2 0 2 4 327 394 Cook's distance 14 0.5 10 15 20 25 30 35 Fitted values 0.00 0.05 0.10 0.15 Leverage From the plots, it seems: The residual plot has a strong curve, indicating issues with the fit. 4

The assumption of normality does not seem to be violated. Observation 14 has high leverage. (e). Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant? Note that when specifying formulas, the * operator gives the multiplicative term as well as the first order terms, and the : operator just gives the multiplicative term. This means that lm(y ~ a*b) lm(y ~ a + b + a:b) would give the same result. There are many choices for possible interaction effects. Typically, one needs to have knowledge of the problem in order to choose meaningful combinations of predictor variables. For example, in this problem, it seems logical that the effect of weight and acceleration may have some kind of joint impact on mpg, hence we can modify the model in part (c) to add this in. my_lm_e = lm(mpg ~. - name + weight:acceleration,data=auto) summary(my_lm_e) Call: lm(formula = mpg ~. - name + weight:acceleration, data = Auto) Residuals: Min 1Q Median 3Q Max -8.247-2.048-0.045 1.619 12.193 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -4.364e+01 5.811e+00-7.511 4.18e-13 *** cylinders -2.141e-01 3.078e-01-0.696 0.487117 displacement 3.138e-03 7.495e-03 0.419 0.675622 horsepower -4.141e-02 1.348e-02-3.071 0.002287 ** weight 4.027e-03 1.636e-03 2.462 0.014268 * acceleration 1.629e+00 2.422e-01 6.726 6.36e-11 *** year 7.821e-01 4.833e-02 16.184 < 2e-16 *** origin 1.033e+00 2.686e-01 3.846 0.000141 *** weight:acceleration -5.826e-04 8.408e-05-6.928 1.81e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.141 on 383 degrees of freedom Multiple R-squared: 0.8414,Adjusted R-squared: 0.838 F-statistic: 253.9 on 8 and 383 DF, p-value: < 2.2e-16 Note that our interaction effect is significant and the Ra 2 value increased, indicating a better fit. different possibilities here, so your answers will likely vary. There are many (f). Try a few different transformations of the variables, such as log(x), X, X 2, etc. findings. Comment on your my_lm_f = lm(mpg ~. - name + log(weight) + sqrt(horsepower) + I(displacement^2) + I(cylinders^2),data=Auto) summary(my_lm_f) 5

Call: lm(formula = mpg ~. - name + log(weight) + sqrt(horsepower) + I(displacement^2) + I(cylinders^2), data = Auto) Residuals: Min 1Q Median 3Q Max -9.2216-1.4972-0.1142 1.4184 11.9541 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.228e+02 4.381e+01 2.803 0.00532 ** cylinders 3.360e-01 1.451e+00 0.232 0.81697 displacement -3.765e-02 2.175e-02-1.731 0.08421. horsepower 2.197e-01 6.684e-02 3.287 0.00111 ** weight 1.181e-03 2.074e-03 0.569 0.56949 acceleration -2.036e-01 1.004e-01-2.028 0.04323 * year 7.654e-01 4.526e-02 16.911 < 2e-16 *** origin 5.497e-01 2.679e-01 2.052 0.04088 * log(weight) -1.493e+01 6.714e+00-2.223 0.02678 * sqrt(horsepower) -5.998e+00 1.493e+00-4.018 7.06e-05 *** I(displacement^2) 6.788e-05 3.773e-05 1.799 0.07279. I(cylinders^2) -1.067e-02 1.164e-01-0.092 0.92702 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.904 on 380 degrees of freedom Multiple R-squared: 0.8654,Adjusted R-squared: 0.8615 F-statistic: 222.2 on 11 and 380 DF, p-value: < 2.2e-16 2 Problem 13 (a). Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0,1) distribution. This represents a feature, X. set.seed(1) x = rnorm(100) (b). Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25. eps = rnorm(100, 0, sqrt(0.25)) (c). Using x and eps, generate a vector y according to the model Y = 1 + 0.5X + ε What is the length of the vector y? What are the values of β 0 and β 1 in this linear model? y = -1 + 0.5*x + eps y is comprised of 100 observations, and the values of the intercept β 0 and slope β 1 are -1 and 0.5, respectively. (d). Create a scatterplot displaying the relationship between x and y. Comment on what you observe. 6

plot(y,x) x 2 1 0 1 2 2.5 2.0 1.5 1.0 0.5 0.0 0.5 y (e). Fit a least squares linear model to predict y using x. Comment on the model obtained. How do ˆβ 0 and ˆβ 1 compare to β 0 and β 1? my_lm = lm(y~x) summary(my_lm) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -0.93842-0.30688-0.06975 0.26970 1.17309 7

Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.01885 0.04849-21.010 < 2e-16 *** x 0.49947 0.05386 9.273 4.58e-15 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4814 on 98 degrees of freedom Multiple R-squared: 0.4674,Adjusted R-squared: 0.4619 F-statistic: 85.99 on 1 and 98 DF, p-value: 4.583e-15 (f). Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend. plot(x, y) abline(my_lm, lwd=3, col=2) abline(-1, 0.5, lwd=3, col=3) legend("bottomright", legend = c("model fit", "pop. regression"), col=2:3, lwd=3) y 2.5 2.0 1.5 1.0 0.5 0.0 0.5 model fit pop. regression 2 1 0 1 2 x (g). Now fit a polynomial regression model that predicts y using x and x 2. quadratic term improves the model fit? Explain your answer. Is there evidence that the lm_fit_sq = lm(y ~ x + I(x^2)) summary(lm_fit_sq) 8

Call: lm(formula = y ~ x + I(x^2)) Residuals: Min 1Q Median 3Q Max -0.98252-0.31270-0.06441 0.29014 1.13500 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.97164 0.05883-16.517 < 2e-16 *** x 0.50858 0.05399 9.420 2.4e-15 *** I(x^2) -0.05946 0.04238-1.403 0.164 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.479 on 97 degrees of freedom Multiple R-squared: 0.4779,Adjusted R-squared: 0.4672 F-statistic: 44.4 on 2 and 97 DF, p-value: 2.038e-14 (h). Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term in (b). Describe your results. set.seed(1) x = rnorm(100) eps = rnorm(100, 0, 0.125) y = -1 + 0.5*x + eps lm.fit = lm(y~x) summary(lm.fit) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -0.23461-0.07672-0.01744 0.06742 0.29327 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.00471 0.01212-82.87 <2e-16 *** x 0.49987 0.01347 37.12 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1203 on 98 degrees of freedom Multiple R-squared: 0.9336,Adjusted R-squared: 0.9329 F-statistic: 1378 on 1 and 98 DF, p-value: < 2.2e-16 plot(x, y) abline(lm.fit, lwd=3, col=2) abline(-1, 0.5, lwd=3, col=3) legend("bottomright", legend = c("model fit", "pop. regression"), col=2:3, lwd=3) 9

y 2.0 1.5 1.0 0.5 0.0 model fit pop. regression 2 1 0 1 2 x (i). Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term in (b). Describe your results. set.seed(1) x = rnorm(100) eps = rnorm(100, 0,.75) y = -1 + 0.5*x + eps lm.fit = lm(y~x) summary(lm.fit) Call: lm(formula = y ~ x) 10

Residuals: Min 1Q Median 3Q Max -1.4076-0.4603-0.1046 0.4046 1.7596 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.02827 0.07274-14.136 < 2e-16 *** x 0.49920 0.08080 6.179 1.48e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7221 on 98 degrees of freedom Multiple R-squared: 0.2803,Adjusted R-squared: 0.273 F-statistic: 38.18 on 1 and 98 DF, p-value: 1.479e-08 plot(x, y) abline(lm.fit, lwd=3, col=2) abline(-1, 0.5, lwd=3, col=3) legend("bottomright", legend = c("model fit", "pop. regression"), col=2:3, lwd=3) 11

y 3 2 1 0 1 model fit pop. regression 2 1 0 1 2 x 12