Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Size: px

Start display at page:

Download "Analytics 512: Homework # 2 Tim Ahn February 9, 2016"

Maude Crawford
5 years ago
Views:

1 Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Chapter 3 Problem 1 (# 3) Suppose we have a data set with five predictors, X 1 = GP A, X 2 = IQ, X 3 = Gender (1 for Female and 0 for Male), X 4 = Interaction between GP A and IQ, and X 5 = Interaction between GP A and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get ˆβ 0 = 50, ˆβ 1 = 20, ˆβ 2 = 0.07, ˆβ 3 = 35, ˆβ 4 = 0.01, ˆβ 5 = 10. (a) Which answer is correct, and why? iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough. The least squares regression line is Y = GP A IQ + 35Gender (GP A IQ) 10(GP A Gender) = (35 10GP A)Gender If Male = 0 is our baseline, we find that males with a GPA higher than 3.5 will earn more on average than females. (b) Predict the salary of a female with IQ of 110 and a GPA of 4.0. Y = (4) (110) + 35(1) (4 110) 10(4 1) = The predicted salary would be $137,100. (c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer. False - The size of the coefficient for the interaction term does not necessarily imply little evidence of an interaction effect. The p value will help us determine significance of the term in the model, and the size of the coefficients of the GPA and IQ main effects will give us a relative scale of which we will see the actual effects of the interaction. Problem 2 (# 4) I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. Y = β 0 + β 1 X + β 2 X 2 + β 3 X 3 + ɛ. 1

2 (a) Suppose that the true relationship between X and Y is linear, i.e. Y = β 0 +β 1 X +ɛ. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer. We would expect the RSS for the cubic regression to be lower since it is based on the training data that the linear model was created from. (b) Answer (a) using test rather than training RSS. The model is based on the training data so we do not know how accurate the model is compared to the true population regression plan. This can cause model bias. Also, the irreducible error inherent in the test data can cause the test RSS of the cubic regression to actually be higher than that of the linear regression. (c) Suppose that the true relationship between X and Y is not linear, but we don t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer. Since we know the true relationship between X and Y is not linear, it is safe to assume that the training RSS for the cubic regression would be lower than that of the linear regression. This is because we are adding variables to the least squares equations which must allow us to fit the training data more accurately. (d) Answer (c) using test rather than training RSS. There is not enough information to determine which model will result in a lower test RSS. We know that the true relationship is not linear, but it could still be closer to linear than it is to cubic. The reducible and irreducible error also still come into play to cloud any sort of prediction. Problem 3 (# 8) This question involves the use of simple linear regression on the Auto data set. (a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. library(islr) autolm = lm(mpg ~ horsepower, data = Auto) summary(autolm) Call: lm(formula = mpg ~ horsepower, data = Auto) Residuals: Min 1Q Median 3Q Max

3 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** horsepower <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 390 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 390 DF, p-value: < 2.2e-16 i. Is there a relationship between the predictor and the response? The high t value of the predictor (horsepower) results in a very low p value indicating that there is a relationship between the predictor and response. ii. How strong is the relationship between the predictor and the response? The R 2 value of.6059 is relatively high, indicating that horsepower is a fairly strong predictor. iii. Is the relationship between the predictor and the response positive or negative? The coefficient of the predictor is negative which implies a negative relationship. iv. What is the predicted `mpg` associated with a `horsepower` of 98? What are the associated 95% confidence and prediction intervals? ŷ = x + ɛ # Predicted mpg associated with horsepower of 98 predict(autolm, data.frame(horsepower=98)) # 95% confidence interval predict(autolm, data.frame(horsepower=98), interval = "confidence") fit lwr upr # 95% prediction interval predict(autolm, data.frame(horsepower=98), interval = "prediction") fit lwr upr (b) Plot the response and the predictor. Use the abline() function to display the least squares regression line. 3

4 plot(auto$horsepower, Auto$mpg) abline(autolm, lwd=2, col=2) Auto$mpg Auto$horsepower (c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit. par(mfrow=c(2,2)) plot(autolm) 4

5 Residuals vs Fitted Normal Q Q Residuals Standardized residuals Fitted values Theoretical Quantiles Standardized residuals Scale Location Standardized residuals Residuals vs Leverage Cook's distance Fitted values Leverage The Residuals vs Fitted plot shows a u-shape that is mostly positive, indicating higher bias on the lower and higher values. The q-q plot has a slight bend which could indicate non-normal distribution. Problem 4 (# 9) This question involves the use of multiple linear regression on the Auto data set. (a) Produce a scatterplot matrix which includes all of the variables in the data set. pairs(auto) 5

6 mpg 3 cylinders 100 displacement 50 horsepower 1500 weight acceleration year name origin (b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative. cor(auto[1:8]) mpg cylinders displacement acceleration year origin mpg cylinders displacement horsepower weight acceleration year origin mpg cylinders displacement horsepower weight acceleration year origin 6 horsepower weight

7 (c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. autolm2 <- lm(mpg ~. -name, data = Auto) summary(autolm2) Call: lm(formula = mpg ~. - name, data = Auto) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *** cylinders displacement ** horsepower weight < 2e-16 *** acceleration year < 2e-16 *** origin e-07 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 384 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 7 and 384 DF, p-value: < 2.2e-16 i. Is there a relationship between the predictors and the response? The F-statistic is very high which indicates that there is most likely a strong relationship between the predictors and the response. ii. Which predictors appear to have a statistically significant relationship to the response? The low p-values for displacement, weight, year, and origin indicate a statistically significant relationship to mpg. iii. What does the coefficient for the year variable suggest? Each additional year improves fuel efficiency by approximately 0.75 mpg. (d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage? 7

8 par(mfrow=c(2,2)) plot(autolm2) Residuals vs Fitted Normal Q Q Residuals Standardized residuals Fitted values Theoretical Quantiles Standardized residuals Scale Location Standardized residuals Residuals vs Leverage Cook's distance Fitted values Leverage The u-shape in Residuals vs Fitted indicates a bad fit and bias on the upper and lower values. The residual plots appear to have a handful of outliers with residuals above 10 on the Residuals vs Fitted plot. The Residuals vs Leverage plot has a high leverage point at observation 14. (e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant? autolm3 <- lm(mpg ~. *., data = Auto[,1:8]) summary(autolm3) 8

9 Call: lm(formula = mpg ~. *., data = Auto[, 1:8]) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.548e e cylinders 6.989e e displacement e e * horsepower 5.034e e weight 4.133e e acceleration e e ** year 6.974e e origin e e ** cylinders:displacement e e cylinders:horsepower 1.161e e cylinders:weight 3.575e e cylinders:acceleration 2.779e e cylinders:year e e cylinders:origin 4.022e e displacement:horsepower e e displacement:weight 2.472e e displacement:acceleration e e displacement:year 5.934e e * displacement:origin 2.398e e horsepower:weight e e horsepower:acceleration e e horsepower:year e e horsepower:origin 2.233e e weight:acceleration 2.346e e weight:year e e weight:origin e e acceleration:year 5.562e e * acceleration:origin 4.583e e ** year:origin 1.393e e Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 363 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 28 and 363 DF, p-value: < 2.2e-16 The interactions between displacement and year, acceleration and year, and acceleration and origin all have low p values that indicate significance. 9

10 (f) Try a few different transformations of the variables, such as log(x), X, X 2. Comment on your findings. autolmlog <- lm(mpg ~ log(horsepower) + log(weight) + log(acceleration), data = Auto) summary(autolmlog) Call: lm(formula = mpg ~ log(horsepower) + log(weight) + log(acceleration), data = Auto) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** log(horsepower) e-09 *** log(weight) e-11 *** log(acceleration) ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 388 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 388 DF, p-value: < 2.2e-16 autolmx2 <- lm(mpg ~ (horsepower)^2 + (weight)^2 + (acceleration)^2, data = Auto) summary(autolmx2) Call: lm(formula = mpg ~ (horsepower)^2 + (weight)^2 + (acceleration)^2, data = Auto) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** horsepower ** weight < 2e-16 *** acceleration Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 388 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 388 DF, p-value: < 2.2e-16 10

11 autolmsqrt <- lm(mpg ~ sqrt(horsepower) + sqrt(weight) + sqrt(acceleration), data = Auto) summary(autolmsqrt) Call: lm(formula = mpg ~ sqrt(horsepower) + sqrt(weight) + sqrt(acceleration), data = Auto) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** sqrt(horsepower) e-06 *** sqrt(weight) e-14 *** sqrt(acceleration) Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 388 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 388 DF, p-value: < 2.2e-16 Applying the log function to each of the variables resulted in the highest R 2 value and F-statistic. It also provided the lowest individual p-values for horsepower and acceleration while squaring the weight variable resulted in the lowest p-value. Problem 5 (# 12) This problem involves simple linear regression without an intercept. (a) Recall that the coefficient estimate ˆβ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X? In order for the coefficient estimates to be the same in both circumstances, ( n ) ( n ) ( n ) ( n ) ( n ) ˆβ = x i y i / x 2 i = x i y i / yi 2 x 2 i i=1 i =1 i=1 i =1 i =1 = ( n ) yi 2 i =1 (b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X. x <- rnorm(100) y <- x^2 coefficients(lm(x ~ y)) coefficients(lm(y ~ x)) 11

12 (Intercept) y (Intercept) x (c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X. x <- rnorm(100) y <- x coefficients(lm(x ~ y)) coefficients(lm(y ~ x)) (Intercept) y e e+00 (Intercept) x e e+00 12

Chapter 3 - Linear Regression

Chapter 3 - Linear Regression Lab Solution 1 Problem 9 First we will read the Auto" data. Note that most datasets referred to in the text are in the R package the authors developed. So we just need to