Handout 4: Simple Linear Regression

Similar documents
Ch 2: Simple Linear Regression

Inference for Regression

STAT 3022 Spring 2007

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Coefficient of Determination

Ch 3: Multiple Linear Regression

Density Temp vs Ratio. temp

Simple and Multiple Linear Regression

ST430 Exam 1 with Answers

Chapter 16: Understanding Relationships Numerical Data

MS&E 226: Small Data

Lecture 6 Multiple Linear Regression, cont.

Simple Linear Regression

Applied Regression Analysis

Linear Regression Model. Badr Missaoui

Chapter 8: Simple Linear Regression

Simple Linear Regression

Linear models and their mathematical foundations: Simple linear regression

CAS MA575 Linear Models

Simple Linear Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression

STAT 215 Confidence and Prediction Intervals in Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

Measuring the fit of the model - SSR

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Multiple Linear Regression

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Simple Linear Regression

Multiple Linear Regression (solutions to exercises)

Lecture 4 Multiple linear regression

Introduction and Single Predictor Regression. Correlation

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Variance Decomposition and Goodness of Fit

Statistics for Engineers Lecture 9 Linear Regression

R 2 and F -Tests and ANOVA

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

where x and ȳ are the sample means of x 1,, x n

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

ST430 Exam 2 Solutions

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Estimated Simple Regression Equation

Regression Analysis lab 3. 1 Multiple linear regression. 1.1 Import data. 1.2 Scatterplot matrix

Homework 9 Sample Solution

STAT 350: Summer Semester Midterm 1: Solutions

STAT5044: Regression and Anova. Inyoung Kim

ANOVA (Analysis of Variance) output RLS 11/20/2016

Homework 2: Simple Linear Regression

Correlation Analysis

Lecture 18: Simple Linear Regression

Inferences for Regression

Stat 5102 Final Exam May 14, 2015

Multiple Linear Regression

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Formal Statement of Simple Linear Regression Model

Solution to Series 3

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

MATH 644: Regression Analysis Methods

Math 3330: Solution to midterm Exam

1 Use of indicator random variables. (Chapter 8)

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Lecture 15. Hypothesis testing in the linear model

Foundations of Correlation and Regression

Inference for Regression Simple Linear Regression

Simple linear regression

1 Multiple Regression

Chapter 12 - Lecture 2 Inferences about regression coefficient

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

Scatter plot of data from the study. Linear Regression

How to mathematically model a linear relationship and make predictions.

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Chapter 14. Linear least squares

How to mathematically model a linear relationship and make predictions.

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Chapter 11: Linear Regression and Correla4on. Correla4on

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Chapter 12: Multiple Linear Regression

Lecture 1: Linear Models and Applications

Regression used to predict or estimate the value of one variable corresponding to a given value of another variable.

Section 4.6 Simple Linear Regression

Scatter plot of data from the study. Linear Regression

Multiple Regression Analysis

Generating OLS Results Manually via R

Chapter 8 Conclusion

Inference for the Regression Coefficient

Nonstationary time series models

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

SCHOOL OF MATHEMATICS AND STATISTICS

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Inference for Regression Inference about the Regression Model and Using the Regression Line

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

STAT2012 Statistical Tests 23 Regression analysis: method of least squares

Replication of Examples in Chapter 6

Transcription:

Handout 4: Simple Linear Regression By: Brandon Berman The following problem comes from Kokoska s Introductory Statistics: A Problem-Solving Approach. The data can be read in to R using the following code: msm = read.csv("http://www.ics.uci.edu/~zhaoxia/teaching/stat120c/data/msmdata.csv") 1 Background and Maximum Likelihood Estimates The European Food Safety Authority recently issued a scientific opinion on the public health risks related to mechanically separated meat (MSM). The analysis suggested that calcium could be used to distinguish between MSM and non-msm products. A random sample of MSM poultry was obtained and the deboner head pressure (in psi) and the amount of calcium (in ppm) was measured for each. The data are given in the following table. Based on this data set we want to build a model that can predict how much calcium (Y i ) a sample will contain depending on the pressure (x i ) the deboner used on the poultry. Our model will be: Y i = β 0 + β 1 x i + ɛ i where ɛ i iid N(0, σ 2 ), i = 1, 2,..., 18 A consequence of the definition above is that Y i indep. N (β 0 + β 1 x i, σ 2 ) for all i = 1, 2,..., n. In order to fit the model we first must find the maximum likelihood estimates for β 0, β 1, and σ 2. (Note that n = 18, but we re going to pretend for the time being we don t know that fact). The likelihood and log-likelihood equations are: L(β 0, β 1, σ 2 ) = n [ ( 1 exp (y )] i β 0 β 1 x i ) 2 2πσ 2 2σ 2 l(β 0, β 1, σ 2 ) = n 2 log(2πσ2 ) n (y i β 0 β 1 x i ) 2 2σ 2 1

Pressure (in psi) Calcium (in ppm) 51 573 95 654 104 581 143 709 77 560 109 629 102 623 72 560 120 598 112 577 76 600 143 666 93 616 87 514 70 586 49 584 142 634 132 632 Now we must take the derivative of the likelihood equation with respect to β 0, β 1 and σ 2. l β 0 = l β 1 = n n (y i β 0 β 1 x i ) σ 2 l σ = n n 2 2σ + 2 x i (y i β 0 β 1 x i ) σ 2 (y i β 0 β 1 x i ) 2 2(σ 2 ) 2 Set the three equations above equal to zero and simultaneously solve for β 0, β 1 and σ 2. The results will be the maximum likelihood estimates. Prove to yourself the following are the maximum 2

likelihood estimates: ˆβ 0 = ȳ ˆβ 1 x n ˆβ 1 = (x i x)(y i ȳ) n (x i x) 2 ˆσ 2 = n (y i ˆβ 0 ˆβ 1 x i ) 2 To find the maximum likelihood estimates for β 0 and β 1 in R, we can use the following code: > n = dim(msm)[1] > x = msm$pressure > y = msm$calcium > beta1 = sum( ( x - mean(x) ) * ( y - mean(y) ) )/sum( ( x - mean(x) )^2 ) > beta0 = mean(y) - beta1 * mean(x) > beta0 [1] 505.2149 > beta1 [1] 1.014143 To interpret ˆβ 0 s value of 505.215, we would say that the expected calcium, given the machine is set to a pressure of 0 psi, is 505 ppm. Note that often times, interpretations of ˆβ 0 might be non-sensical such as in this case; in this example when the machine is set to 0 psi it can t separate the meat. Often what is of scientific interest is the interpretation of ˆβ 1. In this example, one way to interpret ˆβ 1 is to say that for a 1 psi increase in pressure the calcium concentration is expected to increase by 1.01 ppm. Typically, we don t use the maximum likelihood estimator for σ 2 because it is a biased estimator (prove this fact to yourself). Instead, we use the unbiased estimate we sometimes refer to as MSE, n (y i ŷ i ) 2 n (y i MSE = = ˆβ 0 ˆβ 1 x i ) 2 n 2 n 2 To find the value for MSE using R, we can use the following code: > yhat = beta0 + beta1*x > MSE = sum( (y - yhat)^2 )/(n-2) > MSE [1] 1221.953 3 n

2 Hypothesis Testing H 0 : β 1 = 0 vs. H a : β 1 0 According to the assumptions we made, ˆβ 1 N ( β 1, ) σ 2 n (x i x) 2 If we wanted to test the null hypothesis of H 0 : β 1 = 0 vs. H a : β 1 0 then our test statistic might be: ˆβ 1 0 test statistic = σ 2 n (x i x) 2 However, there is a problem with the test statistic above, we don t know the value of σ 2, so we have to substitute in for σ 2 the unbiased estimate we previously found. test statistic = ˆβ 1 0 MSE n (x i x) 2 Then when the null hypothesis is true, test statistic = ˆβ 1 0 MSE n (x i x) 2 t (n 2) To carry out the equivalent test in R, we could use the following code: > test.stat = beta1/sqrt( MSE / sum( ( x - mean(x) )^2 ) ) > test.stat [1] 3.569207 > alpha = 0.05 > # Rejection region approach > cutoff = qt( c(alpha/2, 1-alpha/2), df = n - 2 ) > cutoff [1] -2.119905 2.119905 > (test.stat <= cutoff[1]) (test.stat >= cutoff[2]) [1] TRUE > > # p-value approach > p.value = 2*pt( test.stat, df = n - 2, lower.tail = F) 4

> p.value [1] 0.002560482 > p.value <= alpha [1] TRUE From our hypothesis tests above we can now make a conclusion. If we choose to use the rejection region approach then we reject H 0 if the test statistic falls in to either the (, 2.12] or the [2.12, ) interval. Since our test statistic is 3.57 then our conclusion becomes we reject H 0 and conclude significance, of course, no conclusion is complete without referencing the context of the problem, so here we would conclude that calcium concentration in MSM poultry is linearly associated with pressure of the separation machine. If we choose to use the p-value approach to hypothesis testing then we compare our p-value against the pre-selected significance level of α = 0.05. Since the p-value is 0.0025 which is less than 0.05 then we reject the null hypothesis and conclude their is a significant relationship. Like before, to complete the conclusion we need to explain which relationship is significant so we need to say that the linear relationship between calcium concentration in MSM poultry and the pressure of the separation machine is significant. 3 Confidence Interval for Estimated Mean and a New Observation at a given point There are a few other things that we might be interested in examining with our model. Suppose we wanted to create a confidence interval for the estimated mean response at a given point of x = x h. From class we know that such a confidence interval has the following formula: Ŷ ± t (n 2);1 α/2 MSE ( 1 n + (x ) h x) 2 n (x i x) 2 Suppose we are interested in generating a 95% confidence interval for the mean calcium at 100 psi. To do this in R we could use the following code: > y_100 = beta0 + beta1*100 > y_100 [1] 606.6292 > y_100 + c(-1,1)*qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + + (100-mean(x))^2/sum( (x-mean(x))^2 ) ) ) [1] 589.1457 624.1127 5

Some students find the concept of using vectors in R challenging, so an alternative way is to produce the lower and upper endpoints of the interval separately, like so: > lower = y_100 - qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + + (100-mean(x))^2/sum( (x-mean(x))^2 ) ) ) > upper = y_100 + qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + + (100-mean(x))^2/sum( (x-mean(x))^2 ) ) ) > lower [1] 589.1457 > upper [1] 624.1127 Notice that both ways produce equivalent results. The 95% confidence interval for the mean calcium when pressure is 100 psi is (589.1, 624.1). We interpret this confidence interval by saying We are 95% confident that the mean amount of calcium at 100 psi is between 589.1 and 624.1. In R, there is a builtin function that will achieve the same results: > predict(mod, newdata = list(pressure = 100), level = 0.95, interval = "confidence") fit lwr upr 606.6292 589.1457 624.1127 Now suppose there was a new observation at 100 psi, to calculate a confidence interval for that new observation we use the formula: Ŷ new ± t (n 2);1 α/2 MSE ( 1 n + 1 + (x ) h x) 2 n (x i x) 2 To calculate a 95% confidence interval for the calcium of a new observation at 100 psi we could use the following R code: > y_100 + c(-1,1)*qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + 1 + +(100-mean(x))^2/sum( (x-mean(x))^2 ) [1] 530.4903 682.7681 or, > lower = y_100 - qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + 1 + +(100-mean(x))^2/sum( (x-mean(x))^2 ) ) ) > upper = y_100 + qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + 1 + +(100-mean(x))^2/sum( (x-mean(x))^2 ) ) ) 6

> lower [1] 530.4903 > upper [1] 682.7681 So the 95% confidence interval we just solved for would have the following interpretation, We are 95% confident that a new observation with a pressure of 100 psi will be between 530.5 and 682.8 ppm. In R the same builtin function can be used to solve for the prediction interval: > predict(mod, newdata = list(pressure = 100), level = 0.95, interval = "prediction") fit lwr upr 606.6292 530.4903 682.7681 4 S ums of Squares Finally, we can find the Sum of Squares due to Regression, the Sum of Squares due to Error, and the Sum of Squares of Total. Recall the formulas: SSE (sometimes called RSS) = SSR = SST O = n (y i ŷ i ) 2 n (ŷ i ȳ) 2 n (y i ȳ) 2 In R, we can find these values easily using the following code: > yhat = beta0 + beta1*x > SSE = sum( (y - yhat)^2 ) # called RSS, residual sum of squares > SSE [1] 19551.25 > SSReg = sum( (yhat - mean(y))^2 ) # SS Regression > SSReg [1] 15566.75 > SSTO = sum( (y - mean(y))^2 ) > SSTO [1] 35118 7

> Rsquared = SSReg/SSTO > Rsquared [1] 0.4432698 Of course there is an easy way to do all of these tasks in R without having to calculate all this: > mod = lm(calcium ~ Pressure, data = msm) > summary(mod) Call: lm(formula = Calcium ~ Pressure, data = msm) Residuals: Min 1Q Median 3Q Max -79.44-22.04 11.52 16.37 58.76 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 505.2149 29.2356 17.281 8.99e-12 *** Pressure 1.0141 0.2841 3.569 0.00256 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 34.96 on 16 degrees of freedom Multiple R-squared: 0.4433,Adjusted R-squared: 0.4085 F-statistic: 12.74 on 1 and 16 DF, p-value: 0.00256 5 Checking Model Assumptions Regardless, we need to check the assumptions made for linear regression are correct. One of the assumptions we made was the variance is constant for all observations. To check the assumption we can plot the residuals vs. fitted values. > plot(x = yhat, y = stdresid, xlab = "Predicted Calcium (in ppm)", ylab = + "Standardized Residuals", main = "Standardized Residuals\nvs. Fitted") > abline(h = 0, lwd = 1, lty = 2, col = "grey") The plot that results from the code above is figure 1. 8

Standardized Residuals Standardized Residuals vs. Fitted 2 0 2 560 600 640 Predicted Calcium (in ppm) Figure 1: Residuals vs. Fitted Values. The cloud of points should be centered around zero and remain fairly constant. The next assumption we can check the assumption that the data are normally distributed. To check this assumption we can create a QQ plot of the residuals. To check this assumption in R we can use the following code: > qqnorm( scale(y-yhat) ) > qqline( scale(y-yhat), lty = 2, lwd = 1, col = "grey" ) The plot the code generates is in Figure 2. Finally, one of the plots often included in is a scatter plot with the regression line added (see Figure 3). The following code generates that: > plot(x = x, y = y, xlab = "Pressure (in psi)", ylab = "Calcium (in ppm)", + main = "Scatterplot of data with\nregression line added") > curve(beta0 + beta1*x, from = min(x), to = max(x), add = TRUE, lwd = 2, lty = 1) 9

Sample Quantiles 2 0 1 Normal Q Q Plot 2 1 0 1 2 Theoretical Quantiles Figure 2: QQ plot. The points should follow the line y = x if the data is normally distributed. This looks pretty close. 10

Calcium (in ppm) Scatterplot of data with regression line added 550 650 60 100 140 Pressure (in psi) Figure 3: Scatter plot with regression line. 11