Handout 4: Simple Linear Regression

Handout 4: Simple Linear Regression By: Brandon Berman The following problem comes from Kokoska s Introductory Statistics: A Problem-Solving Approach. The data can be read in to R using the following code: msm = read.csv("http://www.ics.uci.edu/~zhaoxia/teaching/stat120c/data/msmdata.csv") 1 Background and Maximum Likelihood Estimates The European Food Safety Authority recently issued a scientific opinion on the public health risks related to mechanically separated meat (MSM). The analysis suggested that calcium could be used to distinguish between MSM and non-msm products. A random sample of MSM poultry was obtained and the deboner head pressure (in psi) and the amount of calcium (in ppm) was measured for each. The data are given in the following table. Based on this data set we want to build a model that can predict how much calcium (Y i ) a sample will contain depending on the pressure (x i ) the deboner used on the poultry. Our model will be: Y i = β 0 + β 1 x i + ɛ i where ɛ i iid N(0, σ 2 ), i = 1, 2,..., 18 A consequence of the definition above is that Y i indep. N (β 0 + β 1 x i, σ 2 ) for all i = 1, 2,..., n. In order to fit the model we first must find the maximum likelihood estimates for β 0, β 1, and σ 2. (Note that n = 18, but we re going to pretend for the time being we don t know that fact). The likelihood and log-likelihood equations are: L(β 0, β 1, σ 2 ) = n [ ( 1 exp (y )] i β 0 β 1 x i ) 2 2πσ 2 2σ 2 l(β 0, β 1, σ 2 ) = n 2 log(2πσ2 ) n (y i β 0 β 1 x i ) 2 2σ 2 1

Pressure (in psi) Calcium (in ppm) 51 573 95 654 104 581 143 709 77 560 109 629 102 623 72 560 120 598 112 577 76 600 143 666 93 616 87 514 70 586 49 584 142 634 132 632 Now we must take the derivative of the likelihood equation with respect to β 0, β 1 and σ 2. l β 0 = l β 1 = n n (y i β 0 β 1 x i ) σ 2 l σ = n n 2 2σ + 2 x i (y i β 0 β 1 x i ) σ 2 (y i β 0 β 1 x i ) 2 2(σ 2 ) 2 Set the three equations above equal to zero and simultaneously solve for β 0, β 1 and σ 2. The results will be the maximum likelihood estimates. Prove to yourself the following are the maximum 2

likelihood estimates: ˆβ 0 = ȳ ˆβ 1 x n ˆβ 1 = (x i x)(y i ȳ) n (x i x) 2 ˆσ 2 = n (y i ˆβ 0 ˆβ 1 x i ) 2 To find the maximum likelihood estimates for β 0 and β 1 in R, we can use the following code: > n = dim(msm)[1] > x = msm$pressure > y = msm$calcium > beta1 = sum( ( x - mean(x) ) * ( y - mean(y) ) )/sum( ( x - mean(x) )^2 ) > beta0 = mean(y) - beta1 * mean(x) > beta0 [1] 505.2149 > beta1 [1] 1.014143 To interpret ˆβ 0 s value of 505.215, we would say that the expected calcium, given the machine is set to a pressure of 0 psi, is 505 ppm. Note that often times, interpretations of ˆβ 0 might be non-sensical such as in this case; in this example when the machine is set to 0 psi it can t separate the meat. Often what is of scientific interest is the interpretation of ˆβ 1. In this example, one way to interpret ˆβ 1 is to say that for a 1 psi increase in pressure the calcium concentration is expected to increase by 1.01 ppm. Typically, we don t use the maximum likelihood estimator for σ 2 because it is a biased estimator (prove this fact to yourself). Instead, we use the unbiased estimate we sometimes refer to as MSE, n (y i ŷ i ) 2 n (y i MSE = = ˆβ 0 ˆβ 1 x i ) 2 n 2 n 2 To find the value for MSE using R, we can use the following code: > yhat = beta0 + beta1*x > MSE = sum( (y - yhat)^2 )/(n-2) > MSE [1] 1221.953 3 n

2 Hypothesis Testing H 0 : β 1 = 0 vs. H a : β 1 0 According to the assumptions we made, ˆβ 1 N ( β 1, ) σ 2 n (x i x) 2 If we wanted to test the null hypothesis of H 0 : β 1 = 0 vs. H a : β 1 0 then our test statistic might be: ˆβ 1 0 test statistic = σ 2 n (x i x) 2 However, there is a problem with the test statistic above, we don t know the value of σ 2, so we have to substitute in for σ 2 the unbiased estimate we previously found. test statistic = ˆβ 1 0 MSE n (x i x) 2 Then when the null hypothesis is true, test statistic = ˆβ 1 0 MSE n (x i x) 2 t (n 2) To carry out the equivalent test in R, we could use the following code: > test.stat = beta1/sqrt( MSE / sum( ( x - mean(x) )^2 ) ) > test.stat [1] 3.569207 > alpha = 0.05 > # Rejection region approach > cutoff = qt( c(alpha/2, 1-alpha/2), df = n - 2 ) > cutoff [1] -2.119905 2.119905 > (test.stat <= cutoff[1]) (test.stat >= cutoff[2]) [1] TRUE > > # p-value approach > p.value = 2*pt( test.stat, df = n - 2, lower.tail = F) 4

> p.value [1] 0.002560482 > p.value <= alpha [1] TRUE From our hypothesis tests above we can now make a conclusion. If we choose to use the rejection region approach then we reject H 0 if the test statistic falls in to either the (, 2.12] or the [2.12, ) interval. Since our test statistic is 3.57 then our conclusion becomes we reject H 0 and conclude significance, of course, no conclusion is complete without referencing the context of the problem, so here we would conclude that calcium concentration in MSM poultry is linearly associated with pressure of the separation machine. If we choose to use the p-value approach to hypothesis testing then we compare our p-value against the pre-selected significance level of α = 0.05. Since the p-value is 0.0025 which is less than 0.05 then we reject the null hypothesis and conclude their is a significant relationship. Like before, to complete the conclusion we need to explain which relationship is significant so we need to say that the linear relationship between calcium concentration in MSM poultry and the pressure of the separation machine is significant. 3 Confidence Interval for Estimated Mean and a New Observation at a given point There are a few other things that we might be interested in examining with our model. Suppose we wanted to create a confidence interval for the estimated mean response at a given point of x = x h. From class we know that such a confidence interval has the following formula: Ŷ ± t (n 2);1 α/2 MSE ( 1 n + (x ) h x) 2 n (x i x) 2 Suppose we are interested in generating a 95% confidence interval for the mean calcium at 100 psi. To do this in R we could use the following code: > y_100 = beta0 + beta1*100 > y_100 [1] 606.6292 > y_100 + c(-1,1)*qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + + (100-mean(x))^2/sum( (x-mean(x))^2 ) ) ) [1] 589.1457 624.1127 5

Some students find the concept of using vectors in R challenging, so an alternative way is to produce the lower and upper endpoints of the interval separately, like so: > lower = y_100 - qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + + (100-mean(x))^2/sum( (x-mean(x))^2 ) ) ) > upper = y_100 + qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + + (100-mean(x))^2/sum( (x-mean(x))^2 ) ) ) > lower [1] 589.1457 > upper [1] 624.1127 Notice that both ways produce equivalent results. The 95% confidence interval for the mean calcium when pressure is 100 psi is (589.1, 624.1). We interpret this confidence interval by saying We are 95% confident that the mean amount of calcium at 100 psi is between 589.1 and 624.1. In R, there is a builtin function that will achieve the same results: > predict(mod, newdata = list(pressure = 100), level = 0.95, interval = "confidence") fit lwr upr 606.6292 589.1457 624.1127 Now suppose there was a new observation at 100 psi, to calculate a confidence interval for that new observation we use the formula: Ŷ new ± t (n 2);1 α/2 MSE ( 1 n + 1 + (x ) h x) 2 n (x i x) 2 To calculate a 95% confidence interval for the calcium of a new observation at 100 psi we could use the following R code: > y_100 + c(-1,1)*qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + 1 + +(100-mean(x))^2/sum( (x-mean(x))^2 ) [1] 530.4903 682.7681 or, > lower = y_100 - qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + 1 + +(100-mean(x))^2/sum( (x-mean(x))^2 ) ) ) > upper = y_100 + qt(1-0.05/2, df = n-2)*sqrt(mse*(1/n + 1 + +(100-mean(x))^2/sum( (x-mean(x))^2 ) ) ) 6

> lower [1] 530.4903 > upper [1] 682.7681 So the 95% confidence interval we just solved for would have the following interpretation, We are 95% confident that a new observation with a pressure of 100 psi will be between 530.5 and 682.8 ppm. In R the same builtin function can be used to solve for the prediction interval: > predict(mod, newdata = list(pressure = 100), level = 0.95, interval = "prediction") fit lwr upr 606.6292 530.4903 682.7681 4 S ums of Squares Finally, we can find the Sum of Squares due to Regression, the Sum of Squares due to Error, and the Sum of Squares of Total. Recall the formulas: SSE (sometimes called RSS) = SSR = SST O = n (y i ŷ i ) 2 n (ŷ i ȳ) 2 n (y i ȳ) 2 In R, we can find these values easily using the following code: > yhat = beta0 + beta1*x > SSE = sum( (y - yhat)^2 ) # called RSS, residual sum of squares > SSE [1] 19551.25 > SSReg = sum( (yhat - mean(y))^2 ) # SS Regression > SSReg [1] 15566.75 > SSTO = sum( (y - mean(y))^2 ) > SSTO [1] 35118 7

> Rsquared = SSReg/SSTO > Rsquared [1] 0.4432698 Of course there is an easy way to do all of these tasks in R without having to calculate all this: > mod = lm(calcium ~ Pressure, data = msm) > summary(mod) Call: lm(formula = Calcium ~ Pressure, data = msm) Residuals: Min 1Q Median 3Q Max -79.44-22.04 11.52 16.37 58.76 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 505.2149 29.2356 17.281 8.99e-12 *** Pressure 1.0141 0.2841 3.569 0.00256 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 34.96 on 16 degrees of freedom Multiple R-squared: 0.4433,Adjusted R-squared: 0.4085 F-statistic: 12.74 on 1 and 16 DF, p-value: 0.00256 5 Checking Model Assumptions Regardless, we need to check the assumptions made for linear regression are correct. One of the assumptions we made was the variance is constant for all observations. To check the assumption we can plot the residuals vs. fitted values. > plot(x = yhat, y = stdresid, xlab = "Predicted Calcium (in ppm)", ylab = + "Standardized Residuals", main = "Standardized Residuals\nvs. Fitted") > abline(h = 0, lwd = 1, lty = 2, col = "grey") The plot that results from the code above is figure 1. 8

Standardized Residuals Standardized Residuals vs. Fitted 2 0 2 560 600 640 Predicted Calcium (in ppm) Figure 1: Residuals vs. Fitted Values. The cloud of points should be centered around zero and remain fairly constant. The next assumption we can check the assumption that the data are normally distributed. To check this assumption we can create a QQ plot of the residuals. To check this assumption in R we can use the following code: > qqnorm( scale(y-yhat) ) > qqline( scale(y-yhat), lty = 2, lwd = 1, col = "grey" ) The plot the code generates is in Figure 2. Finally, one of the plots often included in is a scatter plot with the regression line added (see Figure 3). The following code generates that: > plot(x = x, y = y, xlab = "Pressure (in psi)", ylab = "Calcium (in ppm)", + main = "Scatterplot of data with\nregression line added") > curve(beta0 + beta1*x, from = min(x), to = max(x), add = TRUE, lwd = 2, lty = 1) 9

Sample Quantiles 2 0 1 Normal Q Q Plot 2 1 0 1 2 Theoretical Quantiles Figure 2: QQ plot. The points should follow the line y = x if the data is normally distributed. This looks pretty close. 10

Calcium (in ppm) Scatterplot of data with regression line added 550 650 60 100 140 Pressure (in psi) Figure 3: Scatter plot with regression line. 11