Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so you will get the same results we had in class. (Remember that you don t enter the > or + that R uses as a prompt at the beginning of each line.) Now pick 50 x s between 1 and 25: > x <- sample( 25, 50, replace = TRUE ) We can make an approximately linear function of x by entering > y <- 4 * x + 17 + 25 * rnorm(x) This adds a random component to a line with slope 4 and y-intercept 17; the random part is normally distributed with mean 0 and standard deviation 25. Observe the data. > plot( x, y, las = 1 ) There is a general linear trend, but lots of scatter, too. Find the center of the data, i.e., x and ȳ, and add them to the graph. > xbar <- mean( x ) ; ybar <- mean( y ) ; data.frame( xbar, ybar ) > abline( v = xbar, lty = 3 ) ; axis( 3, at = xbar ) > abline( h = ybar, lty = 3 ) ; axis( 4, at = ybar ) Now we use least squares to fit a line to the data. We can draw that on our graph, and we can compare it to the true regression line. > output <- lm( y ~ x ) > abline( output ) # sample regression line # true (population) regression line > abline( 17, 4, lty = 2, col="red", lwd = 2) It looks like a pretty good fit, but remember that the line we get depends on the points we started with, and they are random. Suppose we started with the same true relation between x and y, that is, with y = 4x + 17 plus a random component which is normally distributed with mean 0 and standard deviation 25, and repeated the process of finding a line based on a sample of 50 points. Every time we do that, we have a different batch of points, so we get a different line, even though all the lines we get are supposed to estimate the same true line, namely y =4x + 17. We can use R to do this. Define a function to draw a sample of 50 points and compute the least-squares line. > do.it.again <- function(){ + y <- 4 * x + 17 + 25 * rnorm(x) + more.output <- lm( y ~ x ) + abline( more.output, col="gray" ) + } Now try it a few times to see how it works. Do lots more > for( i in 1:200 ){ do.it.again() } 1
Show the true line again. > abline( 17, 4, lty = 2, lwd = 3, col = "red" ) # true line It should look like this: 15.44 150 100 y 77.30829 50 0 5 10 15 20 25 x Regression lines 2
More examples Here are R commands to do what is shown in some of the worked-out examples in the text. These commands may also be useful for doing some of the homework. These examples use the meat data from one of the case studies. > time <- c( 1, 1, 2, 2, 4, 4, 6, 6, 8, 8 ) > ph <- c( 7.02, 6.93, 6.42, 6.51, 6.07, 5.99, 5.59, 5.80, 5.51, 5.36 ) The first thing to do is to look at the data, and the second is to try fitting a regression model. > plot( time, ph, las = 1 ) # scatterplot of ph versus time > abline( lm( ph ~ time ) ) # 7.0 6.5 ph 6.0 5.5 1 2 3 4 5 6 7 8 time Line does not follow curvature of data There is evidence that the model is inadequate; perhaps a transformation would help. Try logarithm of time. > log.time <- log( time ) > meat.data <- data.frame( time, log.time, ph ) > meat <- lm( ph ~ log.time, data = meat.data ) 3
7.0 6.5 ph 6.0 5.5 We ll use these transformed data. > summary( meat ) Call: lm(formula = ph ~ log.time, data = meat.data) Residuals: Min 1Q Median 3Q Max -0.11466-0.05889 0.02086 0.03611 0.11658 0.0 0.5 1.0 1.5 2.0 log.time Line fits transformed data much better Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.98363 0.04853 143.90 6.08e-15 *** log.time -0.72566 0.03443-21.08 2.70e-08 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.08226 on 8 degrees of freedom Multiple R-Squared: 0.9823, Adjusted R-squared: 0.9801 F-statistic: 444.3 on 1 and 8 DF, p-value: 2.695e-08 From this output we see that our estimated standard deviation is ˆσ = 0.08226 and our estimated slope coefficient is -0.72566, with standard error 0.03443. So ph =6.9836 0.7257 log t for t between 1 and 8 hours. 4
Point Estimates and Standard Errors (Display 7.10) We can use the line to estimate the value of ph for any time between 1 and 8 hours, whether or not a specific time was one we had data for. Even though we had two observations with time 4 hours, we still use the line to estimate the mean ph for steers at time 4 hours, just as we would for times (such as 5 hours) where we did not have any observations. The point estimate is just the y-coordinate for a given value of time t. For example, we estimate that when t = 4 hours, ph =6.9836 0.7257 log 4 = 6.9836 0.7257(1.386) = 5.98 but we d like some idea of how reliable this is. We need to compute a standard error, and there are several ways to do that. One way involves the formula 1 SE[ˆµ{Y X 0 }]=ˆσ n + (X 0 X) 2 (n 1)s 2 X for the standard error at a specified X value (X 0 = log t =log4=1.386 in this example). This approach is shown in the text as Display 7.10 on page 187. The text also describes a computer centering trick to avoid having to do all the calculations shown in Display 7.10. Here s how that works in R. We create an artificial variable, in this case by subtracting log 4 from log(time). > log.time.star <- log.time - log(4) Then fit a model using this instead of the original explanatory variable. > summary( lm( ph ~ log.time.star ) ) Call: lm(formula = ph ~ log.time.star) Residuals: Min 1Q Median 3Q Max -0.11466-0.05889 0.02086 0.03611 0.11658 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.97765 0.02688 222.42 < 2e-16 *** log.time.star -0.72566 0.03443-21.08 2.70e-08 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.08226 on 8 degrees of freedom Multiple R-Squared: 0.9823, Adjusted R-squared: 0.9801 F-statistic: 444.3 on 1 and 8 DF, p-value: 2.695e-08 The only parts we want from this are the estimated intercept 5.97765 and its standard error 0.02688; they are the point estimate we already had (shown as 5.98 in Display 7.10) and its standard error (shown as 0.0269 in Display 7.10). 5
Confidence Intervals (Display 7.10) We can use the point estimate and its associated standard error to form a confidence interval for the mean ph of all steers measured at time 4 hours. The calculations are shown in the bottom of Display 7.10 and we can add this to our graph. 7.0 6.5 ph 6.0 5.5 0.0 0.5 1.0 1.5 2.0 log.time 95% CI for mean ph at 4 hours after slaughter Remember that this is an estimate for the true mean value of all steers. What if we wanted to predict the ph for a single steer? The point estimate would be the same 5.98, but our uncertainty would be different. Even if we knew the exact true regression line, there would still be sampling variability about that line. That s what σ describes, after all. But we have only our estimated line, and the confidence interval we ve found describes only the variation between the true line and its estimates such as our line. 6
Prediction Intervals (Display 7.12) We can form a different interval that allows for additional variability. As before, there are several ways to do this. One way uses the formulas (from page 190) for standard error of prediction: SE[Pred{Y X 0 }]= ˆσ 2 + SE[ˆµ{Y X 0 }] 2 We can use the centering method to get SE[ˆµ{Y X 0 }] and that computer output also gives ˆσ, so this is really not too hard. For our example, we had > summary( lm( ph ~ log.time.star ) ) # same centering as before Call: lm(formula = ph ~ log.time.star) Residuals: Min 1Q Median 3Q Max -0.11466-0.05889 0.02086 0.03611 0.11658 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.97765 0.02688 222.42 < 2e-16 *** log.time.star -0.72566 0.03443-21.08 2.70e-08 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.08226 on 8 degrees of freedom Multiple R-Squared: 0.9823, Adjusted R-squared: 0.9801 F-statistic: 444.3 on 1 and 8 DF, p-value: 2.695e-08 From this output we get SE[ˆµ{Y X 0 }]=0.02688 ˆσ =0.08226 We combine these to get SE for prediction > sqrt( 0.02688^2 + 0.08226^2 ) # SE for predicted value [1] 0.0865404 This is shown as 0.0865 in Display 7.12. The rest of that display shows how to form a 95% prediction interval, and we can use R to do that, too. > qt( 1 -.05/2, 8 ) # t critical value [1] 2.306004 > 5.97765-0.0865404 * 2.306004 # lower limit [1] 5.778087 > 5.97765 + 0.0865404 * 2.306004 # upper limit [1] 6.177213 7
We can add this interval to our graph. 7.0 6.5 ph 6.0 5.5 0.0 0.5 1.0 1.5 2.0 log.time 95% prediction interval for ph at 4 hours after slaughter This shows both the prediction interval and the confidence interval. We can think of the confidence interval as reflecting our uncertainty involving the location of the line itself, and the prediction interval incorporates the additional variability of points scattered about that line. R can do all this at once. The preceding material is useful, no matter what computer software you have. However, many packages, including R have built-in routines for these tasks: > predict( meat, data.frame( log.time = log(4) ), interval = "confidence" ) fit lwr upr [1,] 5.977651 5.915677 6.039625 Rounding these values, we have a point estimate of 5.98, and a confidence interval from 5.92 to 6.04. > predict( meat, data.frame( log.time = log(4) ), interval = "prediction" ) fit lwr upr [1,] 5.977651 5.778092 6.177209 Here we still have the same point estimate of 5.98, but our prediction interval is from 5.78 to 6.18. 8