STAT 3022 Spring 2007

Size: px

Start display at page:

Download "STAT 3022 Spring 2007"

Duane Horn
5 years ago
Views:

1 Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so you will get the same results we had in class. (Remember that you don t enter the > or + that R uses as a prompt at the beginning of each line.) Now pick 50 x s between 1 and 25: > x <- sample( 25, 50, replace = TRUE ) We can make an approximately linear function of x by entering > y <- 4 * x * rnorm(x) This adds a random component to a line with slope 4 and y-intercept 17; the random part is normally distributed with mean 0 and standard deviation 25. Observe the data. > plot( x, y, las = 1 ) There is a general linear trend, but lots of scatter, too. Find the center of the data, i.e., x and ȳ, and add them to the graph. > xbar <- mean( x ) ; ybar <- mean( y ) ; data.frame( xbar, ybar ) > abline( v = xbar, lty = 3 ) ; axis( 3, at = xbar ) > abline( h = ybar, lty = 3 ) ; axis( 4, at = ybar ) Now we use least squares to fit a line to the data. We can draw that on our graph, and we can compare it to the true regression line. > output <- lm( y ~ x ) > abline( output ) # sample regression line # true (population) regression line > abline( 17, 4, lty = 2, col="red", lwd = 2) It looks like a pretty good fit, but remember that the line we get depends on the points we started with, and they are random. Suppose we started with the same true relation between x and y, that is, with y = 4x + 17 plus a random component which is normally distributed with mean 0 and standard deviation 25, and repeated the process of finding a line based on a sample of 50 points. Every time we do that, we have a different batch of points, so we get a different line, even though all the lines we get are supposed to estimate the same true line, namely y =4x We can use R to do this. Define a function to draw a sample of 50 points and compute the least-squares line. > do.it.again <- function(){ + y <- 4 * x * rnorm(x) + more.output <- lm( y ~ x ) + abline( more.output, col="gray" ) + } Now try it a few times to see how it works. Do lots more > for( i in 1:200 ){ do.it.again() } 1

2 Show the true line again. > abline( 17, 4, lty = 2, lwd = 3, col = "red" ) # true line It should look like this: y x Regression lines 2

3 More examples Here are R commands to do what is shown in some of the worked-out examples in the text. These commands may also be useful for doing some of the homework. These examples use the meat data from one of the case studies. > time <- c( 1, 1, 2, 2, 4, 4, 6, 6, 8, 8 ) > ph <- c( 7.02, 6.93, 6.42, 6.51, 6.07, 5.99, 5.59, 5.80, 5.51, 5.36 ) The first thing to do is to look at the data, and the second is to try fitting a regression model. > plot( time, ph, las = 1 ) # scatterplot of ph versus time > abline( lm( ph ~ time ) ) # ph time Line does not follow curvature of data There is evidence that the model is inadequate; perhaps a transformation would help. Try logarithm of time. > log.time <- log( time ) > meat.data <- data.frame( time, log.time, ph ) > meat <- lm( ph ~ log.time, data = meat.data ) 3

4 ph We ll use these transformed data. > summary( meat ) Call: lm(formula = ph ~ log.time, data = meat.data) Residuals: Min 1Q Median 3Q Max log.time Line fits transformed data much better Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-15 *** log.time e-08 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 8 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 8 DF, p-value: 2.695e-08 From this output we see that our estimated standard deviation is ˆσ = and our estimated slope coefficient is , with standard error So ph = log t for t between 1 and 8 hours. 4

5 Point Estimates and Standard Errors (Display 7.10) We can use the line to estimate the value of ph for any time between 1 and 8 hours, whether or not a specific time was one we had data for. Even though we had two observations with time 4 hours, we still use the line to estimate the mean ph for steers at time 4 hours, just as we would for times (such as 5 hours) where we did not have any observations. The point estimate is just the y-coordinate for a given value of time t. For example, we estimate that when t = 4 hours, ph = log 4 = (1.386) = 5.98 but we d like some idea of how reliable this is. We need to compute a standard error, and there are several ways to do that. One way involves the formula 1 SE[ˆµ{Y X 0 }]=ˆσ n + (X 0 X) 2 (n 1)s 2 X for the standard error at a specified X value (X 0 = log t =log4=1.386 in this example). This approach is shown in the text as Display 7.10 on page 187. The text also describes a computer centering trick to avoid having to do all the calculations shown in Display Here s how that works in R. We create an artificial variable, in this case by subtracting log 4 from log(time). > log.time.star <- log.time - log(4) Then fit a model using this instead of the original explanatory variable. > summary( lm( ph ~ log.time.star ) ) Call: lm(formula = ph ~ log.time.star) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** log.time.star e-08 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 8 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 8 DF, p-value: 2.695e-08 The only parts we want from this are the estimated intercept and its standard error ; they are the point estimate we already had (shown as 5.98 in Display 7.10) and its standard error (shown as in Display 7.10). 5

6 Confidence Intervals (Display 7.10) We can use the point estimate and its associated standard error to form a confidence interval for the mean ph of all steers measured at time 4 hours. The calculations are shown in the bottom of Display 7.10 and we can add this to our graph ph log.time 95% CI for mean ph at 4 hours after slaughter Remember that this is an estimate for the true mean value of all steers. What if we wanted to predict the ph for a single steer? The point estimate would be the same 5.98, but our uncertainty would be different. Even if we knew the exact true regression line, there would still be sampling variability about that line. That s what σ describes, after all. But we have only our estimated line, and the confidence interval we ve found describes only the variation between the true line and its estimates such as our line. 6

7 Prediction Intervals (Display 7.12) We can form a different interval that allows for additional variability. As before, there are several ways to do this. One way uses the formulas (from page 190) for standard error of prediction: SE[Pred{Y X 0 }]= ˆσ 2 + SE[ˆµ{Y X 0 }] 2 We can use the centering method to get SE[ˆµ{Y X 0 }] and that computer output also gives ˆσ, so this is really not too hard. For our example, we had > summary( lm( ph ~ log.time.star ) ) # same centering as before Call: lm(formula = ph ~ log.time.star) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** log.time.star e-08 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 8 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 8 DF, p-value: 2.695e-08 From this output we get SE[ˆµ{Y X 0 }]= ˆσ = We combine these to get SE for prediction > sqrt( ^ ^2 ) # SE for predicted value [1] This is shown as in Display The rest of that display shows how to form a 95% prediction interval, and we can use R to do that, too. > qt( /2, 8 ) # t critical value [1] > * # lower limit [1] > * # upper limit [1]

8 We can add this interval to our graph ph log.time 95% prediction interval for ph at 4 hours after slaughter This shows both the prediction interval and the confidence interval. We can think of the confidence interval as reflecting our uncertainty involving the location of the line itself, and the prediction interval incorporates the additional variability of points scattered about that line. R can do all this at once. The preceding material is useful, no matter what computer software you have. However, many packages, including R have built-in routines for these tasks: > predict( meat, data.frame( log.time = log(4) ), interval = "confidence" ) fit lwr upr [1,] Rounding these values, we have a point estimate of 5.98, and a confidence interval from 5.92 to > predict( meat, data.frame( log.time = log(4) ), interval = "prediction" ) fit lwr upr [1,] Here we still have the same point estimate of 5.98, but our prediction interval is from 5.78 to

Regression on Faithful with Section 9.3 content

Regression on Faithful with Section 9.3 content The faithful data frame contains 272 obervational units with variables waiting and eruptions measuring, in minutes, the amount of wait time between eruptions,