CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We have n observations on an explanatory variable x and a response variable y. Our goal is to study or predict the behavior of y for given values of x. For any fixed value of x, the response y varies according to a Normal distribution. Repeated responses y are independent of each other. The mean response μy has a straight-line relationship with x given by a population regression line: μy = α + β x The slope β and intercept α are unknown parameters. The standard deviation of y (call it σ) is the same for all values of x. The value of σ is unknown. The true (population) regression line is μ y = α + β x and says that the mean response μ y moves along a straight line as the explanatory variable x changes. The parameters β and α are estimated by the slope b and intercept a of the least-squares regression line, and the formulas for these estimates are and b = r s y s x a = y b x where r is the correlation between y and x, y is the mean of the y observations, s y is the standard deviation of the y observations, x is the mean of the x observations, and s x is the standard deviation of the x observations. The standard error about the least-squares line is s = residual = n n ( y y ˆ ) where ˆ y = a + bx is the value we would predict for the response variable based on the least-squares regression line. We use s to estimate the unknown σ in the regression model.
Inference for Regression A level C confidence interval for β is b ± t*seb where t* is the critical value for the t distribution with n degrees of freedom with area C between t* and t*, and s SEb = ( x x ) is the standard error of the least-squares slope b. SEb is usually computed using a calculator or statistical software. The test of the hypothesis H : β = is based on the t statistic b t = SE b with P-values computed from the t distribution with n degrees of freedom. This test is also a test of the hypothesis that the correlation is in the population. A level C confidence interval for the mean response μy when x takes the value x* is y ˆ ± t*se ˆ μ where ˆ y = a + bx, t* is the critical value for the t distribution with n degrees of freedom and area C between t* and t* and SE ˆ μ = s n + ( x * x ) (x x ) SE ˆ μ is usually computed using a calculator or statistical software. A level C prediction interval for a single observation on y when x takes the value x* is y ˆ ± t*se y ˆ where t* is the critical value for the t distribution with n degrees of freedom and and area C between t* and t* and SE y ˆ = s + n (x * x ) + (x x ) SE ˆ y is usually computed using a calculator or statistical software. Finally, it is always good practice to check that the data satisfy the linear regression model assumptions before doing inference. Scatterplots and residual plots are useful tools for checking these assumptions.
Chapter 3 GUIDED SOLUTIONS Exercise 3. KEY CONCEPTS: Scatterplots, correlation, linear regression, residuals, standard error of the leastsquares line (a) First, examine the data and judge whether the relationship between Distance and Days is positive or negative. Sketch your scatterplot on the axes provided, or use software. 5 Scatterplot of Days versus Distance 3 Days 3 Distance 5 Use your calculator (or statistical software) to compute the correlation r: r = (b) What does the slope β of the true regression line say about the number of days until group infection and a group s distance from the first infected group? Enter your estimates of the slope β and intercept α of the true regression line. Use software or your calculator, or compute these values manually using the formulas in Chapter 6 of your textbook. Estimate of β =
Inference for Regression 3 Estimate of α = Although it isn t asked for in this part, write the equation of the least-squares regression line for predicting the number of days to infection for a gorilla group given its distance from the first group infected. You ll use this in part (c) The least-squares regression line is: ŷ = (c) To compute the residuals, complete the table. Remember, to compute the predicted number of days until infection, use the least-squares regression line. Distance from first group infected 3 5 Predicted number of days until infection Residual (prediction error) Compute the sum of residuals (sum of prediction errors). They should sum to zero. residual = Now estimate the standard deviation σ by computing residual = and then completing the following calculation. This is an estimate of σ. s = residual = n
Chapter 3 Exercise 3. KEY CONCEPTS: Tests for the slope of the least-squares regression line (a)the test of the hypotheses H : β is based on the t statistic t = = b SE b. In the statement of the problem, we are told that b =.63 and SE b =.59. The value of b is slightly different than the value we found in Exercise 3., due to differences in how much rounding was done at intermediate stages of the calculations. Compute the test statistic: t = b SE b = (b) What are the degrees of freedom for t? Refer to the original data in Exercise 3. of your textbook to determine the sample size n. Degrees of freedom = n = Now, use Table C to estimate the P-value for testing with the alternative hypothesis H a : β >, which hypothesizes a positive linear association between Days and Distance. P-value: What do you conclude? Exercise 3.38 KEY CONCEPTS: Scatterplots, examining residuals, confidence intervals for the slope (a) Use software or a calculator to compute the correlation between Time and Calories : Use software or a calculator to compute the equation of the least-squares regression line. Don t forget to have the computer or your calculator save the residuals, as we ll use them in part (b): ˆ y =
Inference for Regression 5 Use software or the axes provided to make a scatterplot of Calories versus Time. 5 5 8 Calories 6 5 3 Time 35 5 (b) Here, we ll check conditions needed for regression inference. First, to check for a Linear Relationship, and to check whether spread about the line stays the same for all values of the explanatory variable, plot the residuals against Time (the explanatory variable): 8 6 Residuals - - -6-8 - 5 3 Time 35 5 Does this plot show any systematic deviation from a roughly linear pattern? Does this plot show any systematic change in spread as Time changes?
6 Chapter 3 Are the observations independent? Is this obvious? Finally, look for evidence that the variation about the line appear to be Normal. Use software or the axes that follow (with class intervals residual < 3, 3 residual <, residual <, and so on) to make a histogram. 3 Frequency - - Residuals Does this plot have strong skewness or outliers which might suggest lack of Normality? (c) In this problem, the rate of change in calories consumed as time at the table increases is the slope of the population line, β. Hence, we need to construct a 95% confidence interval for β. Recall that a level C confidence interval for β is b ± t*se b where t* is the critical value for the t distribution with n degrees of freedom with area C between t* and t*, and s SE b = ( x x ) is the standard error of the least-squares slope b.
Inference for Regression 7 In this exercise, b and SE b can be read directly from the output of statistical software. Record their values. b = SE b = Now, find t* for a 95% confidence interval from Table C (what is n here?). t* = Compute the 9% confidence interval: Interpret this confidence interval in the context of this problem. Exercise 3. KEY CONCEPTS: Prediction, prediction intervals We used Minitab to compute a prediction of Calories when Time =. The output follows: The regression equation is Calories = 56 3.8 Time Predictor Coef Stdev t-ratio p Constant 56.65 9.37 9.9. Time 3.77.898 3.6. s = 3. R-sq =.% R-sq(adj) = 38.9% Analysis of Variance SOURCE DF SS MS F p Regression 777.6 777.6 3.. Error 8 985. 57.5 Total 9 73. Fit Stdev.Fit 95.% C.I. 95.% P.I. 37.57 7.3 (.3, 5.9) ( 386.6, 89.8) Where in this output does one find the 95% confidence interval to predict Rachel s calorie consumption at lunch? Refer to Examples 3.7 and 3.8 in the textbook if you need help. 95% prediction interval:
8 Chapter 3 COMPLETE SOLUTIONS Exercise 3. (a) If we look at the data, we see that as a gorilla group s distance from the first infection increases, so does the number of days until that group is infected. Thus, there is a positive association between Days and Distance. A scatterplot of the data with price as the explanatory variable follows. 5 Scatterplot of Days versus Distance 3 Days 3 Distance 5 The scatterplot indicates a strong positive linear association between Distance and Days. The correlation r is given by r =.96. This is consistent with the scatterplot as suggesting a strong linear relationship between Distance and Days. The estimate of β is b =.3 days per distance unit. The estimate of α is a = -8.9 days. The equation of the least-squares regression line for predicting days to infection for a gorilla group given its distance from the initial group infected is: Days = 8.9 +.3 Distance (b) The slope of the population regression line, β, is the number of additional days (on average) required to infect a gorilla group one additional distance unit from the original infection group. You might think of this as a measure of the rate of the infection s spread - on average it takes β days for the infection to spread to an additional home range. The estimate of β is b =.3 days per distance unit. The estimate of α is a = 8.9 days. The equation of the least-squares regression line for predicting days to infection for a gorilla group given its distance from the initial group infected is: Days = 8.9 +.3 Distance
Inference for Regression 9 (c) The residuals for the six data points are given in the table. Distance from first group infected Predicted number of days until infection Residual (prediction error) 3.8 3.8 =.8 3 5.7 5.7 =.7 36.96 33 36.96 = 3.96 36.96 36.96 =. 36.96 3 36.96 = 6. 5 8.3 6 8.3 =.3 The sum of the residuals listed is residual =.. The difference from is due to rounding in the parameter estimates above. To estimate the standard deviation σ in the regression model, we first calculate the sum of the squares of the residuals listed: residual =.8 (.7) (.3) + + + = 96.. Our estimate of the standard deviation σ in the regression model is therefore s = residual = n (96.) =.9 days. 6- Exercise 3. (a) b =.63 and SE b =.59, so t = b SE b =.63.59 = 7.79 (b) Referring to the original data in Exercise 3. of the textbook, we see that n = 6. Degrees of freedom = n = 6 = To estimate the P-value, we use Table C with df = and refer to the P-values corresponding to the two values of t* that bracket the computed value of t = 7.79: t* 5.598 7.73 One-sided P.5. Because the test is two-sided,. < P-value <.5. Statistical software (Minitab) gives a P-value of.. There is extremely strong (overwhelming) evidence to support a positive linear association between distance of a gorilla group from the primary infection group and the number of days it takes for the infection to reach the group.
3 Chapter 3 Exercise 3.38 (a) Here is a scatterplot showing the relationship between time at the table and calories consumed. 5 5 8 Calories 6 5 3 Time 35 5 The correlation between Calories and Time is r =.69. The overall pattern is roughly (perhaps weakly) linear with a negative slope. There are no clear outliers or strongly influential data points, it seems. Using statistical software, we find that the equation of the least-squares line is ˆ y = 56.65 3.8 time (b) A scatterplot of the residuals against Time follows. 8 6 Residuals - - -6-8 - 5 3 Time 35 5
Inference for Regression 3 This plot is useful for addressing the first two of the four conditions we check: Does the relationship appear linear? This scatterplot magnifies deviations from the regression line, making it easier to detect any non-linear pattern in the data. Based on this plot, there is little reason to doubt that the relationship between Calories and Time is linear. Does the spread about the line stay the same? The scatterplot of residuals versus Time seems to suggest that the spread about the line is roughly constant. Points seem to lie consistently in a band between and +. Are the observations independent? The answer is not clear. These are observations on different children rather than on a single child, and that is good. However, we do not know if the children were selected at random. In addition, we do not know if the children were all together so that the behavior of one child could influence the behavior of another. Are there children from the same family in this group? These issues would impact independence of observations. Does the variation about the line appear to be Normal? The histogram that follows has a gap and is not particularly bell-shaped. On the other hand there do not appear to be any outliers or extreme skew. With only observations, it s difficult to assess non- Normality here. 3 Frequency - - Residuals The conditions for inference (for a sample of size ) are approximately satisfied.
3 Chapter 3 (c) From statistical software, we find that b = 3.8 SE b =.85 For a 95% confidence interval from Table C with n = (and n = 8), t* =. We use these to compute the 95% confidence interval for the true slope of the regression line: b ± t*se b = 3.8 ± (.)(.85) = 3.8 ±.79 or.87 to.9 calories per minute. With 95% confidence, each minute spent at the table reduces calories consumed by between.9 calories and.87 calories. Exercise 3. Using software (Minitab, in this case): The output from Minitab follows: The regression equation is Calories = 56 3.8 Time Predictor Coef Stdev t-ratio p Constant 56.65 9.37 9.9. Time 3.77.898 3.6. s = 3. R-sq =.% R-sq(adj) = 38.9% Analysis of Variance SOURCE DF SS MS F p Regression 777.6 777.6 3.. Error 8 985. 57.5 Total 9 73. Fit Stdev.Fit 95.% C.I. 95.% P.I. 37.57 7.3 (.3, 5.9) (386.6, 89.8) The Fit entry gives the predicted calories. Minitab gives both the 95% confidence interval for the mean response and the prediction interval for a single observation. We are predicting a single observation, so the column labeled 95% PI contains the interval we want. We see that this 95% prediction interval is (386.6, 89.8). With 95% confidence, the mean number of calories consumed by Rachel at lunch is between 386 and 89 calories, roughly.