AMS 315/576 Lecture Notes Chapter 11. Simple Linear Regression 11.1 Motivation A restaurant opening on a reservations-only basis would like to use the number of advance reservations x to predict the number of dinners y to be prepared. Data on reservations and numbers of dinners served for one day chosen at random from each week in a 100-week period gave the following results: (# of meals) 100 66 33 50 100 150 200 (# of reservations) Question: Suppose the # of reservations for a future week is 135, how many meals should be prepared? 1
11.2 A simple graphical representation: the scatter plot 11.3 Transformation to linearize data 11.4 The simple linear (regression) model: y = β 0 + β 1 x + ɛ, where ɛ is a random error with mean 0 and variance σ 2 (unknown but usually assumed to be constant.) 2
11.5 The least squares method of model fitting Suppose the fitted line is ŷ = ˆβ 0 + ˆβ 1 x; the sum of the squared distance between the fitted value ŷ and the observed value y is = (y i ŷ i ) 2 = (y i ˆβ 0 ˆβ 1 x i ) 2 ; the least squares estimators of the model parameters β 0 and β 1 are the values of ˆβ 0 and ˆβ 1 that minimize δ, they are; ˆβ 0 = ȳ ˆβ i x and ˆβ 1 = S XY S XX where S XY = (X i X)(Y i Ȳ ) = X i Y i ( X i )( Y i ) = X i Y i n( X)(Ȳ ). n S XX = (X i X) 2 = Xi 2 ( X i ) 2 = Xi 2 n( n X) 2 ; [same for S Y Y = (Y i Ȳ )2 ] A good estimator for the error variance σ 2 is the mean square error s 2 ɛ = (y i ŷ i ) 2 /(n 2) = SSE n 2, s ɛ = SSE n 2 11.6 Partitioning the variability (S ɛ is called the residual standard deviation) y i ȳ = y i ŷ i + ŷ i ȳ (y i ȳ) 2 = (y i ŷ i ) 2 + (ŷ i ȳ) 2 SSTotal = SSError + SSREG (Note: SS stands for Sum of Squares ). 3
A useful measure of model fit is the Coefficient of determination (R 2 ). R 2 = SSREG SSTotal, 0 R2 1 The larger the R 2, the closer the fit. The Sample correlation coefficient between X and Y is r X,Y = S XY SXX S Y Y, 1 r X,Y 1. It measures the linear relationship between X and Y. r X,Y = +1 Y = a + bx, b > 0, r X,Y = 1 Y = a bx, b > 0. For the simple linear regression ŷ = ˆβ 0 + ˆβ 1 x; we have (prove!) rx,y 2 = r 2 and r 2 Y,Ŷ X,Y = R 2 11.7 Distributions of the estimated model parameters In order to construct the CI s for the unknown parameters β 0 and β 1, or to do hypothesis test such as H 0 : β 0 = v.s. H 1 : β 1 0. We need to know the distributions of ˆβ 0 and ˆβ 1. To do this, we assume the distribution of the random error ɛ to be normal, i.e. ɛ N(0, σ 2 ). Under this normality assumption, T 1 = ˆβ 1 β 1 s ɛ / S XX t n 2 ; T 0 = ˆβ 0 β 0 X 2 i s ɛ n S XX t n 2. Under H 0 : β 1 = 0, T 1 = ˆβ 1 0 S ɛ / S XX. 4
(T 1 ) 2 = ( ˆβ 1 ) 2 S XX s 2 ɛ = SSREG s 2 ɛ 11.8 Checking the model assumptions F 1,n 2. The constant variance assumption can be checked via a scatter plot of the residuals (y i ŷ i ) versus x i (or ŷ i ). This plot is often called the residual plot. The normality assumption : a normal p-p plot of the standardized residuals (residual divided by its standard error.) EXAMPLE 11.1 A restaurant opening on a reservations-only basis would like to use the number of advance reservations x to predict the number of dinners y to be prepared. Data on reservations and number of dinners served for one day chosen at random from each week in a 100-week period gave the following results. x = 150 ȳ = 120 5
(x x) 2 = 90, 000 (y ȳ) 2 = 70, 000. (x x)(y ȳ) = 60, 000 a. Find the least squares estimates ˆβ 0 and ˆβ 1 for the linear regression line ŷ = ˆβ 0 + ˆβ 1 x. b. Predict the number of meals to be prepared if the number of reservations is 135. c. Construct a 90% confidence interval for the slope. Does information on x (number of advance reservations) help in predicting y (number of dinners prepared)? Solution: a. The least squares estimates are given by ˆβ 1 = S XY 60, 000 = S XX 90, 000 =.67 and ˆβ 0 = ȳ ˆβ 1 x = 120.67(150) = 19.50. b. The predicted number of meals required for the number of advance reservations equal to 135 is ŷ = 19.50 +.67(135) = 109.95, or 110. c. The 90% confidence interval for β 1 uses the formula ˆβ 1 ± t(standard error), where the standard error is s t / S XX. Although Table 4 in the Appendix does not list a t-value for α =.05 and df = 98, we ll use the t-value for the next higher df(df = 120); this value is 1.658. The standard deviation s ɛ can be computed using the summary sample data where s 2 ɛ = SSE n 2, SSE = S Y Y ˆβ 1 S XY = 70, 000 0.67(60, 000) = 29, 800. 6
Thus, 29, 800 s ɛ = = 304.08 = 17.44 98 and the 90% confidence interval for β 1 is 0.67 ± 1.658 (17.44) 90, 000 or 0.67 ±.10. Since we are 90% confident that the true value of β 1 lies somewhere in the interval.57 β 1.77, we are thus confident the increase in y ( number of dinners prepared) for every increase of one advance reservation is in the interval from.57 to.77. Also, since the interval for β 1 does not include 0 as a possible value for the slope, it appears that the number of advance reservations is a useful predictor of the number of meals to be prepared in the context of a linear regression model, y = β 0 + β 1 x + ɛ. EXAMPLE 11.2 Refer to the data of Example 11.1. Confirm the conclusion we receached concerning β 1 by conducting a test of H 0 : β 1 = 0 versus H a : β 1 0. Use α =.10. Solution: The parts of the statistical test are given here: H 0 : β 1 = 0 H a : β 1 0 T.S. : t = ˆβ 1 SXX s ɛ = 0.67 17.44 = 11.53 90,000 R.R. : For a two-tailed test with α =.10 and df = 98, we will reject H 0 if t > 1.645. Conclusion: Since t = 11.53 is greater than 1.645, we have sufficient evidence to reject H 0. It does appear that x is useful in predicting y. 7
8
The Simple Linear Regression independent variable The dependent variable Random error unknown model parameters β 0 : intercept; β 1 : slope