Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component with zero mean and unknown variance σ 2. The primary Goal is to do statistical inference on f(x) = β 0 + β 1 x 1. Estimate β 0 and β 1. 2. Do testing H 0 : β 1 = 0. Note that the regressor x is not random variable. But the response y is random variable. With the error assumption, the mean and variance response given x are E(y x) = β 0 + β 1 x and V ar(y x) = σ 2. The true regression line, β 0 + β 1 x is a line of the mean response. The parameters, β 0 (intercept) and β 1 (slope) are called regression coefficients. 2. Least-Squares Estimation Suppose that we observe n pairs of data (x 1, y 1 ),..., (x n, y n ) from the following model y i = β 0 + β 1 x i + ɛ i, i = 1,..., n. The question is how to estimate β 0 and β 1. We will use least squares method. Find β 0 and β 1 to minimize the sum of the squares of the differences between the observations and a linear function. 1
y 60 80 100 120 140 160 20 30 40 50 60 70 80 x Least-Squares Estimation of β 0 and β 1 Use least squares (LS) method: estimate β 0 and β 1 to minimize LS estimators, ˆβ0 and ˆβ 1 satisfy S(β 0, β 1 ) = (y i β 0 β 1 x i ) 2 = ε 2 i S β 0 ˆβ0, ˆβ 1 = 2 S β 1 ˆβ0, ˆβ 1 = 2 = Least-squares normal equations ˆβ 0 n ˆβ 0 + ˆβ 1 n x i + ˆβ 1 The solution to the normal equation is where ȳ = n y i /n, x = n x i /n, (y i ˆβ 0 ˆβ 1 x i ) = 0 (y i ˆβ 0 ˆβ 1 x i )x i = 0 n n x i = x 2 i = ˆβ 0 = ȳ ˆβ 1 x ˆβ 1 = S xy, y i y i x i S xy = y i (x i x) and = (x i x) 2 2
Define the fitted values and the residuals as ŷ i = ˆβ 0 + ˆβ 1 x i and e i = y i ŷ i, respectively, for i = 1, 2,..., n. Both quantities will be used for model adequacy later. Example (The rocket propellant data, n = 20) Question: statistical relationship between y (Shear strength) and x (Age of propellant). Use a simple linear regression model. Shear strength 1800 2000 2200 2400 2600 5 10 15 20 25 Age of propellant Example for estimating parameters: To estimate the model parameters, first calculate x = 13.3625 = S xy = Then, we find that ȳ = 2131.358 20 20 (x i x) 2 = 1106.559 y i (x i x) = 41112.65 ˆβ 1 = S xy = 411125.65 1106.56 = 37.15 ˆβ 0 = ȳ ˆβ 0 x = 2131.358 + 37.15 13.3625 = 2627.82 3
The least-squares fit is ŷ = 2627.82 37.15x Shear strength 1800 2000 2200 2400 2600 5 10 15 20 25 Age of propellant Output from R (the rocket propellant data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2627.822 44.184 59.48 < 2e-16 *** x -37.154 2.889-12.86 1.64e-10 *** Properties of ˆβ 0 and ˆβ 1 ˆβ 0 and ˆβ 1 are unbiased estimators, that is, E( ˆβ 0 ) = β 0 and E( ˆβ 1 ) = β 1 The variances of ˆβ 0 and ˆβ 1 are ( ) V ar( ˆβ 1 ) = σ2 and V ar( S ˆβ 1 0 ) = σ 2 xx n + x2 4
Properties of the least-squares fit, ŷ The sum of the residuals is always zero, that is, (y i ŷ i ) = e i = 0, which results in y i = ŷ i The least squares fit always passes through the point ( x, ȳ) The sum of the residuals weighted by the corresponding x i always equals zero, x i e i = 0. The sum of the residuals weighted by the corresponding fitted values ŷ i always equals zero, ŷ i e i = 0 Estimation of σ 2 An estimation of σ 2 is required to do hypothesis testing on the β 0 and β 1. To that end, decompose the corrected SS of y i into residual (error) SS and model (regression) SS, that is, (y i ȳ) 2 = (y i ŷ i ) 2 + (ŷ i ȳ) 2 + 2 (y i ŷ i )(ŷ i ȳ) (y i ȳ) 2 = (y i ŷ i ) 2 + (ŷ i ȳ) 2 SS T = SS E + SS R For the SSE term, SS E = = e 2 i = (y i ŷ i ) 2 yi 2 nȳ 2 ˆβ 1 S xy = SS T ˆβ 1 S xy 5
SS E has n 2 degree of freedom As an unbiased estimator of σ 2, use MS E is called the mean square error. ˆσ 2 = SS E n 2 = MS E. ˆσ is called the standard error of regression. Notice that it is a model-dependent estimate of σ 2. For the rocket propellant data, find Then compute SS T = 20 y 2 i 20 ȳ 2 = 1693737.6 SS E = SS T ˆβ 1 S xy = 169377.6 ( 37.15) ( 41112.65) = 166402.65 Finally, the estimate of σ 2 is ˆσ 2 = SS E n 2 = 166402.65 18 = 9244.59 Alternative form of the model Consider an alternative model by rewriting the original model as y i = (β 0 + β 1 x) + β 1 (x i x) + ɛ i = β 0 + β 1 (x i x) + ɛ i which is a shifted version of the original model by x. ˆβ 0 = ȳ and ˆβ 1 = S xy /. The least squares estimators are uncorrelated, that is, Cov( ˆβ 0, ˆβ 1 ) = 0. 6
3. Hypothesis testing on the slope Now we need the assumption ε i.i.d N(0, σ 2 ). Wish to test the hypothesis that the slope is 0 (that is, the significance of regression) H 0 : β 1 = 0 H 1 : β 1 0 ˆβ 1 is distributed in N(β 1, σ 2 / ). In case that σ 2 is known, use the statistic Z 0 = ˆβ 1 0 σ 2 /, which is distributed N(0, 1) under the null hypothesis is true. t-tests Typically, σ 2 is unknown Use t-test t 0 = ˆβ 1 0 se( ˆβ 1 ), which is distributed t-distribution with d.f. n 2 under the null below. Here se( ˆβ 1 ) = MS E / is called the standard error of the slope H 0 : β 1 = 0 H 1 : β 1 0. Reject the null hypothesis if t 0 > t α/2,n 2 If we fail to reject H 0 : β 1 = 0, there is no linear relationship between x and y Example (The rocket propellant data): 7
With ˆβ 1 = 37.15, MS E = 9244.59 and = 1106.56, The test statistic is t 0 = 37.15 0 9244.59/1106.56 = 12.85 Under that α = 0.05, t 0.025,18 = 2.101 Since t 0 > 2.101, we would reject H 0. Example with n = 20: With ˆβ 1 0, MS E = 2.889 and = 1106.56, The test statistic is t 0 = 0 2.889/1106.56 = 0 Under that α = 0.05, t 0.025,18 = 2.101 Since t 0 < 2.101, we fail to reject H 0. new.y 400 200 0 200 400 5 10 15 20 25 x Output from a Statistical Software Many statistical softwares use P-value approach for decision making. P-value is the probability (when H 0 is true) that t 0 takes a value as extreme or more extreme than the actually observed value. 8
Decision rule: if P-value α, reject H 0. Output from R (the rocket propellant data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2627.822 44.184 59.48 < 2e-16 *** x -37.154 2.889-12.86 1.64e-10 *** = Since p-value 0(< 0.05), reject H 0. The Analysis of Variance (ANOVA) An approach to test significance of regression (that is, test H 0 : β 1 = 0). ANOVA provides the same result as that of t-test. But it will be very useful for multiple regression model. ANOVA is based on a partition of total variability in the response variable y. Consider the following partition y i ȳ = (y i ŷ i ) + (ŷ i ȳ) Squaring the both sides and summing (y i ȳ) 2 = (y i ŷ i ) 2 + (ŷ i ȳ) 2 SS T = SS E + SS R Note that the cross-product term is equal to 0 and SS R = ˆβ 1 S xy. Degree of freedom (df) df SST = df SSR + df SSE (n 1) = 1 + (n 2) 9
Do F-test for testing H 0 : β 1 = 0 F 0 = SS R /1 SS E /(n 2) = MS R MS E where F 0 follows the F 1,n 2 distribution under the H 0. Decision rule is that reject H 0 if F 0 > F α,1,n 2 The benefit to use the analysis of variance is in Multiple regression Example (The rocket propellant data): To obtain the statistic, F 0, first compute SS T = (y i ȳ) 2 = 1693737.6 i 1 SS R = ˆβ 1 S xy = ( 37.15) ( 41112.65) = 1527334.95 And then find SS E = SS T SS R = 9244.59 Finally, compute Reject H 0 because F 0 > F 0.05,1,18 = 8.29 Output from R (P-value approach) F 0 = 1527334.95/1 9244.59/18 = 165.21 F-statistic: 165.4 on 1 and 18 DF, p-value: 1.643e-010 4. Interval Estimation Confidence intervals on β 0 and β 1 Under the assumption that the errors are normally and independently distributed, both ˆβ 1 β 1 se( ˆβ 1 ) and ˆβ 0 β 0 se( ˆβ 0 ) follow t distribution with n 2 degrees of freedom. 10
Therefore, a 100(1 α) percent confidence interval on β 1 is ˆβ 1 t α/2,n 2 se( ˆβ 1 ) β 1 ˆβ 1 + t α/2,n 2 se( ˆβ 1 ) and a 100(1 α) percent confidence interval on β 0 is ˆβ 0 t α/2,n 2 se( ˆβ 0 ) β 0 ˆβ 0 + t α/2,n 2 se( ˆβ 0 ), where ( ) se( ˆβ 1 0 ) = MS E n + x2 and se( ˆβ 1 ) = MSE Example (The rocket propellant data): Construct 95% confidence intervals on β 0 and β 1 With se( ˆβ 0 ) = 44.184, se( ˆβ 1 ) = 2.89 and t 0.025,18 = 2.101, ˆβ 0 t 0.025,18 se( ˆβ 0 ) β 0 ˆβ 0 + t 0.025,18 se( ˆβ 0 ) 2627.82 (2.101)(44.184) β 0 2627.82 + (2.101)(44.184) 2534.99 β 0 2720.65 ˆβ 1 t 0.025,18 se( ˆβ 1 ) β 1 ˆβ 1 + t 0.025,18 se( ˆβ 1 ) 37.15 (2.101)(2.89) β 1 37.15 + (2.101)(2.89) 43.22 β 1 31.08 Confidence intervals on σ 2 If the errors ɛ i are normally and independently distributed, it can be shown that (n 2)MS E σ 2 χ 2 (n 2). Thus, a 100(1 α) percent confidence interval on σ 2 is (n 2)MS E χ 2 α/2,n 2 σ 2 (n 2)MS E. χ 2 1 α/2,n 2 11
Interval estimation of the mean response A major use of a regression model is to estimate mean response E(y) for a particular value of x. Want to estimate mean response E(y x 0 ), where x 0 is any value of regressor variable with in the range of the original data on x to used to fit the model Use the estimates E(y x 0 ) = ˆµ y x0 = ˆβ 0 + ˆβ 1 x 0 ˆµ y x0 are normally distributed with mean E(y x 0 ) and variance ( 1 σ 2 n + (x 0 x) 2 ) The sampling distribution of ˆµ y x0 E(y x 0 ) ( MS 1 E + ) (x 0 x) 2 n is t distribution with n 2 degrees of freedom. Therefore, a 100(1 α) percent confidence interval on the mean response at the point x = x 0 is ˆµ y x0 t α/2,n 2 A E(y x 0 ) ˆµ y x0 + t α/2,n 2 A, where ( 1 A = MS E n + (x ) 0 x) 2 For the rocket propellant data, a 95% confidence interval on E(y x 0 ) is ˆµ y x0 (2.101)A(x 0 ) E(y x 0 ) ˆµ y x0 + (2.101)A(x 0 ), where ( 1 A(x 0 ) = 9244.59 20 + (x ) 0 13.3625) 2 1106.56 For example, if x 0 = 13.3625, then the confidence interval is 2086.230 E(y 13.3625) 2176.571 12
Prediction of new observation Prediction is an important application of the regression model If x 0 is the value of the regressor variable of interest, then ŷ 0 = ˆβ 0 + ˆβ 1 x 0 is the point prediction of the future observation y 0. The variability of ŷ 0 y 0 is E(ŷ 0 y 0 ) 2 = E{[ŷ 0 E(y 0 )] [y 0 E(y 0 )]} 2 = E[ŷ 0 E(y 0 )] 2 + E[y 0 E(y 0 )] 2 ( 1 = σ 2 n + (x 0 x) 2 ) + σ 2 Therefore, a 100(1 α) percent prediction interval on a future observation at the point x = x 0 is ŷ 0 t α/2,n 2 B y 0 ŷ 0 + t α/2,n 2 B, where ( B = MS E 1 + 1 n + (x ) 0 x) 2 For the rocket propellant data, a 95% prediction interval on y 0 is ŷ 0 (2.101)B(x 0 ) y 0 ŷ 0 + (2.101)B(x 0 ), where ( B(x 0 ) = 9244.59 1 + 1 20 + (x ) 0 13.3625) 2 1106.56 If x 0 = 10, then a 95% prediction interval is [2048.32,2464.32] 13
Coefficient of determination R 2 The coefficient of determination is defined as R 2 = SS R SS T (0 R 2 1) R 2 is the proportion of variation explained by the regressor x R 2 1 implies that most of variability in y is explained by the regression model For the rocket propellant data, R 2 = SS R SS T = 1527334.95 1693737.6 = 0.9018 that is, 90.18% of the variability in strength is explained by the regression model Relationship to the correlation coefficient R 2 = SS R SS T = ˆβ 1 S xy S yy = S xys xy S yy 2 S = xy Sxx Syy = r 2 xy = the square of the sample correlation between x and y For the rocket propellant data, R 2 = (r xy ) 2 = ( 0.9496533) 2 = 0.9018 14