Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates) x 1, x 2,..., x k. 1 / 20 Simple Linear Regression Introduction
The Straight-Line Probabilistic Model Simplest case of a regression model: One independent variable, k = 1, x 1 x; Linear dependence; Model equation: E(Y ) = β 0 + β 1 x, or equivalently Y = β 0 + β 1 x + ɛ. 2 / 20 Simple Linear Regression The Straight-Line Probabilistic Model
Interpreting the parameters: β 0 is the intercept (so called because it is where the graph of y = β 0 + β 1 x meets the y-axis x = 0); β 1 is the slope; that is, the change in E(y) as x is changed to x + 1. Note: if β 1 = 0, x has no effect on y; that will often be an interesting hypothesis to test. 3 / 20 Simple Linear Regression The Straight-Line Probabilistic Model
Advertising and Sales example x = monthly advertising expenditure, in hundreds of dollars; y = monthly sales revenue, in thousands of dollars; β 0 = expected revenue with no advertising; β 1 = expected revenue increase per $100 increase in advertising, in thousands of dollars. Sample data for five months: Advertising 1 2 3 4 5 Revenue 1 1 2 2 4 4 / 20 Simple Linear Regression The Straight-Line Probabilistic Model
What do these data tell about β 0 and β 1? Advertising and revenue scatterplot y 0 1 2 3 4 0 1 2 3 4 5 x 5 / 20 Simple Linear Regression The Straight-Line Probabilistic Model
We could try various values of β 0 and β 1. For given values of β 0 and β 1, we get predictions p i = β 0 + β 1 x i, i = 1, 2, 3, 4, 5. The difference betweem the observed value y i and the prediction p i is the residual r i = y i p i, i = 1, 2, 3, 4, 5. A good choice of β 0 and β 1 gives accurate predictions, and generally small residuals. 6 / 20 Simple Linear Regression The Straight-Line Probabilistic Model
One candidate line (β 0 = 0.1, β 1 = 0.7): Advertising and revenue with candidate line y 0 1 2 3 4 0 1 2 3 4 5 x 7 / 20 Simple Linear Regression The Straight-Line Probabilistic Model
Fitting the Model How to measure the overall size of the residuals? Most common measure (but not the only possibility): sum of squares of residuals r 2 i = (y i p i ) 2 = {y i (β 0 + β 1 x i )} 2 = S(β 0, β 1 ). The least squares line is the one with the smallest sum of squares. Note: the least squares line has the property that r i = 0; Definition 3.1 (page 95) does not need to impose that as a constraint. 8 / 20 Simple Linear Regression Fitting the Model
The least squares estimates of β 0 and β 1 are the coefficients of the least squares line. Some algebra shows that the least squares estimates are ˆβ 1 = (xi x)(y i ȳ) (xi x) 2 = xi y i n xȳ x 2 i n x 2 and ˆβ 0 = ȳ ˆβ 1 x. With a little luck, you will never need to use these formulæ. 9 / 20 Simple Linear Regression Fitting the Model
Other criteria Why square the residuals? We could use least absolute deviations estimates, minimizing S 1 (β 0, β 1 ) = y i (β 0 + β 1 x i ). Convenience: we have equations for the least squares estimates, but to find the least absolute deviations estimates we have to solve a linear programming problem. Optimality: least squares estimates are BLUE if the errors ɛ are uncorrelated with constant variance, and MVUE if additionally ɛ is normal. 10 / 20 Simple Linear Regression Fitting the Model
Model Assumptions The least squares line gives point estimates of β 0 and β 1. These estimates are always unbiased. To use the other forms of statistical inference: interval estimates, such as confidence intervals; hypothesis tests; we need some assumptions about the random errors ɛ. 11 / 20 Simple Linear Regression Model Assumptions
1 Zero mean: E(ɛ i ) = 0; as noted earlier, this is not really an assumption, but a consequence of the definition ɛ = Y E(Y ). 2 Constant variance: V (ɛ i ) = σ 2 ; this is a nontrivial assumption, often violated in practice. 3 Normality: ɛ i N(0, σ 2 ); this is also a nontrivial assumption, always violated in practice, but sometimes a useful approximation. 4 Independence: ɛ i and ɛ j are statistically independent; another nontrivial assumption, often true in practice, but typically violated with time series and spatial data. 12 / 20 Simple Linear Regression Model Assumptions
Notes: Assumptions 2 and 4 are the conditions under which least squares estimates are BLUE (Best Linear Unbiased Estimators); Assumptions 2, 3, and 4 are the conditions under which least squares estimates are MVUE (Minimum Variance Unbiased Estimators). 13 / 20 Simple Linear Regression Model Assumptions
Estimating σ 2 ST 430/514 Recall that σ 2 is the variance of ɛ i, which we have assumed to be the same for all i. That is, σ 2 = V (ɛ i ) = V [Y i E(Y i )] = V [Y i (β 0 + β 1 x i )], i = 1, 2,..., n. We observe Y i = y i and x i ; if we knew β 0 and β 1, we would estimate σ 2 by 1 {yi (β 0 + β 1 x i )} 2 = 1 n n S(β 0, β 1 ). 14 / 20 Simple Linear Regression An Estimator of σ 2
We do not know β 0 and β 1, but we have least squares estimates ˆβ 0 and ˆβ 1. ( So we could use S ˆβ0, ˆβ ) 1 as an approximation to S(β 0, β 1 ). ( But we know that S ˆβ0, ˆβ ) 1 < S(β 0, β 1 ), so would be a biased estimate of σ 2. 1 ( n S ˆβ0, ˆβ ) 1 15 / 20 Simple Linear Regression An Estimator of σ 2
We can show that, under Assumptions 2 and 4, [ ( E S ˆβ0, ˆβ )] 1 = (n 2)σ 2. So s 2 = 1 ( ) n 2 S ˆβ 0, ˆβ 1 = 1 (yi ŷ i ) 2, n 2 where ŷ i = ˆβ 0 + ˆβ 1 x i, is an unbiased estimate of σ 2. This is sometimes written s 2 = Mean Square for Error = MS E Sum of Squares for Error = degrees of freedom for Error = SS E. df E 16 / 20 Simple Linear Regression An Estimator of σ 2
Inferences about the line ST 430/514 We are often interested in the question of whether x has any effect on E(Y ). Since E(Y ) = β 0 + β 1 x, the independent variable x has some effect whenever β 1 0. So we need to test the null hypothesis H 0 : β 1 = 0. 17 / 20 Simple Linear Regression Making Inferences About the Slope β 1
We also need to construct a confidence interval for β 1, to indicate how precisely we know its value. For both purposes, we need the standard error: σ ˆβ1 = σ SSxx, where SS xx = (x i x) 2. As always, since σ is unknown, we replace it by its estimate s, to get the estimated standard error ˆσ ˆβ 1 = s SSxx. 18 / 20 Simple Linear Regression Making Inferences About the Slope β 1
A confidence interval for β 1 is ˆβ 1 ± t α/2,n 2 ˆσ ˆβ 1. Note that we use the t-distribution with n 2 degrees of freedom, because that is the degrees of freedom associated with s 2. To test H 0 : β 1 = 0, we use the test statistic t = ˆβ 1 ˆσ ˆβ1, and reject H 0 at the significance level α if t > t α/2,n 2. 19 / 20 Simple Linear Regression Making Inferences About the Slope β 1
Compare the Confidence Interval and the Hypothesis Test Note that we reject H 0 if and only if the corresponding confidence interval does not include 0. 20 / 20 Simple Linear Regression Making Inferences About the Slope β 1