Linear regression model we assume that two quantitative variables, x and y, are linearly related; that is, the the entire population of (x, y) pairs are related by an ideal population regression line y = a + bx + e where a and b represent the y-intercept and slope coefficients; the quantity e is included to represent the fact that the relation is subject to random errors in measurement e can be interpreted to represent either - the deviation from the mean of that value of y from the population regression line, or - the error in using the line to predict a value of y from the corresponding given x we assume that e is a normally distributed random variable with mean m e = 0 and standard deviation s e, which is large when errors are large and small when errors are small
note that e is a different random variable for different values of x; all such e are assumed to be independent of each other and identically distributed for a fixed value x* of x, the quantity a + bx* represents the (fixed) height of the regression line at x = x*, so y = a + bx* + e is subject to the same kind of variability as e: namely, y is normally distributed with mean m y = a + bx* and standard deviation s y = s e b, being the slope of the line, represents the change in m y associated with a unit change in x; that is, b is the average change in y associated with a unit change in x parameters of interest for the regression model are s e, which measures the ideal size of errors in using the line to make predictions of y values, and b, which measures the average change in y associated with a unit change in x
Estimating regression parameters estimating s e standard deviation about the regression line s e = SSResid n - 2 is not an unbiased estimator of s e [TI83: STAT TESTS LinRegTTest (denoted s).] estimating b the slope of the regression line, b = r s x s y, is an unbiased estimator for b [TI83: STAT TESTS LinRegTTest, also STAT CALC LinReg(a+bx).]
The sampling distribution for b the sampling distribution of b is studied to determine how estimates of b will behave from sample to sample assuming that the n data points produce identical independent normally distributed errors e, all with mean 0 and standard deviation s e, we have that - m b = b s - s b = e s x n -1 - the sampling distribution of b is normal, but since neither s e nor s b are known, we estimate s e with the statistic s e, and s b with the statistic s e s b =, then estimate b with the statistic s x n -1 t = b - b having df = n 2 s b
Confidence interval for b Assuming that the n data points produce identical independent normally distributed errors e, all with mean 0 and standard deviation s e, we obtain the following confidence interval for b: b ± (t-crit.) s b where the t-critical value is based on df = n 2
Model utility test for linear regression If the slope of the regression line is b = 0, then the line is horizontal and values of y do not depend on x, so there is no use to search for a prediction of y based on knowledge of x. A test for whether b = 0 can determine whether it is appropriate to search for a linear regression between the variables x and y. Hypotheses H 0 : b = 0 H a : b 0 Test statistic t = b -0 s b, with df = n 2 Assumptions independent normally distributed errors with mean 0 and equal standard deviatons [TI-83: STAT TESTS LinRegTTest ]
Residual analysis We can use a residual plot (a plot of residuals vs. x values) to check whether it is reasonable to assume that errors are identically distributed independent normal variables; the z-scores of these residuals can be used to display a standardized residual plot: z resid = resid -0 s resid but the standard deviations of each residual vary from point to point and are not automatically calculated by the TI-83. Many statistical packages, however, do perfome these calculations. What to look for: absence of patterns in the (standardized) residual plot very few large residuals (more than 2 standard deviations from the x-axis) no variations in spread of the residuals (would indicate that s e varies with x) influential residuals (residual points far removed from the bulk of the plot)
The sampling distribution for a + bx* Assuming that the n data points produce identical independent normally distributed errors e, all with mean 0 and standard deviation s e, we study the distribution of the prediction statistic a + bx* for some fixed choice of x = x*. a + bx* is an unbiased estimate for the true regression value a + bx*, which thus represents ma + bx* 1 s a + bx* = s e n + Ê z x ˆ Á Ë n -1 1 statistic s a + bx* = s e n + Ê z x ˆ Á Ë n -1 2 and is estimated by the 2 a + bx* is normally distributed, but replacing s a + bx* with the estimate s a + bx* produces a standardized t variable with df = n 2
Confidence interval for a + bx* With the same assumptions as above, the confidence interval formula for a + bx*, the mean value of the predicted y, is where t has df = n 2 (a + bx*) ± (t-crit) s a + bx* Prediction intervals With the same assumptions as above, the prediction interval formula for y*, the prediction of y for the x value x = x*, is (a + bx*) ± (t-crit) s 2 2 e + s a+bx* where t has df = n 2 (variability comes not only from the size of the error but the extent to which the estimate a + bx* differs from the mean value)