Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details Section 10.1, 2, 3

Basic components of regression setup Target of inference: linear dependency of a response variable on one or more explanatory variables One explanatory variable simple linear regression (SLR) One or more multiple regression The least-squares regression line describes this dependency in the data. A population regression line describes its underlying idealization Involved in describing the probabilities of observing certain values in the sample measurements.

Brief review of least-squares regression The least-squares regression line makes the sum of squared-prediction errors as small as possible. The slope is and the intercept is Predictions are made by plugging in values of x The residuals describe the leftover variation in y after fitting the least-squares regression line The coefficient of determination, r 2, measures the proportion of variability in y that is explained by x

Formal setup for inference in regression The data arise as n pairs of measurements, (x 1, y 1 ),, (x n, y n ) (x i, y i ) are measurements on the i th individual The statistical model is y i = β 0 + β 1 x i + ε i µ y = β 0 + β 1 x i is the mean response when x = x i The ε i are independent, and each ε i is N(0, σ) The least-squares regression line is Sample estimate of µ y = β 0 + β 1 x i

Example: Wages and experience Do wages rise with experience? In a study of employment trends, wage (y, in $/week) and length of service (LOS = x, in months) measurements were obtained from n = 59 workers in similar customer-service positions. Wages LOS Wages LOS Wages LOS Wages LOS Wages LOS Wages LOS 389 94 403 76 443 222 486 60 547 228 443 104 395 48 378 48 353 58 393 7 347 27 566 34 329 102 348 61 349 41 311 22 328 48 461 184 295 20 488 30 499 153 316 57 327 7 436 156 377 60 391 108 322 16 384 78 320 74 321 25 479 78 541 61 408 43 360 36 404 204 221 43 315 45 312 10 393 96 369 83 443 24 547 36 316 39 418 68 277 98 529 66 261 13 362 60 324 20 417 54 649 150 270 47 417 30 415 102 307 65 516 24 272 124 332 97 450 95

Example: Wages and experience (continued) Summary statistics: Least-squares regression line and scatterplot:

Sampling framework Idea: Each value of x defines a subpopulation Multiple, independent SRSs: Each SRS is drawn from a distinct subpopulation y = response variable = measurement of interest x = explanatory variable = subpopulation and sample labels One SRS, with multiple measurements: measure (x i, y i ) on the i th individual but treat the x i as fixed quantities Model describes the conditional distribution of y given its associated subpopulation

Comments on the statistical model y i = β 0 + β 1 x i + ε i with independent ε i, each N(0, σ) Data = Fit + Residual Linearity: µ y = β 0 + β 1 x connects subpopulation means Constant spread: σ does not depend on x Normality: response measurements are bell-shaped within each subpopulation

Residuals and residual standard deviation Unknown population quantities: The random variables ε i are residual deviations The parameter σ is the residual standard deviation Analogous quantities calculated from the sample: The i th (sample) residual is e i = y i ŷ i The regression standard error is

Properties of the slope estimate Suppose (x 1, y 1 ),, (x n, y n ) satisfy the assumptions of the statistical model for SLR Mean: Standard deviation: Standard error:

Some computational formulas Regression standard error: Standard error for slope:

Example: Wages and experience (continued) Regression standard error: Standard error for slope:

The t test and CI for slope in SLR Assumptions: The statistical model for SLR Hypotheses: H 0 : β 1 = 0 versus a one- or two-sided H a Test statistic: P-value: P(T -t) for H a : β 1 < 0 P(T t) for H a : β 1 > 0 2P(T t ) for H a : β 1 0, where T is t(n 2) CI: For confidence level C, the interval is where t* is such that P(T t*) = (1 C)/2

Example: Wages and experience (continued) Hypotheses: H 0 : β 1 = 0 versus H a : β 1 > 0 Summary statistics: b 1 = 0.59, s = 82.2, and SE b1 = 0.21 Test statistic: t = b 1 / SE b1 = 0.59 / 0.21 = 2.85 P-value: P(T 2.85) = 0.003, with k = n 2 = 57 d.f. Decision: Reject H 0 at significance level α = 0.05, and conclude that wages rise with experience

Example: Wages and experience (continued) How much do wages rise with experience? 95% CI: P(T 2.00) = 0.025, using k = n 2 = 57 d.f. t* = 2.00, and the interval is b 1 ± t*se b1 = 0.59 ± (2.00)(0.21) = 0.59 ± 0.41 = (0.18, 1.00) Conclude an increase in weekly salary between $0.18 and $1.00 per month of service, on average

Robustness A moderate lack of Normality may be tolerated Better for large n Outliers or influential observations may be problematic Basic tool: residual plots Example: Wages and experience (continued)

Connections to correlation One SRS, with multiple measurements: (x i, y i ) are paired measurements from one SRS Idea: Treat x as random and work with correlation r is the sample correlation ρ is the population correlation A test of H 0 : ρ = 0 may be carried out with identical calculations as a test of H 0 : β 1 = 0 but CI formulas for ρ and β 1 are very different Different interpretations: Correlation is for two-way relationships; regression is for one-way relationships

Uncertainty in predicted values Plugging x into ŷ = b 0 + b 1 x provides a prediction of the response. Two possible interpretations: ŷ is an estimate of the subpopulation mean µ y = β 0 + β 1 x ŷ is a prediction of an unobserved response, y, from a subpopulation with mean µ y = β 0 + β 1 x Note: There is more uncertainty in the second interpretation since the target of inference is random

Confidence interval for µ y Suppose ŷ is to be an estimate of µ y = β 0 + β 1 x Standard error: CI: For confidence level C, the interval is where t* is such that P(T t*) = (1 C)/2

Prediction interval for y Suppose ŷ is to be a prediction of y from a subpopulation with mean µ y = β 0 + β 1 x Standard error: PI: For confidence level C, the interval is where t* is such that P(T t*) = (1 C)/2

Example: Wages and experience (continued) What is the mean of the subpopulation of workers who s length of service is x = 125? Estimate of µ y : ŷ = b 0 + b 1 x = 349.4 + (0.59)(125) = 423.2 Standard error: 95% CI:

Example: Wages and experience (continued) Suppose the length of service of some interesting worker is x = 125. What is his or her weekly wage? Prediction of y: ŷ = b 0 + b 1 x = 423.2 Standard error: 95% PI:

Confidence and prediction bands Observe: PIs are less precise than CIs Reflects greater uncertainty of the prediction problem

Decomposition of variation Analysis of variance (ANOVA) equation: Total variation in y (= 0 if all y i are equal) Variation about the line (= 0 if all y i = ŷ i ) Variation along the line (= 0 if b 1 = 0)

ANOVA setup Total variation in y: Variation along the line: Variation about the line: Note: Total d.f. = Regression d.f. + Residual d.f.

Related calculations Coefficient of determination: proportion of total variation accounted for by the regression line Mean squares: Alternative formula (which generalizes to multiple regression)

Example: Wages and experience (continued) Relevant summary statistics: Mean square statistics: Sums of square statistics (MS d.f.):

Testing in ANOVA The ANOVA F statistic is May be used to test H 0 : β 1 = 0 versus H a : β 1 0 and its generalization to multiple regression Large values of F provide evidence against H 0 : β 1 = 0 In SLR case, t = F P-value is 2P(T F) where T is t(n 2)