Simple Linear Regression September 24, 2008 Reading HH 8, GIll 4 Simple Linear Regression p.1/20
Problem Data: Observe pairs (Y i,x i ),i = 1,...n Response or dependent variable Y Predictor or independent variable X GOALS: Exploring p(y x) as a function of x Understanding the mean (variability) in Y as a function of x Special cases: linear regression (normal Y ) logistic regression (binomial Y success prob depends on x) Simple Linear Regression p.2/20
Models Model with additive error: for i = 1,...,n, Y i = f(x i ) + ǫ i Regression function E(Y x) = f(x) Taylors series expansion of f(x i ) = f(x 0 ) + f (x 0 )(x i x 0 ) + Remainder leads to locally linear approximation Y i = α + βx i + ε i ε i : independent errors (sampling, measurement, lack of fit) Simple Linear Regression p.3/20
BIG PICTURE: Y i = α + βx i + ε i, Estimate parameters (α,β,σ 2 ) interpretation of parameters: β, α ε i iid N(0,σ 2 ) assess model fit adequate? good? if inadequate, how? predict new ( future ) responses at new x n+1,... how much variability does x explain? normal models: variance measures variability analysis of variance (anova) Simple Linear Regression p.4/20
Example: Body Fat data For a group of subjects, various body measurements were obtained An accurate measurement of the percentage of body fat is recorded for each Goal is to use the other body measurements as a proxy for predicting body fat Focus on simple linear regression model for predicting body fat as a function of abdomen circumference Simple Linear Regression p.5/20
Data Percent Bodyfat 0 10 20 30 40 80 100 120 140 Circumference of Abdomin (cm) Simple Linear Regression p.6/20
Sample Summary Statistics Sample means: x, ȳ Sample variances: s 2 y = S yy/(n 1) s 2 x = S xx/(n 1) sample covariance s xy = S xy /(n 1) where the Sums of Squares are: S yy = n i=1 (y i ȳ) 2 : Total Variation in response S xx = n i=1 (x i x) 2 S xy = n i=1 (x i x)(y i ȳ) Simple Linear Regression p.7/20
Correlation Sample correlation is covariance in a standardized scale (unit-less) measure of dependence r = s xy s x s y 1 r 1 Simple Linear Regression p.8/20
Ordinary Least Squares (OLS) For any chosen α,β, Q(α,β) = n ε 2 i = n (y i α βx i ) 2 i=1 i=1 measures fit of chosen line α + βx to response data OLS estimator: Choose ˆα, ˆβ to minimize Q(α,β) Ad-hoc principal of least squares estimation Under normal error assumption OLS is equivalent to MLE Simple Linear Regression p.9/20
Least Squares Estimates FACTS: or ˆβ = s xy s 2 x, ˆα = ȳ ˆβ x ( ) sy ˆβ = r s x ˆβ is correlation coefficient r, corrected for relative scales of y : x so that the units of the fitted values ˆα + ˆβx are on scale of Y For use in theoretical derivations ˆβ = S xy S xx Simple Linear Regression p.10/20
R 2 measure of model fit: Simplest model: β = ˆβ = 0 so Y i are a normal random sample mean α ˆα = ȳ, Q(ȳ, 0) = S yy = Total Sum of Squares = TSS Any other model fit: SSE = Sum of Squares Error Q(ˆα, ˆβ) R 2 = 1 Q(ˆα, ˆβ)/Q(ȳ, 0) = 1 Sum Squares Error Total SS TSS = SS due to Regression on X + SSE Simple Linear Regression p.11/20
Facts R 2 = r 2 Higher % variation explained is better: Higher correlation Measures linear correlation only not general dependence not causation Can be used to compare other simple linear regression models with transformations of X Cannot be used to compare models with transformed Y Does NOT provide a measure of model adequacy Simple Linear Regression p.12/20
Summarizing Model Fit Fitted values Ŷi = ˆα + ˆβx i Residuals ˆε i = Y i Ŷi estimates of ε i Residual sum of squares = SSE = Q(ˆα, ˆβ) = n i=1 ˆε2 i measures remaining/residual variation in response data s 2 Y X is a point estimate of σ2 from fitted model s 2 Y X = MSE = SSE n 2 = n i=1 ˆε 2 i n 2 note: n 2 degrees of freedom, not n 1 lose 2 degrees of freedom for estimation of α,β Simple Linear Regression p.13/20
Some R commands bodyfat.lm = lm(bodyfat abdomin, data = bodyfat) summary(bodyfat.lm) (regression output) plot(bodyfat.lm) (residual plots) anova(bodyfat.lm) (Analysis of Variance) regr1.plot(bodyfat.lm) in library(hh) Simple Linear Regression p.14/20
Model Assessment Residual analysis: Graphical exploration of fitted residuals ˆε i Standardize: r i = ˆε i / var(ˆε i ) var(ˆε i ) = ˆσ 2 (1 h ii ) h ii leverage measure of potential influence Check normality assumption, constant variance, outliers, influence Treat ˆε i as new data look at structure, other predictors Other predictors? Transformations? Revise model before making interpretations... Simple Linear Regression p.15/20
Residuals Residuals 20 0 10 Residuals vs Fitted 207 204 39 10 20 30 40 50 Standardized residuals 4 0 2 Normal Q Q 207 204 39 3 2 1 0 1 2 3 Fitted values Theoretical Quantiles Standardized residuals 0.0 1.0 2.0 Scale Location 207 204 10 20 30 40 50 39 Standardized residuals 4 0 2 Residuals vs Leverage 216 41 0.5 Cook s distance 1 0.00 0.02 0.04 0.06 0.08 0.10 39 0.5 Fitted values Leverage Simple Linear Regression p.16/20
Diagnostics Residuals versus fitted values (or versus x) Normal quantile plot of residuals (check distributional assumptions of the errors) Scale-location plot: ε i versus Ŷi. Detect if the spread of the residuals is constant over the range of fitted values. Cook s distance plot: shows if any data points have a large influence on the predicted values of the response variable. Values greater than 1 are considered influential. Case 39 appears influential... Simple Linear Regression p.17/20
Residuals Without Case 39 lm(bodyfat Abdomen, subset=c(-39)) Residuals 10 0 10 Residuals vs Fitted 207 204 180 10 20 30 40 Standardized residuals 2 0 2 Normal Q Q 207 204180 3 2 1 0 1 2 3 Fitted values Theoretical Quantiles Standardized residuals 0.0 1.0 Scale Location 207 204 180 10 20 30 40 Standardized residuals 2 0 2 Residuals vs Leverage 216 36 Cook s distance 0.00 0.01 0.02 0.03 0.04 41 Fitted values Leverage Simple Linear Regression p.18/20
Summary Estimate Std. Error t value Pr(> t > summary( lm(bodyfat Abdomen, data=bodyfat, subset=c(-39))) Coefficients: (Intercept) -42.95774 2.71323-15.83 <2e- Abdomen 0.67195 0.02921 23.01 <2e- --- Residual standard error: 4.717 on 249 df Multiple R-squared: 0.6801, F-statistic: 529.3 on 1 and 249 DF, p-value: < 2.2e-16 Simple Linear Regression p.19/20
Interpretation For every additional centimeter of abdominal circumference, percent body fat increases by 0.67 percent For every additional inch of abdominal circumference, percent body fat increases by 2.54.67 = 1.7 percent Abdominal circumference explains roughly 68% of the variation in bodyfat Percent Body fat for 34 inch abdomin 42.96 + 34 2.54.67 = 14.9% Simple Linear Regression p.20/20