Simple Linear Regression MATH 282A Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math282a.html MATH 282A University of California San Diego Ery Arias-Castro 1 / 19
Hooker s data on boiling temperature in the Himalayas Consider the hooker dataset taken from the package alr3. 'data.frame': 31 obs. of 2 variables: $ Temp : num 211 210 208 202 201... $ Pressure: num 29.2 28.6 28.0 24.7 23.7... Data collected by the botanist Dr. Joseph Hooker on temperatures and boiling points measured often at higher altitudes in the Himalaya Mountains. Main question. Is there a relationship between boiling temperature and pressure, and if so, quantify this relationship by building a predictive model. MATH 282A University of California San Diego Ery Arias-Castro 2 / 19
Scatterplot Temp 180 185 190 195 200 205 210 16 18 20 22 24 26 28 Pressure MATH 282A University of California San Diego Ery Arias-Castro 3 / 19
Simple linear model We assume a simple linear model Temp = β 0 + β 1 Pressure With (x i, y i representing the pressure-temperature pairs, the formal model is y i = β 0 + β 1 x i + ǫ i β 0 is the intercept β 1 is the slope ǫ is the random error Assumptions: The ǫ i s are i.i.d. normal with mean zero and same variance σ 2. All parameters β 0, β 1 and σ 2 are in general unknown. MATH 282A University of California San Diego Ery Arias-Castro 4 / 19
Least squares regression A popular way to fit this data is by minimizing the error sum of squares SSE(β 0, β 1 = (y i β 0 β 1 x i 2 Under the assumption that the ǫ i s are i.i.d. normal, this corresponds to maximum likelihood estimation. > hooker.lm = lm(temp ~ Pressure Coefficients: (Intercept Pressure 146.673 2.253 MATH 282A University of California San Diego Ery Arias-Castro 5 / 19
Least squares regression Define SXX = x = 1 n x i, y = 1 n (x i x 2, SYY = (y i y 2, SXY = y i (x i x(y i y Standard calculations provide explicit formulae: β 1 = SXY SXX and β0 = y β 1 x Note. The least squares regression line passes through the mean (x, y. MATH 282A University of California San Diego Ery Arias-Castro 6 / 19
Regression line Temp 180 185 190 195 200 205 210 16 18 20 22 24 26 28 Pressure MATH 282A University of California San Diego Ery Arias-Castro 7 / 19
Residuals and Estimate for σ 2 The fitted values: The residuals: ŷ i = ˆβ 0 + ˆβ 1 x i e i = y i ŷ i Estimate for σ 2 : ˆσ 2 = 1 n 2 (y i ŷ i 2 = 1 n 2 e 2 i = SSE n 2 = MSE Note. This estimate corresponds to the maximum likelihood estimate multiplied by n/(n 2 to make it unbiased. MATH 282A University of California San Diego Ery Arias-Castro 8 / 19
Moments Assumptions: the errors are uncorrelated, mean 0 and same variance σ 2. β 0 and β 1 are unbiased: E ( β0 = β 0, E ( β1 = β 1 Their second moments are: ( 1 var ( β0 = σ 2 n + x2, var ( β1 SXX = σ 2 1 SXX, cov ( β0, β 1 = σ 2 x SXX σ 2 is unbiased: E ( σ 2 = σ 2 MATH 282A University of California San Diego Ery Arias-Castro 9 / 19
Distributions Assumptions: the errors are i.i.d. normal, mean 0 and same variance σ 2. ( β 0, β 1 is normally distributed. (n 2 σ 2 /σ 2 has the chi-square distribution with n 2 degrees of freedom. ( β 0, β 1 and σ 2 are independent. MATH 282A University of California San Diego Ery Arias-Castro 10 / 19
t-tests and confidence intervals Define ŝe ( β0 = σ 1 n + x2 SXX ŝe ( β1 = σ 1 SXX Assumptions: the errors are i.i.d. normal, mean 0 and same variance σ 2. Then: β 0 β 0 ŝe ( β0 and β 1 β 1 ŝe ( β1 have the t-distribution with n 2 degrees of freedom. Coefficients: Estimate Std. Error t value Pr(> t (Intercept 146.67290 0.77641 188.91 <2e-16 *** Pressure 2.25260 0.03809 59.14 <2e-16 *** --- Residual standard error: 0.806 on 29 degrees of freedom MATH 282A University of California San Diego Ery Arias-Castro 11 / 19
t-tests and confidence intervals We can also provide a condifence interval for β 0 + β 1 x. The natural estimator is β 0 + β 1 x. Under the normal assumptions, it has the normal distribution with E ( β0 + β 1 x = β 0 + β 1 x var( β0 + β ( 1 1 x = σ 2 n We therefore estimate its standard error by + (x x2 SXX ŝe ( β0 + β 1 x = σ 1 n + (x x2 SXX MATH 282A University of California San Diego Ery Arias-Castro 12 / 19
t-tests and confidence intervals Therefore: β 0 + β 1 x β 0 β 1 x ŝe ( β0 + β 1 x has the t-distribution with n 2 degrees of freedom. As a consequence, β 0 + β 1 x ± t α/2 n 2 ( β0 ŝe + β 1 x is a level-α confidence interval for β 0 + β 1 x. MATH 282A University of California San Diego Ery Arias-Castro 13 / 19
Confidence Bands We want to provide confidence intervals for β 0 + β 1 x simultaneously level-α for all x. Under the normal assumptions, the following satisfies that property β 0 + β 1 x ± (2F2,n 2 α 1/2 ŝe ( β0 + β 1 x MATH 282A University of California San Diego Ery Arias-Castro 14 / 19
Confidence Bands Temp 180 185 190 195 200 205 210 16 18 20 22 24 26 28 Pressure MATH 282A University of California San Diego Ery Arias-Castro 15 / 19
Prediction Intervals The confidence intervals (and bands above compute a range for the expected value at a given x, meaning the target is β 0 + β 1 x. Suppose we want instead an interval that contains a new observation at x with high confidence, meaning the target is now y new = β 0 + β 1 x + ǫ new. Our prediction is ŷ new, and given that the new observation is independent of the n observations used to build the model, we have y new ŷ new N(0, var(ǫ new + var( β0 + β 1 x. Therefore, we have the following (1 α-level prediction interval: y new ŷ new ± t α/2 n 2 σ 1 + 1 n + (x x2 SXX MATH 282A University of California San Diego Ery Arias-Castro 16 / 19
Analysis of variance Consider H 0 : the model is y i = β 0 + ǫ i against H 1 : the model is y i = β 0 + β 1 x i + ǫ i Define the error sum of squares SSE = (y i ŷ i 2 = e 2 i Define the sum of squares due to regression SSreg = (y ŷ i 2 = SXY2 SXX MATH 282A University of California San Diego Ery Arias-Castro 17 / 19
Analysis of variance The ANOVA is based on F = SSreg/1 SSE/(n 2 Under the normal assumptions, under the null F has an F-distribution with 1 and n 2 degrees of freedom. Note. Equivalent to the two-sided t-test for β 1 = 0. The computations are often summarized in an ANOVA table. Analysis of Variance Table Response: Temp Df Sum Sq Mean Sq F value Pr(>F Pressure 1 2272.47 2272.47 3497.9 < 2.2e-16 *** Residuals 29 18.84 0.65 --- MATH 282A University of California San Diego Ery Arias-Castro 18 / 19
Coefficient of determination R 2 The coefficient of determination measures the quality of the fit: Note that R 2 = SSreg SYY = 1 SSE SYY R 2 = ρ(x,y 2 where ρ(x,y is the correlation of x = (x 1,...,x n and y = (y 1,...,y n ρ(x,y = cov(x, y var(xvar(y MATH 282A University of California San Diego Ery Arias-Castro 19 / 19