lm statistics Chris Parrish 2017-04-01 Contents s e and R 2 1 experiment1................................................. 2 experiment2................................................. 3 experiment3................................................. 5 experiment4................................................. 7 experiment5................................................. 8 experiment6................................................. 10 conclusions................................................. 12 s e and R 2 Regression problems are framed by imagining two numerical population variables x and y related to each other by an equation of the form y = β 0 + β 1 x + ɛ. Here β 0 and β 1 are the y-intercept and slope of the regression line and ɛ Normal(0, σ 2 ) expresses the fact that there is a random component to the values of y. Linear models calculated on random samples from the population, y = b 0 + b 1 x, produce statistics b 0 and b 1 which capture information about the parameters β 0 and β 1, and s e and R 2 which measure how well the data in the sample matches the model. The e in s e stands for errors, or residuals, and is an estimator of σ, e i = y i ŷ i, s e = e 2 i n 2 s e = ˆσ R 2 is the proportion of the variation in y that is explained by the linear model (EPS, p.529). We would like to perform some experiments illustrating the meaning of s e and R 2. 1
experiment1 Start with a horizontal line. Load package. library(ggplot2) Assemble the data. xs <- seq(from = 0, to = 10, by = 0.01) beta0 <- 0 beta1 <- 0 sigma <- 1 data <- data.frame(x = xs, y = rnorm(1001, 0, sigma)) illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 2 0 y 2 Statistics. 0.0 2.5 5.0 7.5 10.0 x options(show.signif.stars = FALSE) lm1 <- lm(y ~ x, data = data) summary(lm1) 2
Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -3.6019-0.6949 0.0138 0.6730 2.8112 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.028807 0.064203 0.449 0.654 x -0.007732 0.011118-0.695 0.487 Residual standard error: 1.016 on 999 degrees of freedom Multiple R-squared: 0.0004839, Adjusted R-squared: -0.0005166 F-statistic: 0.4837 on 1 and 999 DF, p-value: 0.4869 observations What values do you expect to see for b 0 and b 1? Why? What values do you actually see for b 0 and b 1? data.frame(lm = 1, b0 = as.numeric(lm1$coefficients[1]), b1 = as.numeric(lm1$coefficients[2])) lm b0 b1 1 1 0.02880689-0.007731869 What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? data.frame(lm = 1, s.e = summary(lm1)$sigma, R.sq = summary(lm1)$r.squared) lm s.e R.sq 1 1 1.016411 0.0004839214 experiment2 Design and run two more experiments in which the data is just as for experiment 1 except that epsilon is set to 2 and then to 3. Comment on s e and R 2. beta0 <- 0 beta1 <- 0 sigma <- 2 data <- data.frame(x = xs, y = rnorm(1001, 0, sigma)) 3
illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 5.0 2.5 y 0.0 2.5 5.0 Statistics. 0.0 2.5 5.0 7.5 10.0 x lm2 <- lm(y ~ x, data = data) summary(lm2) Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -4.8720-1.2731-0.0441 1.2368 6.4095 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.14831 0.11983 1.238 0.216 x -0.02933 0.02075-1.414 0.158 Residual standard error: 1.897 on 999 degrees of freedom Multiple R-squared: 0.001997, Adjusted R-squared: 0.0009977 F-statistic: 1.999 on 1 and 999 DF, p-value: 0.1578 4
observations What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? data.frame(lm = 2, s.e = summary(lm2)$sigma, R.sq = summary(lm2)$r.squared) lm s.e R.sq 1 2 1.897019 0.001996656 experiment3 beta0 <- 0 beta1 <- 0 sigma <- 3 data <- data.frame(x = xs, y = rnorm(1001, 0, sigma)) illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 5
10 5 y 0 5 10 Statistics. 0.0 2.5 5.0 7.5 10.0 x lm3 <- lm(y ~ x, data = data) summary(lm3) Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -9.3921-2.1614 0.0676 2.2159 9.8495 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.153028 0.200944-0.762 0.447 x 0.008766 0.034796 0.252 0.801 Residual standard error: 3.181 on 999 degrees of freedom Multiple R-squared: 6.353e-05, Adjusted R-squared: -0.0009374 F-statistic: 0.06347 on 1 and 999 DF, p-value: 0.8011 observations What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? 6
data.frame(lm = 3, s.e = summary(lm3)$sigma, R.sq = summary(lm3)$r.squared) lm s.e R.sq 1 3 3.18118 6.352856e-05 experiment4 In experiment 4, we set β 0 = 0 and β 1 = 1 and we reset ɛ = 1. beta0 <- 0 beta1 <- 1 sigma <- 1 data <- data.frame(x = xs, y = xs + rnorm(1001, 0, sigma)) illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 12 8 y 4 0 Statistics. 0.0 2.5 5.0 7.5 10.0 x 7
lm4 <- lm(y ~ x, data = data) summary(lm4) Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -3.2745-0.6817-0.0380 0.6830 3.8101 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.01879 0.06486-0.29 0.772 x 1.01493 0.01123 90.37 <2e-16 Residual standard error: 1.027 on 999 degrees of freedom Multiple R-squared: 0.891, Adjusted R-squared: 0.8909 F-statistic: 8167 on 1 and 999 DF, p-value: < 2.2e-16 observations What values do you expect to see for b 0 and b 1? Why? What values do you actually see for b 0 and b 1? data.frame(lm = 4, b0 = as.numeric(lm4$coefficients[1]), b1 = as.numeric(lm4$coefficients[2])) lm b0 b1 1 4-0.01878739 1.014931 What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? data.frame(lm = 4, s.e = summary(lm4)$sigma, R.sq = summary(lm4)$r.squared) lm s.e R.sq 1 4 1.026768 0.8910073 experiment5 Design and run two more experiments in which the data is just as for experiment 4 except that epsilon is set to 2 and then to 3. beta0 <- 0 beta1 <- 1 sigma <- 2 data <- data.frame(x = xs, y = xs + rnorm(1001, 0, sigma)) 8
illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 10 y 5 0 5 Statistics. 0.0 2.5 5.0 7.5 10.0 x lm5 <- lm(y ~ x, data = data) summary(lm5) Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -6.2500-1.3235-0.0047 1.3183 5.7601 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.03192 0.12661 0.252 0.801 x 0.99071 0.02192 45.187 <2e-16 Residual standard error: 2.004 on 999 degrees of freedom Multiple R-squared: 0.6715, Adjusted R-squared: 0.6711 F-statistic: 2042 on 1 and 999 DF, p-value: < 2.2e-16 9
observations What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? data.frame(lm = 5, s.e = summary(lm5)$sigma, R.sq = summary(lm5)$r.squared) lm s.e R.sq 1 5 2.004433 0.6714768 experiment6 beta0 <- 0 beta1 <- 1 sigma <- 3 data <- data.frame(x = xs, y = xs + rnorm(1001, 0, sigma)) illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 10
10 y 0 Statistics. 0.0 2.5 5.0 7.5 10.0 x lm6 <- lm(y ~ x, data = data) summary(lm6) Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -10.0045-2.0830 0.0968 1.8539 8.6591 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.007685 0.187651-0.041 0.967 x 0.976397 0.032494 30.049 <2e-16 Residual standard error: 2.971 on 999 degrees of freedom Multiple R-squared: 0.4747, Adjusted R-squared: 0.4742 F-statistic: 902.9 on 1 and 999 DF, p-value: < 2.2e-16 observations What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? 11
data.frame(lm = 6, s.e = summary(lm6)$sigma, R.sq = summary(lm6)$r.squared) lm s.e R.sq 1 6 2.970728 0.4747404 conclusions Summarize these experiments by defining s e and R 2 in your own words. 12