lm statistics Chris Parrish

Similar documents
movies Name:

Generating OLS Results Manually via R

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

STAT 3022 Spring 2007

ST430 Exam 1 with Answers

Applied Regression Analysis

MATH 644: Regression Analysis Methods

The Statistical Sleuth in R: Chapter 7

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

STAT 350: Summer Semester Midterm 1: Solutions

Regression on Faithful with Section 9.3 content

Chapter 3 - Linear Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression

Using R in 200D Luke Sonnet

Inference for Regression

Comparing Nested Models

MODELS WITHOUT AN INTERCEPT

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Cherry.R. > cherry d h v <portion omitted> > # Step 1.

Chapter 16: Understanding Relationships Numerical Data

Inferences on Linear Combinations of Coefficients

ST430 Exam 2 Solutions

Lecture 18: Simple Linear Regression

Practice 2 due today. Assignment from Berndt due Monday. If you double the number of programmers the amount of time it takes doubles. Huh?

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Tests of Linear Restrictions

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Lab #5 - Predictive Regression I Econ 224 September 11th, 2018

SCHOOL OF MATHEMATICS AND STATISTICS

GPA Chris Parrish January 18, 2016

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Ch 2: Simple Linear Regression

Chapter 5 Exercises 1

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Regression and the 2-Sample t

Leftovers. Morris. University Farm. University Farm. Morris. yield

Chapter 12: Linear regression II

5.4 wells in Bangladesh Chris Parrish July 3, 2016

Example: 1982 State SAT Scores (First year state by state data available)

ALR 2014 Caual Modelling Workshop

Exercise 2 SISG Association Mapping

MS&E 226: Small Data

Topics on Statistics 2

Coefficient of Determination

5. Linear Regression

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Analysis of variance. Gilles Guillot. September 30, Gilles Guillot September 30, / 29

General Linear Statistical Models - Part III

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

Chaper 5: Matrix Approach to Simple Linear Regression. Matrix: A m by n matrix B is a grid of numbers with m rows and n columns. B = b 11 b m1 ...

Linear Regression Model. Badr Missaoui

Math 2311 Written Homework 6 (Sections )

y i s 2 X 1 n i 1 1. Show that the least squares estimators can be written as n xx i x i 1 ns 2 X i 1 n ` px xqx i x i 1 pδ ij 1 n px i xq x j x

Lecture 4 Multiple linear regression

14 Multiple Linear Regression

Multiple Linear Regression

5. Linear Regression

Multiple Linear Regression (solutions to exercises)

Final Exam. Name: Solution:

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

1 Multiple Regression

Introduction and Single Predictor Regression. Correlation

STAT 572 Assignment 5 - Answers Due: March 2, 2007

R 2 and F -Tests and ANOVA

1 Introduction 1. 2 The Multiple Regression Model 1

Study Sheet. December 10, The course PDF has been updated (6/11). Read the new one.

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Analysis of Covariance: Comparing Regression Lines

GMM - Generalized method of moments

Variance Decomposition and Goodness of Fit

Stat 5102 Final Exam May 14, 2015

Density Temp vs Ratio. temp

Lecture 4: Regression Analysis

Ch 3: Multiple Linear Regression

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Least Squares Solutions for Overdetermined Systems Joel S Steele

Diagnostics and Transformations Part 2

Multiple Regression: Example

Chaos, Complexity, and Inference (36-462)

Regression. Marc H. Mehlman University of New Haven

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

1 Use of indicator random variables. (Chapter 8)

Simple Linear Regression

MATH 556 Homework 13 Due: Nov 21, Wednesday

Linear Model Specification in R

Foundations of Correlation and Regression

22s:152 Applied Linear Regression. Chapter 5: Ordinary Least Squares Regression. Part 2: Multiple Linear Regression Introduction

Regression Analysis in R

1 The Classic Bivariate Least Squares Model

Correlated Data: Linear Mixed Models with Random Intercepts

CLEAR EVIDENCE OF VOTING ANOMALIES IN BLADEN AND ROBESON COUNTIES RICHARD L. SMITH FEBRUARY 11, 2019

Lecture 10. Factorial experiments (2-way ANOVA etc)

Linear Probability Model

Lecture 6 Multiple Linear Regression, cont.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Simple Linear Regression

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Statistics for Engineers Lecture 9 Linear Regression

Transcription:

lm statistics Chris Parrish 2017-04-01 Contents s e and R 2 1 experiment1................................................. 2 experiment2................................................. 3 experiment3................................................. 5 experiment4................................................. 7 experiment5................................................. 8 experiment6................................................. 10 conclusions................................................. 12 s e and R 2 Regression problems are framed by imagining two numerical population variables x and y related to each other by an equation of the form y = β 0 + β 1 x + ɛ. Here β 0 and β 1 are the y-intercept and slope of the regression line and ɛ Normal(0, σ 2 ) expresses the fact that there is a random component to the values of y. Linear models calculated on random samples from the population, y = b 0 + b 1 x, produce statistics b 0 and b 1 which capture information about the parameters β 0 and β 1, and s e and R 2 which measure how well the data in the sample matches the model. The e in s e stands for errors, or residuals, and is an estimator of σ, e i = y i ŷ i, s e = e 2 i n 2 s e = ˆσ R 2 is the proportion of the variation in y that is explained by the linear model (EPS, p.529). We would like to perform some experiments illustrating the meaning of s e and R 2. 1

experiment1 Start with a horizontal line. Load package. library(ggplot2) Assemble the data. xs <- seq(from = 0, to = 10, by = 0.01) beta0 <- 0 beta1 <- 0 sigma <- 1 data <- data.frame(x = xs, y = rnorm(1001, 0, sigma)) illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 2 0 y 2 Statistics. 0.0 2.5 5.0 7.5 10.0 x options(show.signif.stars = FALSE) lm1 <- lm(y ~ x, data = data) summary(lm1) 2

Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -3.6019-0.6949 0.0138 0.6730 2.8112 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.028807 0.064203 0.449 0.654 x -0.007732 0.011118-0.695 0.487 Residual standard error: 1.016 on 999 degrees of freedom Multiple R-squared: 0.0004839, Adjusted R-squared: -0.0005166 F-statistic: 0.4837 on 1 and 999 DF, p-value: 0.4869 observations What values do you expect to see for b 0 and b 1? Why? What values do you actually see for b 0 and b 1? data.frame(lm = 1, b0 = as.numeric(lm1$coefficients[1]), b1 = as.numeric(lm1$coefficients[2])) lm b0 b1 1 1 0.02880689-0.007731869 What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? data.frame(lm = 1, s.e = summary(lm1)$sigma, R.sq = summary(lm1)$r.squared) lm s.e R.sq 1 1 1.016411 0.0004839214 experiment2 Design and run two more experiments in which the data is just as for experiment 1 except that epsilon is set to 2 and then to 3. Comment on s e and R 2. beta0 <- 0 beta1 <- 0 sigma <- 2 data <- data.frame(x = xs, y = rnorm(1001, 0, sigma)) 3

illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 5.0 2.5 y 0.0 2.5 5.0 Statistics. 0.0 2.5 5.0 7.5 10.0 x lm2 <- lm(y ~ x, data = data) summary(lm2) Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -4.8720-1.2731-0.0441 1.2368 6.4095 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.14831 0.11983 1.238 0.216 x -0.02933 0.02075-1.414 0.158 Residual standard error: 1.897 on 999 degrees of freedom Multiple R-squared: 0.001997, Adjusted R-squared: 0.0009977 F-statistic: 1.999 on 1 and 999 DF, p-value: 0.1578 4

observations What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? data.frame(lm = 2, s.e = summary(lm2)$sigma, R.sq = summary(lm2)$r.squared) lm s.e R.sq 1 2 1.897019 0.001996656 experiment3 beta0 <- 0 beta1 <- 0 sigma <- 3 data <- data.frame(x = xs, y = rnorm(1001, 0, sigma)) illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 5

10 5 y 0 5 10 Statistics. 0.0 2.5 5.0 7.5 10.0 x lm3 <- lm(y ~ x, data = data) summary(lm3) Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -9.3921-2.1614 0.0676 2.2159 9.8495 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.153028 0.200944-0.762 0.447 x 0.008766 0.034796 0.252 0.801 Residual standard error: 3.181 on 999 degrees of freedom Multiple R-squared: 6.353e-05, Adjusted R-squared: -0.0009374 F-statistic: 0.06347 on 1 and 999 DF, p-value: 0.8011 observations What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? 6

data.frame(lm = 3, s.e = summary(lm3)$sigma, R.sq = summary(lm3)$r.squared) lm s.e R.sq 1 3 3.18118 6.352856e-05 experiment4 In experiment 4, we set β 0 = 0 and β 1 = 1 and we reset ɛ = 1. beta0 <- 0 beta1 <- 1 sigma <- 1 data <- data.frame(x = xs, y = xs + rnorm(1001, 0, sigma)) illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 12 8 y 4 0 Statistics. 0.0 2.5 5.0 7.5 10.0 x 7

lm4 <- lm(y ~ x, data = data) summary(lm4) Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -3.2745-0.6817-0.0380 0.6830 3.8101 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.01879 0.06486-0.29 0.772 x 1.01493 0.01123 90.37 <2e-16 Residual standard error: 1.027 on 999 degrees of freedom Multiple R-squared: 0.891, Adjusted R-squared: 0.8909 F-statistic: 8167 on 1 and 999 DF, p-value: < 2.2e-16 observations What values do you expect to see for b 0 and b 1? Why? What values do you actually see for b 0 and b 1? data.frame(lm = 4, b0 = as.numeric(lm4$coefficients[1]), b1 = as.numeric(lm4$coefficients[2])) lm b0 b1 1 4-0.01878739 1.014931 What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? data.frame(lm = 4, s.e = summary(lm4)$sigma, R.sq = summary(lm4)$r.squared) lm s.e R.sq 1 4 1.026768 0.8910073 experiment5 Design and run two more experiments in which the data is just as for experiment 4 except that epsilon is set to 2 and then to 3. beta0 <- 0 beta1 <- 1 sigma <- 2 data <- data.frame(x = xs, y = xs + rnorm(1001, 0, sigma)) 8

illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 10 y 5 0 5 Statistics. 0.0 2.5 5.0 7.5 10.0 x lm5 <- lm(y ~ x, data = data) summary(lm5) Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -6.2500-1.3235-0.0047 1.3183 5.7601 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.03192 0.12661 0.252 0.801 x 0.99071 0.02192 45.187 <2e-16 Residual standard error: 2.004 on 999 degrees of freedom Multiple R-squared: 0.6715, Adjusted R-squared: 0.6711 F-statistic: 2042 on 1 and 999 DF, p-value: < 2.2e-16 9

observations What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? data.frame(lm = 5, s.e = summary(lm5)$sigma, R.sq = summary(lm5)$r.squared) lm s.e R.sq 1 5 2.004433 0.6714768 experiment6 beta0 <- 0 beta1 <- 1 sigma <- 3 data <- data.frame(x = xs, y = xs + rnorm(1001, 0, sigma)) illustration ggplot(data, aes(x, y)) + geom_point(shape = 20, color = "darkred") + geom_smooth(method = "lm") 10

10 y 0 Statistics. 0.0 2.5 5.0 7.5 10.0 x lm6 <- lm(y ~ x, data = data) summary(lm6) Call: lm(formula = y ~ x, data = data) Residuals: Min 1Q Median 3Q Max -10.0045-2.0830 0.0968 1.8539 8.6591 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.007685 0.187651-0.041 0.967 x 0.976397 0.032494 30.049 <2e-16 Residual standard error: 2.971 on 999 degrees of freedom Multiple R-squared: 0.4747, Adjusted R-squared: 0.4742 F-statistic: 902.9 on 1 and 999 DF, p-value: < 2.2e-16 observations What values do you expect to see for s e and R 2? Why? What do you actually see for s e and R 2? 11

data.frame(lm = 6, s.e = summary(lm6)$sigma, R.sq = summary(lm6)$r.squared) lm s.e R.sq 1 6 2.970728 0.4747404 conclusions Summarize these experiments by defining s e and R 2 in your own words. 12