STAT 3022 Spring 2007

Similar documents
Regression on Faithful with Section 9.3 content

Regression and Models with Multiple Factors. Ch. 17, 18

Inference for Regression

Handout 4: Simple Linear Regression

The Statistical Sleuth in R: Chapter 7

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Regression. Marc H. Mehlman University of New Haven

Introduction to Linear Regression

Regression and the 2-Sample t

We d like to know the equation of the line shown (the so called best fit or regression line).

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Chapter 3 - Linear Regression

Lecture 18: Simple Linear Regression

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

ST430 Exam 1 with Answers

Introduction and Single Predictor Regression. Correlation

Correlation and regression

AMS 7 Correlation and Regression Lecture 8

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

1 The Classic Bivariate Least Squares Model

Density Temp vs Ratio. temp

1 Multiple Regression

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

lm statistics Chris Parrish

INFERENCE FOR REGRESSION

Homework 2. For the homework, be sure to give full explanations where required and to turn in any relevant plots.

7. Linear Models 149. Y = f(x 1,x 2,...,x k )+"

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

General Linear Statistical Models - Part III

Model Modifications. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. February 6, 2007

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Applied Regression Analysis

Simple Linear Regression: A Model for the Mean. Chap 7

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Solution to Series 3

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

Quantitative Understanding in Biology 2.2 Fitting Model Parameters

Linear Probability Model

Warm-up Using the given data Create a scatterplot Find the regression line

Chapter 16: Understanding Relationships Numerical Data

Week 7 Multiple factors. Ch , Some miscellaneous parts

STAT 215 Confidence and Prediction Intervals in Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Chapter 5 Exercises 1

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Lecture 19: Inference for SLR & Transformations

Biostatistics 380 Multiple Regression 1. Multiple Regression

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Regression Analysis: Exploring relationships between variables. Stat 251

Inferences on Linear Combinations of Coefficients

Math 2311 Written Homework 6 (Sections )

Chapter 8: Correlation & Regression

Correlation and Regression

An introduction to plotting data

R 2 and F -Tests and ANOVA

Variance Decomposition and Goodness of Fit

Generating OLS Results Manually via R

Foundations of Correlation and Regression

Inference for Regression Simple Linear Regression

Chapter 27 Summary Inferences for Regression

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

Simple Linear Regression

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Chapter 12: Linear regression II

Inference with Simple Regression

ANOVA (Analysis of Variance) output RLS 11/20/2016

AMS-207: Bayesian Statistics

MODELS WITHOUT AN INTERCEPT

Diagnostics and Transformations Part 2

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Multiple comparison procedures

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Comparing Nested Models

Simple Linear Regression

1 Introduction 1. 2 The Multiple Regression Model 1

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Multiple Linear Regression (solutions to exercises)

UNIVERSITY OF TORONTO Faculty of Arts and Science

Simulating MLM. Paul E. Johnson 1 2. Descriptive 1 / Department of Political Science

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Exam 3 Practice Questions Psych , Fall 9

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Examples of fitting various piecewise-continuous functions to data, using basis functions in doing the regressions.

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression. Chapter 5: Ordinary Least Squares Regression. Part 2: Multiple Linear Regression Introduction

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities

Data Analysis Using R ASC & OIR

De-mystifying random effects models

Business Statistics. Lecture 10: Course Review

22 Approximations - the method of least squares (1)

appstats27.notebook April 06, 2017

The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example

MS&E 226: Small Data

Transcription:

Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so you will get the same results we had in class. (Remember that you don t enter the > or + that R uses as a prompt at the beginning of each line.) Now pick 50 x s between 1 and 25: > x <- sample( 25, 50, replace = TRUE ) We can make an approximately linear function of x by entering > y <- 4 * x + 17 + 25 * rnorm(x) This adds a random component to a line with slope 4 and y-intercept 17; the random part is normally distributed with mean 0 and standard deviation 25. Observe the data. > plot( x, y, las = 1 ) There is a general linear trend, but lots of scatter, too. Find the center of the data, i.e., x and ȳ, and add them to the graph. > xbar <- mean( x ) ; ybar <- mean( y ) ; data.frame( xbar, ybar ) > abline( v = xbar, lty = 3 ) ; axis( 3, at = xbar ) > abline( h = ybar, lty = 3 ) ; axis( 4, at = ybar ) Now we use least squares to fit a line to the data. We can draw that on our graph, and we can compare it to the true regression line. > output <- lm( y ~ x ) > abline( output ) # sample regression line # true (population) regression line > abline( 17, 4, lty = 2, col="red", lwd = 2) It looks like a pretty good fit, but remember that the line we get depends on the points we started with, and they are random. Suppose we started with the same true relation between x and y, that is, with y = 4x + 17 plus a random component which is normally distributed with mean 0 and standard deviation 25, and repeated the process of finding a line based on a sample of 50 points. Every time we do that, we have a different batch of points, so we get a different line, even though all the lines we get are supposed to estimate the same true line, namely y =4x + 17. We can use R to do this. Define a function to draw a sample of 50 points and compute the least-squares line. > do.it.again <- function(){ + y <- 4 * x + 17 + 25 * rnorm(x) + more.output <- lm( y ~ x ) + abline( more.output, col="gray" ) + } Now try it a few times to see how it works. Do lots more > for( i in 1:200 ){ do.it.again() } 1

Show the true line again. > abline( 17, 4, lty = 2, lwd = 3, col = "red" ) # true line It should look like this: 15.44 150 100 y 77.30829 50 0 5 10 15 20 25 x Regression lines 2

More examples Here are R commands to do what is shown in some of the worked-out examples in the text. These commands may also be useful for doing some of the homework. These examples use the meat data from one of the case studies. > time <- c( 1, 1, 2, 2, 4, 4, 6, 6, 8, 8 ) > ph <- c( 7.02, 6.93, 6.42, 6.51, 6.07, 5.99, 5.59, 5.80, 5.51, 5.36 ) The first thing to do is to look at the data, and the second is to try fitting a regression model. > plot( time, ph, las = 1 ) # scatterplot of ph versus time > abline( lm( ph ~ time ) ) # 7.0 6.5 ph 6.0 5.5 1 2 3 4 5 6 7 8 time Line does not follow curvature of data There is evidence that the model is inadequate; perhaps a transformation would help. Try logarithm of time. > log.time <- log( time ) > meat.data <- data.frame( time, log.time, ph ) > meat <- lm( ph ~ log.time, data = meat.data ) 3

7.0 6.5 ph 6.0 5.5 We ll use these transformed data. > summary( meat ) Call: lm(formula = ph ~ log.time, data = meat.data) Residuals: Min 1Q Median 3Q Max -0.11466-0.05889 0.02086 0.03611 0.11658 0.0 0.5 1.0 1.5 2.0 log.time Line fits transformed data much better Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.98363 0.04853 143.90 6.08e-15 *** log.time -0.72566 0.03443-21.08 2.70e-08 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.08226 on 8 degrees of freedom Multiple R-Squared: 0.9823, Adjusted R-squared: 0.9801 F-statistic: 444.3 on 1 and 8 DF, p-value: 2.695e-08 From this output we see that our estimated standard deviation is ˆσ = 0.08226 and our estimated slope coefficient is -0.72566, with standard error 0.03443. So ph =6.9836 0.7257 log t for t between 1 and 8 hours. 4

Point Estimates and Standard Errors (Display 7.10) We can use the line to estimate the value of ph for any time between 1 and 8 hours, whether or not a specific time was one we had data for. Even though we had two observations with time 4 hours, we still use the line to estimate the mean ph for steers at time 4 hours, just as we would for times (such as 5 hours) where we did not have any observations. The point estimate is just the y-coordinate for a given value of time t. For example, we estimate that when t = 4 hours, ph =6.9836 0.7257 log 4 = 6.9836 0.7257(1.386) = 5.98 but we d like some idea of how reliable this is. We need to compute a standard error, and there are several ways to do that. One way involves the formula 1 SE[ˆµ{Y X 0 }]=ˆσ n + (X 0 X) 2 (n 1)s 2 X for the standard error at a specified X value (X 0 = log t =log4=1.386 in this example). This approach is shown in the text as Display 7.10 on page 187. The text also describes a computer centering trick to avoid having to do all the calculations shown in Display 7.10. Here s how that works in R. We create an artificial variable, in this case by subtracting log 4 from log(time). > log.time.star <- log.time - log(4) Then fit a model using this instead of the original explanatory variable. > summary( lm( ph ~ log.time.star ) ) Call: lm(formula = ph ~ log.time.star) Residuals: Min 1Q Median 3Q Max -0.11466-0.05889 0.02086 0.03611 0.11658 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.97765 0.02688 222.42 < 2e-16 *** log.time.star -0.72566 0.03443-21.08 2.70e-08 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.08226 on 8 degrees of freedom Multiple R-Squared: 0.9823, Adjusted R-squared: 0.9801 F-statistic: 444.3 on 1 and 8 DF, p-value: 2.695e-08 The only parts we want from this are the estimated intercept 5.97765 and its standard error 0.02688; they are the point estimate we already had (shown as 5.98 in Display 7.10) and its standard error (shown as 0.0269 in Display 7.10). 5

Confidence Intervals (Display 7.10) We can use the point estimate and its associated standard error to form a confidence interval for the mean ph of all steers measured at time 4 hours. The calculations are shown in the bottom of Display 7.10 and we can add this to our graph. 7.0 6.5 ph 6.0 5.5 0.0 0.5 1.0 1.5 2.0 log.time 95% CI for mean ph at 4 hours after slaughter Remember that this is an estimate for the true mean value of all steers. What if we wanted to predict the ph for a single steer? The point estimate would be the same 5.98, but our uncertainty would be different. Even if we knew the exact true regression line, there would still be sampling variability about that line. That s what σ describes, after all. But we have only our estimated line, and the confidence interval we ve found describes only the variation between the true line and its estimates such as our line. 6

Prediction Intervals (Display 7.12) We can form a different interval that allows for additional variability. As before, there are several ways to do this. One way uses the formulas (from page 190) for standard error of prediction: SE[Pred{Y X 0 }]= ˆσ 2 + SE[ˆµ{Y X 0 }] 2 We can use the centering method to get SE[ˆµ{Y X 0 }] and that computer output also gives ˆσ, so this is really not too hard. For our example, we had > summary( lm( ph ~ log.time.star ) ) # same centering as before Call: lm(formula = ph ~ log.time.star) Residuals: Min 1Q Median 3Q Max -0.11466-0.05889 0.02086 0.03611 0.11658 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.97765 0.02688 222.42 < 2e-16 *** log.time.star -0.72566 0.03443-21.08 2.70e-08 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.08226 on 8 degrees of freedom Multiple R-Squared: 0.9823, Adjusted R-squared: 0.9801 F-statistic: 444.3 on 1 and 8 DF, p-value: 2.695e-08 From this output we get SE[ˆµ{Y X 0 }]=0.02688 ˆσ =0.08226 We combine these to get SE for prediction > sqrt( 0.02688^2 + 0.08226^2 ) # SE for predicted value [1] 0.0865404 This is shown as 0.0865 in Display 7.12. The rest of that display shows how to form a 95% prediction interval, and we can use R to do that, too. > qt( 1 -.05/2, 8 ) # t critical value [1] 2.306004 > 5.97765-0.0865404 * 2.306004 # lower limit [1] 5.778087 > 5.97765 + 0.0865404 * 2.306004 # upper limit [1] 6.177213 7

We can add this interval to our graph. 7.0 6.5 ph 6.0 5.5 0.0 0.5 1.0 1.5 2.0 log.time 95% prediction interval for ph at 4 hours after slaughter This shows both the prediction interval and the confidence interval. We can think of the confidence interval as reflecting our uncertainty involving the location of the line itself, and the prediction interval incorporates the additional variability of points scattered about that line. R can do all this at once. The preceding material is useful, no matter what computer software you have. However, many packages, including R have built-in routines for these tasks: > predict( meat, data.frame( log.time = log(4) ), interval = "confidence" ) fit lwr upr [1,] 5.977651 5.915677 6.039625 Rounding these values, we have a point estimate of 5.98, and a confidence interval from 5.92 to 6.04. > predict( meat, data.frame( log.time = log(4) ), interval = "prediction" ) fit lwr upr [1,] 5.977651 5.778092 6.177209 Here we still have the same point estimate of 5.98, but our prediction interval is from 5.78 to 6.18. 8