INFERENCE FOR REGRESSION

Similar documents
23. Inference for regression

Lecture 18: Simple Linear Regression

Warm-up Using the given data Create a scatterplot Find the regression line

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Simple Linear Regression: A Model for the Mean. Chap 7

Inferences for Regression

28. SIMPLE LINEAR REGRESSION III

Intro to Linear Regression

7.0 Lesson Plan. Regression. Residuals

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Inference for the Regression Coefficient

Ch 13 & 14 - Regression Analysis

Chapter 27 Summary Inferences for Regression

The simple linear regression model discussed in Chapter 13 was written as

Intro to Linear Regression

appstats27.notebook April 06, 2017

Confidence Interval for the mean response

Multiple Regression Examples

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Chapter 9. Correlation and Regression

Analysis of Bivariate Data

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

UNIT 12 ~ More About Regression

9. Linear Regression and Correlation

Conditions for Regression Inference:

1 Introduction to Minitab

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Inference for Regression Simple Linear Regression

Mathematics for Economics MA course

Chapter 3: Describing Relationships

Correlation & Simple Regression

MULTIPLE REGRESSION METHODS

Important note: Transcripts are not substitutes for textbook assignments. 1

Multiple Regression Methods

Chapter 3: Describing Relationships

School of Mathematical Sciences. Question 1

Introduction to Regression

Math Section MW 1-2:30pm SR 117. Bekki George 206 PGH

SMAM 319 Exam1 Name. a B.The equation of a line is 3x + y =6. The slope is a. -3 b.3 c.6 d.1/3 e.-1/3

Inferences for linear regression (sections 12.1, 12.2)

Models with qualitative explanatory variables p216

Review of Statistics 101

Correlation and Regression

Multiple Regression an Introduction. Stat 511 Chap 9

Review of Regression Basics

Regression. Marc H. Mehlman University of New Haven

AP Statistics Bivariate Data Analysis Test Review. Multiple-Choice

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

SMAM 314 Practice Final Examination Winter 2003

Linear Regression Communication, skills, and understanding Calculator Use

7. Do not estimate values for y using x-values outside the limits of the data given. This is called extrapolation and is not reliable.

Correlation and Linear Regression

Unit 6 - Introduction to linear regression

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

AP Statistics. The only statistics you can trust are those you falsified yourself. RE- E X P R E S S I N G D A T A ( P A R T 2 ) C H A P 9

SMAM 314 Exam 42 Name

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

1. An article on peanut butter in Consumer reports reported the following scores for various brands

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

This document contains 3 sets of practice problems.

2. Outliers and inference for regression

MATH 1150 Chapter 2 Notation and Terminology

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Basic Business Statistics 6 th Edition

Basic Business Statistics, 10/e

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model

Statistical Modelling in Stata 5: Linear Models

1. Least squares with more than one predictor

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

11 Correlation and Regression

Inference with Simple Regression

Is economic freedom related to economic growth?

Business Statistics. Lecture 10: Course Review

Examination paper for TMA4255 Applied statistics

Chapter 3: Examining Relationships

AMS 7 Correlation and Regression Lecture 8

Six Sigma Black Belt Study Guides

Midterm 2 - Solutions

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Density Temp vs Ratio. temp

SMAM 319 Exam 1 Name. 1.Pick the best choice for the multiple choice questions below (10 points 2 each)

Multiple and Logistic Regression

Model Building Chap 5 p251

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Chapter 7 Linear Regression

Chapter 2: Looking at Data Relationships (Part 3)

Pre-Calculus Multiple Choice Questions - Chapter S8

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

Chapter 16. Simple Linear Regression and dcorrelation

What is the easiest way to lose points when making a scatterplot?

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Simple Linear Regression

CREATED BY SHANNON MARTIN GRACEY 146 STATISTICS GUIDED NOTEBOOK/FOR USE WITH MARIO TRIOLA S TEXTBOOK ESSENTIALS OF STATISTICS, 3RD ED.

Simple Linear Regression

Bivariate Data Summary

Transcription:

CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We have n observations on an explanatory variable x and a response variable y. Our goal is to study or predict the behavior of y for given values of x. For any fixed value of x, the response y varies according to a Normal distribution. Repeated responses y are independent of each other. The mean response μy has a straight-line relationship with x given by a population regression line: μy = α + β x The slope β and intercept α are unknown parameters. The standard deviation of y (call it σ) is the same for all values of x. The value of σ is unknown. The true (population) regression line is μ y = α + β x and says that the mean response μ y moves along a straight line as the explanatory variable x changes. The parameters β and α are estimated by the slope b and intercept a of the least-squares regression line, and the formulas for these estimates are and b = r s y s x a = y b x where r is the correlation between y and x, y is the mean of the y observations, s y is the standard deviation of the y observations, x is the mean of the x observations, and s x is the standard deviation of the x observations. The standard error about the least-squares line is s = residual = n n ( y y ˆ ) where ˆ y = a + bx is the value we would predict for the response variable based on the least-squares regression line. We use s to estimate the unknown σ in the regression model.

Inference for Regression A level C confidence interval for β is b ± t*seb where t* is the critical value for the t distribution with n degrees of freedom with area C between t* and t*, and s SEb = ( x x ) is the standard error of the least-squares slope b. SEb is usually computed using a calculator or statistical software. The test of the hypothesis H : β = is based on the t statistic b t = SE b with P-values computed from the t distribution with n degrees of freedom. This test is also a test of the hypothesis that the correlation is in the population. A level C confidence interval for the mean response μy when x takes the value x* is y ˆ ± t*se ˆ μ where ˆ y = a + bx, t* is the critical value for the t distribution with n degrees of freedom and area C between t* and t* and SE ˆ μ = s n + ( x * x ) (x x ) SE ˆ μ is usually computed using a calculator or statistical software. A level C prediction interval for a single observation on y when x takes the value x* is y ˆ ± t*se y ˆ where t* is the critical value for the t distribution with n degrees of freedom and and area C between t* and t* and SE y ˆ = s + n (x * x ) + (x x ) SE ˆ y is usually computed using a calculator or statistical software. Finally, it is always good practice to check that the data satisfy the linear regression model assumptions before doing inference. Scatterplots and residual plots are useful tools for checking these assumptions.

Chapter 3 GUIDED SOLUTIONS Exercise 3. KEY CONCEPTS: Scatterplots, correlation, linear regression, residuals, standard error of the leastsquares line (a) First, examine the data and judge whether the relationship between Distance and Days is positive or negative. Sketch your scatterplot on the axes provided, or use software. 5 Scatterplot of Days versus Distance 3 Days 3 Distance 5 Use your calculator (or statistical software) to compute the correlation r: r = (b) What does the slope β of the true regression line say about the number of days until group infection and a group s distance from the first infected group? Enter your estimates of the slope β and intercept α of the true regression line. Use software or your calculator, or compute these values manually using the formulas in Chapter 6 of your textbook. Estimate of β =

Inference for Regression 3 Estimate of α = Although it isn t asked for in this part, write the equation of the least-squares regression line for predicting the number of days to infection for a gorilla group given its distance from the first group infected. You ll use this in part (c) The least-squares regression line is: ŷ = (c) To compute the residuals, complete the table. Remember, to compute the predicted number of days until infection, use the least-squares regression line. Distance from first group infected 3 5 Predicted number of days until infection Residual (prediction error) Compute the sum of residuals (sum of prediction errors). They should sum to zero. residual = Now estimate the standard deviation σ by computing residual = and then completing the following calculation. This is an estimate of σ. s = residual = n

Chapter 3 Exercise 3. KEY CONCEPTS: Tests for the slope of the least-squares regression line (a)the test of the hypotheses H : β is based on the t statistic t = = b SE b. In the statement of the problem, we are told that b =.63 and SE b =.59. The value of b is slightly different than the value we found in Exercise 3., due to differences in how much rounding was done at intermediate stages of the calculations. Compute the test statistic: t = b SE b = (b) What are the degrees of freedom for t? Refer to the original data in Exercise 3. of your textbook to determine the sample size n. Degrees of freedom = n = Now, use Table C to estimate the P-value for testing with the alternative hypothesis H a : β >, which hypothesizes a positive linear association between Days and Distance. P-value: What do you conclude? Exercise 3.38 KEY CONCEPTS: Scatterplots, examining residuals, confidence intervals for the slope (a) Use software or a calculator to compute the correlation between Time and Calories : Use software or a calculator to compute the equation of the least-squares regression line. Don t forget to have the computer or your calculator save the residuals, as we ll use them in part (b): ˆ y =

Inference for Regression 5 Use software or the axes provided to make a scatterplot of Calories versus Time. 5 5 8 Calories 6 5 3 Time 35 5 (b) Here, we ll check conditions needed for regression inference. First, to check for a Linear Relationship, and to check whether spread about the line stays the same for all values of the explanatory variable, plot the residuals against Time (the explanatory variable): 8 6 Residuals - - -6-8 - 5 3 Time 35 5 Does this plot show any systematic deviation from a roughly linear pattern? Does this plot show any systematic change in spread as Time changes?

6 Chapter 3 Are the observations independent? Is this obvious? Finally, look for evidence that the variation about the line appear to be Normal. Use software or the axes that follow (with class intervals residual < 3, 3 residual <, residual <, and so on) to make a histogram. 3 Frequency - - Residuals Does this plot have strong skewness or outliers which might suggest lack of Normality? (c) In this problem, the rate of change in calories consumed as time at the table increases is the slope of the population line, β. Hence, we need to construct a 95% confidence interval for β. Recall that a level C confidence interval for β is b ± t*se b where t* is the critical value for the t distribution with n degrees of freedom with area C between t* and t*, and s SE b = ( x x ) is the standard error of the least-squares slope b.

Inference for Regression 7 In this exercise, b and SE b can be read directly from the output of statistical software. Record their values. b = SE b = Now, find t* for a 95% confidence interval from Table C (what is n here?). t* = Compute the 9% confidence interval: Interpret this confidence interval in the context of this problem. Exercise 3. KEY CONCEPTS: Prediction, prediction intervals We used Minitab to compute a prediction of Calories when Time =. The output follows: The regression equation is Calories = 56 3.8 Time Predictor Coef Stdev t-ratio p Constant 56.65 9.37 9.9. Time 3.77.898 3.6. s = 3. R-sq =.% R-sq(adj) = 38.9% Analysis of Variance SOURCE DF SS MS F p Regression 777.6 777.6 3.. Error 8 985. 57.5 Total 9 73. Fit Stdev.Fit 95.% C.I. 95.% P.I. 37.57 7.3 (.3, 5.9) ( 386.6, 89.8) Where in this output does one find the 95% confidence interval to predict Rachel s calorie consumption at lunch? Refer to Examples 3.7 and 3.8 in the textbook if you need help. 95% prediction interval:

8 Chapter 3 COMPLETE SOLUTIONS Exercise 3. (a) If we look at the data, we see that as a gorilla group s distance from the first infection increases, so does the number of days until that group is infected. Thus, there is a positive association between Days and Distance. A scatterplot of the data with price as the explanatory variable follows. 5 Scatterplot of Days versus Distance 3 Days 3 Distance 5 The scatterplot indicates a strong positive linear association between Distance and Days. The correlation r is given by r =.96. This is consistent with the scatterplot as suggesting a strong linear relationship between Distance and Days. The estimate of β is b =.3 days per distance unit. The estimate of α is a = -8.9 days. The equation of the least-squares regression line for predicting days to infection for a gorilla group given its distance from the initial group infected is: Days = 8.9 +.3 Distance (b) The slope of the population regression line, β, is the number of additional days (on average) required to infect a gorilla group one additional distance unit from the original infection group. You might think of this as a measure of the rate of the infection s spread - on average it takes β days for the infection to spread to an additional home range. The estimate of β is b =.3 days per distance unit. The estimate of α is a = 8.9 days. The equation of the least-squares regression line for predicting days to infection for a gorilla group given its distance from the initial group infected is: Days = 8.9 +.3 Distance

Inference for Regression 9 (c) The residuals for the six data points are given in the table. Distance from first group infected Predicted number of days until infection Residual (prediction error) 3.8 3.8 =.8 3 5.7 5.7 =.7 36.96 33 36.96 = 3.96 36.96 36.96 =. 36.96 3 36.96 = 6. 5 8.3 6 8.3 =.3 The sum of the residuals listed is residual =.. The difference from is due to rounding in the parameter estimates above. To estimate the standard deviation σ in the regression model, we first calculate the sum of the squares of the residuals listed: residual =.8 (.7) (.3) + + + = 96.. Our estimate of the standard deviation σ in the regression model is therefore s = residual = n (96.) =.9 days. 6- Exercise 3. (a) b =.63 and SE b =.59, so t = b SE b =.63.59 = 7.79 (b) Referring to the original data in Exercise 3. of the textbook, we see that n = 6. Degrees of freedom = n = 6 = To estimate the P-value, we use Table C with df = and refer to the P-values corresponding to the two values of t* that bracket the computed value of t = 7.79: t* 5.598 7.73 One-sided P.5. Because the test is two-sided,. < P-value <.5. Statistical software (Minitab) gives a P-value of.. There is extremely strong (overwhelming) evidence to support a positive linear association between distance of a gorilla group from the primary infection group and the number of days it takes for the infection to reach the group.

3 Chapter 3 Exercise 3.38 (a) Here is a scatterplot showing the relationship between time at the table and calories consumed. 5 5 8 Calories 6 5 3 Time 35 5 The correlation between Calories and Time is r =.69. The overall pattern is roughly (perhaps weakly) linear with a negative slope. There are no clear outliers or strongly influential data points, it seems. Using statistical software, we find that the equation of the least-squares line is ˆ y = 56.65 3.8 time (b) A scatterplot of the residuals against Time follows. 8 6 Residuals - - -6-8 - 5 3 Time 35 5

Inference for Regression 3 This plot is useful for addressing the first two of the four conditions we check: Does the relationship appear linear? This scatterplot magnifies deviations from the regression line, making it easier to detect any non-linear pattern in the data. Based on this plot, there is little reason to doubt that the relationship between Calories and Time is linear. Does the spread about the line stay the same? The scatterplot of residuals versus Time seems to suggest that the spread about the line is roughly constant. Points seem to lie consistently in a band between and +. Are the observations independent? The answer is not clear. These are observations on different children rather than on a single child, and that is good. However, we do not know if the children were selected at random. In addition, we do not know if the children were all together so that the behavior of one child could influence the behavior of another. Are there children from the same family in this group? These issues would impact independence of observations. Does the variation about the line appear to be Normal? The histogram that follows has a gap and is not particularly bell-shaped. On the other hand there do not appear to be any outliers or extreme skew. With only observations, it s difficult to assess non- Normality here. 3 Frequency - - Residuals The conditions for inference (for a sample of size ) are approximately satisfied.

3 Chapter 3 (c) From statistical software, we find that b = 3.8 SE b =.85 For a 95% confidence interval from Table C with n = (and n = 8), t* =. We use these to compute the 95% confidence interval for the true slope of the regression line: b ± t*se b = 3.8 ± (.)(.85) = 3.8 ±.79 or.87 to.9 calories per minute. With 95% confidence, each minute spent at the table reduces calories consumed by between.9 calories and.87 calories. Exercise 3. Using software (Minitab, in this case): The output from Minitab follows: The regression equation is Calories = 56 3.8 Time Predictor Coef Stdev t-ratio p Constant 56.65 9.37 9.9. Time 3.77.898 3.6. s = 3. R-sq =.% R-sq(adj) = 38.9% Analysis of Variance SOURCE DF SS MS F p Regression 777.6 777.6 3.. Error 8 985. 57.5 Total 9 73. Fit Stdev.Fit 95.% C.I. 95.% P.I. 37.57 7.3 (.3, 5.9) (386.6, 89.8) The Fit entry gives the predicted calories. Minitab gives both the 95% confidence interval for the mean response and the prediction interval for a single observation. We are predicting a single observation, so the column labeled 95% PI contains the interval we want. We see that this 95% prediction interval is (386.6, 89.8). With 95% confidence, the mean number of calories consumed by Rachel at lunch is between 386 and 89 calories, roughly.