Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

Similar documents
Inferences for Regression

Chapter 27 Summary Inferences for Regression

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Warm-up Using the given data Create a scatterplot Find the regression line

appstats27.notebook April 06, 2017

Inference with Simple Regression

Stats Review Chapter 14. Mary Stangler Center for Academic Success Revised 8/16

Lecture 18: Simple Linear Regression

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and dcorrelation

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Ordinary Least Squares Regression Explained: Vartanian

1 Introduction to Minitab

Simple Linear Regression

Sociology 6Z03 Review II

Business Statistics. Lecture 9: Simple Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

Scatter plot of data from the study. Linear Regression

The simple linear regression model discussed in Chapter 13 was written as

Notebook Tab 6 Pages 183 to ConteSolutions

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Two-Sample Inference for Proportions and Inference for Linear Regression

Using SPSS for One Way Analysis of Variance

Scatter plot of data from the study. Linear Regression

Density Temp vs Ratio. temp

INFERENCE FOR REGRESSION

Simple Linear Regression for the Climate Data

1 Correlation and Inference from Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Simple and Multiple Linear Regression

Big Data Analysis with Apache Spark UC#BERKELEY

Business Statistics. Lecture 10: Course Review

Can you tell the relationship between students SAT scores and their college grades?

Simple Linear Regression

Logistic Regression Analysis

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Basic Business Statistics 6 th Edition

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Inference for Regression Inference about the Regression Model and Using the Regression Line

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

Data Science for Engineers Department of Computer Science and Engineering Indian Institute of Technology, Madras

STATS DOESN T SUCK! ~ CHAPTER 16

Inference for Regression Simple Linear Regression

A discussion on multiple regression models

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Inference for Single Proportions and Means T.Scofield

Using Tables and Graphing Calculators in Math 11

Single and multiple linear regression analysis

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

F-tests and Nested Models

23. Inference for regression

Review of Statistics 101

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

CREATED BY SHANNON MARTIN GRACEY 146 STATISTICS GUIDED NOTEBOOK/FOR USE WITH MARIO TRIOLA S TEXTBOOK ESSENTIALS OF STATISTICS, 3RD ED.

Measuring the fit of the model - SSR

Section 3: Simple Linear Regression

Chapter 4 Describing the Relation between Two Variables

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Six Sigma Black Belt Study Guides

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Regression Analysis IV... More MLR and Model Building

REVIEW 8/2/2017 陈芳华东师大英语系

Inference for the Regression Coefficient

Mathematical Notation Math Introduction to Applied Statistics

Ch 2: Simple Linear Regression

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

y n 1 ( x i x )( y y i n 1 i y 2

Steps to take to do the descriptive part of regression analysis:

Unit 6 - Introduction to linear regression

Statistical Inference. Why Use Statistical Inference. Point Estimates. Point Estimates. Greg C Elvers

Chapter 10 Regression Analysis

ST505/S697R: Fall Homework 2 Solution.

9 Correlation and Regression

Ordinary Least Squares Regression Explained: Vartanian

Final Exam - Solutions

MORE ON MULTIPLE REGRESSION

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Linear Regression & Correlation

Conditions for Regression Inference:

28. SIMPLE LINEAR REGRESSION III

Correlation Analysis

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

Chapter 10. Simple Linear Regression and Correlation

Lecture 30. DATA 8 Summer Regression Inference

R 2 and F -Tests and ANOVA

Inference for Regression

Repeated-Measures ANOVA in SPSS Correct data formatting for a repeated-measures ANOVA in SPSS involves having a single line of data for each

Chapter 12: Linear regression II

Hypothesis Testing for Var-Cov Components

AMS 7 Correlation and Regression Lecture 8

Basic Probability Reference Sheet

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

1 Multiple Regression

Simple linear regression

Transcription:

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals 4 December 2018 1 The Simple Linear Regression Model with Normal Residuals In previous class sessions, we saw that we had to make assumptions about the population a sample resulted from in order to use the inferential procedures for the population mean. For example, to perform a hypothesis test on the population mean µ of some characteristic of a population using the test statistic Z = X µ σ/ n, we had to assume that: 1. The sample in-hand was a random sample from the population of interest. 2. The population standard deviation σ of the characteristic is known. 3. At least one of the two following requirements are met: (a) The characteristic in the population is normally distributed. (b) The sample size n is sufficiently large so that the Central Limit Theorem guarentees normality of Z. In developing the simple linear regression model in the previous class session, nothing we did required statistical assumptions about the population. We determined the line of best fit ŷ = b 0 + b 1 x (1) as a way to provide the best predictions ŷ i of the responses y i using the predictors x i in our sample, where best here means that we minimized the squared residual between our predictions ŷ i = b 0 + b 1 x i and the actual responses y i. If all we want to do is prediction, and we are satisfied with the performance of the linear prediction ŷ = b 0 + b 1 x, we can stop there. The regression function ŷ = b 0 + b 1 x tells us how our prediction of y changes as the predictor x changes, and thus summarizes the data using a line. Typically, however, we do not just care about how the values of x and y are associated in our sample, but rather how they are associated in the population from which they were sampled. In that case, we want to make an inference from the sample to the population, and a statistical model is required. The statistical model underlying all of the formulas in Chapter 9 of Triola & Triola is the linear regression model with normal residuals, Y i = β 0 + β 1 x i + ɛ i. (2) This model says that the response Y i (now treated as a random variable) is constructed by multiplying the predictor x i by β 1, adding that to β 0, and then adding a random residual term ɛ i to β 0 + β 1 x i. Here, β is the Greek letter beta (the analog to the Roman letter b), and ɛ is the Greek letter epsilon (the analog to the Roman letter e). Thus, the intercept β 0 and slope β 1 are the population analogs of b 0 and b 1, and the residual ɛ is the population analog of the in-sample residual / error e. 1

The simple linear regression model with normal residuals also makes assumptions on the distribution of ɛ. It assumes that the ɛ i are all independent normal random variables with mean 0 and standard deviation σ ɛ. Summing up, this means that the assumptions of the full simple linear regression model with normal residuals are the following: Assumptions Needed for Inferences Derived from Simple Linear Regression Model with Normal Residuals: 1. The sample in-hand was a random sample from the population of interest. 2. The response Y can be modeled as a linear function of the predictor x by Y i = β 0 + β 1 x i + ɛ i. 3. The residuals ɛ i are mutually independent. That is, each residual does not depend in any way on any of the other residuals. 4. The residuals ɛ i have mean 0 and a standard deviation σ ɛ, both of which do not depend on x i. 5. At least one of the two following requirements are met: (a) In the population, the residuals ɛ are normally distributed. (b) The sample size n is sufficiently large that the distribution of the residuals washes out in a Central Limit Theorem-like result. In the final section of this handout, we will discuss margins of error and hypothesis tests for the population intercept β 0 and slope β 1. For these to be valid, we need each of the above assumptions to hold. This means that before we perform an inference on β 0 and β 1, we should check that the assumptions hold. One way to check these assumptions is to use a set of diagnostic plots, which we turn to next. 2 Diagnostic Plots for the Simple Linear Regression Model with Normal Residuals To test the assumptions of the Simple Linear Regression Model with Normal Residuals, we would like to have the population residuals ɛ = Y (β 0 + β 1 x), which we could then test for each of the assumptions above. However, we do not have access to these residuals. Instead, we have access to the in-sample residuals e = y (b 0 + b 1 x) from each of our predictions. The in-sample residual e will not quite match ɛ since we are using our estimates of the intercept and slope, b 0 and b 1, in place of the population values of the intercept and slope, β 0 and β 1. If we ve done a good job of estimating β 0 and β 1, however, then b 0 and b 1 will be close to the population values and e ɛ. This means that we can use the in-sample residuals to check the assumptions listed above. If the Simple Linear Regression Model with Normal Residuals is correct, then the sample residuals should approximately satisfy each of the assumptions in the box above. When we perform a regression from the Stat > Regression > Regression > Fit Regression Model... menu item in Minitab, we can tell Minitab to plot a set of diagnostic plots by clicking the Graphs... button in the dialog box, and selecting the Four in one radio button from the resulting window. Each of these diagnostic plots checks for one of the assumptions above. One such set of diagnostic plots is shown in Figure 1 below. I have labeled these plots 1-4, and will take them one-by-one, and explain the assumption they check. Plot 1: Normality of the Residuals The first plot is a plot we have seen before: it is a plot of the observed values of the residuals e i (sorted from most negative to most positive) against the expected percentiles of those residuals if the residuals followed 2

1 2 3 4 Figure 1: A sample diagnostic plot produced by Minitab when the Simple Linear Regression Model with Normal Residuals is appropriate. See the Demo of Diagnostic Plots for Simple Linear Model on the course website to generate more plots like these where the model is well-specified and misspecified. 3

a normal distribution. This is a probability plot, and if the residuals are approximately normal, then the points on the probability plot will approximately follow the expected values given by the red line. Plot 2: Residuals Have Mean 0 and Constant Standard Deviation for Each Value of x The second plot tests the assumption that the residual terms have mean 0 and standard deviation σ ɛ, neither of which should depend on x. To check this, we plot each in-sample residual e i against the fitted value of ŷ i. So each point in the second plot corresponds to (ŷ i, e i ). Since ŷ = b 0 + b 1 x, this is just a rescaling the predictor x, so it is the same as plotting the residuals against the predictor x. If the assumptions hold, then this should look like a blob without any discernible structure, and the center (mean) of that blob, along the vertical axis, should be zero for all values of x. Moreover, the vertical spread of the blob should be constant for each value of x, since the standard deviations of the residual terms are assumed to be fixed at σ ɛ. Plot 3: Normality of the Residuals The third plot gives us the same information as the first plot: it is the frequency histogram of sample residuals. Thus, if the assumptions hold, the frequency histogram for the sample residuals should be bellshaped, and if we overlay the frequency histogram with the density histogram of the appropriate normal distribution, they should match up. Plot 4: Independence of the Residuals The third assumption, that the residuals are mutually independent, is a very strong condition. One way to look for obvious deviations from mutual independence is to plot the residuals as a function of their order in the data set. This is especially useful if the data is ordered in some natural way, such as by time, by space, or by entrance into a study. This is precisely what the fourth plot shows: each sample residual is plotted against its order in the sample. If the residuals are mutually independent, then there should not be any pattern in this plot as a function of the observation index. Conclusion from Diagnostic Plots in Figure 1 Based on the four plots in Figure 1, we see that all of the assumptions of the simple linear regression model with normal residuals seem to be satisfied. Plots 1 and 3 show that the sample residuals are approximately normally distributed. Plot 2 shows that there is no strong pattern in either the mean or the standard deviation of the sample residuals as a function of the predictor x: the plot is a blob. Finally, Plot 4 shows no clear pattern in the sample residuals as a function of their index, so the sample residuals look approximately independent. Demo on Course Website You should experiment with the Demo of Diagnostic Plots for Simple Linear Model on the course website to generate more plots like these where the model is either well-specified and misspecified. Use the drop-down menu to change the population so that the population s distribution breaks one or more of the assumptions of the simple linear regression model with normal residuals. In this case, the simple linear regression model with normal residuals is misspecified : it does not match the actual properties of the population, and therefore inferences using the simple linear regression model with normal residuals will not be valid. For example, confidence intervals for the population characteristics will not have the desired coverage, and hypothesis tests will not have the desired Type I error rate. Whenever you find that your data do not match a pre-defined model, you should be very skeptical of the inferential statistics reported by Minitab. 4

3 Inferences from the Simple Linear Model with Normal Residuals All of the inferential statistics (standard errors, T -statistics, P -values, etc.) reported by Minitab after running a regression rely on the assumptions outlined in the first section of this handout. This is why you should check these assumptions before proceeding. Otherwise, the numbers mean nothing: garbage in, garbage out. Once we have found that the assumptions seem to hold in our data set, we can proceed to perform inferences from our sample to the population the sample comes from. Remember that, under the simple linear regression model with normal errors, we assume the response Y can be modeled as Y = β 0 + β 1 x + ɛ. (3) where β 0 is the population intercept and β 1 is the population slope. Notice that, if the model is true, then when β 1 = 0, our prediction of the response does not change per unit change in the prediction. Thus, the response and the predictor are not linearly associated. In fact, when the linear model with normal residuals is correct, zero linear association implies something much stronger: independence between the predictor and the response. For this reason, the most common hypothesis test to perform in the setting of a simple linear regression is: H 0 : β 1 = 0 H 1 : β 1 0 That is, we set up the null hypothesis as the boring case: there is no linear association between the predictor and the response. The alternative hypothesis is the opposite of this: there is some linear association between the predictor and the response. To test this hypothesis, we will play the same hypothesis testing game as usual. We will compare b 1, our sample estimate of β 1, to the null value, in this case zero, and see how likely it was to get the observed value of b 1 when the null hypothesis that β 0 = 0 is actually true. So we set up a T -statistic under the null hypothesis that β 1 = 0, giving, T = b 1 β 1 (4) = b 1 0 = b 1 (5) where is the standard error (what Triola & Triola call the margin of error) of b 1. Minitab reports this standard error as SE Coef: the standard error of the coefficient. Under the null hypothesis, this T -statistic is T -distributed with n 2 degrees of freedom. The P -value reported by Minitab thus uses the T -table with n 2 degrees of freedom to determine the probability of observing a value of t at least as extreme as t = b 1 when the null hypothesis is true. Notice that this is a two-tailed test. We interpret the P -value the usual way: we reject the null hypothesis at significance level α if the P -value is less than or equal to α. We can also use the standard error of the coefficient to construct a 1 α confidence interval for β 1, namely b 1 ± t α/2,n 2 Thus, a rough 95% confidence interval for β 1 for large enough n is b 1 ± 1.96 b 1 ± 2. You can use the same construction to perform a hypothesis test or construct a confidence interval for the population slope β 0. However, this is typically of less interest, so I will not cover it in class or in these notes. 5