Inference for Regression Simple Linear Regression

Similar documents
Inference for Regression Inference about the Regression Model and Using the Regression Line

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Inference for the Regression Coefficient

Inferences for Regression

Correlation Analysis

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

23. Inference for regression

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Ch 2: Simple Linear Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Mathematics for Economics MA course

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

STAT Chapter 11: Regression

Chapter 16. Simple Linear Regression and dcorrelation

Lecture 3: Inference in SLR

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

SIMPLE REGRESSION ANALYSIS. Business Statistics

Regression Models - Introduction

Chapter 16. Simple Linear Regression and Correlation

STOR 455 STATISTICAL METHODS I

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Simple Linear Regression

Ch 3: Multiple Linear Regression

Chapter 14 Simple Linear Regression (A)

Statistics for Managers using Microsoft Excel 6 th Edition

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Analysis of Variance

Psychology 282 Lecture #4 Outline Inferences in SLR

Lecture 10 Multiple Linear Regression

df=degrees of freedom = n - 1

Unit 10: Simple Linear Regression and Correlation

The Multiple Regression Model

Confidence Intervals, Testing and ANOVA Summary

Basic Business Statistics 6 th Edition

Business Statistics. Lecture 10: Correlation and Linear Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Inference for Regression

Six Sigma Black Belt Study Guides

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Lecture 1 Linear Regression with One Predictor Variable.p2

9. Linear Regression and Correlation

Simple Linear Regression

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

y response variable x 1, x 2,, x k -- a set of explanatory variables

Confidence Interval for the mean response

Chapter 4: Regression Models

INFERENCE FOR REGRESSION

Business Statistics. Lecture 10: Course Review

Ch. 1: Data and Distributions

Chapter 4. Regression Models. Learning Objectives

What If There Are More Than. Two Factor Levels?

Measuring the fit of the model - SSR

Formal Statement of Simple Linear Regression Model

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Simple linear regression

Basic Business Statistics, 10/e

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Review of Statistics 101

Example: Four levels of herbicide strength in an experiment on dry weight of treated plants.

Can you tell the relationship between students SAT scores and their college grades?

A discussion on multiple regression models

Multiple Linear Regression

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

ST Correlation and Regression

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Lectures on Simple Linear Regression Stat 431, Summer 2012

STAT 350 Final (new Material) Review Problems Key Spring 2016

CHAPTER EIGHT Linear Regression

Inference with Simple Regression

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

ANOVA: Analysis of Variation

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Section 3: Simple Linear Regression

Homework 2: Simple Linear Regression

Chapter 27 Summary Inferences for Regression

Applied Econometrics (QEM)

Ch 13 & 14 - Regression Analysis

Chapter 7 Student Lecture Notes 7-1

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Statistical Techniques II EXST7015 Simple Linear Regression

Chapter 3 Multiple Regression Complete Example

Estadística II Chapter 4: Simple linear regression

Regression used to predict or estimate the value of one variable corresponding to a given value of another variable.

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

ECO220Y Simple Regression: Testing the Slope

Review 6. n 1 = 85 n 2 = 75 x 1 = x 2 = s 1 = 38.7 s 2 = 39.2

The simple linear regression model discussed in Chapter 13 was written as

Chapter 14 Student Lecture Notes 14-1

Final Exam - Solutions

Simple Linear Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Statistics for Engineers Lecture 9 Linear Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

STAT 215 Confidence and Prediction Intervals in Regression

28. SIMPLE LINEAR REGRESSION III

Transcription:

Inference for Regression Simple Linear Regression IPS Chapter 10.1 2009 W.H. Freeman and Company

Objectives (IPS Chapter 10.1) Simple linear regression p Statistical model for linear regression p Estimating the regression parameters p Confidence interval for regression parameters p Significance test for the slope p Confidence interval for µ y p Prediction intervals

ˆ y = 0.125x 41.4 The data in a scatterplot are a random sample from a population that may contain a linear relationship between x and y. Different sample, different plot. We want to describe the population mean response µ y as a function of the explanatory variable X: µ y = β 0 + β 1 x, and to assess whether the observed relationship is statistically significant (not entirely explained by chance events due to random sampling).

Statistical model for linear regression In the population, the linear regression equation is µ y = β 0 + β 1 x. Sample data then fits the model: Data = fit + residual y i = (β 0 + β 1 x i ) + (ε i ) where the ε i are independent and Normally distributed N(0,σ). Linear regression assumes equal variance of y (σ is the same for all values of x).

Estimating the parameters µ y = β 0 + β 1 x The intercept β 0, the slope β 1, and the standard deviation σ of y are the unknown parameters of the regression model. We rely on the sample data to provide unbiased estimates of these parameters. p The value of ŷ from the least-squares regression line (remember Chapter 2?) is really a prediction of the mean value of y (µ y ) for a given value of x. p The least-squares regression line (ŷ = b 0 + b 1 x) obtained from sample data is the best estimate of the true population regression line (µ y = β 0 + β 1 x). ŷ: unbiased estimate for mean response µ y b 0 : unbiased estimate for intercept β 0 b 1 : unbiased estimate for slope β 1

The population standard deviation σ for y at any given value of x represents the spread of the normal distribution of the ε i around the mean µ y. The estimate of σ is calculated from the residuals, e i = y i ŷ i : s = 2 e i n 2 = (y i ŷ i ) 2 n 2 s is an estimate of the population standard deviation σ.

Conditions for inference p p The observations are independent. The relationship is linear. p The standard deviation of y, σ, is the same for all values of x. p The response y varies normally around its mean.

Using residual plots to check for regression validity The residuals give useful information about the contribution of individual data points to the overall pattern of scatter. We view the residuals in a residual plot of e i vs. x i If residuals are scattered randomly around 0 with uniform variation, it indicates that the data fit a linear model, have normally distributed residuals for each value of x, and constant standard deviation σ.

Residuals are randomly scattered à good! Curved pattern à the relationship is not linear. Change in variability across plot à σ not equal for all values of x.

Confidence intervals for regression parameters Estimating the regression parameters β 0 and β 1 is a case of onesample inference with unknown population variance. We rely on the t distribution, with n 2 degrees of freedom. A level C confidence interval for the slope, β 1, is based on an estimate of the standard deviation of the estimated slope, b 1 : b 1 ± t* SE b1 A level C confidence interval for the intercept, β 0, is based on an estimate of the standard deviation of the estimated intercept, b 0 : b 0 ± t* SE b0 t* is the value for the t (n 2) distribution with area C between t* and +t*. We ll see formulas for the SE values in 10.2.

Significance test for the slope We can test the hypothesis H 0 : β 1 = 0 against a 1 or 2 sided alternative. We calculate t = b 1 / SE b1, which has a t (n 2) distribution to find the p-value of the test. Note: Software typically provides two-sided p-values.

Confidence interval for µ y We can calculate a confidence interval for the population mean µ y of all responses y when x takes the value x*: This interval is centered on ŷ, the unbiased estimate of µ y, and has The typical form of a CI: estimate ± t*se estimate. The true value of the population mean µ y when x = x* will be within our confidence interval in C% of samples taken from the population.

The level C confidence interval for the mean response µ y at a given value x* of x is: ˆ y ± t * SE ˆ t* is the value from the t (n 2) distribution with area C between t* and +t*. Again, we ll hold off on the SE term for now. A confidence interval for µ y can be found for any value, x*. Graphically, these confidence intervals are shown 95% confidence interval for µ y as a pair of smooth curves centered on the LS regression line. We often call these confidence bands.

Inference for prediction We have seen that one use of regression is to predict the value of y for a particular value of x: ŷ = b 0 + b 1 x. Now that we understand the ideas of statistical inference, we can employ them to help us understand the variability of that simple estimate. To predict an individual response y for a given value of x, we use a prediction interval. The prediction interval is wider than the CI for mean response to reflect the fact that individual observations are more variable.

The level C prediction interval for a single observation of y when x takes the value x* is: ŷ ± t * SEŷ t* is the value for the t (n 2) distribution with area C between t* and +t*. We can display the prediction intervals in the same way as the CI for mean response. 95% prediction interval for ŷ

p The confidence interval for µ y estimates the mean value of y for all individuals in the population whose value of x is x*. p The prediction interval predicts the value of y for one single individual whose value of x is x*. 95% prediction interval for y 95% confidence interval for µ y This figure illustrates the fact that the CI for mean response is narrower than the corresponding prediction interval. Notice that both intervals are most narrow for values near the center of the X distribution.

1918 flu epidemic 1918 influenza epidemic 1918 influenza epidemic Date # Cases # Deaths week 1 36 0 week 2 531 0 week 3 4233 130 week 4 8682 552 week 5 7164 738 week 6 2229 414 week 7 600 198 week 8 164 90 week 9 57 56 week 10 722 50 week 11 1517 71 week 12 1828 137 week 13 1539 178 week 14 2416 194 week 15 3148 290 week 16 3465 310 week 17 1440 149 # cases diagnosed 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 week 1 week 3 week 5 week 7 week 9 week 11 week 13 week 15 # Cases # Deaths week 17 800 700 600 500 400 300 200 100 The graph suggests that 7 to 9% of those diagnosed with the flu died within about a week of diagnosis. We look at the relationship between the number of deaths in a given week and the number of new diagnosed cases one week earlier. 0 # deaths reported

1918 flu epidemic: Relationship between the number of deaths in a given week and the number of new diagnosed cases one week earlier. EXCEL Regression Statistics Multiple R 0.911 R Square 0.830 Adjusted R Square 0.82 Standard Error 85.07 s Observations 16.00 r = 0.91 Coefficients St. Error t Stat P-value Lower 95% Upper 95% Intercept 49.292 29.845 1.652 0.1209 (14.720) 113.304 FluCases0 0.072 0.009 8.263 0.0000 0.053 0.091 b 1 SE b1 P-value for H 0 : β 1 = 0 P-value very small è reject H 0 è β 1 significantly different from 0 There is a significant relationship between the number of flu cases and the number of deaths from flu a week later.

SPSS CI for mean weekly death count one week after 4000 flu cases are diagnosed: µ y within about 300 380. Prediction interval for a weekly death count one week after 4000 flu cases are diagnosed: y within about 180 500 deaths. Least squares regression line 95% prediction interval for y 95% confidence interval for µ y

Inference for Regression More Detail about Simple Linear Regression IPS Chapter 10.2 2009 W.H. Freeman and Company

Objectives (IPS Chapter 10.2) Inference for regression more details p Analysis of variance for regression p The ANOVA F test p Calculations for regression inference p Inference for correlation

Analysis of variance for regression The regression model is: Data = fit + residual y i = β 0 + β 1 x i + ε i where the ε i are independent and normally distributed N(0,σ) for all values of x. The calculations are based on a technique called Analysis of Variance, or ANOVA. ANOVA partitions the total amount of variation in a data set into two or more components of variation.

Analysis of Variance (ANOVA) In the linear regression setting, there are two reasons that values of y differ from one another: u u The individuals have different values of X Random variation among individuals with the same value of X n (y i y) 2 = (ŷ i y) 2 + (y i ŷ) 2 i=1 n i=1 n i=1 Total variation of DATA Variation due to RESIDUALS Variation due to model FIT SST = SSM + SSE

Sums of Squares, Degrees of Freedom, and Mean Squares We have just seen that SSTotal = SSModel +SSError. Each sum of squares has an associated degrees of freedom, which also add, DFT = DFM + DFE. For SLR, DFT = n-1, DFM = 1, DFE = n-2 Each SS is also associated with Mean Squares, which do NOT add: MS = SS DF. MST = SST DFT, MSM = SSM DFM, MSE = SSE DFE

ANOVA and correlation Fact: the correlation between x and y is related to the percentage of variation in y explained by the FIT of the linear model: r 2 = SSM SST If x and y are highly correlated, SST is dominated by the SSM term, with little contribution from the residuals (SSE); conversely, if x and y are weakly correlated, SST is dominated by SSE. r 2 is often called the coefficient of determination.

The ANOVA F test The null hypotheses that y is not linearly related to x, H 0 : β 1 = 0 versus H a : β 1 0 can be tested by comparing MSM to MSE. F = MSM MSE When H 0 is true, F follows the F(1, n 2) distribution. The p-value is the probability of observing a value of F greater than the observed one.

ANOVA table Source Sum of squares SS DF Mean square MS F P-value ( yˆ y) 2 i Model 1 MSM = SSM/DFM MSM/MSE Tail area above F Error n 2 MSE = SSE/DFE ( y i yˆ i) ( y y) 2 i 2 Total n 1 MST = SST/DFT The estimate of σ is calculated from the residuals e i = y i ŷ i : s 2 2 ei ( yi yˆ i ) = = n 2 n 2 2 = SSE DFE = MSE

Calculations for regression inference To estimate the regression parameters we calculate the standard errors for the estimated regression coefficients. The standard error of the least-squares slope b 1 is: SE b1 = s (x i x i ) 2 The standard error of the intercept b 0 is: SE b0 = s 1 n + x 2 (x i x i ) 2

To estimate mean responses or predict future responses, we calculate the following standard errors The standard error of the mean response µ y is: The standard error for predicting an individual response y is:

Inference for correlation The sample correlation coefficient, r, can be used as an estimate of a population-level correlation, ρ. We can test H 0 : ρ = 0 with a t statistic: t = r n 2 1 r 2 A P-value can be found using the t(n-2) distribution.

The test of significance for ρ uses the one-sample t-test for: H 0 : ρ = 0. We compute the t statistics for sample size n and correlation coefficient r. t = r n 2 1 r 2 The p-value is the area under t (n 2) for values of T as extreme as t in the direction of H a :