Ch 2: Simple Linear Regression

Similar documents
Ch 3: Multiple Linear Regression

Simple Linear Regression

Simple Linear Regression

Multiple Linear Regression

Inference for Regression

Linear models and their mathematical foundations: Simple linear regression

Measuring the fit of the model - SSR

Simple and Multiple Linear Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Simple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Correlation Analysis

Applied Regression Analysis

Lecture 14 Simple Linear Regression

Lecture 6 Multiple Linear Regression, cont.

Math 3330: Solution to midterm Exam

REGRESSION ANALYSIS AND INDICATOR VARIABLES

Lecture 15. Hypothesis testing in the linear model

Coefficient of Determination

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Linear Models and Estimation by Least Squares

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Simple Linear Regression

Chapter 14 Simple Linear Regression (A)

Concordia University (5+5)Q 1.

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Chapter 12 - Lecture 2 Inferences about regression coefficient

Linear Regression Model. Badr Missaoui

Simple Linear Regression

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Multiple Linear Regression

13 Simple Linear Regression

Inference for Regression Simple Linear Regression

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Homework 2: Simple Linear Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Simple linear regression

Chapter 3: Multiple Regression. August 14, 2018

Inferences for Regression

STAT2012 Statistical Tests 23 Regression analysis: method of least squares

STAT 540: Data Analysis and Regression

Table 1: Fish Biomass data set on 26 streams

Statistics for Managers using Microsoft Excel 6 th Edition

Basic Business Statistics 6 th Edition

Density Temp vs Ratio. temp

where x and ȳ are the sample means of x 1,, x n

Applied Econometrics (QEM)

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Lecture 11: Simple Linear Regression

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Statistics for Engineers Lecture 9 Linear Regression

Overview Scatter Plot Example

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

BNAD 276 Lecture 10 Simple Linear Regression Model

Lecture 10 Multiple Linear Regression

Formal Statement of Simple Linear Regression Model

STAT5044: Regression and Anova. Inyoung Kim

STAT Chapter 11: Regression

Inference in Regression Analysis

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA

22s:152 Applied Linear Regression. Take random samples from each of m populations.

ST430 Exam 1 with Answers

Estadística II Chapter 4: Simple linear regression

Mathematics for Economics MA course

ECO220Y Simple Regression: Testing the Slope

STAT 4385 Topic 03: Simple Linear Regression

STAT 350 Final (new Material) Review Problems Key Spring 2016

Inference for the Regression Coefficient

Lectures on Simple Linear Regression Stat 431, Summer 2012

Simple Linear Regression

Math 423/533: The Main Theoretical Topics

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

MATH 644: Regression Analysis Methods

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Categorical Predictor Variables

Hypothesis Testing hypothesis testing approach

Simple Linear Regression Analysis

Lecture 3: Inference in SLR

ECON The Simple Regression Model

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

14 Multiple Linear Regression

SIMPLE REGRESSION ANALYSIS. Business Statistics

The Multiple Regression Model

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

From last time... The equations

Chapter 1. Linear Regression with One Predictor Variable

Chapter 14. Linear least squares

Chapter 2. Continued. Proofs For ANOVA Proof of ANOVA Identity. the product term in the above equation can be simplified as n

A discussion on multiple regression models

20.1. Balanced One-Way Classification Cell means parametrization: ε 1. ε I. + ˆɛ 2 ij =

Section 3: Simple Linear Regression

6. Multiple Linear Regression

Transcription:

Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component with zero mean and unknown variance σ 2. The primary Goal is to do statistical inference on f(x) = β 0 + β 1 x 1. Estimate β 0 and β 1. 2. Do testing H 0 : β 1 = 0. Note that the regressor x is not random variable. But the response y is random variable. With the error assumption, the mean and variance response given x are E(y x) = β 0 + β 1 x and V ar(y x) = σ 2. The true regression line, β 0 + β 1 x is a line of the mean response. The parameters, β 0 (intercept) and β 1 (slope) are called regression coefficients. 2. Least-Squares Estimation Suppose that we observe n pairs of data (x 1, y 1 ),..., (x n, y n ) from the following model y i = β 0 + β 1 x i + ɛ i, i = 1,..., n. The question is how to estimate β 0 and β 1. We will use least squares method. Find β 0 and β 1 to minimize the sum of the squares of the differences between the observations and a linear function. 1

y 60 80 100 120 140 160 20 30 40 50 60 70 80 x Least-Squares Estimation of β 0 and β 1 Use least squares (LS) method: estimate β 0 and β 1 to minimize LS estimators, ˆβ0 and ˆβ 1 satisfy S(β 0, β 1 ) = (y i β 0 β 1 x i ) 2 = ε 2 i S β 0 ˆβ0, ˆβ 1 = 2 S β 1 ˆβ0, ˆβ 1 = 2 = Least-squares normal equations ˆβ 0 n ˆβ 0 + ˆβ 1 n x i + ˆβ 1 The solution to the normal equation is where ȳ = n y i /n, x = n x i /n, (y i ˆβ 0 ˆβ 1 x i ) = 0 (y i ˆβ 0 ˆβ 1 x i )x i = 0 n n x i = x 2 i = ˆβ 0 = ȳ ˆβ 1 x ˆβ 1 = S xy, y i y i x i S xy = y i (x i x) and = (x i x) 2 2

Define the fitted values and the residuals as ŷ i = ˆβ 0 + ˆβ 1 x i and e i = y i ŷ i, respectively, for i = 1, 2,..., n. Both quantities will be used for model adequacy later. Example (The rocket propellant data, n = 20) Question: statistical relationship between y (Shear strength) and x (Age of propellant). Use a simple linear regression model. Shear strength 1800 2000 2200 2400 2600 5 10 15 20 25 Age of propellant Example for estimating parameters: To estimate the model parameters, first calculate x = 13.3625 = S xy = Then, we find that ȳ = 2131.358 20 20 (x i x) 2 = 1106.559 y i (x i x) = 41112.65 ˆβ 1 = S xy = 411125.65 1106.56 = 37.15 ˆβ 0 = ȳ ˆβ 0 x = 2131.358 + 37.15 13.3625 = 2627.82 3

The least-squares fit is ŷ = 2627.82 37.15x Shear strength 1800 2000 2200 2400 2600 5 10 15 20 25 Age of propellant Output from R (the rocket propellant data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2627.822 44.184 59.48 < 2e-16 *** x -37.154 2.889-12.86 1.64e-10 *** Properties of ˆβ 0 and ˆβ 1 ˆβ 0 and ˆβ 1 are unbiased estimators, that is, E( ˆβ 0 ) = β 0 and E( ˆβ 1 ) = β 1 The variances of ˆβ 0 and ˆβ 1 are ( ) V ar( ˆβ 1 ) = σ2 and V ar( S ˆβ 1 0 ) = σ 2 xx n + x2 4

Properties of the least-squares fit, ŷ The sum of the residuals is always zero, that is, (y i ŷ i ) = e i = 0, which results in y i = ŷ i The least squares fit always passes through the point ( x, ȳ) The sum of the residuals weighted by the corresponding x i always equals zero, x i e i = 0. The sum of the residuals weighted by the corresponding fitted values ŷ i always equals zero, ŷ i e i = 0 Estimation of σ 2 An estimation of σ 2 is required to do hypothesis testing on the β 0 and β 1. To that end, decompose the corrected SS of y i into residual (error) SS and model (regression) SS, that is, (y i ȳ) 2 = (y i ŷ i ) 2 + (ŷ i ȳ) 2 + 2 (y i ŷ i )(ŷ i ȳ) (y i ȳ) 2 = (y i ŷ i ) 2 + (ŷ i ȳ) 2 SS T = SS E + SS R For the SSE term, SS E = = e 2 i = (y i ŷ i ) 2 yi 2 nȳ 2 ˆβ 1 S xy = SS T ˆβ 1 S xy 5

SS E has n 2 degree of freedom As an unbiased estimator of σ 2, use MS E is called the mean square error. ˆσ 2 = SS E n 2 = MS E. ˆσ is called the standard error of regression. Notice that it is a model-dependent estimate of σ 2. For the rocket propellant data, find Then compute SS T = 20 y 2 i 20 ȳ 2 = 1693737.6 SS E = SS T ˆβ 1 S xy = 169377.6 ( 37.15) ( 41112.65) = 166402.65 Finally, the estimate of σ 2 is ˆσ 2 = SS E n 2 = 166402.65 18 = 9244.59 Alternative form of the model Consider an alternative model by rewriting the original model as y i = (β 0 + β 1 x) + β 1 (x i x) + ɛ i = β 0 + β 1 (x i x) + ɛ i which is a shifted version of the original model by x. ˆβ 0 = ȳ and ˆβ 1 = S xy /. The least squares estimators are uncorrelated, that is, Cov( ˆβ 0, ˆβ 1 ) = 0. 6

3. Hypothesis testing on the slope Now we need the assumption ε i.i.d N(0, σ 2 ). Wish to test the hypothesis that the slope is 0 (that is, the significance of regression) H 0 : β 1 = 0 H 1 : β 1 0 ˆβ 1 is distributed in N(β 1, σ 2 / ). In case that σ 2 is known, use the statistic Z 0 = ˆβ 1 0 σ 2 /, which is distributed N(0, 1) under the null hypothesis is true. t-tests Typically, σ 2 is unknown Use t-test t 0 = ˆβ 1 0 se( ˆβ 1 ), which is distributed t-distribution with d.f. n 2 under the null below. Here se( ˆβ 1 ) = MS E / is called the standard error of the slope H 0 : β 1 = 0 H 1 : β 1 0. Reject the null hypothesis if t 0 > t α/2,n 2 If we fail to reject H 0 : β 1 = 0, there is no linear relationship between x and y Example (The rocket propellant data): 7

With ˆβ 1 = 37.15, MS E = 9244.59 and = 1106.56, The test statistic is t 0 = 37.15 0 9244.59/1106.56 = 12.85 Under that α = 0.05, t 0.025,18 = 2.101 Since t 0 > 2.101, we would reject H 0. Example with n = 20: With ˆβ 1 0, MS E = 2.889 and = 1106.56, The test statistic is t 0 = 0 2.889/1106.56 = 0 Under that α = 0.05, t 0.025,18 = 2.101 Since t 0 < 2.101, we fail to reject H 0. new.y 400 200 0 200 400 5 10 15 20 25 x Output from a Statistical Software Many statistical softwares use P-value approach for decision making. P-value is the probability (when H 0 is true) that t 0 takes a value as extreme or more extreme than the actually observed value. 8

Decision rule: if P-value α, reject H 0. Output from R (the rocket propellant data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2627.822 44.184 59.48 < 2e-16 *** x -37.154 2.889-12.86 1.64e-10 *** = Since p-value 0(< 0.05), reject H 0. The Analysis of Variance (ANOVA) An approach to test significance of regression (that is, test H 0 : β 1 = 0). ANOVA provides the same result as that of t-test. But it will be very useful for multiple regression model. ANOVA is based on a partition of total variability in the response variable y. Consider the following partition y i ȳ = (y i ŷ i ) + (ŷ i ȳ) Squaring the both sides and summing (y i ȳ) 2 = (y i ŷ i ) 2 + (ŷ i ȳ) 2 SS T = SS E + SS R Note that the cross-product term is equal to 0 and SS R = ˆβ 1 S xy. Degree of freedom (df) df SST = df SSR + df SSE (n 1) = 1 + (n 2) 9

Do F-test for testing H 0 : β 1 = 0 F 0 = SS R /1 SS E /(n 2) = MS R MS E where F 0 follows the F 1,n 2 distribution under the H 0. Decision rule is that reject H 0 if F 0 > F α,1,n 2 The benefit to use the analysis of variance is in Multiple regression Example (The rocket propellant data): To obtain the statistic, F 0, first compute SS T = (y i ȳ) 2 = 1693737.6 i 1 SS R = ˆβ 1 S xy = ( 37.15) ( 41112.65) = 1527334.95 And then find SS E = SS T SS R = 9244.59 Finally, compute Reject H 0 because F 0 > F 0.05,1,18 = 8.29 Output from R (P-value approach) F 0 = 1527334.95/1 9244.59/18 = 165.21 F-statistic: 165.4 on 1 and 18 DF, p-value: 1.643e-010 4. Interval Estimation Confidence intervals on β 0 and β 1 Under the assumption that the errors are normally and independently distributed, both ˆβ 1 β 1 se( ˆβ 1 ) and ˆβ 0 β 0 se( ˆβ 0 ) follow t distribution with n 2 degrees of freedom. 10

Therefore, a 100(1 α) percent confidence interval on β 1 is ˆβ 1 t α/2,n 2 se( ˆβ 1 ) β 1 ˆβ 1 + t α/2,n 2 se( ˆβ 1 ) and a 100(1 α) percent confidence interval on β 0 is ˆβ 0 t α/2,n 2 se( ˆβ 0 ) β 0 ˆβ 0 + t α/2,n 2 se( ˆβ 0 ), where ( ) se( ˆβ 1 0 ) = MS E n + x2 and se( ˆβ 1 ) = MSE Example (The rocket propellant data): Construct 95% confidence intervals on β 0 and β 1 With se( ˆβ 0 ) = 44.184, se( ˆβ 1 ) = 2.89 and t 0.025,18 = 2.101, ˆβ 0 t 0.025,18 se( ˆβ 0 ) β 0 ˆβ 0 + t 0.025,18 se( ˆβ 0 ) 2627.82 (2.101)(44.184) β 0 2627.82 + (2.101)(44.184) 2534.99 β 0 2720.65 ˆβ 1 t 0.025,18 se( ˆβ 1 ) β 1 ˆβ 1 + t 0.025,18 se( ˆβ 1 ) 37.15 (2.101)(2.89) β 1 37.15 + (2.101)(2.89) 43.22 β 1 31.08 Confidence intervals on σ 2 If the errors ɛ i are normally and independently distributed, it can be shown that (n 2)MS E σ 2 χ 2 (n 2). Thus, a 100(1 α) percent confidence interval on σ 2 is (n 2)MS E χ 2 α/2,n 2 σ 2 (n 2)MS E. χ 2 1 α/2,n 2 11

Interval estimation of the mean response A major use of a regression model is to estimate mean response E(y) for a particular value of x. Want to estimate mean response E(y x 0 ), where x 0 is any value of regressor variable with in the range of the original data on x to used to fit the model Use the estimates E(y x 0 ) = ˆµ y x0 = ˆβ 0 + ˆβ 1 x 0 ˆµ y x0 are normally distributed with mean E(y x 0 ) and variance ( 1 σ 2 n + (x 0 x) 2 ) The sampling distribution of ˆµ y x0 E(y x 0 ) ( MS 1 E + ) (x 0 x) 2 n is t distribution with n 2 degrees of freedom. Therefore, a 100(1 α) percent confidence interval on the mean response at the point x = x 0 is ˆµ y x0 t α/2,n 2 A E(y x 0 ) ˆµ y x0 + t α/2,n 2 A, where ( 1 A = MS E n + (x ) 0 x) 2 For the rocket propellant data, a 95% confidence interval on E(y x 0 ) is ˆµ y x0 (2.101)A(x 0 ) E(y x 0 ) ˆµ y x0 + (2.101)A(x 0 ), where ( 1 A(x 0 ) = 9244.59 20 + (x ) 0 13.3625) 2 1106.56 For example, if x 0 = 13.3625, then the confidence interval is 2086.230 E(y 13.3625) 2176.571 12

Prediction of new observation Prediction is an important application of the regression model If x 0 is the value of the regressor variable of interest, then ŷ 0 = ˆβ 0 + ˆβ 1 x 0 is the point prediction of the future observation y 0. The variability of ŷ 0 y 0 is E(ŷ 0 y 0 ) 2 = E{[ŷ 0 E(y 0 )] [y 0 E(y 0 )]} 2 = E[ŷ 0 E(y 0 )] 2 + E[y 0 E(y 0 )] 2 ( 1 = σ 2 n + (x 0 x) 2 ) + σ 2 Therefore, a 100(1 α) percent prediction interval on a future observation at the point x = x 0 is ŷ 0 t α/2,n 2 B y 0 ŷ 0 + t α/2,n 2 B, where ( B = MS E 1 + 1 n + (x ) 0 x) 2 For the rocket propellant data, a 95% prediction interval on y 0 is ŷ 0 (2.101)B(x 0 ) y 0 ŷ 0 + (2.101)B(x 0 ), where ( B(x 0 ) = 9244.59 1 + 1 20 + (x ) 0 13.3625) 2 1106.56 If x 0 = 10, then a 95% prediction interval is [2048.32,2464.32] 13

Coefficient of determination R 2 The coefficient of determination is defined as R 2 = SS R SS T (0 R 2 1) R 2 is the proportion of variation explained by the regressor x R 2 1 implies that most of variability in y is explained by the regression model For the rocket propellant data, R 2 = SS R SS T = 1527334.95 1693737.6 = 0.9018 that is, 90.18% of the variability in strength is explained by the regression model Relationship to the correlation coefficient R 2 = SS R SS T = ˆβ 1 S xy S yy = S xys xy S yy 2 S = xy Sxx Syy = r 2 xy = the square of the sample correlation between x and y For the rocket propellant data, R 2 = (r xy ) 2 = ( 0.9496533) 2 = 0.9018 14