Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Lecture Simple Linear Regression STAT 51 Spring 011 Background Reading KNNL: Chapter 1-1

Topic Overview This topic we will cover: Regression Terminology Simple Linear Regression with a single predictor variable -

Relationships Among Variables Functional Relationships The value of the dependent variable Y can be computed exactly if we know the value of the independent variable X. (e.g., Y=X) Statistical Relationships Not a perfect or exact relationship. The expected value of the response variable Y is a function of the explanatory or predictor variable X. The observed value of Y is the expected value plus a random deviation. -3

Simple Linear Regression -4

Uses of SLR Why Use Simple Linear Regression? Descriptive/ Exploratory purposes (explore the strength of known cause/effect relationships) Administrative Control (often the response variable is $$$) Prediction of outcomes (predict future needs; often overlaps with cost control) -5

Statistical Relationships vs. Causality Statistical relationships do not imply causality!!! Example : A Lafayette ice cream shop does more business on days when attendance at an Indianapolis swimming pool is high. -6

Data for Simple Linear Regression Observe pairs of variables; Each pair is called a case or a data point Y i is the i th value of the response variable; X i is the i th value of the explanatory (or predictor) variable; in practice the value of X is a known constant. i -7

Simple Linear Regression Model Statement of Model Y X where i 0 1 i i Model Parameters (unknown) i 1,,..., n i ~ N 0, 0 = intercept; may not have meaning 1 = slope; 1 0 if no relationship between X and Y. is the error variance -8

Y i 0 1 i i EY i X -9

Interpretation of the Regression Coefficients 0 is the expected value of the response variable when X = 0. 1 represents the increase (or decrease if negative) in the mean response for a 1-unit increase in the value of X. -10

Features of SLR Model Errors are independent, identically distributed normal random variables: iid ~ Normal 0, i Implies Y ~ iid Normal X, i 0 1 (See A.36, p1303 for the proof) i -11

Fitted Regression Equation The parameters from the data. Estimates denoted must be estimated 0, 1, b b s. 0, 1, Fitted (or estimated) regression line is Y b b X ˆi 0 1 The hat symbol is used to differentiate the fitted value Y ˆi from the actual observed value Y i. i -1

Residuals The deviations (or errors) from the true regression line, i Yi 0 1Xi, cannot be known since the regression parameters 0 and 1 are unknown. We may estimate these by the residuals: e Observed Predicted i = Y i i Yˆ Y b b X i 0 1 i -13

Error Terms vs Residuals -14

Assumptions Model assumes that the error terms are independent, normal, and have constant variance. Residuals may be used to explore the legitimacy of these assumptions. More on this topic in later. -15

Least Squares Estimation Want to find best estimates b 0, b 1 for,. 0 1 Best estimates will minimize the sum of the squared residuals: n i i 0 1 i i1 SSE e Y b b X To do this, use calculus (see pages 17, 18 of KNNL). -16

Least Squares Solution The LS estimate for 1 can be written in terms of the sums of squares b X X Y Y SS i i XY 1 X SS i X X The LS estimate for 0 is b0 Y b1x -17

About the LS Estimates They are also the maximum likelihood estimates (see KNNL pages 7-3). These are the best estimates because they are unbiased (their expectation is the parameter that they are estimating) and they have minimum variance among all such estimators. Big picture: We wouldn t want to use any other estimates because we can do no better. -18

Mean Square Error We also need to estimate. This estimate is developed based on the sum of the squared residuals (SSE) and the available degrees of freedom: s SSE e MSE df n The error degrees of freedom are based on the fact that we have n observations and parameters 0, 1 that we have already estimated. E i -19

Variance Notation s MSE will always be the estimate for. This can be confusing, because there will be estimated variances for other quantities, and these will be denoted e.g. s b 1, s b 0, etc. These are not products, but single variance quantities. To avoid confusion, I will generally write MSE whenever referring to the estimate for. -0

EXAMPLE: Diamond Rings Variables Response Variable ~ price in Singapore dollars (Y) Explanatory Variable ~ weight of diamond in carats (X) Associated SAS File diamonds.sas -1

SAS Regression Procedure PROC REG data=diamonds; model price=weight; RUN; -

Output (1) Sum of Mean Source DF Squares Square Model 1 098596 098596 Error 46 46636 1013.81886 Total 47 1453 Root MSE = 31.8405-3

Output () Parameter Standard Variable DF Estimate Error Intercept 1-59.6591 17.31886 weight 1 371.0485 81.78588-4

Output Summary From the output, we see that b b 0 1 59.6 371.0 MSE 1014 MSE 31.8 Note that the Root-MSE has a direct interpretation as the estimated standard deviation (in $$). -5

Interpretations It doesn t really make sense to talk about a 1-carat increase. But we can change this to a 0.01-carat increase by dividing by 100. From b 1 we see that a 0.01-carat increase in the weight of a diamond will lead to a $37.1 increase in the mean response. The interpretation of b 0 would be that one would actually be paid $60 to simply take a 0-carat diamond ring. Why doesn t this make sense? -6

Scope of Model The scope of a regression model is the range of X-values over which we actually have data. Using a model to look at X-values outside the scope of the model (extrapolation) is quite dangerous. -7

-8

Prediction for 0.43 Carats Does this make sense in light of the previous discussion? Suppose we assume that it does. Then the mean price for a 0.43 carat ring can be computed as follows: Y ˆ 60 371 0.43 1340 How confident would you be in this estimate? -9

Upcoming in Lecture 3... We will discuss more about inference concerning the regression coefficients Background Reading o.1-.6-30