Simple Linear Regression for the MPG Data 2000 2500 3000 3500 15 20 25 30 35 40 45 Wgt MPG
What do we do with the data? y i = MPG of i th car x i = Weight of i th car i =1,...,n n = Sample Size
Exploratory Techniques 1. Scatterplots 2000 2500 3000 3500 15 20 25 30 35 40 45 Wgt MPG 1. Form Linear? 2. Direction Positive or Negative 3. Strength 4. Outliers
Exploratory Techniques 1b. Scatterplots with smooth curve 1. Form Linear? 2. Direction Positive or Negative 3. Strength 4. Outliers 2000 2500 3000 3500 15 20 25 30 35 40 45 weight mpg
Exploratory Techniques 2. Numerical Summaries of the Data a) Covariance Cov(Y,X)= P n i=1 (y i ȳ)(x i x) n 1 2000 2500 3000 3500 15 20 25 30 35 40 45 Wgt MPG I II III IV I and IV: (y i ȳ)(x i x) < 0 II and III: (y i ȳ)(x i x) > 0 Covariance is affected by scale of Y and X à Not interpretable Cov(Y,X)= 2304.581 Cov(Y,X) < 0 ) Neg. Rel. Cov(Y,X) > 0 ) Pos. Rel.
Exploratory Techniques 2. Numerical Summaries of the Data SD of Y SD of X b) Correlation Corr(Y,X)= 1 nx yi ȳ xi x n 1 s i=1 y s x P n i=1 = (y i ȳ)(x i x) pp n i=1 (y i ȳ) P 2 n i=1 (x i x) 2 Corr(Y,X)= 0.71 Properties: 1. 1 apple Corr(Y,X) apple 1 2. Scale invariant
Exploratory Techniques 2. Numerical Summaries of the Data b) Correlation Warning #1: Correlation is only for linear relationships! Y 25 30 35 40 45 50 Corr(Y,X)=0 4 2 0 2 4 X
Exploratory Techniques 2. Numerical Summaries of the Data b) Correlation Warning #2: Correlation is highly effected by outliers! Corr(Y,X)=0.81 Y 6 8 10 12 Y 5.5 6.0 6.5 7.0 7.5 8.0 8.5 Corr(Y,X)=0.99 4 6 8 10 12 14 X 4 6 8 10 12 14 X
What is a good model for the MPG Data? Y i iid p Y (y i ) E(y i )=f(x i1,...,x ip ) What to use for p and f?
Simple Linear Regression Model The Simple Linear Regression (SLR) Model is written as: iid y i N ( 0 + 1 x i, 2 ) Or, equivalently (how?) y i = 0 + 1 x i + i i iid N (0, A Few Notes: i : Residuals, distance to mean 2 ) 0 : Intercept Coe cient 1 : Slope Coe cient : Std. Deviation About Line 2 : Variance About Line Independent and identically distributed
Simple Linear Regression Model
Simple Linear Regression Model The Simple Linear Regression (SLR) Model is written as: y i iid N ( 0 + 1 x i, 2 ) Interpretations: 0 1 When x i (weight) is zero, the mean MPG (y) is 0 As x i (Weight) increases by 1, the mean MPG goes up by 1 For any x i (Weight), 99.7% of the MPG will be within 3 of 0 + 1 x i
Simple Linear Regression Model The Simple Linear Regression (SLR) Model is written as: y i iid N ( 0 + 1 x i, Assumptions: 1. Linear 2. Independent 3. Normal 4. Equal variance across the whole line (homoskedastic) 2 )
Fitting the SLR Model The Simple Linear Regression (SLR) Model is written as: y i iid N ( 0 + 1 x i, 2 ) What are the unknowns (parameters)? 0, 1, How do we estimate them? 1. Least squares estimation (this class) 2. Maximum likelihood estimation (take Stat 340) 3. Bayesian estimation (take Stat 451)
Fitting the SLR Model The Simple Linear Regression (SLR) Model is written as: y i iid N ( 0 + 1 x i, 2 ) Least Squares Estimation: Find ˆ0 & ˆ1 such that nx i=1 (y i ˆ0 ˆ1x i ) 2 =min 0, 1 nx (y i 0 1 x i ) 2 i=1 In other words, minimize Objective function O( 0, 1) = nx (y i 0 1 x i ) 2 i=1
DERIVE LEAST SQUARES ESTIMATES
Fitting the SLR Model The Simple Linear Regression (SLR) Model is written as: Least Squares Estimators: y i iid N ( 0 + 1 x i, ˆ0 =ȳ ˆ1 x 2 ) ˆ1 = P n i=1 (y i ȳ)(x i x) P n i=1 (x i x) 2 = Cov(Y,X) Var(X) = Corr(Y,X) s y s x
Fitting the SLR Model The Simple Linear Regression (SLR) Model is written as: Least Squares Estimators: y i iid N ( 0 + 1 x i, 2 ) Note: We don t get an estimate of maximum likelihood) we get: from least squares but (using ˆ = s Pn i=1 (y i ˆ0 n 2 ˆ1x i ) 2
2000 2500 3000 3500 15 20 25 30 35 40 45 Wgt MPG Fitting the SLR Model Least Squares Estimates: ˆ0 = 51.587 ˆ1 = 0.01 ˆ =4.723 How do we interpret these numbers?
Fitting the SLR Model Note: Notice that ˆ0 & ˆ1 are just estimators (guesses) at the true values 0 & 1. So we need to answer a few questions: 1. Does calculating return on average? ˆ0 & ˆ1 0 & 1 Yes this property is call unbiased Expected Value E( ˆ0) = 0 E( ˆ1) = 1 2. How accurate are the estimators? ˆ0 & ˆ1 Standard Error (How variable are the estimates) SE( ˆ0) =ˆ SE( ˆ1) = s 1 n + x 2 P (xi x) 2 ˆ pp (xi x) 2
Fitting the SLR Model Note: Notice that ˆ0 & ˆ1 are just estimators (guesses) at the true values 0 & 1. So we need to answer a few questions: 1. Does calculating return on average? ˆ0 & ˆ1 0 & 1 Yes this property is call unbiased Expected Value E( ˆ0) = 0 E( ˆ1) = 1 2. How accurate are the estimators? ˆ0 & ˆ1 Note: The Gauss Markov theorem states that ˆ0 & ˆ1will have the smallest variance among all linear unbiased estimates (BLUE = best linear unbiased estimates )
Fitting the SLR Model Note: Notice that ˆ0 & ˆ1 are just estimators (guesses) at the true values 0 & 1. So we need to answer a few questions: 1. Does calculating return on average? ˆ0 & ˆ1 0 & 1 Yes this property is call unbiased Expected Value E( ˆ0) = 0 E( ˆ1) = 1 2. How accurate are the estimators? ˆ0 & ˆ1 Corr( ˆ0, ˆ1) = x pp (xi x) 2 /n + x 2 = 0.9512
Fitting the SLR Model Note: Notice that ˆ is just an estimator (guess) of. So we need to answer a question: 1. Does calculating ˆ return on average? Yes! ˆ = s Pn i=1 (y i ˆ0 n 2 ˆ1x i ) 2 Called regression (or residual) standard error
Using the SLR Model How would we use the SLR model to predict the MPG for a weight of 3000 lbs? E(y new )=ŷ = ˆ0 + ˆ1 3000 = 51.587 0.01 3000 22.08 How do you interpret the number 22.08? 1. The mean MPG of all cars that weigh 3000 lbs is 22.08 2. The predicted value of a car that weighs 3000 lbs is 22.08.
Using the SLR Model Wait: ŷ = ˆ0 + ˆ1x is just an estimate (a guess) at what the true value is going to be. We need to know: 1. Does calculating return y on average? ŷ = ˆ0 + ˆ1x Yes: E(ŷ) =y 2. How accurate is ŷ? It depends on what you are trying to predict! SE All Cars (ŷ) =ˆ SE 1Car (ŷ) =ˆ s s 1 n (x x)2 + P (xi x) 2 1+ 1 n (x x)2 + P (xi x) 2 Recall: mean is 0 + 1 x but 1 obs. is 0 + 1 x +
Using the SLR Model How would we use the SLR model to predict the MPG for an 18 wheeler with weight 20,000lbs? ŷ = ˆ0 + ˆ1 20000 = 51.587 0.01 20000 145.08 What went wrong? You can t extrapolate (predict outside the data).
End of MPG Analysis (see webpage for R and SAS code)