Business Statistics Lecture 9: Simple Regression 1
On to Model Building! Up to now, class was about descriptive and inferential statistics Numerical and graphical summaries of data Confidence intervals Hypothesis testing We finish by taking those tools and building models to try to explain data For each Y in my data I also observe an X E.g., Y = weight and X = height for a group of people Can I use X to say something about Y? 2
Goals for this Lecture Understand correlation Describe a linear equation and a linear regression model Understand the criteria for fitting a line to data Learn to fit a simple linear regression model in Excel and JMP Understand how to assess the fit 3
Scatterplots Two continuous variables Graphically shows relationship between X and Y This graph shows a positive association As X increases so does Y 2.5 1.5 0.5-0.5-1.5 0 1 2 3 4 X 4
How to Express Numerically? Raw data (pairs of X and Y) too hard to interpret or summarize Scatterplots informative, but can also sometimes have too much information How to reduce to a number that tells us how strong is the association? Answer: Correlation It s often said, X and Y are correlated without knowing what it really means 5
Correlation Two measures of the strength of linear relationship between X and Y Xs and Ys are two different (continuous) variables observed on the same units in your sample E.g., height and weight for each person E.g., test score and amount of time spend studying E.g., average household income in a region and average pizza sales 6
Correlation r ( xi x)( yi y) xy x y i 1 Var( x) Var( y) Var( x) Var( y) n Correlation (r) close to: 1: strong positive linear relationship -1: strong negative linear relationship 0: no linear relationship 7
Positive Correlation 2.5 1.5 n r ( xi x)( yi y) Var( x) Var( y) i 1 X greater than average X Y greater than average Y Product is positive 0.5-0.5 Y less than average Y X less than average X -1.5 0 1 2 3 4 Product is positive 8 X
Negative Correlation X less than average X n r ( xi x)( yi y) Var( x) Var( y) i 1 Product is negative 3 2 1 0-1 -2 0 1 2 3 4 X Y greater than average Y Y less than average Y Product is negative X greater than average X 9
Example: Predicting Pizza Sales from Income Data from eight towns: One-month s pizza sales Income Town Income ($000) Pizza Sales ($000) 1 5 27 2 10 46 3 20 73 4 8 40 5 4 30 6 6 28 7 12 46 8 15 59 10
Pizza Sales ($000) Summarizing the Data Scatterplot of the data: 80 70 60 50 40 30 20 10 0 0 5 10 15 20 25 Income ($000) Can also say, the correlation between income and pizza sales is 0.98 There is a high positive association Use the CORREL function in Excel 11
Pizza Sales ($000) Estimating the Linear Relationship Correlation measures the strength of the linear relationship between X and Y Estimating the actual linear relationship is given by the regression of Y on X 80 70 60 50 yˆ aˆ b ˆ x 40 30 20 intercept slope 10 0 0 5 10 15 20 25 Income ($000) yˆ 14.577 2.905x 12
Linear Equations General expression for a linear equation a is the intercept It is the value of Y when X = 0 b is the slope Y = a + bx How much Y increases per unit increase in X 13
A Numerical Example 10 5 Equation: Y=1+3X Y increases by 3 0-5 Each time X increases by 1-10 -4-2 0 2 4 X Y is 1 when X is 0 14
Linear Model General expression for a linear model y a bx e i a and b are model parameters e is the error or noise term Error terms often assumed independent 2 observations from a N(0, ) distribution i 15
Estimating the Linear Model Given some data we will estimate the regression model parameters (a and b) with coefficients: where ŷ yˆ aˆ b ˆ x i y, x y y 1 1 2 2 is the predicted value of y, n, x x n i 16
Which is the Y Y is often called the dependent or response variable We have some reason to think it results from X E.g., Pizza sales are affected by income, not the other way around We often are trying to make a statement about Y based on X Sometimes we are trying to predict Y as a function of X 17
and Which is X? X is often called the independent or predictor variable We generally do not believe that Y has an effect on X E.g., we do not believe that if we were to increase the pizza sales in a town in would cause household incomes to increase Should carefully think about which should be the Y and which the X 18
Pizza Sales ($000) Regression in Excel First, create a scatterplot of the data: 80 70 60 50 40 30 20 10 0 0 5 10 15 20 25 Income ($000) 19
Regression in Excel (2) Right-click on one of the points in the scatterplot Choose Add trendline Click on Linear Then click on [the Options tab and check] Display equation on chart and Display R- squared value on chart Finally, click OK 20
Pizza Sales ($000) Excel Regression Output 80 70 60 50 40 30 20 10 y = 2.9048x + 14.577 R² = 0.9683 0 0 5 10 15 20 25 Income ($000) 21
Regression in JMP Again, first create a scatterplot Make sure both variables are continuous Analyze > Fit Y by X Put the dependent variable as Y and the independent as X Click on the red triangle above the scatterplot and choose Fit line 22
JMP Regression Output Estimated equation of the line We ll talk about the rest of the output shortly Regression coefficients 23
Regression Chooses Best Line Must make some assumptions y = a + bx + e : data follow a straight line e s are normally distributed with mean 0 The e s have constant variance The errors are independent of each other For now, just accept that regression is making a good choice Shortly we will define what best line means in statistical terms 24
Residuals: Vertical Deviations Between Line and Data Points Line based on your choice of a and b 10 5 0-5 Predicted Residual or ŷfrom regression error y yˆ Observed y from data set -10-4 -2 0 2 4 X 25
Minimizing Sum of Squares y ˆi ˆ ˆ x 0 1 i For each observation in the data set, your line predicts where Y should be error y yˆ i i i The residual from i th data point is how far the true Y value is from where the line predicts SE e e e 2 2 2 line 1 2 n The sum of squared residuals (or sum of squared errors) gives an overall measure of how well the line fits Choose a and b to make SE line as small as possible 26
Why Square the Residuals? Squaring makes everything positive Some residuals are positive, and some are negative If not squared, large positive residuals could balance out large negative residuals Why not absolute value? Squared things are easier to do calculus to than absolute valued things Minimizing something (SE line ) is a calculus problem 27
Doing the Calcs by Hand Turns out, SE line minimized for ˆ xy x y b x 2 x 2 and â y bx ˆ Calculations only need five quantities: n n n n 2,,,, and i i i i i i 1 i 1 i 1 i 1 n x y x x y Though can do these calculations easily in Excel, we ll still use JMP 28
Coefficient of Variation (R 2 ) Some of the variation in Y can be explained by variation in X and some cannot R-squared tells you the fraction of variance that can be explained by X R SE 1 1 SE 2 line av y y i i yˆ y i i 2 2 29
Pizza Example R-squared = 0.97 SD of residuals = 3.1 97% of the variance of Y has been explained by X SD of Y = 16.2 30
Interpreting R 2 R 2 and correlation both tell you how well the regression is doing In fact, note that R 2 = r 2 R 2 is always between 0 and 1 R 2 =0 means no linear relationship between X and Y A large R 2 means your regression has explained a lot Large can only be judged in context of the specific problem 31
Back to the Pizza Problem: JMP Regression Output n 2 R y ˆb â 32
Estimating the Variance of the Error Remember, we assumed e 2 ~ N(0, ) We can estimate the variance of e as ˆ 1 SE n 2 2 line y ˆ i yi n 2 i 1 n 2 From this result, and some other details we won t go into here, we can construct hypothesis tests for a and b! Just skim the Statistical Analysis of Regressions section in your textbook 33
Back to the Pizza Problem: JMP Regression Output ˆ t statistic and p-value for the hypothesis test of the slope: H : 0 0 b H : b 0 a p-value < 0.05, so reject the null hypothesis i.e., the slope is nonzero Interpretation: Income is a relevant predictor of pizza sales 34
Back to the Pizza Problem: JMP Regression Output t statistic and p-value for the hypothesis test of the intercept: H : 0 0 a H : a 0 a p-value < 0.05, so reject the null hypothesis i.e., the intercept is nonzero Interpretation: The intercept does not pass through the origin 35
Lots More on Regression With more time, we could cover: Using the regression for prediction Analyzing the residuals to assess model appropriateness Transformations Just skim these sections of your text And then there s multiple regression and more complicated modeling But we ll have to stop here 36
Case: Mutual Fund Performance Can past performance of a mutual fund be used to predict future performance? Do funds perform consistently over time? I.e., if I pick a fund that has performed well in the past, should I expect it to continue to perform well in the future? Data: Performance of 1,533 mutual funds from 1988-1993 (MutFunds.jmp) 37
38
Correlations by Year Correlation matrix summarizes each scatterplot with a single number More helpful? 39
Alternate Approach Use the average return of the past five years (1988-1992) to predict the next year (1993) So, which is the X and which is the Y? If past performance associated with future returns, what would you expect to see? 40
Linear Regression Appropriate? 41
Regression Results (1) Slope and intercept significant? Equation of the line? Any issues? 42
Regression Results (2) What happens if we remove all the really poorly performing funds? I.e., exclude all those with negative 5-year average returns 43
Regression Results (3) Okay, what if we concentrate on the poorly performing funds? I.e., exclude all those with positive 5-year average returns 44
Regression Results (4) But if we exclude the three unusually poor performers...hmmm 45
Regression Results (5) Would you invest in these funds? 46
What We Have Covered Understand correlation Describe a linear equation and a linear regression model Understand the criteria for fitting a line to data Learn to fit a simple linear regression model in Excel and JMP Understand how to assess the fit 47
What We Have Learned in this Course Descriptive statistics A bit about probability Inference for a population mean Confidence intervals Hypothesis testing for the mean One-sample tests Two-sample tests Paired tests Introduction to simple linear regression 48