Business Statistics. Lecture 9: Simple Regression

Similar documents
Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Correlation and Linear Regression

Simple Linear Regression

Chapter 27 Summary Inferences for Regression

Inferences for Regression

AMS 7 Correlation and Regression Lecture 8

Inference with Simple Regression

Basic Business Statistics 6 th Edition

Data Analysis and Statistical Methods Statistics 651

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Review of Multiple Regression

Correlation Analysis

Intro to Linear Regression

LOOKING FOR RELATIONSHIPS

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

Section 3: Simple Linear Regression

Statistical View of Least Squares

Intro to Linear Regression

Review 6. n 1 = 85 n 2 = 75 x 1 = x 2 = s 1 = 38.7 s 2 = 39.2

9. Linear Regression and Correlation

Can you tell the relationship between students SAT scores and their college grades?

appstats27.notebook April 06, 2017

Lectures 5 & 6: Hypothesis Testing

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Mathematics for Economics MA course

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Statistics for Managers using Microsoft Excel 6 th Edition

Wed, June 26, (Lecture 8-2). Nonlinearity. Significance test for correlation R-squared, SSE, and SST. Correlation in SPSS.

Introduction to Linear Regression

determine whether or not this relationship is.

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Warm-up Using the given data Create a scatterplot Find the regression line

Ordinary Least Squares Regression Explained: Vartanian

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Chapter 16. Simple Linear Regression and dcorrelation

Statistical View of Least Squares

Topic 10 - Linear Regression

Comparing Means from Two-Sample

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Descriptive Statistics (And a little bit on rounding and significant digits)

Section Least Squares Regression

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Regression I - the least squares line

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Lecture 3: Inference in SLR

Inference for Regression Simple Linear Regression

Data Analysis and Statistical Methods Statistics 651

Chapter 12 - Part I: Correlation Analysis

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Linear Regression Analysis for Survey Data. Professor Ron Fricker Naval Postgraduate School Monterey, California

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

review session gov 2000 gov 2000 () review session 1 / 38

Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

1 Correlation and Inference from Regression

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Chapter 16. Simple Linear Regression and Correlation

28. SIMPLE LINEAR REGRESSION III

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

Lecture 16 - Correlation and Regression

STA121: Applied Regression Analysis

Chapter 4: Regression Models

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

LECTURE 2: SIMPLE REGRESSION I

STAT 350 Final (new Material) Review Problems Key Spring 2016

Final Exam - Solutions

Regression Models - Introduction

Linear Regression with Multiple Regressors

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

11 Correlation and Regression

MATH 1150 Chapter 2 Notation and Terminology

Chapter 19 Sir Migo Mendoza

CHAPTER EIGHT Linear Regression

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities

Introduction and Single Predictor Regression. Correlation

Correlation and Regression

Statistical Distribution Assumptions of General Linear Models

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

Quadratic Equations Part I

Correlation and Linear Regression

Review of Statistics 101

22 Approximations - the method of least squares (1)

The Simple Linear Regression Model

Simple Linear Regression Using Ordinary Least Squares

Approximations - the method of least squares (1)

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Lecture 18: Simple Linear Regression

REVIEW 8/2/2017 陈芳华东师大英语系

Important note: Transcripts are not substitutes for textbook assignments. 1

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Statistics and Quantitative Analysis U4320. Segment 10 Prof. Sharyn O Halloran

Unit 10: Simple Linear Regression and Correlation

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Transcription:

Business Statistics Lecture 9: Simple Regression 1

On to Model Building! Up to now, class was about descriptive and inferential statistics Numerical and graphical summaries of data Confidence intervals Hypothesis testing We finish by taking those tools and building models to try to explain data For each Y in my data I also observe an X E.g., Y = weight and X = height for a group of people Can I use X to say something about Y? 2

Goals for this Lecture Understand correlation Describe a linear equation and a linear regression model Understand the criteria for fitting a line to data Learn to fit a simple linear regression model in Excel and JMP Understand how to assess the fit 3

Scatterplots Two continuous variables Graphically shows relationship between X and Y This graph shows a positive association As X increases so does Y 2.5 1.5 0.5-0.5-1.5 0 1 2 3 4 X 4

How to Express Numerically? Raw data (pairs of X and Y) too hard to interpret or summarize Scatterplots informative, but can also sometimes have too much information How to reduce to a number that tells us how strong is the association? Answer: Correlation It s often said, X and Y are correlated without knowing what it really means 5

Correlation Two measures of the strength of linear relationship between X and Y Xs and Ys are two different (continuous) variables observed on the same units in your sample E.g., height and weight for each person E.g., test score and amount of time spend studying E.g., average household income in a region and average pizza sales 6

Correlation r ( xi x)( yi y) xy x y i 1 Var( x) Var( y) Var( x) Var( y) n Correlation (r) close to: 1: strong positive linear relationship -1: strong negative linear relationship 0: no linear relationship 7

Positive Correlation 2.5 1.5 n r ( xi x)( yi y) Var( x) Var( y) i 1 X greater than average X Y greater than average Y Product is positive 0.5-0.5 Y less than average Y X less than average X -1.5 0 1 2 3 4 Product is positive 8 X

Negative Correlation X less than average X n r ( xi x)( yi y) Var( x) Var( y) i 1 Product is negative 3 2 1 0-1 -2 0 1 2 3 4 X Y greater than average Y Y less than average Y Product is negative X greater than average X 9

Example: Predicting Pizza Sales from Income Data from eight towns: One-month s pizza sales Income Town Income ($000) Pizza Sales ($000) 1 5 27 2 10 46 3 20 73 4 8 40 5 4 30 6 6 28 7 12 46 8 15 59 10

Pizza Sales ($000) Summarizing the Data Scatterplot of the data: 80 70 60 50 40 30 20 10 0 0 5 10 15 20 25 Income ($000) Can also say, the correlation between income and pizza sales is 0.98 There is a high positive association Use the CORREL function in Excel 11

Pizza Sales ($000) Estimating the Linear Relationship Correlation measures the strength of the linear relationship between X and Y Estimating the actual linear relationship is given by the regression of Y on X 80 70 60 50 yˆ aˆ b ˆ x 40 30 20 intercept slope 10 0 0 5 10 15 20 25 Income ($000) yˆ 14.577 2.905x 12

Linear Equations General expression for a linear equation a is the intercept It is the value of Y when X = 0 b is the slope Y = a + bx How much Y increases per unit increase in X 13

A Numerical Example 10 5 Equation: Y=1+3X Y increases by 3 0-5 Each time X increases by 1-10 -4-2 0 2 4 X Y is 1 when X is 0 14

Linear Model General expression for a linear model y a bx e i a and b are model parameters e is the error or noise term Error terms often assumed independent 2 observations from a N(0, ) distribution i 15

Estimating the Linear Model Given some data we will estimate the regression model parameters (a and b) with coefficients: where ŷ yˆ aˆ b ˆ x i y, x y y 1 1 2 2 is the predicted value of y, n, x x n i 16

Which is the Y Y is often called the dependent or response variable We have some reason to think it results from X E.g., Pizza sales are affected by income, not the other way around We often are trying to make a statement about Y based on X Sometimes we are trying to predict Y as a function of X 17

and Which is X? X is often called the independent or predictor variable We generally do not believe that Y has an effect on X E.g., we do not believe that if we were to increase the pizza sales in a town in would cause household incomes to increase Should carefully think about which should be the Y and which the X 18

Pizza Sales ($000) Regression in Excel First, create a scatterplot of the data: 80 70 60 50 40 30 20 10 0 0 5 10 15 20 25 Income ($000) 19

Regression in Excel (2) Right-click on one of the points in the scatterplot Choose Add trendline Click on Linear Then click on [the Options tab and check] Display equation on chart and Display R- squared value on chart Finally, click OK 20

Pizza Sales ($000) Excel Regression Output 80 70 60 50 40 30 20 10 y = 2.9048x + 14.577 R² = 0.9683 0 0 5 10 15 20 25 Income ($000) 21

Regression in JMP Again, first create a scatterplot Make sure both variables are continuous Analyze > Fit Y by X Put the dependent variable as Y and the independent as X Click on the red triangle above the scatterplot and choose Fit line 22

JMP Regression Output Estimated equation of the line We ll talk about the rest of the output shortly Regression coefficients 23

Regression Chooses Best Line Must make some assumptions y = a + bx + e : data follow a straight line e s are normally distributed with mean 0 The e s have constant variance The errors are independent of each other For now, just accept that regression is making a good choice Shortly we will define what best line means in statistical terms 24

Residuals: Vertical Deviations Between Line and Data Points Line based on your choice of a and b 10 5 0-5 Predicted Residual or ŷfrom regression error y yˆ Observed y from data set -10-4 -2 0 2 4 X 25

Minimizing Sum of Squares y ˆi ˆ ˆ x 0 1 i For each observation in the data set, your line predicts where Y should be error y yˆ i i i The residual from i th data point is how far the true Y value is from where the line predicts SE e e e 2 2 2 line 1 2 n The sum of squared residuals (or sum of squared errors) gives an overall measure of how well the line fits Choose a and b to make SE line as small as possible 26

Why Square the Residuals? Squaring makes everything positive Some residuals are positive, and some are negative If not squared, large positive residuals could balance out large negative residuals Why not absolute value? Squared things are easier to do calculus to than absolute valued things Minimizing something (SE line ) is a calculus problem 27

Doing the Calcs by Hand Turns out, SE line minimized for ˆ xy x y b x 2 x 2 and â y bx ˆ Calculations only need five quantities: n n n n 2,,,, and i i i i i i 1 i 1 i 1 i 1 n x y x x y Though can do these calculations easily in Excel, we ll still use JMP 28

Coefficient of Variation (R 2 ) Some of the variation in Y can be explained by variation in X and some cannot R-squared tells you the fraction of variance that can be explained by X R SE 1 1 SE 2 line av y y i i yˆ y i i 2 2 29

Pizza Example R-squared = 0.97 SD of residuals = 3.1 97% of the variance of Y has been explained by X SD of Y = 16.2 30

Interpreting R 2 R 2 and correlation both tell you how well the regression is doing In fact, note that R 2 = r 2 R 2 is always between 0 and 1 R 2 =0 means no linear relationship between X and Y A large R 2 means your regression has explained a lot Large can only be judged in context of the specific problem 31

Back to the Pizza Problem: JMP Regression Output n 2 R y ˆb â 32

Estimating the Variance of the Error Remember, we assumed e 2 ~ N(0, ) We can estimate the variance of e as ˆ 1 SE n 2 2 line y ˆ i yi n 2 i 1 n 2 From this result, and some other details we won t go into here, we can construct hypothesis tests for a and b! Just skim the Statistical Analysis of Regressions section in your textbook 33

Back to the Pizza Problem: JMP Regression Output ˆ t statistic and p-value for the hypothesis test of the slope: H : 0 0 b H : b 0 a p-value < 0.05, so reject the null hypothesis i.e., the slope is nonzero Interpretation: Income is a relevant predictor of pizza sales 34

Back to the Pizza Problem: JMP Regression Output t statistic and p-value for the hypothesis test of the intercept: H : 0 0 a H : a 0 a p-value < 0.05, so reject the null hypothesis i.e., the intercept is nonzero Interpretation: The intercept does not pass through the origin 35

Lots More on Regression With more time, we could cover: Using the regression for prediction Analyzing the residuals to assess model appropriateness Transformations Just skim these sections of your text And then there s multiple regression and more complicated modeling But we ll have to stop here 36

Case: Mutual Fund Performance Can past performance of a mutual fund be used to predict future performance? Do funds perform consistently over time? I.e., if I pick a fund that has performed well in the past, should I expect it to continue to perform well in the future? Data: Performance of 1,533 mutual funds from 1988-1993 (MutFunds.jmp) 37

38

Correlations by Year Correlation matrix summarizes each scatterplot with a single number More helpful? 39

Alternate Approach Use the average return of the past five years (1988-1992) to predict the next year (1993) So, which is the X and which is the Y? If past performance associated with future returns, what would you expect to see? 40

Linear Regression Appropriate? 41

Regression Results (1) Slope and intercept significant? Equation of the line? Any issues? 42

Regression Results (2) What happens if we remove all the really poorly performing funds? I.e., exclude all those with negative 5-year average returns 43

Regression Results (3) Okay, what if we concentrate on the poorly performing funds? I.e., exclude all those with positive 5-year average returns 44

Regression Results (4) But if we exclude the three unusually poor performers...hmmm 45

Regression Results (5) Would you invest in these funds? 46

What We Have Covered Understand correlation Describe a linear equation and a linear regression model Understand the criteria for fitting a line to data Learn to fit a simple linear regression model in Excel and JMP Understand how to assess the fit 47

What We Have Learned in this Course Descriptive statistics A bit about probability Inference for a population mean Confidence intervals Hypothesis testing for the mean One-sample tests Two-sample tests Paired tests Introduction to simple linear regression 48