Correlation and Regression

Similar documents
Six Sigma Black Belt Study Guides

Contents. Acknowledgments. xix

Inferences for Regression

Correlation & Simple Regression

INFERENCE FOR REGRESSION

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Inference for Regression Inference about the Regression Model and Using the Regression Line

Basic Business Statistics 6 th Edition

Warm-up Using the given data Create a scatterplot Find the regression line

Business Statistics. Lecture 10: Correlation and Linear Regression

appstats27.notebook April 06, 2017

Design of Experiments Part 3

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

1 Introduction to Minitab

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Chapter 9. Correlation and Regression

28. SIMPLE LINEAR REGRESSION III

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

STAT 350 Final (new Material) Review Problems Key Spring 2016

Inference for the Regression Coefficient

Ch 13 & 14 - Regression Analysis

Chapter 27 Summary Inferences for Regression

Chapter 4: Regression Models

Chapter 10 Correlation and Regression

What is a Hypothesis?

CRP 272 Introduction To Regression Analysis

23. Inference for regression

Important note: Transcripts are not substitutes for textbook assignments. 1

Confidence Interval for the mean response

Year 10 Mathematics Semester 2 Bivariate Data Chapter 13

Practice Questions for Exam 1

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

REVIEW 8/2/2017 陈芳华东师大英语系

MULTIPLE REGRESSION METHODS

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Correlation Analysis

Inference with Simple Regression

Topic 10 - Linear Regression

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Review of Statistics 101

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Conditions for Regression Inference:

Inference for Regression Simple Linear Regression

Basic Business Statistics, 10/e

Unit 6 - Introduction to linear regression

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Chapter 16. Simple Linear Regression and Correlation

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Chapter 14 Student Lecture Notes 14-1

Section 3: Simple Linear Regression

Lecture 18: Simple Linear Regression

Chapter 14 Multiple Regression Analysis

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

The simple linear regression model discussed in Chapter 13 was written as

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Module 8: Linear Regression. The Applied Research Center

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

Inference for Regression

Exam details. Final Review Session. Things to Review

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Chapter 4. Regression Models. Learning Objectives

Chapter 5 Least Squares Regression

Review of Multiple Regression

Statistics for Managers using Microsoft Excel 6 th Edition

A discussion on multiple regression models

Lecture 11: Simple Linear Regression

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Chapter 12 : Linear Correlation and Linear Regression

Chapter 16. Simple Linear Regression and dcorrelation

The Multiple Regression Model

4.1. Introduction: Comparing Means

Non-parametric (Distribution-free) approaches p188 CN

Regression. Marc H. Mehlman University of New Haven

This document contains 3 sets of practice problems.

Mathematics for Economics MA course

Section 4: Multiple Linear Regression

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

MODELING. Simple Linear Regression. Want More Stats??? Crickets and Temperature. Crickets and Temperature 4/16/2015. Linear Model

Lectures on Simple Linear Regression Stat 431, Summer 2012

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

determine whether or not this relationship is.

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation

Unit 6 - Simple linear regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Business Statistics. Lecture 10: Course Review

Correlation and Linear Regression

Notebook Tab 6 Pages 183 to ConteSolutions

This gives us an upper and lower bound that capture our population mean.

Chapter 11. Correlation and Regression

SEVERAL μs AND MEDIANS: MORE ISSUES. Business Statistics

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Analysis of Bivariate Data

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Bayesian Analysis LEARNING OBJECTIVES. Calculating Revised Probabilities. Calculating Revised Probabilities. Calculating Revised Probabilities

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Transcription:

Correlation and Regression Dr. Bob Gee Dean Scott Bonney Professor William G. Journigan American Meridian University 1

Learning Objectives Upon successful completion of this module, the student should be able to: Understand Correlation Understand Terminology of Regression Understand Simple Linear Regression 2

Analyze: Determining Root Cause Output (Y) Input (X) Assumptions Met Analytic Method Continuous Continuous Yes Correlation Simple Linear Regression Multiple Linear Regression Continuous Continuous No Polynomial Regression Nonlinear Regression Transformations on (Y) or (X) Continuous Attribute Yes One sample t-test Two sample t test ANOVA Continuous Attribute No Wilcoxon Signed Rank test Mann Whitney test Mood s Median test Kruskal Wallis 3

Correlation Correlation: A technique through which we can quantify the strength of association between a variable output and a variable input via the correlation coefficient = r. 4

Correlation Measures the linear relationship between two continuous variables. Pearson s correlation coefficient, r Values between -1 and +1 Guideline (usually based on sample size): If r > 0.80, then relationship is important If r < 0.20, then relationship is not significant 5

Correlation (r): The Strength of the Relationship Y Y Y Strong Positive Correlation r =.95 R 2 = 90% X Moderate Positive Correlation r =.70 R 2 = 49% X No Correlation r =.006 R 2 =.0036% X Y Y Y Strong Negative Correlation r = -.90 R 2 = 81% X Moderate Negative Correlation r = -.73 R 2 = 53% X Other Pattern - No Linear Correlation r = -.29 R 2 = 8% X 6 Note: If the slope b 1 = 0, then r = 0. Otherwise, there is no relationship between the slope value b 1 and the correlation value, r

Abuse and Misuse of Correlation If we establish a correlation between the output (y) and an input (x1), that does NOT necessarily mean variation in the input caused variation in the output. A third variable may be lurking that causes both x1 and y to vary. To conclude that there is a relationship between two variables does NOT mean that there is cause and effect relationship. 7 Correlation does NOT determine causation!

Shark Attacks and Ice Cream Studies show that as ice cream sales increase, the number of shark attacks increases. Does buying more ice cream lead to more shark attacks? 175 Scatterplot of Shark Attacks vs Ice Cream Sold 150 Shark Attacks 125 100 75 8 50 50 55 60 Ice Cream Sold 65 70

Correlation Example A six sigma team assigned to determine why invoices are not being paid promptly by the accounts payable department needs to determine if a relationship exists between the amount of the invoice and the number of days required to pay the invoice. 9

Correlation Example Practical Question: Is there a relationship between the amount of the invoice and the number of days needed to process the invoice? Hypothesis: Null: The two variables are not correlated Alternative: The two variables are correlated 10

Example Correlation Open Invoice Payments.mtw Select Graph > Scatterplot Select Simple Select Days to Pay as the Y Variable Select Invoice Amount as the X Variable Select OK 11

Example Correlation Select Stat > Basic Statistics > Correlation Select Days to Pay and Invoice Amount Select OK 12

Example Correlation Correlations: Days to Pay, Invoice Amount Pearson correlation of Days to Pay and Invoice Amount = 0.970 P-Value = 0.000 The r value is 0.97, strong evidence there is a linear relationship 13

Simple Linear Regression A simple linear regression analysis is used to find a line that best fits the data and describes the relationship between two continuous variables. Generally the line that best fits the data is the line that minimizes the error variation or the residuals. The regression line, or model, may be used to predict the values of the output (Y), as the values of the input (X) change. 14

Simple Linear Regression Y Slope = Magnitude of change in the Y given a one unit change in X. X 15

Simple Linear Regression Y(X)=b0 + b1x + e (remember Y = mx + b) b 0 = Y- Intercept b 1 = Slope of Line E = error term for the model Regression coefficients that will be estimated by the software. 16

Simple Linear Regression Y Y Total Variation = SST Unexplained Variation = SSE Explained Variation = SSM X 17

Simple Linear Regression Null Hypothesis The simple linear regression model does not fit the data better than the baseline model (the model does not explain the variation in Y) b1 = 0 Alternative Hypothesis The simple linear regression model does fit the data better than the baseline model (the model explains the variation in Y) b1 0 18

Regression Terminology r: The correlation coefficient (r) for multiple regression The closer to +/ - 1, the better the fit of the model 0 indicates no linear relationship R-Sq: The correlation coefficient squared (R 2 ) A value of R2 closer to 100% indicates that there is a possible relationship and more variation is explained R-Sq (Adj): Adjustment of R 2 for an over-fit condition. Takes into account the number of terms in the model Standard Error of the Estimate (s): Expected deviation of data about the predictive surface s = M Serror 1/2 Mean Square of Regression (MS regress ): Between estimate of variance for the overall model 19 MS regression = SS regression / DF regression (DF = Degrees of Freedom)

Regression Terminology Mean Square of the Residual (Error) (MS error ): Within estimate of variance Best estimate of population variance MS error = SS error / DF regression F-Ratio: F statistic A higher value indicates the model can detect a relationship between the factors and the response F = MS regression / MS error p-value: Probability of an error if difference is claimed p-value < 0.05 indicates a difference (significant) p-value >0.05 indicates that no conclusion of difference (significance) can be drawn Probability that the model is not a good model Good indicates that a relationship between factors and response has been found 20

Coefficient of Determination The test of b 1 answers the question of whether or not a linear relationship exists between two continuous variables. It is also useful to measure the strength of the linear relationship with the Coefficient of Determination (R 2 ): R 2 SS SS M T 21

Assumptions There is a linear relationship between Y and X. The data points are independent. The residuals (Residual = Predicted Y Measured Y) are normally distributed. The residuals have a constant variance. 22

Example Simple Linear Regression Open Invoice Payments.mtw Select Stat > Regression > Fitted Line Plot Select Days to Pay as the Response Select Invoice Amount as the Predictor Select Graphs Select Four in one Select OK, OK 23

Example Simple Linear Regression Regression Analysis: Days to Pay versus Invoice Amount The regression equation is Days to Pay = 15.49 + 0.003304 Invoice Amount S = 2.60182 R-Sq = 94.1% R-Sq(adj) = 94.0% Analysis of Variance Source DF SS MS F P Regression 1 9274.91 9274.91 1370.11 0.000 Error 86 582.17 6.77 Total 87 9857.08 Use the ANOVA table to evaluate the null hypothesis. Can you predict the number of days needed to process an invoice, if you know the amount of the invoice? 24

Example Validating Assumptions 25 Examine the residuals plot for patterns. If the residuals are randomly scattered about the average (red line) the constant variance assumption is validated.

Four in One Residuals Plots Useful for comparing the plots to determine whether your model meets the assumptions of the analysis. The residual plots in the graph include: Histogram - indicates whether the data are skewed or outliers exist in the data Normal probability plot - indicates whether the data are normally distributed, other variables are influencing the response, or outliers exist in the data Residuals versus fitted values - indicates whether the variance is constant, a nonlinear relationship exists, or outliers exist in the data Residuals versus order of the data - indicates whether there are systematic effects in the data due to time or data collection order 26

Regression: Quantifies the Relationship Between X and Y Regression analysis generates a line that quantifies the relationship between X and Y The line, or regression equation, is represented as Y = b o + b 1 X Y (output) 8 b o = intercept (where the line 7 crosses X= 0) 6 b 1 = slope 5 (rise over run, or 4 change in Y per 3 unit increase in X) 2 1 27 0 1 2 3 4 5 6 7 8 9 10 X (input)

Benefits of Quantifying a Relationship Prediction The equation can be used to predict future Ys by plugging in an X-value Control If X is controllable, you can manipulate process conditions to avoid undesirable results and/or generate desirable results 28

Caution! Extrapolating Beyond the Data Is Risky Cycle Time (Days) 200 What is the relationship between X and Y for X > 120??? 160? 120 80? 29 40 Range of data 0 0 40 80 120 160 200 Amount $M?

Extrapolating Beyond the Data Is Risky Extrapolation is making predictions outside the range of the X data It s a natural desire but it s like walking from solid ground onto thin ice Predictions from regression equations are more reliable for Xs within the range of the observed data Extrapolation is less risky if you have a theory, process knowledge, or other data to guide you 30

How the Regression Equation Is Determined The least squares method The regression equation is determined by a procedure that minimizes the total squared distance of all points to the line Finds the line where the squared vertical distance from each data point to the line is as small as possible (or the least) Restated minimizes the square of all the residuals Regression uses the least squares method to determine the best line: Data (both X and Y values) are used to obtain b0 and b1 values The b0 and b1 values establish the equation We will use Minitab to do this Least squares method 1. Measure vertical distance from points to line 2. Square the figures 3. Sum the total squared distance 4. Find the line that minimizes that sum 31

Minitab Follow Along: Make a Plot With a Regression Line Objective: Practice using Minitab to make a plot with a regression line on it and interpret the results Data: Staff_Calls.mtw Background: You are in charge of the help desk for Quantico where technicians process trouble tickets over the phone from 6 AM to 6 PM M-F. The current staffing plan begins with about 4 technicians at 6 AM and increases to about 35 technicians by 9 AM. At 3:30 PM the number of technicians begins to drop to about 7 by 6 PM. You want to know how many calls can be answered in a 30 minute time interval for various staffing levels 32

Minitab Example You obtain data on the number of technicians and the number of calls answered for each 30 minute time interval for the last 2 weeks (n=240) Questions: What is X and what type of data is it? What is Y and what type of data is it? 33

Minitab Follow Along: Make a Plot With a Regression Line Stat > Regression >Fitted Line Plot 34

Questions: Make a Plot With a Regression Line 1. What is the intercept and the slope? 2. What does the slope mean? 3. How many calls can be answered by 30 technicians in a 30 minute time interval on average? 4. Because of growth in the business, management anticipates an increase in call volume. As many as 140 calls are expected within a 30 minute time interval in the middle of the day. How many technicians are needed to answer that many calls? What assumptions must you make to predict this? 5. What is the R-sq value? CallsAnswd 100 90 80 70 60 50 40 30 20 10 0 Regression Plot CallsAnswd = 7.42510 + 2.32708 Staff S = 7.52932 R-Sq = 87.3 % R-Sq(adj) = 87.2 % 0 5 10 15 20 25 30 35 Staff 35

Answers: Make a Plot With a Regression Line Intercept = 7.43; Slope = 2.33 Which means that for each additional technician, you can answer about 2.3 more calls on average over a 30 minute interval 30 technicians can answer about 77 calls on average = (7.43 + 2.33 x 30) Solve for X in the regression equation 140 = 7.43 + 2.33(X) which means X = 57 If the linear relationship continues to hold beyond the range of data shown, about 57 technicians will be needed to answer 140 calls in a 30 minute interval R-sq = 87.3% (see next page for more discussion) 36

R-Squared (R-Sq or R 2 ): The % Explained Variation R-Squared = R-sq Measures the percent of variation in the Y-values that is explained by the linear relationship with X Ranges from 0 to 1 (= 0% to 100%) R-sq Try using the plot to conceptually understand explained variation* Explained variation Total variation x 100 % Explained Y 15 10 Think of this distance conceptually as the explained variation*; remaining variation is unexplained, and presumed to result from common causes Total Variation in Y 5 37 0 0 5 10 15 X

Discussion: Interpret R-Squared (R 2 ) 1. What R-Sq value did you get for the data on number of calls answered? 2. What does it mean? 3. How sure do you feel about the prediction for the number of calls answered by 30 technicians? 38

Answers: Interpret R-Squared (R 2 ) 1. What R-Sq value did you get for the data on number of calls answered? 87.3% 2. What does it mean? That 87% of the variation in the number of calls answered is explained by the number of technicians answering the calls. About 13% of the variation is unexplained 3. How sure do you feel about the prediction for the number of calls answered by 30 technicians? Since 30 technicians is within the range of data studied (we do not have to extrapolate), and since R 2 is relatively large, we can be reasonably comfortable with a prediction for the number of calls that can be answered 39

Summary In this module you have learned about: Correlation Terminology of Regression Simple Linear Regression 40