10/8/18 Correlation and Regression Analysis Correlation Analysis is the study of the relationship between variables. It is also defined as group of techniques to measure the association between two variables. Regression Analysis is a technique used to express the relationship between two variables. If the relationship is assumed to be a straight line, this is called linear regression. Linear Regression and Correlation Dr. Rick Jerz 1 Three Questions Correlation and Linear Regression 1. Are two variables related? (correlation analysis). Is there a linear relationship between two variables? (linear regression analysis) 3. How strong are these relationships? 3 4 Correlation Correlation & Regression Example The sales manager of Copier Sales of America, which has a large sales force throughout the United States and Canada, wants to determine whether there is a relationship between the number of sales calls made in a month and the number of copiers sold that month. The manager selects a random sample of 10 representatives and determines the number of sales calls each representative made last month and the number of copiers sold. 5 6 1
Step 1: Look at the Data (Plot the Data) A Scatter Diagram is a chart that portrays the relationship between the two variables. It is the usual first step in correlation analysis The Dependent Variable is the variable being predicted or estimated. The Independent Variable provides the basis for estimation. It is the predictor variable. Step : Are they correlated? 7 8 The Coefficient of Correlation, r The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables. It requires interval or ratioscaled data. It can range from -1.00 to 1.00. Values of -1.00 or 1.00 indicate perfect and strong correlation. Values close to 0.0 indicate weak correlation. Negative values indicate an inverse relationship and positive values indicate a direct relationship. Correlation Coefficient Interpretation 9 10 Correlation Coefficient Equation, r Coefficient of Determination The coefficient of determination (r ) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X). It is the square of the coefficient of correlation. It ranges from 0 to 1. It does not give any information on the direction of the relationship between the variables. 11 1
Example: Correlation Coefficient How do we interpret a correlation of 0.759? First, it is positive, so we see there is a direct relationship between the number of sales calls and the number of copiers sold. The value of 0.759 is fairly close to 1.00, so we conclude that the association is strong. However, does this mean that more sales calls cause more sales? No, we have not demonstrated cause and effect here, only that the two variables sales calls and copiers sold are related. 13 Coefficient of Determination (r ) - Example The coefficient of determination, r,is 0.576, found by (0.759) This is a proportion or a percent; we can say that 57.6 percent of the variation in the number of copiers sold is explained, or accounted for, by the variation in the number of sales calls. 14 Step 3: How strong are they correlated? Testing the Significance of H 0 : r = 0 (the correlation in the population is 0) H 1 : r 0 (the correlation in the population is not 0) Reject H 0 if: t > t a/,n- or t < -t a/,n- 15 16 Example: Testing the Significance of H 0: r = 0 (the correlation in the population is 0) H 1: r 0 (the correlation in the population is not 0) Reject H 0 if: t > t a/,n- or t < -t a/,n- t > t 0.05,8 or t < -t 0.05,8 t >.306 or t < -.306 Testing the Significance of Equivalently, you can calculate the critical value for the correlation coefficient using This method gives a benchmark for the correlation coefficient. However, there is no p-value and is inflexible if you change your mind about α. 17 18 3
Testing Relationship with r Testing Relationship with r Step 1: State the Hypotheses Determine whether you are using a one or two-tailed test and the level of significance (a). H0 : ρ = 0 H 1 : ρ 0 Step : Calculate the Critical Value For degrees of freedom n = n -, determine t α then calculate Make the Decision If the sample correlation coefficient r exceeds the critical value r α, then reject H 0. If using the t statistic method, reject H 0 if t > t α or if the p-value α. 19 0 Ex: Testing the Significance of Step 4: Is there a linear relationship? Regression The computed t (3.97) is within the rejection region, therefore, we will reject H0. This means the correlation in the population is not zero. From a practical standpoint, it indicates to the sales manager that there is correlation with respect to the number of sales calls made and the number of copiers sold in the population of salespeople. 1 Example: Robot Repeatability Data 3 4 4
Bivariate Regression Analysis Linear Regression Bivariate Regression analyzes the relationship between two variables. It specifies one dependent (response) variable and one independent (predictor) variable. This hypothesized relationship may be linear, quadratic, or whatever. Unknown parameters are β 0 Intercept β 1 Slope The assumed model for a linear relationship is y i = β 0 + β 1x i + e i, for all observations (i = 1,,, n) The error term is not observable, is assumed normally distributed with mean of 0 and standard deviation σ. 5 6 Linear Regression Model Y ˆ = a + bx Differences between Actual Y Estimated Y Y Hat, is the estimate of Y given X a is the Y-intercept b is the slope of the line X is any value of the independent variable 7 8 Least Squares Principle Determining a regression equation by minimizing the sum of the squares (the variance) of the vertical distances between the actual Y values and the predicted values of Y. Calculating a and b (least squares) b = å XY å X - nxy - nx a = Y - bx 9 30 5
Example: Finding the Regression Equation Step 5: Plot the Estimated and the Actual Y s The regression equation is: Y = a + bx Y = 18.9476 + 1.184X Y = 18.9476 + 1.184(0) 31 Y = 4.6316 3 Computing the Estimates of Y Step 6: Predict other values Step 1 Using the regression equation, substitute the value of each X to solve for the estimated sales The regression equation is: Y = a + bx Y = 18.9476 + 1.184X Y = 18.9476 + 1.184(0) Tom Keller Y = 18.9476 + 1.184X Y = 18.9476 + 1.184(0) Soni Jones Y = 18.9476 + 1.184X Y = 18.9476 + 1.184(30) Y = 4.6316 33 Y = 4.6316 Y = 54.4736 34 Assumptions Underlying Linear Regression Assumptions: Graphic For each value of X, there is a group of Y values, and these Y values are normally distributed. The means of these normal distributions of Y values all lie on the straight line of regression. The standard deviations of these normal distributions are equal. The Y values are statistically independent. This means that in the selection of a sample, the Y values chosen for a particular X value do not depend on the Y values for any other X values. 35 36 6
Confidence Intervals for Slope and Intercept Confidence interval for the true slope: Ex: Confidence Interval Estimate Step 4 Use the formula above by substituting the numbers computed in previous slides Confidence interval for the true intercept: Thus, the 95 percent confidence interval for the average sales of all sales representatives who make 5 calls is from 40.9170 up to 56.188 copiers. 37 38 Ex: Prediction Interval Estimate Step Using the information computed earlier in the confidence interval estimation example, use the formula above. 39 If Sheila Baker makes 5 sales calls, the number of copiers she will sell will be between about 4 and 73 copiers 7