Chapter 11. Correlation and Regression

Chapter 11 Correlation and Regression

Correlation A relationship between two variables. The data can be represented b ordered pairs (, ) is the independent (or eplanator) variable is the dependent (or response) variable A scatter plot can be used to determine whether a linear (straight line) correlation eists between two variables. Eample: 1 2 3 4 5 4 2 1 0 2 2 2 2 4 6 4 Tpes of Correlation Negative Linear Correlation Positive Linear Correlation No Correlation Nonlinear Correlation As increases, tends to decrease. As increases, tends to increase.

Eample: Constructing a Scatter Plot An economist wants to determine whether there is a linear relationship between a countr s gross domestic product (GDP) and carbon dioide (CO 2 ) emissions. The data are shown in the table. Displa the data in a scatter plot and determine whether there appears to be a positive or negative linear correlation or no linear correlation. (Source: World Bank and U.S. Energ Information Administration) GDP (trillions of $), CO 2 emission (millions of metric tons), 1.6 428.2 3.6 828.8 4.9 1214.2 1.1 444.6 0.9 264.0 2.9 415.3 2.7 571.8 2.3 454.9 1.6 358.7 1.5 573.5 Solution: Constructing a Scatter Plot Appears to be a positive linear correlation. As the gross domestic products increase, the carbon dioide emissions tend to increase.

Eample: Constructing a Scatter Plot Using Technolog Old Faithful, located in Yellowstone National Park, is the world s most famous geser. The duration (in minutes) of several of Old Faithful s eruptions and the times (in minutes) until the net eruption are shown in the table. Using a TI-83/84, displa the data in a scatter plot. Determine the tpe of correlation. Duration Time, Duration Time, 1.80 56 3.78 79 1.82 58 3.83 85 1.90 62 3.88 80 1.93 56 4.10 89 1.98 57 4.27 90 2.05 57 4.30 89 2.13 60 4.43 89 2.30 57 4.47 86 2.37 61 4.53 89 2.82 73 4.55 86 3.13 76 4.60 92 3.27 77 4.63 91 3.65 77 10 0 Enter the -values into list L1 and the -values into list L2. Use Stat Plot to construct the scatter plot. STAT > Edit 50 1 5 STATPLOT From the scatter plot, it appears that the variables have a positive linear correlation.

Correlation coefficient A measure of the strength and the direction of a linear relationship between two variables. The smbol r represents the sample correlation coefficient. A formula for r is r n 2 2 2 2 n n The population correlation coefficient is represented b ρ (rho). n is the number of data pairs The range of the correlation coefficient is 1 to 1. -1 0 1 If r = 1 there is a perfect negative correlation If r is close to 0 there is no linear correlation If r = 1 there is a perfect positive correlation

Linear Correlation r = 0.91 r = 0.88 r = 0.42 r = 0.07 Strong negative correlation Strong positive correlation Weak positive correlation Nonlinear Correlation Using Technolog to Find a Correlation Coefficient Use a technolog tool to calculate the correlation coefficient for the Old Faithful data. What can ou conclude? To calculate r, ou must fenter the LinREgTTest command found in the Calc menu STAT > Calc r 0.979 suggests a strong positive correlation.

Using a Table to Test a Population Correlation Coefficient ρ Once the sample correlation coefficient r has been calculated, we need to determine whether there is enough evidence to decide that the population correlation coefficient ρ is significant at a specified level of significance. Use correlation significance table If r is greater than the critical value, there is enough evidence to decide that the correlation coefficient ρ is significant. Eample: Determine whether ρ is significant for five pairs of data (n = 5) at a level of significance of α = 0.01. Number of pairs of data in sample level of significance If r > 0.959, the correlation is significant. Otherwise, there is not enough evidence to conclude that the correlation is significant. Eample: Using the Old Faithful data, ou used 25 pairs of data to find r 0.979. Is the correlation coefficient significant? Use α = 0.05. Solution: n = 25, α = 0.05 r 0.979 > 0.396 There is enough evidence at the 5% level of significance to conclude that there is a significant linear correlation between the duration of Old Faithful s eruptions and the time between eruptions.

What the VALUE of r tells us: The value of r is alwas between -1 and +1: -1 r 1. The size of the correlation r indicates the strength of the linear relationship between and. Values of r close to -1 or to +1 indicate a stronger linear relationship between and. If r = 0 there is absolutel no linear relationship between and (no linear correlation). If r = 1, there is perfect positive correlation. If r = -1, there is perfect negative correlation. In both these cases, all of the original data points lie on a straight line. Of course, in the real world, this will not generall happen. What the SIGN of r tells us A positive value of r means that when increases, tends to increase and when decreases, tends to decrease (positive correlation). A negative value of r means that when increases, tends to decrease and when decreases, tends to increase (negative correlation). The sign of r is the same as the sign of the slope, b, of the best fit line The Coefficient of Determination r 2 is called the coefficient of determination. r 2 is the square of the correlation coefficient, but is usuall stated as a percent, rather than in decimal form. r 2 has an interpretation in the contet of the data: r 2, when epressed as a percent, represents the percent of variation in the dependent variable that can be eplained b variation in the independent variable using the regression (best fit) line. 1 - r 2, when epressed as a percent, represents the percent of variation in that is NOT eplained b variation in using the regression line. This can be seen as the scattering of the observed data points about the regression line.

Correlation and Causation The fact that two variables are strongl correlated does not in itself impl a cause-and-effect relationship between the variables. If there is a significant correlation between two variables, ou should consider the following possibilities. Is there a direct cause-and-effect relationship between the variables? Does cause? Is there a reverse cause-and-effect relationship between the variables? Does cause? Is it possible that the relationship between the variables can be caused b a third variable or b a combination of several other variables? Is it possible that the relationship between two variables ma be a coincidence?

Linear Regression: Regression lines After verifing that the linear correlation between two variables is significant, net we determine the equation of the line that best models the data (regression line). Can be used to predict the value of for a given value of. Residual The difference between the observed -value and the predicted -value for a given -value on the line. For a given -value, d i = (observed -value) (predicted -value) Observe d -value }d 1 d 3 { } d 2 d 4 { } d 5 Predicted -value d 6 {

Regression line (line of best fit) The line for which the sum of the squares of the residuals is a minimum. The equation of a regression line for an independent variable and a dependent variable is ŷ = m + b Predicted -value for a given -value Slope -intercept 2 n m 2 n Eample: Using Technolog to Find a Regression Equation b m m n n Use a technolog tool to find the equation of the regression line for the Old Faithful data. Duration Time, Duration Time, 1.8 56 3.78 79 1.82 58 3.83 85 1.9 62 3.88 80 1.93 56 4.1 89 1.98 57 4.27 90 2.05 57 4.3 89 ˆ 12.481 33.683 2.13 60 4.43 89 2.3 57 4.47 86 2.37 61 4.53 89 10 0 2.82 73 4.55 86 3.13 76 4.6 92 3.27 77 4.63 91 3.65 77 50 1 5

Eample: Finding the Equation of a Regression Line Find the equation of the regression line for the gross domestic products and carbon dioide emissions data. GDP (trillions of $), CO 2 emission (millions of metric tons), 1.6 428.2 3.6 828.8 4.9 1214.2 1.1 444.6 0.9 264.0 2.9 415.3 2.7 571.8 2.3 454.9 1.6 358.7 1.5 573.5 To sketch the regression line, use an two -values within the range of the data and calculate the corresponding -values from the regression line. ŷ = 196.152 + 102.289. Use this equation to predict the epected carbon dioide emissions for the following gross domestic products. (Recall from section 9.1 that and have a significant linear correlation.) 1. 1.2 trillion dollars 2. 2.0 trillion dollars 3. 2.5 trillion dollars ŷ =196.152(1.2) + 102.289 337.671 ŷ =196.152(2.0) + 102.289 = 494.593 ŷ =196.152(2.5) + 102.289 = 592.669 Prediction values are meaningful onl for -values in (or close to) the range of the data. The -values in the original data set range from 0.9 to 4.9. So, it would not be appropriate to use the regression line to predict carbon dioide emissions for gross domestic products such as $0.2 or $14.5 trillion dollars.

Outliers In some data sets, there are values (observed data points) called outliers. Outliers are observed data points that are far from the least squares line. The have large "errors", where the "error" or residual is the vertical distance from the line to the point. Outliers need to be eamined closel. Sometimes, for some reason or another, the should not be included in the analsis of the data. It is possible that an outlier is a result of erroneous data. Other times, an outlier ma hold valuable information about the population under stud and should remain included in the data. The ke is to carefull eamine what causes a data point to be an outlier. Besides outliers, a sample ma contain one or a few points that are called influential points. Influential points are observed data points that are far from the other observed data points in the horizontal direction. These points ma have a big effect on the slope of the regression line. To begin to identif an influential point, ou can remove it from the data set and see if the slope of the regression line is changed significantl.

Identifing Outliers We could guess at outliers b looking at a graph of the scatterplot and best fit line. However we would like some guideline as to how far awa a point needs to be in order to be considered an outlier. As a rough rule of thumb, we can flag an point that is located further than two standard deviations above or below the best fit line as an outlier. The standard deviation used is the standard deviation of the residuals or errors. We can do this visuall in the scatterplot b drawing an etra pair of lines that are two standard deviations above and below the best fit line. An data points that are outside this etra pair of lines are flagged as potential outliers. Or we can do this numericall b calculating each residual and comparing it to twice the standard deviation.

A random sample of 11 statistics students produced the following data where is the third eam score, out of 80, and is the final eam score, out of 200. Can ou predict the final eam score of a random student if ou know the third eam score? (third eam score) (final eam score) 65 175 67 133 71 185 71 163 66 126 75 198 67 153 70 163 71 159 69 151 69 159 The least squares regression line (best fit line) for the third eam/final eam eample has the equation: ˆ = -173.51 + 4.83 In this eample, ou can determine if there is an outlier or not. If there is an outlier, as an eercise, delete it and fit the remaining data to a new line. For this eample, the new line ought to fit the remaining data better. This means the variation should be smaller and the correlation coefficient ought to be closer to 1 or -1.

Using the LinRegTTest with this data, scroll down through the output screens to find s = 16.412 Line Y 2 = -173.5 + 4.83-2(16.4) and line Y 3 = -173.5 + 4.83 + 2(16.4) where ˆ = -173.5 + 4.83 is the line of best fit. Y 2 and Y 3 have the same slope as the line of best fit. Graph the scatterplot with the best fit line in equation Y1, then enter the two etra lines as Y2 and Y3 in the "Y="equation editor and press ZOOM 9. You will find that the onl data point that is not between lines Y2 and Y3 is the point =65, =175. On the calculator screen it is just barel outside these lines. The outlier is the student who had a grade of 65 on the third eam and 175 on the final eam; this point is further than 2 standard deviations awa from the best fit line.

Numerical Identification of Outliers In the table below, the first two columns are the third eam and final eam data. The third column shows the predicted ˆ values calculated from the line of best fit: ˆ= -173.5 + 4.83. The residuals, or errors, have been calculated in the fourth column of the table: observed value predicted value= ˆ s is the standard deviation of all the ˆ values where n = the total number of data points. ˆ ˆ 65 175 140 175 140=35 67 133 150 133 150=-17 71 185 169 185 169=16 71 163 169 163 169=-6 66 126 145 126 145=-19 75 198 189 198 189=9 67 153 150 153 150=3 70 163 164 163 164=-1 For this eample, the calculator function LinRegTTest found s = 16.4 as the standard deviation of the residuals 35; -17; 16;-6; -19; 9; 3; -1; -10; -9; -1. We are looking for all data points for which the residual is greater than 2s = 2(16.4) = 32.8 or less than -32.8. Compare these values to the residuals in column 4 of the table. The onl such data point is the student who had a grade of 65 on the third eam and 175 on the final eam; the residual for this student is 35. 71 159 169 159 169=-10 69 151 160 151 160=-9 69 159 160 159 160=-1

How does the outlier affect the best fit line? Numericall and graphicall, we have identified the point (65,175) as an outlier. We should re-eamine the data for this point to see if there are an problems with the data. If there is an error we should fi the error if possible, or delete the data. If the data is correct, we would leave it in the data set. For this problem, we will suppose that we eamined the data and found that this outlier data was an error. Therefore we will continue on and delete the outlier, so that we can eplore how it affects the results, as a learning eperience. Compute a new best-fit line and correlation coefficient using the 10 remaining points: On the TI-83, TI-83+, TI-84+ calculators, delete the outlier from L1 and L2. Using the LinRegTTest, the new line of best fit and the correlation coefficient are: ˆ= -355.19 + 7.39 and r = 0.9121 The new line with r = 0.9121 is a stronger correlation than the original (r = 0.6631) because r = 0.9121 is closer to 1. This means that the new line is a better fit to the 10 remaining data values. The line can better predict the final eam score given the third eam score. EXAMPLE: Using this new line of best fit (based on the remaining 10 data points), what would a student who receives a 73 on the third eam epect to receive on the final eam? Is this the same as the prediction made using the original line? SOLUTION: Using the new line of best fit, ˆ=-355.19+7.39(73)= 184.28. A student who scored 73 points on the third eam would epect to earn 184 points on the final eam. The original line predicted ˆ= -173.51 + 4.83(73) = 179.08 so the prediction using the new line with the outlier eliminated differs from the original prediction.