Chapter 6 Scatterplots, Association and Correlation
Looking for Correlation Example Does the number of hours you watch TV per week impact your average grade in a class? Hours 12 10 5 3 15 16 8 Grade 70 85 82 88 65 75 68
Looking for Correlation Example Does the number of hours you watch TV per week impact your average grade in a class? Hours 12 10 5 3 15 16 8 Grade 70 85 82 88 65 75 68 To see if there is a relationship, we will create a scatterplot and analyze it. Definition A scatterplot is a geographical representation between two quantitative variables. They may be from the same individual (i.e. education v. income, height v. weight) or from paired individuals (i.e. age of partners in a relationship).
Scatterplots When working with scatterplots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study.
Scatterplots When working with scatterplots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study. Definition An explanatory variable may explain or influence changes in a response variable.
Scatterplots When working with scatterplots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study. Definition An explanatory variable may explain or influence changes in a response variable. Explanatory variables are often called independent and are on the x-axis. Response variables are often called dependent and are on the y-axis.
Back to Our Example In our example, which is the explanatory variable?
Back to Our Example In our example, which is the explanatory variable? Watched TV hours.
Back to Our Example In our example, which is the explanatory variable? Watched TV hours. The response variable is there for the average grade. So the question we are trying to answer is Does watching TV influence the average grade in a class?
Back to Our Example In our example, which is the explanatory variable? Watched TV hours. The response variable is there for the average grade. So the question we are trying to answer is Does watching TV influence the average grade in a class? Let s plot the data and see what we have.
The Scatterplot Grades v. Hours of TV Grade 90 85 80 75 70 65 5 10 15 Hours of TV
How Does the Relationship Look? What do we think?
How Does the Relationship Look? What do we think? It looks like the more hours of TV that are watched, the lower the average grade. But how good is the relationship? We can measure this in different ways. One is direction (+, ) and another is by ranking the strength. These are both accomplished by looking at the correlation coefficient.
Facts About Correlation Coefficients: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad.
Facts About Correlation Coefficients: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice.
Facts About Correlation Coefficients: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice. 3 Correlation measures only the linear relationship.
Facts About Correlation Coefficients: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice. 3 Correlation measures only the linear relationship. 4 Correlation is not resistant.
Facts About Correlation Coefficients: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice. 3 Correlation measures only the linear relationship. 4 Correlation is not resistant. 5 Correlation has no units.
So How Do We Find This Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y
So How Do We Find This Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y Let s find the correlation coefficient for our example. First, we need a few values, x, y, S x, S y.
So How Do We Find This Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y Let s find the correlation coefficient for our example. First, we need a few values, x, y, S x, S y. x = 9.857 y = 76.143 S x = 4.880 S y = 8.971
Finding the Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007
Finding the Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007 2.0293.9873.0289 3 -.9953.6529 -.6498 4-1.4050 1.3217-1.8570 5 1.0539-1.2421-1.3090 6 1.2588 -.1274 -.1604 7 -.3805 -.9077.3454-3.9026
Finding the Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007 2.0293.9873.0289 3 -.9953.6529 -.6498 4-1.4050 1.3217-1.8570 5 1.0539-1.2421-1.3090 6 1.2588 -.1274 -.1604 7 -.3805 -.9077.3454-3.9026 r = 1 ( 3.9026) =.6504 6
Finding the Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007 2.0293.9873.0289 3 -.9953.6529 -.6498 4-1.4050 1.3217-1.8570 5 1.0539-1.2421-1.3090 6 1.2588 -.1274 -.1604 7 -.3805 -.9077.3454-3.9026 r = 1 ( 3.9026) =.6504 6 Interpretation: Moderate negative correlation
So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast...
So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast... Correlation does not necessarily imply causation.
So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast... Correlation does not necessarily imply causation. Just because it looks the part does not mean we have evidence that there is a relationship. We have to consider a couple of other things. One is lurking variables. These are variables that may be present but we are not actually considering them within the data.
So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast... Correlation does not necessarily imply causation. Just because it looks the part does not mean we have evidence that there is a relationship. We have to consider a couple of other things. One is lurking variables. These are variables that may be present but we are not actually considering them within the data. Can you think of any lurking variables that would impact our example?
Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant
Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant.
Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant. Reasons why data may not be significant: 1 Genuine lack of correlation
Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant. Reasons why data may not be significant: 1 Genuine lack of correlation 2 Not enough data
Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant. Reasons why data may not be significant: 1 Genuine lack of correlation 2 Not enough data Our example is not significant because of quantity. So we cannot consider that watching TV has a direct impact on grades.
Assumptions and Conditions for Correlation Quantitative Variables Condition Don t make the common error of calling an association involving a categorical variable a correlation. Correlation is only about quantitative variables.
Assumptions and Conditions for Correlation Quantitative Variables Condition Don t make the common error of calling an association involving a categorical variable a correlation. Correlation is only about quantitative variables. Straight Enough Condition The best check for the assumption that the variables are truly linearly related is to look at the scatterplot to see whether it looks reasonably straight. That s a judgment call, but not a difficult one.
Assumptions and Conditions for Correlation Quantitative Variables Condition Don t make the common error of calling an association involving a categorical variable a correlation. Correlation is only about quantitative variables. Straight Enough Condition The best check for the assumption that the variables are truly linearly related is to look at the scatterplot to see whether it looks reasonably straight. That s a judgment call, but not a difficult one. No Outliers Condition Outliers can distort the correlation dramatically, making a weak association look strong or a strong one look weak. Outliers can even change the sign of the correlation. But it s easy to see outlier in the scatterplot, so to check this condition, just look.
Another Example Example The following gives the power numbers for the starting 9 for the 2007 Boston Red Sox. Is there relationship between the number of home runs and the number of RBIs? Does the number of home runs affect the number of RBIs? Produce a scatterplot and discuss the correlation. Player Home Runs RBIs Varitek 17 68 Youkilis 16 83 Pedroia 8 50 Lowell 21 120 Lugo 8 73 Ramirez 20 88 Crisp 6 60 Drew 11 64 Ortiz 35 117
Red Sox Example Which is the response variable? Which is the response variable?
Red Sox Example Which is the response variable? Which is the response variable? Since we are asking if HR affects RBIs, HR would be the explanatory variable and therefore x. So RBIs is the y variable. 2007 Red Sox Power Numbers RBIs 120 100 80 60 40 20 10 20 30 Home Runs
Before We Go On Something to notice: we have two values with the same x-coordinate. 2007 Red Sox Power Numbers 120 RBIs 100 80 60 40 20 10 20 30 Home Runs
Finding the Correlation Coefficient What is our guess as to the correlation?
Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology.
Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology. Input data in usual way, with explanatory variable under L 1 and response variable under L 2
Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS
Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS Select LinRegTTest
Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS Select LinRegTTest Make sure the XList and YList are the lists where the data for the explanatory and response variables are located, respectively
Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS Select LinRegTTest Make sure the XList and YList are the lists where the data for the explanatory and response variables are located, respectively Press Calculate and scroll to find r and r 2
Using Technology For our example, we have
Using Technology For our example, we have r r 2.8463.7162
Using Technology For our example, we have r r 2.8463.7162 So the correlation coefficient tells us that there is a strong positive correlation.
What Does r 2 Tell Us? r 2 tells us how much better our predictions will be if we go through the trouble to find the regression line rather than just make our predictions with the means. Ours is pretty good here, indicating that we should find the regression line. 2007 Red Sox Power Numbers RBIs 120 100 80 60 40 20 10 20 30 Home Runs
Technology and Scatterplots We can also create a scatterplot on the calculator.
Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check)
Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example)
Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu
Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu Make sure only the plot we want is turned on
Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu Make sure only the plot we want is turned on Select the first graph in the first row and then make sure the XList and YList are correct
Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu Make sure only the plot we want is turned on Select the first graph in the first row and then make sure the XList and YList are correct Press ZOOM 9
One More Example Example There is some evidence that drinking moderate amounts of wine helps prevent heart attacks. The accompanying table gives data on yearly wine consumption (in liters of alcohol from drinking wine per person) and yearly deaths from heart disease (per 100,000 people) in 19 developing nations. Construct a scatterplot and describe what you see. Country Alcohol Deaths County Alcohol Deaths Australia 2.5 211 Austria 3.9 167 Belgium 2.9 131 Canada 2.4 191 Denmark 2.9 220 Finland 0.8 297 France 9.1 71 Iceland 0.8 211 Ireland 0.7 300 Italy 7.9 107 Netherlands 1.8 167 New Zealand 1.9 266 Norway 0.8 227 Spain 6.5 86 Sweden 1,.6 207 Switzerland 5.8 115 United Kingdom 1.3 285 United States 1.2 199 West Germany 2.7 172
The Scatterplot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters)
The Scatterplot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters) r =.8428, strong negative correlation
The Scatterplot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters) r =.8428, strong negative correlation r 2 =.7103, worthwhile to find linear regression line