Chapter 4 Data with Two Variables

1 Scatter Plots and Correlation and 2 Pearson s Correlation Coefficient

Looking for Correlation Example Does the number of hours you watch TV per week impact your average grade in a class? Hours 12 10 5 3 15 16 8 Grade 70 85 82 88 65 75 68

Looking for Correlation Example Does the number of hours you watch TV per week impact your average grade in a class? Hours 12 10 5 3 15 16 8 Grade 70 85 82 88 65 75 68 To see if there is a relationship, we will create a scatter plot and analyze it. Definition A scatter plot is a geographical representation between two quantitative variables. They may be from the same individual (i.e. education v. income, height v. weight) or from paired individuals (i.e. age of partners in a relationship).

Scatter Plots When working with scatter plots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study.

Scatter Plots When working with scatter plots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study. Definition An explanatory variable may explain or influence changes in a response variable. Explanatory variables are often called independent and are on the x-axis. Response variables are often called dependent and are on the y-axis.

Back to Our Example In our example, which is the explanatory variable?

Back to Our Example In our example, which is the explanatory variable? Watched TV hours.

Back to Our Example In our example, which is the explanatory variable? Watched TV hours. The response variable is there for the average grade. So the question we are trying to answer is Does watching TV influence the average grade in a class?

The Scatter Plot Grades v. Hours of TV Grade 90 85 80 75 70 65 5 10 15 Hours of TV

How Does the Relationship Look? What do we think?

How Does the Relationship Look? What do we think? It looks like the more hours of TV that are watched, the lower the average grade. But how good is the relationship? We can measure this in different ways. One is direction (+, ) and another is by ranking the strength. These are both accomplished by looking at Pearson s Correlation Coefficient.

Facts About Pearson s Correlation Coefficient: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice.

So How Do We Find Pearson s Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y

So How Do We Find Pearson s Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y Let s find the correlation coefficient for our example. First, we need a few values, x, y, S x, S y.

Finding Pearson s Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007

Finding Pearson s Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007 2.0293.9873.0289 3 -.9953.6529 -.6498 4-1.4050 1.3217-1.8570 5 1.0539-1.2421-1.3090 6 1.2588 -.1274 -.1604 7 -.3805 -.9077.3454-3.9026

Another Way To Calculate We can use a formula similar to the way we found the standard deviation to find this correlation coefficient. First, we need a few formulas:

Another Way To Calculate We can use a formula similar to the way we found the standard deviation to find this correlation coefficient. First, we need a few formulas: The Three S s S xx = Σ(x x) 2

Another Way To Calculate We can use a formula similar to the way we found the standard deviation to find this correlation coefficient. First, we need a few formulas: The Three S s S xx = Σ(x x) 2 S yy = Σ(y y) 2

Another Way To Calculate This makes it look just as complicated, or worse, than the other method, but there are simpler ways to calculate the three S s that make this method not so painful.

Another Way To Calculate This makes it look just as complicated, or worse, than the other method, but there are simpler ways to calculate the three S s that make this method not so painful. Calculating the Three S s S xx = Σx 2 (Σx)2 n

Another Way To Calculate Σx 2 823 Σx 69 (Σx) 2 4761 (Σx) 2 7 680.14 Σy 2 41067 Σy 533 (Σy) 2 284089 (Σy) 2 7 40584.14 Σxy 5083 (Σx)(Σy) 36777 (Σx)(Σy) 7 5253.86

Another Way To Calculate So we have r = S xy Sxx S yy

Another Way To Calculate So we have = S xy r = Sxx S yy 5083 5253.86 (823 680.14)(41067 40584.14)

Another Way To Calculate So we have = S xy r = Sxx S yy 5083 5253.86 (823 680.14)(41067 40584.14) 170.86 = (142.86)(482.86)

Another Way To Calculate So we have = S xy r = Sxx S yy 5083 5253.86 (823 680.14)(41067 40584.14) 170.86 = (142.86)(482.86) =.6505

So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast...

So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast... Correlation does not necessarily imply causation. Just because it looks the part does not mean we have evidence that there is a relationship. We have to consider a couple of other things. One is lurking variables. These are variables that may be present but we are not actually considering them within the data.

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant. Reasons why data may not be significant: 1 Genuine lack of correlation 2 Not enough data

Assumptions and Conditions for Correlation Quantitative Variables Condition Don t make the common error of calling an association involving a categorical variable a correlation. Correlation is only about quantitative variables. Straight Enough Condition The best check for the assumption that the variables are truly linearly related is to look at the scatter plot to see whether it looks reasonably straight. That s a judgment call, but not a difficult one. No Outliers Condition Outliers can distort the correlation dramatically, making a weak association look strong or a strong one look weak. Outliers can even change the sign of the correlation. But it s easy to see outlier in the scatter plot, so to check this condition, just look.

Another Example Example The following gives the power numbers for the starting 9 for the 2007 Boston Red Sox. Is there relationship between the number of home runs and the number of RBIs? Does the number of home runs affect the number of RBIs? Produce a scatter plot and discuss the correlation. Player Home Runs RBIs Varitek 17 68 Youkilis 16 83 Pedroia 8 50 Lowell 21 120 Lugo 8 73 Ramirez 20 88 Crisp 6 60 Drew 11 64 Ortiz 35 117

Red Sox Example Which is the response variable? Which is the response variable?

Red Sox Example Which is the response variable? Which is the response variable? Since we are asking if HR affects RBIs, HR would be the explanatory variable and therefore x. So RBIs is the y variable. 2007 Red Sox Power Numbers RBIs 120 100 80 60 40 20 10 20 30 Home Runs

Before We Go On Something to notice: we have two values with the same x-coordinate. 2007 Red Sox Power Numbers 120 RBIs 100 80 60 40 20 10 20 30 Home Runs

Finding Pearson s Correlation Coefficient What is our guess as to the correlation?

Finding Pearson s Correlation Coefficient What is our guess as to the correlation? Now let s find Pearson s correlation coefficient.

Finding Pearson s Correlation Coefficient What is our guess as to the correlation? Now let s find Pearson s correlation coefficient. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS Select LinRegTTest Make sure the XList and YList are the lists where the data for the explanatory and response variables are located, respectively

Using Technology For our example, we have

Using Technology For our example, we have r r 2.8463.7162

Using Technology For our example, we have r r 2.8463.7162 So the correlation coefficient tells us that there is a strong positive correlation.

What Does r 2 Tell Us? r 2 tells us how much better our predictions will be if we go through the trouble to find the regression line rather than just make our predictions with the means. Ours is pretty good here, indicating that we should find the regression line. 2007 Red Sox Power Numbers RBIs 120 100 80 60 40 20 10 20 30 Home Runs

Technology and Scatter Plots We can also create a scatter plot on the calculator.

Technology and Scatter Plots We can also create a scatter plot on the calculator. Make sure there are no functions in the grapher (press Y= to check)

Technology and Scatter Plots We can also create a scatter plot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu Make sure only the plot we want is turned on

One More Example Example There is some evidence that drinking moderate amounts of wine helps prevent heart attacks. The accompanying table gives data on yearly wine consumption (in liters of alcohol from drinking wine per person) and yearly deaths from heart disease (per 100,000 people) in 19 developing nations. Construct a scatter plot and describe what you see. Country Alcohol Deaths County Alcohol Deaths Australia 2.5 211 Austria 3.9 167 Belgium 2.9 131 Canada 2.4 191 Denmark 2.9 220 Finland 0.8 297 France 9.1 71 Iceland 0.8 211 Ireland 0.7 300 Italy 7.9 107 Netherlands 1.8 167 New Zealand 1.9 266 Norway 0.8 227 Spain 6.5 86 Sweden 1,.6 207 Switzerland 5.8 115 United Kingdom 1.3 285 United States 1.2 199 West Germany 2.7 172

The Scatter Plot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters)

The Scatter Plot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters) r =.8428, strong negative correlation

The Scatter Plot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters) r =.8428, strong negative correlation r 2 =.7103, worthwhile to find linear regression line

3 Slopes and Equations of Fitted Lines

This Section Is... The focus of this section is the idea of a fitted curve and the formulas used to find the equation of a line given two points.

This Section Is... The focus of this section is the idea of a fitted curve and the formulas used to find the equation of a line given two points. We will not talk about this section here unless we need to.

4 The Least Squares Line

Getting Started Definition A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Linear functions are of the form y = mx + b but we will consider them as y = b 0 + b 1 x where b 0 is the y-intercept and b 1 is the slope. The calculator actually uses the form y = a + bx so be careful.

Formulas What we will be finding is the least squares regression line of y on x. This is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Least Squares Regression Line b 1 = r s y s x b 0 = y b 1 x If the correlation coefficient is too small, there is no point in finding y since b 0 and b 1 are both dependent on r.

Example Using Given Values Example The following list gives the power numbers for starting 9 Red Sox players for the 2007 season. Name Home Runs RBIs Jason Varitek 17 68 Kevin Youkilis 16 83 Dustin Pedroia 8 50 Mike Lowell 21 120 Julio Lugo 8 73 Manny Ramirez 20 88 Coco Crisp 6 60 J.D. Drew 11 64 David Ortiz 35 117 We want to know if there the number of home runs affects the number of RBIs.

Variable Types What is the response variable?

Variable Types What is the response variable? What is the explanatory variable?

The Needed Values We can find the mean and standard deviation of both sets of data quickly using our technology.

The Needed Values We can find the mean and standard deviation of both sets of data quickly using our technology. Variable Mean Standard Deviation x 15.78 9.05 y 80.33 24.47

The Needed Values We can find the mean and standard deviation of both sets of data quickly using our technology. Variable Mean Standard Deviation x 15.78 9.05 y 80.33 24.47 And, since the data is already in the calculator, we can obtain the values of r and r 2 quickly as well. Variable Mean Standard Deviation x 15.78 9.05 y 80.33 24.47 r =.8463 r 2 =.7162

The Correlation Coefficient How would we describe this correlation coefficient?

The Correlation Coefficient How would we describe this correlation coefficient? It indicates a strong, positive correlation.

The Correlation Coefficient How would we describe this correlation coefficient? It indicates a strong, positive correlation. We can use these values to find the equation of the regression line. b 1 = r s y s ( ) x 24.47 =.8463 9.05 = 2.29

Alternative Calculation We can also use the three S s to find the equation of the least squares line.

Alternative Calculation We can also use the three S s to find the equation of the least squares line. Least Squares Regression Line and m = S xy S xx b = y mx

Alternative Calculation S xy = 12907 (142)(723) 9 = 1499.67

Alternative Calculation S xy = 12907 (142)(723) 9 = 1499.67 S xx = 2896 (142)2 9 = 655.56

Alternative Calculation S xy = 12907 (142)(723) 9 = 1499.67 S xx = 2896 (142)2 9 = 655.56 m = 1499.67 655.56 = 2.29

Alternative Calculation S xy = 12907 (142)(723) 9 = 1499.67 S xx = 2896 (142)2 9 = 655.56 m = 1499.67 655.56 = 2.29 b = 80.33 2.29(15.78) = 44.19 And we have the same results.

Practical Interpretation What do these values mean in practical terms?

Practical Interpretation What do these values mean in practical terms? The slope tells us that a change in the explanatory variable by one unit will result in a change in the response variable by the amount and direction of the slope. In our example, the slope is b 1 = 2.29, which tells us that for every home run hit, you d expect to get an additional 2.29 RBIs. The y-intercept tells us the value of the response variable when the explanatory variable is 0.

The Scatter Plot Let s see how good the regression line is by plotting it over the scatter plot.

The Scatter Plot Let s see how good the regression line is by plotting it over the scatter plot. RBIs 120 100 80 60 40 20 2007 Red Sox Power Numbers 10 20 30 Home Runs To do so, we press Y= and put the line under Y 1, then select GRAPH

Plot and Line And now with the regression line y = 44.19 + 2.29x 2007 Red Sox Power Numbers RBIs 120 100 80 60 40 20 10 20 30 Home Runs

Predictions One use of the regression line is making predictions. Suppose we wanted to know about how many RBI we could expect a player to have if they hit 60 home runs. We are looking to predict the value of y (so we want y) and we are given a value of x = 60.

Facts about Regression Lines 1 Distinction between explanatory variables is essential - remember the formulas...

Facts about Regression Lines 1 Distinction between explanatory variables is essential - remember the formulas... 2 There is a close connection between slope and correlation

Facts about Regression Lines 1 Distinction between explanatory variables is essential - remember the formulas... 2 There is a close connection between slope and correlation 3 (x, y) is always on the line 4 r 2 gives the fraction of the variation in the values of y that is explained by the least squares regression line of y on x. In other words, r 2 gives the fraction of the data s variation accounted for by the model. 5 This only shows us the linear model; it is possible that there is little correlation linearly but that the data has a strong correlation if we were using some other type of model.

Another Sox Example Suppose we wanted to know if a player was expected to score more runs if he got more hits. To answer this question, we will use the roster of the 2011 Boston Red Sox. Name Runs Hits Jarred Saltalamacchia 52 84 Adrian Gonzalez 108 213 Dustin Pedroia 102 195 Marco Scutaro 59 118 Kevin Youkilis 68 111 Carl Crawford 65 129 Jacoby Ellsbury 119 212 J.D. Drew 23 55 David Ortiz 84 162 Jed Lowrie 40 78 Josh Reddick 41 71 Jason Varitek 32 49 Darnell McDonald 26 37 Mike Aviles 17 32 Mike Cameron 9 14 Drew Sutton 11 17 Ryan Lavarnway 5 9 Yamaico Navarro 6 8 Conor Jackson 2 3 Jose Iglesias 3 2 Lars Anderson 2 0 Joey Gathright 1 0

The Correlation Coefficient The first thing we will do is find the correlation coefficient.

The Correlation Coefficient The first thing we will do is find the correlation coefficient. When we plug all of the data into our technology, we get r =.9942.

The Correlation Coefficient The first thing we will do is find the correlation coefficient. When we plug all of the data into our technology, we get r =.9942. Interpretation?

The Correlation Coefficient The first thing we will do is find the correlation coefficient. When we plug all of the data into our technology, we get r =.9942. Interpretation? There is a strong, positive correlation between hits and runs scored.

Interpretation of r 2 What is the value of r 2 here?

Interpretation of r 2 What is the value of r 2 here? r 2 =.9885. What does this mean?

Interpretation of r 2 What is the value of r 2 here? r 2 =.9885. What does this mean? r 2 tells us that the fraction of the variability of the response variable that is accounted for by the variation in the explanatory variable. Here, this means that we can explain 98.85% of the variability in the runs total by the variability in the number of hits. For each standard deviation above the mean for the explanatory variable x, y will be r standard deviations above the mean of the response variable. Here, we would say that for every standard deviation we are above the mean number of hits, we will be.9885 standard deviations above the mean number of runs.

Producing the Scatter Plot Now, let s produce a scatter plot for the data.

Producing the Scatter Plot Now, let s produce a scatter plot for the data. 2011 Red Sox Runs 120 100 80 60 40 20 50 100 150 200 Hits

The Assumptions The points give us a pretty good indication that there is a very strong positive correlation here. Before we go on, we want to make sure all of the assumptions about regression lines are met. Quantitative Variable Condition If either y or x is categorical, you cannot make a scatter plot and you cannot perform a regression. Straight Enough Condition Does the data look straight enough that we can see a linear relationship in the data set? Outlier Condition Are there any outliers that dramatically influence the fit of the regression line?

Linear Regression Line Since all of these are satisfied, we will continue on to find the formula of the regression line. Using our technology, we have

Linear Regression Line Since all of these are satisfied, we will continue on to find the formula of the regression line. Using our technology, we have y = b0 + b 1 x = 1.92 +.52x

Linear Regression Line Since all of these are satisfied, we will continue on to find the formula of the regression line. Using our technology, we have y = b0 + b 1 x = 1.92 +.52x What is the practical interpretation of the slope b 1? For each hit, we expect a player to score.52 additional runs.

Scatter Plot With Regression Line 2011 Red Sox Runs 120 100 80 60 40 20 50 100 150 200 Hits

Scatter Plot With Regression Line 2011 Red Sox Runs 120 100 80 60 40 20 50 100 150 200 Hits So, when we plot the regression line over the scatter plot, we see that the line is a good fit.

Predictions 1 If a player got 200 hits, how many runs would we expect them to have?

Predictions 1 If a player got 200 hits, how many runs would we expect them to have? Here, we are given the x value and using our regression line, we find the predicted value.

Predictions 1 If a player got 200 hits, how many runs would we expect them to have? Here, we are given the x value and using our regression line, we find the predicted value. y = 1.92 +.52(200) 105.92 So, we d expect about 106 runs for a player with 200 hits. 2 What if we wanted to know how many hits a player had if they scored 120 runs? We are given the value of y and want to find the value of x. So, we use our algebra skills...

Important Points A few important points to keep in mind 1 An observation is influential for a statistical calculation if removing it would markedly change the results of the calculation. Point that are outliers in either the x or y direction are often influential points. 2 Correlation and least squares regression lines are not resistant 3 They only describe linear relationships. 4 There could be lurking variables. Those are ones that are not among the explanatory or response variables but may influence the interpretation of the relationship. 5 An association between an explanatory variable x and a response variable y, even if r is very strong, is not itself good evidence that changes in x actually cause changes in y. The phrase to remember is that correlation does not necessarily imply causation.

Example Where You Are Doing The Work Example We want to know know if there is a relationship between the score on the math portion of the SAT exam and the number of hours studying for the test. The question is, Does studying more increase the score on the exam? The following data was taken from a study conducted of 20 students as they prepared and took the SAT exam. Hours 4 9 10 14 4 7 12 22 1 3 Score 390 580 650 730 410 530 600 790 350 400 Hours 8 11 5 6 10 11 16 13 13 10 Score 590 640 450 520 690 690 770 700 730 640

Variable Types What is the response variable?

Variable Types What is the response variable? Math SAT score

Variable Types What is the response variable? Math SAT score What is the explanatory variable?

Variable Types What is the response variable? Math SAT score What is the explanatory variable? Hours of study

Pearson s Correlation Coefficient So let s get first find the correlation coefficient to see what we are dealing with.

Pearson s Correlation Coefficient So let s get first find the correlation coefficient to see what we are dealing with. Our interpretation? r =.9336

Pearson s Correlation Coefficient So let s get first find the correlation coefficient to see what we are dealing with. Our interpretation? r =.9336 This tells us there is a strong positive correlation.

And Now r 2 What about r 2?

And Now r 2 What about r 2? r 2 =.8716 This tells us what?

And Now r 2 What about r 2? r 2 =.8716 This tells us what? We can explain 87.16% of the variation in the Math SAT score can be explained by the variation in the hours of study.

Is The Data Significant? What is the inequality we are using?

Is The Data Significant? What is the inequality we are using? r n > 3

Is The Data Significant? What is the inequality we are using? r n > 3 Is this data significant?

Is The Data Significant? What is the inequality we are using? r n > 3 Is this data significant? r n =.9336 20 4.17 > 3 So, the data is significant based on this criteria.

Visual Representation Next, let s produce our scatter plot so we can see what we are dealing with.

Visual Representation Next, let s produce our scatter plot so we can see what we are dealing with. Math SAT Score v. Hours of Study SAT Score 800 700 600 500 400 300 5 10 15 20 Hours of Study

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this?

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this? Quantitative Variable Condition Both variables are quantitative. Straight Enough Condition Data looks reasonably straight. Outlier Condition There do not seem to be any outliers.