Chapter 6 Scatterplots, Association and Correlation

Similar documents
Chapter 4 Data with Two Variables

Chapter 4 Data with Two Variables

STATISTICS Relationships between variables: Correlation

Ch. 3 Review - LSRL AP Stats

AP Statistics Two-Variable Data Analysis

If the roles of the variable are not clear, then which variable is placed on which axis is not important.

Bivariate Data Summary

The empirical ( ) rule

Describing Bivariate Relationships

Learning Objectives. Math Chapter 3. Chapter 3. Association. Response and Explanatory Variables

5.1 Bivariate Relationships

Chapter 6. September 17, Please pick up a calculator and take out paper and something to write with. Association and Correlation.

BIVARIATE DATA data for two variables

Scatterplots and Correlation

Correlation & Regression

3.1 Scatterplots and Correlation

Chapter 7. Scatterplots, Association, and Correlation

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis

Chapter 3: Examining Relationships

Unit 9 Regression and Correlation Homework #14 (Unit 9 Regression and Correlation) SOLUTIONS. X = cigarette consumption (per capita in 1930)

Chapter 7 Summary Scatterplots, Association, and Correlation

This document contains 3 sets of practice problems.

Math 243 OpenStax Chapter 12 Scatterplots and Linear Regression OpenIntro Section and

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

AP Statistics S C A T T E R P L O T S, A S S O C I A T I O N, A N D C O R R E L A T I O N C H A P 6

Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight?

Prob/Stats Questions? /32

Linear Regression. Al Nosedal University of Toronto. Summer Al Nosedal University of Toronto Linear Regression Summer / 115

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation

Looking at Data Relationships. 2.1 Scatterplots W. H. Freeman and Company

Scatterplots and Correlations

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Chapter 8. Linear Regression /71

Steps to take to do the descriptive part of regression analysis:

y n 1 ( x i x )( y y i n 1 i y 2

Chapter 6: Exploring Data: Relationships Lesson Plan

Relationships Regression

Mathematics. Pre-Leaving Certificate Examination, Paper 2 Higher Level Time: 2 hours, 30 minutes. 300 marks L.20 NAME SCHOOL TEACHER

Talking feet: Scatterplots and lines of best fit

appstats8.notebook October 11, 2016

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

6.1.1 How can I make predictions?

Chapter 12: Linear Regression and Correlation

Chapter 10: Comparing Two Quantitative Variables Section 10.1: Scatterplots & Correlation

economic growth is not conducive to a country s overall economic performance and, additionally,

Business Statistics 41000:

Describing Bivariate Data

CHAPTER 3 Describing Relationships

The Simple Linear Regression Model

Chapter 3: Describing Relationships

Scatterplots. 3.1: Scatterplots & Correlation. Scatterplots. Explanatory & Response Variables. Section 3.1 Scatterplots and Correlation

3.2: Least Squares Regressions

The response variable depends on the explanatory variable.

Describing Data: Two Variables

Chapter 3: Describing Relationships

Lecture 7, Chapter 7 summary

Chapter 9. Correlation and Regression

determine whether or not this relationship is.

Chapter 2: Looking at Data Relationships (Part 3)

Summarizing Data: Paired Quantitative Data

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

St. Gallen, Switzerland, August 22-28, 2010

Review of Regression Basics

The flu example from last class is actually one of our most common transformations called the log-linear model:

THE PEARSON CORRELATION COEFFICIENT

PS2.1 & 2.2: Linear Correlations PS2: Bivariate Statistics

Biostatistics: Correlations

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

2012 OCEAN DRILLING CITATION REPORT

Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables Interpreting scatterplots Outliers

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Exploratory Data Analysis: Two Variables

AP Stats ~ 3A: Scatterplots and Correlation OBJECTIVES:

Least Squares Regression

Lesson 4 Linear Functions and Applications

Calories, Obesity and Health in OECD Countries

Do Now 18 Balance Point. Directions: Use the data table to answer the questions. 2. Explain whether it is reasonable to fit a line to the data.

Current Account Dynamics under Information Rigidity and Imperfect Capital Mobility

Chapter 5 Least Squares Regression

Export Destinations and Input Prices. Appendix A

Exploratory Data Analysis: Two Variables

SIMPLE LINEAR REGRESSION STAT 251

AMS 7 Correlation and Regression Lecture 8

ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT

Chapter 6. Exploring Data: Relationships

Harvard University. Rigorous Research in Engineering Education

1) A residual plot: A)

CORRELATION. compiled by Dr Kunal Pathak

North-South Gap Mapping Assignment Country Classification / Statistical Analysis

a. Length of tube: Diameter of tube:

SCATTERPLOTS. We can talk about the correlation or relationship or association between two variables and mean the same thing.

Shortfalls of Panel Unit Root Testing. Jack Strauss Saint Louis University. And. Taner Yigit Bilkent University. Abstract

2017 Source of Foreign Income Earned By Fund

Transcription:

Chapter 6 Scatterplots, Association and Correlation

Looking for Correlation Example Does the number of hours you watch TV per week impact your average grade in a class? Hours 12 10 5 3 15 16 8 Grade 70 85 82 88 65 75 68

Looking for Correlation Example Does the number of hours you watch TV per week impact your average grade in a class? Hours 12 10 5 3 15 16 8 Grade 70 85 82 88 65 75 68 To see if there is a relationship, we will create a scatterplot and analyze it. Definition A scatterplot is a geographical representation between two quantitative variables. They may be from the same individual (i.e. education v. income, height v. weight) or from paired individuals (i.e. age of partners in a relationship).

Scatterplots When working with scatterplots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study.

Scatterplots When working with scatterplots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study. Definition An explanatory variable may explain or influence changes in a response variable.

Scatterplots When working with scatterplots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study. Definition An explanatory variable may explain or influence changes in a response variable. Explanatory variables are often called independent and are on the x-axis. Response variables are often called dependent and are on the y-axis.

Back to Our Example In our example, which is the explanatory variable?

Back to Our Example In our example, which is the explanatory variable? Watched TV hours.

Back to Our Example In our example, which is the explanatory variable? Watched TV hours. The response variable is there for the average grade. So the question we are trying to answer is Does watching TV influence the average grade in a class?

Back to Our Example In our example, which is the explanatory variable? Watched TV hours. The response variable is there for the average grade. So the question we are trying to answer is Does watching TV influence the average grade in a class? Let s plot the data and see what we have.

The Scatterplot Grades v. Hours of TV Grade 90 85 80 75 70 65 5 10 15 Hours of TV

How Does the Relationship Look? What do we think?

How Does the Relationship Look? What do we think? It looks like the more hours of TV that are watched, the lower the average grade. But how good is the relationship? We can measure this in different ways. One is direction (+, ) and another is by ranking the strength. These are both accomplished by looking at the correlation coefficient.

Facts About Correlation Coefficients: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad.

Facts About Correlation Coefficients: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice.

Facts About Correlation Coefficients: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice. 3 Correlation measures only the linear relationship.

Facts About Correlation Coefficients: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice. 3 Correlation measures only the linear relationship. 4 Correlation is not resistant.

Facts About Correlation Coefficients: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice. 3 Correlation measures only the linear relationship. 4 Correlation is not resistant. 5 Correlation has no units.

So How Do We Find This Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y

So How Do We Find This Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y Let s find the correlation coefficient for our example. First, we need a few values, x, y, S x, S y.

So How Do We Find This Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y Let s find the correlation coefficient for our example. First, we need a few values, x, y, S x, S y. x = 9.857 y = 76.143 S x = 4.880 S y = 8.971

Finding the Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007

Finding the Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007 2.0293.9873.0289 3 -.9953.6529 -.6498 4-1.4050 1.3217-1.8570 5 1.0539-1.2421-1.3090 6 1.2588 -.1274 -.1604 7 -.3805 -.9077.3454-3.9026

Finding the Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007 2.0293.9873.0289 3 -.9953.6529 -.6498 4-1.4050 1.3217-1.8570 5 1.0539-1.2421-1.3090 6 1.2588 -.1274 -.1604 7 -.3805 -.9077.3454-3.9026 r = 1 ( 3.9026) =.6504 6

Finding the Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007 2.0293.9873.0289 3 -.9953.6529 -.6498 4-1.4050 1.3217-1.8570 5 1.0539-1.2421-1.3090 6 1.2588 -.1274 -.1604 7 -.3805 -.9077.3454-3.9026 r = 1 ( 3.9026) =.6504 6 Interpretation: Moderate negative correlation

So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast...

So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast... Correlation does not necessarily imply causation.

So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast... Correlation does not necessarily imply causation. Just because it looks the part does not mean we have evidence that there is a relationship. We have to consider a couple of other things. One is lurking variables. These are variables that may be present but we are not actually considering them within the data.

So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast... Correlation does not necessarily imply causation. Just because it looks the part does not mean we have evidence that there is a relationship. We have to consider a couple of other things. One is lurking variables. These are variables that may be present but we are not actually considering them within the data. Can you think of any lurking variables that would impact our example?

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant.

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant. Reasons why data may not be significant: 1 Genuine lack of correlation

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant. Reasons why data may not be significant: 1 Genuine lack of correlation 2 Not enough data

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant. Reasons why data may not be significant: 1 Genuine lack of correlation 2 Not enough data Our example is not significant because of quantity. So we cannot consider that watching TV has a direct impact on grades.

Assumptions and Conditions for Correlation Quantitative Variables Condition Don t make the common error of calling an association involving a categorical variable a correlation. Correlation is only about quantitative variables.

Assumptions and Conditions for Correlation Quantitative Variables Condition Don t make the common error of calling an association involving a categorical variable a correlation. Correlation is only about quantitative variables. Straight Enough Condition The best check for the assumption that the variables are truly linearly related is to look at the scatterplot to see whether it looks reasonably straight. That s a judgment call, but not a difficult one.

Assumptions and Conditions for Correlation Quantitative Variables Condition Don t make the common error of calling an association involving a categorical variable a correlation. Correlation is only about quantitative variables. Straight Enough Condition The best check for the assumption that the variables are truly linearly related is to look at the scatterplot to see whether it looks reasonably straight. That s a judgment call, but not a difficult one. No Outliers Condition Outliers can distort the correlation dramatically, making a weak association look strong or a strong one look weak. Outliers can even change the sign of the correlation. But it s easy to see outlier in the scatterplot, so to check this condition, just look.

Another Example Example The following gives the power numbers for the starting 9 for the 2007 Boston Red Sox. Is there relationship between the number of home runs and the number of RBIs? Does the number of home runs affect the number of RBIs? Produce a scatterplot and discuss the correlation. Player Home Runs RBIs Varitek 17 68 Youkilis 16 83 Pedroia 8 50 Lowell 21 120 Lugo 8 73 Ramirez 20 88 Crisp 6 60 Drew 11 64 Ortiz 35 117

Red Sox Example Which is the response variable? Which is the response variable?

Red Sox Example Which is the response variable? Which is the response variable? Since we are asking if HR affects RBIs, HR would be the explanatory variable and therefore x. So RBIs is the y variable. 2007 Red Sox Power Numbers RBIs 120 100 80 60 40 20 10 20 30 Home Runs

Before We Go On Something to notice: we have two values with the same x-coordinate. 2007 Red Sox Power Numbers 120 RBIs 100 80 60 40 20 10 20 30 Home Runs

Finding the Correlation Coefficient What is our guess as to the correlation?

Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology.

Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology. Input data in usual way, with explanatory variable under L 1 and response variable under L 2

Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS

Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS Select LinRegTTest

Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS Select LinRegTTest Make sure the XList and YList are the lists where the data for the explanatory and response variables are located, respectively

Finding the Correlation Coefficient What is our guess as to the correlation? Now let s find the correlation coefficient. But there must be an easier way... and that way would be technology. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS Select LinRegTTest Make sure the XList and YList are the lists where the data for the explanatory and response variables are located, respectively Press Calculate and scroll to find r and r 2

Using Technology For our example, we have

Using Technology For our example, we have r r 2.8463.7162

Using Technology For our example, we have r r 2.8463.7162 So the correlation coefficient tells us that there is a strong positive correlation.

What Does r 2 Tell Us? r 2 tells us how much better our predictions will be if we go through the trouble to find the regression line rather than just make our predictions with the means. Ours is pretty good here, indicating that we should find the regression line. 2007 Red Sox Power Numbers RBIs 120 100 80 60 40 20 10 20 30 Home Runs

Technology and Scatterplots We can also create a scatterplot on the calculator.

Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check)

Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example)

Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu

Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu Make sure only the plot we want is turned on

Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu Make sure only the plot we want is turned on Select the first graph in the first row and then make sure the XList and YList are correct

Technology and Scatterplots We can also create a scatterplot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu Make sure only the plot we want is turned on Select the first graph in the first row and then make sure the XList and YList are correct Press ZOOM 9

One More Example Example There is some evidence that drinking moderate amounts of wine helps prevent heart attacks. The accompanying table gives data on yearly wine consumption (in liters of alcohol from drinking wine per person) and yearly deaths from heart disease (per 100,000 people) in 19 developing nations. Construct a scatterplot and describe what you see. Country Alcohol Deaths County Alcohol Deaths Australia 2.5 211 Austria 3.9 167 Belgium 2.9 131 Canada 2.4 191 Denmark 2.9 220 Finland 0.8 297 France 9.1 71 Iceland 0.8 211 Ireland 0.7 300 Italy 7.9 107 Netherlands 1.8 167 New Zealand 1.9 266 Norway 0.8 227 Spain 6.5 86 Sweden 1,.6 207 Switzerland 5.8 115 United Kingdom 1.3 285 United States 1.2 199 West Germany 2.7 172

The Scatterplot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters)

The Scatterplot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters) r =.8428, strong negative correlation

The Scatterplot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters) r =.8428, strong negative correlation r 2 =.7103, worthwhile to find linear regression line