Notes 21: Scatterplots, Association, Causation

Similar documents
Name. The data below are airfares to various cities from Baltimore, MD (including the descriptive statistics).

Math 243 OpenStax Chapter 12 Scatterplots and Linear Regression OpenIntro Section and

CHAPTER 3 Describing Relationships

11 Correlation and Regression

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Slide 7.1. Theme 7. Correlation

Chapter 8. Linear Regression /71

Scatterplots. 3.1: Scatterplots & Correlation. Scatterplots. Explanatory & Response Variables. Section 3.1 Scatterplots and Correlation

Basic Practice of Statistics 7th

The empirical ( ) rule

1 A Review of Correlation and Regression

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

M 225 Test 1 B Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation

Relationships between variables. Visualizing Bivariate Distributions: Scatter Plots

Stat 101 Exam 1 Important Formulas and Concepts 1

Statistics for Managers using Microsoft Excel 6 th Edition

Understand the difference between symmetric and asymmetric measures

appstats8.notebook October 11, 2016

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

Scatterplots and Correlation

Chapter 4 Data with Two Variables

AP Final Review II Exploring Data (20% 30%)

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

Graphical Techniques Stem and Leaf Box plot Histograms Cumulative Frequency Distributions

CHAPTER 3 Describing Relationships

Chapter 4 Data with Two Variables

Ch Inference for Linear Regression

AP Statistics Two-Variable Data Analysis

Chapter 6. September 17, Please pick up a calculator and take out paper and something to write with. Association and Correlation.

Bivariate statistics: correlation

M 140 Test 1 B Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Describing Bivariate Relationships

Overview. 4.1 Tables and Graphs for the Relationship Between Two Variables. 4.2 Introduction to Correlation. 4.3 Introduction to Regression 3.

Math 138 Summer Section 412- Unit Test 1 Green Form, page 1 of 7

MATH 1150 Chapter 2 Notation and Terminology

Examining Relationships. Chapter 3

Sociology 6Z03 Review I

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Chapters 1 & 2 Exam Review

Nov 13 AP STAT. 1. Check/rev HW 2. Review/recap of notes 3. HW: pg #5,7,8,9,11 and read/notes pg smartboad notes ch 3.

5.1 Bivariate Relationships

Chapter 7. Association, and Correlation. Scatterplots & Correlation. Scatterplots & Correlation. Stat correlation.

CS 361: Probability & Statistics

Unit 6 - Simple linear regression

Unit 6 - Introduction to linear regression

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

REVIEW 8/2/2017 陈芳华东师大英语系

The response variable depends on the explanatory variable.

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Upon completion of this chapter, you should be able to:

THE PEARSON CORRELATION COEFFICIENT

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

MBF1923 Econometrics Prepared by Dr Khairul Anuar

Chapter 6 Scatterplots, Association and Correlation

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Recall, Positive/Negative Association:

Lecture 18: Simple Linear Regression

STAT 200 Chapter 1 Looking at Data - Distributions

AP Stats ~ 3A: Scatterplots and Correlation OBJECTIVES:

STT 315 This lecture is based on Chapter 2 of the textbook.

Business Statistics. Lecture 10: Course Review

Correlation: basic properties.

Arvind Borde / MAT , Week 5: Relationships I

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

BIVARIATE DATA data for two variables

Scatterplots. STAT22000 Autumn 2013 Lecture 4. What to Look in a Scatter Plot? Form of an Association

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

Statistical View of Least Squares

Chapter 2: Looking at Data Relationships (Part 3)

Approximate Linear Relationships

Chapter 7 Summary Scatterplots, Association, and Correlation

Chapter 2: Tools for Exploring Univariate Data

HUDM4122 Probability and Statistical Inference. February 2, 2015

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Chapter 6: Exploring Data: Relationships Lesson Plan

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

TOPIC: Descriptive Statistics Single Variable

9 Correlation and Regression

In many situations, there is a non-parametric test that corresponds to the standard test, as described below:

3.1 Scatterplots and Correlation

Multiple Representations: Equations to Tables and Graphs Transcript

Comparing Quantitative Variables

First Edition. Extending the Number System

Chapter 4: Displaying and Summarizing Quantitative Data

Chapter 16: Correlation

Correlation & Simple Regression

Chapter 7. Scatterplots, Association, and Correlation

Review. Number of variables. Standard Scores. Anecdotal / Clinical. Bivariate relationships. Ch. 3: Correlation & Linear Regression

Key Concepts. Correlation (Pearson & Spearman) & Linear Regression. Assumptions. Correlation parametric & non-para. Correlation

Mrs. Poyner/Mr. Page Chapter 3 page 1

Can you tell the relationship between students SAT scores and their college grades?

Correlation & Linear Regression. Slides adopted fromthe Internet

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

Transcription:

STA 6166 Fall 27 Web-based Course Notes 21, page 1 Notes 21: Scatterplots, Association, Causation We used two-way tables and segmented bar charts to examine the relationship between two categorical variables and side-by-side-boxplots to examine the relationship between a quantitative variable and a categorical variable. Scatterplots are the graphical tool to examine the relationship between two quantitative variables. The response variable goes on the y-axis and the explanatory variable on the x-axis. Often, we are trying to predict the response variable from the explanatory. Sometimes, neither variable is obviously the explanatory or the response; then, it doesn t matter which variable we plot on the y-axis. What do we look for in examining a scatterplot? Is there a relationship between the two variables? That is, does the distribution of the y-variable change as the x-variable changes? If there is a relationship, we look for: the direction of the relationship: positive, negative, or some combination the form of the relationship: linear, curved, etc. the strength of the relationship (the more scatter of the points around the form, the weaker the relationship) outliers: points that don t fit the overall pattern or fall far away from the rest of the data (outliers in a scatterplot may or may not be outliers in the x-variable or the y-variable individually) other interesting features, such as clusters of points, or different relationships in different parts of the scatterplot. Use these guidelines to describe the relationships in the following scatterplots. The data for the first three are taken from a data set on education and related data for the 5 states, year unspecified (source: Table 1.6 in Moore (2), The Basic Practice of Statistics, 2 nd ed.). The variables are all averages unless otherwise specified. The data for the fourth scatterplot are from Florida s 2 election results. 1

Average SAT verbal vs. average score Notes 21, page 2 6 58 56 54 52 5 SAT verbal 48 46 46 48 5 52 54 56 58 6 62 62 Average score vs. percent of high school seniors taking SAT 6 58 56 54 52 5 48 46 2 4 6 8 1 Pct. taking SAT 2

Average Math SAT Scores vs. Teacher s Pay Notes 21, page 3 62 6 58 56 54 52 5 48 46 25 3 35 4 45 5 55 Teachers' pay ($1,) County vote totals for Bush versus Buchanan, Florida 2: 4 3 Buchanan votes 2 1 1 2 3 Bush votes Correlation The correlation coefficient r is a measure of the strength of the linear relationship between two quantitative variables. 3

Notes 21, page 4 It has the following properties: -1 r 1 r = indicates no linear relationship, r > indicates a positive relationship and r < indicates a negative relationship. r = 1 occurs only when the data fall perfectly on a line with positive slope; r = -1 occurs only when the data fall perfectly on a line with negative slope. Computing the correlation coefficient x x y y sx = s y z x z r = n 1 n 1 This is sometimes called Pearson s r or Pearson s correlation to distinguish it from other measures of association; however, the phrase correlation coefficient in statistics refers specifically to r. Example: Airfare and distance to 12 destinations from Baltimore on Jan. 8, 1995: 3 y 25 Airfare ($) 2 15 1 5 2 4 6 8 1 12 14 16 Distance (miles) 4

Notes 21, page 5 Distance z-score Airfare z-score Product Atlanta 576 -.339 178.186 -.63 Boston 37 138 Chicago 612 -.25 94-1.226.37 Dallas/Fort Worth 1216 1.25 278 1.868 2.335 Detroit 49 -.754 158 -.15.113 Denver 152 1.96 258 1.532 3.3 Miami 946.579 198.523.33 New Orleans 998.79 188.355.251 New York 189-1.3 98-1.159 1.57 Orlando 787.185 179.23.38 Pittsburgh 21-1.248 138 -.486.67 St. Louis 737 98 Mean 712.67 166.92 Sum 8.745 Std. Dev. 42.69 59.45 r? Checkpoint 1: Why is r a measure of the linear relationship between two variables? Simulated Example: Let x~n(,1) and y=x+3. What is the correlation between x and y? x y z x z y z x z y -.4326-1.6656.1253.2877-1.1465 1.199 1.1892 -.376.3273.1746 2.5674 1.3344 3.1253 3.2877 1.8535 4.199 4.1892 2.9624 3.3273 3.1746 -.482-1.8451.1373.317-1.274 1.3168 1.3149 -.431.369.1919 -.482-1.8451.1373.317-1.274 1.3168 1.3149 -.431.369.1919.236 3.442.189.15 1.614 1.734 1.7289.19.132.368 Sum z x z y 9 N=1 r 1 Since the z-scores give us how far the value is from the mean, if the z x always vary from their mean to the same degree that the z y vary from their mean, the z-scores will be equal and the slope between them will 5

Notes 21, page 6 be one. If the deviation is only slight, then the correlation will be close to one. If the deviation is large, the correlation will be close to zero. Other properties of correlation: it makes no difference which variable you call x and which you call y in computing correlation the correlation is unchanged by changing the units of measurement for x or y The correlations between pairs of variables in a data set with more than two variables are often reported in a correlation matrix. For example, Correlations SAT verbal Percent taking SAT Teachers' pay ($1,) Percent Teachers' SAT verbal taking SAT pay ($1,) 1.97 -.887 -.455.97 1 -.869 -.379 -.887 -.869 1.63 -.455 -.379.63 1 Note that the correlation between a variable and itself is 1. Checkpoint 2: Why? A scatterplot matrix is a graphical analog to the correlation matrix. Remember, that correlations should never be examined without also examining the scatterplots. SAT verbal Percent taking SAT Teachers' pay ($1, 6

Notes 21, page 7 Further explorations of the correlation coefficient Describe the relationship between the two variables in each of the following scatterplots: 8 1 8 7 6 6 y 4 5 2 4 2 4 6 8 1 2 4 6 8 1 12 14 x Checkpoint 3: Using the z-score interpretation, guess approximately what the correlations are. The actual correlations are.36 and.975. The left-hand plot illustrates that the correlation coefficient is a measure of linear association. The right-hand plot illustrates, however, that relationships which are curved, but monotone, may have a very high value of r nonetheless. That s because the data still fall close to a line. Checkpoint 4: Is the correlation coefficient resistant? Guess what the correlations would be with and without the outlier in each of the following scatterplots. 5 25 4 2 3 y y15 2 1 1 5 1 2 3 4 5 x 5 1 15 x 2 Without outlier: With outlier: 7

Resistant measures of association: Notes 21, page 8 Kendall s tau: consider all pairs of points (except those with same x-value); count number of slopes that are positive, negative, and zero. Kendall s tau equals # positive slopes - # negative slopes # positive slopes + # negative slopes + # zero slopes Spearman s rho: replace x-values by their ranks (smallest =1, largest=n), replace y-values by their ranks and compute correlation between the two sets of ranks (practice on airfare data earlier). Distance rank Airfare Atlanta 576 5 Atlanta 178 5 Boston 37 3 Boston 138 3 Chicago 612 6 Chicago 94 1 Dallas/Ft. Worth 1216 11 Dallas/Ft. Worth 278 1 Denver 152 12 Denver 258 9 Detroit 49 4 Detroit 158 4 Miami 946 9 Miami 198 8 New Orleans 998 1 New Orleans 188 7 New York 189 1 New York 98 2 Orlando 787 8 Orlando 179 6 Pittsburgh 21 2 Pittsburgh 138 3 St. Louis 737 7 St. Louis 98 2 Checkpoint 5: When will Kendall s tau and Spearman s rho be equal to 1 or 1? Hence, Kendall s tau and Spearman s rho are measures of how monotone the relationship between x and y is. Checkpoint 6: Are they more resistant than r? Are they completely resistant to outliers? 8

Notes 21, page 9 Examine the scatterplots on the previous page. Roughly, what are the values of Kendall s tau and Spearman s rho for these four scatterplots? Lower left Lower right Upper left Upper right w/o outlier w/outlier w/o outlier w/outlier Kendall s tau: Spearman s rho: Like r, the actual value of Kendall s tau or Spearman s rho is hard to judge in an absolute sense. Hence, we mainly use them to compare the strength of the association between different pairs of variables. The correlation coefficient r is only appropriate as a measure of the strength of the relationship between two quantitative variables if the relationship is linear and there are no outliers. So why would we ever use it instead of a resistant measure like Kendall s tau or Spearman s rho? Because, if the relationship is linear with no outliers, then r (actually, the square of r) has a very nice interpretation, as we ll see in the next chapter. This is analogous to the mean and standard deviation; they re not resistant measures, but they have a nice interpretation (the 68-95-99.7 Rule) if the distribution is symmetric and unimodal with no outliers. 9