Scatterplots. STAT22000 Autumn 2013 Lecture 4. What to Look in a Scatter Plot? Form of an Association

Similar documents
Looking at Data Relationships. 2.1 Scatterplots W. H. Freeman and Company

Describing Bivariate Relationships

5.1 Bivariate Relationships

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Sociology 6Z03 Review I

CHAPTER 3 Describing Relationships

The response variable depends on the explanatory variable.

Basic Practice of Statistics 7th

3.1 Scatterplots and Correlation

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Scatterplots and Correlation

BIVARIATE DATA data for two variables

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

The empirical ( ) rule

Chapter 5 Least Squares Regression

Correlation & Simple Regression

Scatterplots and Correlation

Analysing data: regression and correlation S6 and S7

STAB22 Statistics I. Lecture 7

Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables. Interpreting scatterplots Outliers

Dr. Allen Back. Sep. 23, 2016

Chapter 7. Scatterplots, Association, and Correlation

appstats8.notebook October 11, 2016

Scatterplots. 3.1: Scatterplots & Correlation. Scatterplots. Explanatory & Response Variables. Section 3.1 Scatterplots and Correlation

1) A residual plot: A)

Correlation: basic properties.

MATH 1150 Chapter 2 Notation and Terminology

Chapter 5 Friday, May 21st

Chapter 3: Examining Relationships

Recall, Positive/Negative Association:

Lecture 4 Scatterplots, Association, and Correlation

Unit 6 - Simple linear regression

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

THE PEARSON CORRELATION COEFFICIENT

Math 147 Lecture Notes: Lecture 12

Scatterplots and Correlations

Mrs. Poyner/Mr. Page Chapter 3 page 1

Lecture 4 Scatterplots, Association, and Correlation

Chapter 7 Summary Scatterplots, Association, and Correlation

Midterm 2 - Solutions

Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables Interpreting scatterplots Outliers

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Overview. 4.1 Tables and Graphs for the Relationship Between Two Variables. 4.2 Introduction to Correlation. 4.3 Introduction to Regression 3.

MODULE 11 BIVARIATE EDA - QUANTITATIVE

Unit 6 - Introduction to linear regression

Sem. 1 Review Ch. 1-3

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

Announcements. Lecture 10: Relationship between Measurement Variables. Poverty vs. HS graduate rate. Response vs. explanatory

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Announcements. Unit 6: Simple Linear Regression Lecture : Introduction to SLR. Poverty vs. HS graduate rate. Modeling numerical variables

Chapter 6: Exploring Data: Relationships Lesson Plan

Bivariate Data Summary

SCATTERPLOTS. We can talk about the correlation or relationship or association between two variables and mean the same thing.

Stat 101 Exam 1 Important Formulas and Concepts 1

Final Exam - Solutions

9. Linear Regression and Correlation

Chapter 6. September 17, Please pick up a calculator and take out paper and something to write with. Association and Correlation.

Linear Regression and Correlation. February 11, 2009

Stat 101: Lecture 6. Summer 2006

AP Statistics Unit 2 (Chapters 7-10) Warm-Ups: Part 1

Math 1710 Class 20. V2u. Last Time. Graphs and Association. Correlation. Regression. Association, Correlation, Regression Dr. Back. Oct.

Scatterplots and Correlation

Section 5.4 Residuals

Chapter 3: Examining Relationships

Data Visualization (DSC 530/CIS )

Chapter 3: Examining Relationships

Relationships Regression

GRADE 8. Know that there are numbers that are not rational, and approximate them by rational numbers.

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

Chapter 2: Tools for Exploring Univariate Data

Analysis of Bivariate Data

Algebra I Assessment. Eligible Texas Essential Knowledge and Skills

4.27 (a) Normal Q-Q Plot of JI score

Arkansas Mathematics Standards Grade

Simple Linear Regression

Chapter 7. Association, and Correlation. Scatterplots & Correlation. Scatterplots & Correlation. Stat correlation.

Pre-Calculus Multiple Choice Questions - Chapter S8

STATISTICS Relationships between variables: Correlation

Lecture 18: Simple Linear Regression

3.2: Least Squares Regressions

7.0 Lesson Plan. Regression. Residuals

CK-12 Middle School Math Grade 8

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

If the roles of the variable are not clear, then which variable is placed on which axis is not important.

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Ohio Tutorials are designed specifically for the Ohio Learning Standards to prepare students for the Ohio State Tests and end-ofcourse

Relationships between variables. Visualizing Bivariate Distributions: Scatter Plots

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

In-Class Activities Activity 26-1: House Prices

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Math 8 Common Core. Mathematics Prince George s County Public Schools

Chapter 3: Describing Relationships

Interpreting Correlation & Examining Cause and Effect

Chapter 4 Data with Two Variables

Lecture 16 - Correlation and Regression

Transcription:

Scatterplots STAT22000 Autumn 2013 Lecture 4 Yibi Huang October 7, 2013 21 Scatterplots 22 Correlation (x 1, y 1 ) (x 2, y 2 ) (x 3, y 3 ) (x n, y n ) A scatter plot shows the relationship between two numerical variables measured on the same cases (eg same person, same company, etc) The values of one variable are on the x-axis, and the values of the other are on the y-axis Each individual is represented by a point in the graph Lecture 4-1 Lecture 4-2 2010 Mean Math SAT Scores a Participation State Rate b Math AK 48 515 AL 7 550 AR 4 566 AZ 25 525 CA 50 516 CO 18 572 CT 84 514 DC 76 464 DE 71 495 FL 59 498 IL 6 600 WY 5 567 2010 Math SAT Mean 500 515 550 600 IL AR AL AZ AK 0 20 40 48 60 80 100 > A = readtable("sat2010csv",h=t) > A[1:2,] # to show the first two rows of the data State Abbr Region Percent X2010R X2010M X2010W 1 Alaska AK U 48 518 515 491 2 Alabama AL SE 7 556 550 544 > plot(a$percent,a$x2010m,xlab="", ylab="2010 Mean Math SAT by State") 2010 Mean Math SAT by State a Source: CollegeBoard b The percentage of high school graduates in the class of 2010 who took the SAT Lecture 4-3 Lecture 4-4 What to Look in a Scatter Plot? 1 What is the overall pattern? What is the form of the relationship? (linear, curved, clustered ) What is the direction of the relationship? (positive association, negative association) What is the strength of the relationship? (strong, weak, ) 2 Are there any deviations from the overall pattern? An point that falls outside the overall pattern is an outlier Lecture 4-5 Form of an Association Linear Association Nonlinear Association Lecture 4-6 No Association

Positive and Negative Association Positive Negative association: High values of one variable tend to occur together with high low values of the other variable positive negative positive linear linear nonlinear 00 04 08 00 04 08 00 04 08 12 06 00 00 04 08 35 20 05 Lecture 4-7 No Association No relationship: X and Y vary independently Knowing X tellou nothing about Y Lecture 4-8 Strength of the Association (1) The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form Weak Association Strong Association X Y unknown fixed X Y unknown fixed Large spread of Y Small spread of Y known known Lecture 4-9 weak positive moderate positive strong positive linear linear linear moderate negative moderate strong positive linear nonlinear nonlinear Lecture 4-10 Clusters (1) Sometimes points in a scatter plot form more than one clusters Clusters in a graph suggests that data may describe several distinct kinds of individuals Lecture 4-11 Clusters (2) The mean MSAT score and participation rate scatter plot contains at least two clusters 2010 Mean Math SAT by State Different clusters may exhibit different forms of association Eg, the left cluster above shows a moderate negative association, while the right cluster shows no (or a very weak) association Lecture 4-12

Adding Categorical Variables to Scatter Plots Points in different categories can be marked with different colors or symbols 2010 Math SAT Mean IL OH AK HI IN North East South East Middle West South West West Other Outliers In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship Lecture 4-13 Lecture 4-14 If One of the Variables Is Categorical The two variables in a scatter plot have to be both numerical If one is categorial, and the other is numerical, one can use a side-by-side box plot to examine their relation Months Survived 60 50 40 30 20 10 0 Lifetimes of female mice fed on 6 different diets NP N/N85 lopro N/R50 R/R50 N/R40 Diet Lecture 4-15 Be careful in your interpretation: This is NOT a positive association, because diet is not numerical 22 Correlation Outline The correlation coefficient r r does not distinguish between x and y r has no units of measurement r ranges from 1 to +1 Influential points Lecture 4-16 Which of the 4 scatter plots below shows the strongest association? The weakest? The Correlation Coefficient r While judging the strength of associations, our eyes can be fooled by changing the plotting scales or the space around the cloud of points We need a more objective approach Lecture 4-17 Recall the sample SD of variables X and Y respectively are = 1 n (x i x) 2, = 1 n (y i ȳ) 2, The correlation coefficient r (or simply, correlation) is defined as: r = 1 n ( )( ) xi x yi ȳ z-score of x i Lecture 4-18 z-score of y i

Correlation r Ranges From 1 to +1 Correlation r is a measure of the direction and strength of the linear relationship between two quantitative variables r always lies between 1 and 1; the strength increases as you move away from 0 to either 1 or 1 r > 0: positive association r < 0: negative association r 0: very weak linear relationship large r : strong linear relationship r = 1 or r = 1: only when all the data points on the scatterplot lie exactly along a straight line 1 0 1 Neg Assoc Pos Assoc Strong Weak No Assoc Weak Strong Perfect Perfect Lecture 4-19 r = 0 r = 02 r = 04 r = 06 r = 08 r = 09 Lecture 4-20 r = 01 r = 03 r = 05 r = 07 r = 095 r = 099 Lecture 4-21 Why r Measures the Strength of a Linear Relationship? x y + + x y x i x > 0 x i x < 0 x i x > 0 x i x < 0 y i y > 0 y i y < 0 y i y > 0 y i y < 0 What is the sign of ( xi x )( yi ȳ )?? Here r > 0; more positive contributions than negative Lecture 4-22 Correlation r Has No Unit r = 1 n ( xi x ) z-score of x i ( yi ȳ ) z-score of y i Recall that after standardization, the z-score of neither x i nor y i has unit So r is unit free So we can compare r between data sets, where variables are measured in different units or when variables are different Eg we may compare the r between [swim time and pulse], with the r between [swim time and breathing rate] Lecture 4-23 Correlation r Has No Unit (2) Changing the units of variables does not change the correlation coefficient r, because we get rid of all our units when we standardize (get z-scores) Eg, no matter the temperatures are recorded in F, or C, the correlations are both r = 074 because C = 5 9 (F 32) 25 30 35 40 45 50 55 45 55 65 75 Daily low (F) Daily high (F) 5 0 5 10 10 15 20 25 Daily low (C) Daily high (C) Lecture 4-24

r Does Not Distinguish x & y Sometimes one use the X variable to predict the Y variable In this case, X is called the explanatory variable, and Y the response The correlation coefficient r does not distinguish between the two It treat and y symmetrically r = 1 n ( )( ) xi x yi ȳ Swapping the x-, y-axes doesn t change r (both r = 074) Daily high (F) 50 60 70 80 Daily low (F) 20 30 40 50 20 30 40 50 Daily low (F) Lecture 4-25 50 60 70 80 Daily high (F) Linear Transformation of x & y Does Not Change r Correlation does not depend on the units of measurement Translation and scaling have no effect on the correlation For x i = a+bx i and y i = c +d y i, i = 1,2,,n r x y = sign(cd) r xy In other words, r x y = ±r xy, linear transformation does not change the absolute value of correlation Lecture 4-26 Correlation r Describes Linear Relationships Only The scatter plot below shows a perfect nonlinear association All points fall on the quadratic curve y = 1 4(x 05) 2 y 00 04 08 12 00 04 08 x r of all black dots = 0803, r of all dots = 0019 (black + white) No matter how strong the association, the r of a curved relationship is NEVER 1 or 1 It can even be 0, like the plot above Lecture 4-27 Correlation Is VERY Sensitive to Outliers Sometimes a single outlier can change r drastically For the plot on the left, { 00031 with the outlier r = 06895 without the outlier Outliers that may remarkably change the form of associations when removed are called influential points Remark: Not all outliers are influential points Lecture 4-28 When Data Points Are Clustered In the plot above, each of the two clusters exhibits a weak negative association (r = 0336 and 0323) But the whole diagram shows a moderately strong positive association (r = 0849) This is an example of the Simpson s paradox An overall r can be misleading when data points are clustered Cluster-wise r s should be reported as well Lecture 4-29 Always Check the Scatter Plots (1) The 4 data sets below have identical x, y,,, and r Dataset 1 Dataset 2 Dataset 3 x y x y x y Dataset 4 x y 10 804 10 914 10 746 8 658 8 696 8 814 8 677 8 576 13 758 13 875 13 1276 8 771 9 881 9 877 9 711 8 884 11 833 11 926 11 781 8 847 14 996 14 810 14 884 8 704 6 724 6 613 6 608 8 525 4 426 4 310 4 536 19 1250 12 1084 12 913 12 815 8 556 7 482 7 726 7 642 8 791 5 568 5 474 5 573 8 689 Ave 9 75 9 75 9 75 9 75 SD 316 194 316 194 316 194 316 194 r 082 082 082 082 How about their scatter plots? Lecture 4-30

Always Check the Scatter Plots (2) Dataset 1 20 Dataset 2 20 Dataset 3 20 Dataset 4 In Dataset 2, y can be predicted exactly from x But r < 1, because r only measures linear association 20 In Dataset 3, r would be 1 instead of 082 if the outlier were actually on the line In Dataset 4, r would be 0 if the outlier were removed The correlation coefficient can be misleading in the presence of outliers, multiple clusters, or nonlinear association Lecture 4-31 The R Command to Find r 2010 Mean Math SAT by State The R-command to find the correlation of two variables: cor() > cor(a$percent,a$x2010m) [1] -08547424 Lecture 4-32 Correlation Indicates Association, Not Causation Brainstorm on Correlation Why do both variables have to be quantitative? Why doesn t a tight fit to a horizontal line imply a strong correlation? If the law requires women to marry only men 2 years older than themselves, what is the correlation of the ages between husband and wife? Lecture 4-33 Lecture 4-34