Notes 21: Scatterplots, Association, Causation

STA 6166 Fall 27 Web-based Course Notes 21, page 1 Notes 21: Scatterplots, Association, Causation We used two-way tables and segmented bar charts to examine the relationship between two categorical variables and side-by-side-boxplots to examine the relationship between a quantitative variable and a categorical variable. Scatterplots are the graphical tool to examine the relationship between two quantitative variables. The response variable goes on the y-axis and the explanatory variable on the x-axis. Often, we are trying to predict the response variable from the explanatory. Sometimes, neither variable is obviously the explanatory or the response; then, it doesn t matter which variable we plot on the y-axis. What do we look for in examining a scatterplot? Is there a relationship between the two variables? That is, does the distribution of the y-variable change as the x-variable changes? If there is a relationship, we look for: the direction of the relationship: positive, negative, or some combination the form of the relationship: linear, curved, etc. the strength of the relationship (the more scatter of the points around the form, the weaker the relationship) outliers: points that don t fit the overall pattern or fall far away from the rest of the data (outliers in a scatterplot may or may not be outliers in the x-variable or the y-variable individually) other interesting features, such as clusters of points, or different relationships in different parts of the scatterplot. Use these guidelines to describe the relationships in the following scatterplots. The data for the first three are taken from a data set on education and related data for the 5 states, year unspecified (source: Table 1.6 in Moore (2), The Basic Practice of Statistics, 2 nd ed.). The variables are all averages unless otherwise specified. The data for the fourth scatterplot are from Florida s 2 election results. 1

Average SAT verbal vs. average score Notes 21, page 2 6 58 56 54 52 5 SAT verbal 48 46 46 48 5 52 54 56 58 6 62 62 Average score vs. percent of high school seniors taking SAT 6 58 56 54 52 5 48 46 2 4 6 8 1 Pct. taking SAT 2

Average Math SAT Scores vs. Teacher s Pay Notes 21, page 3 62 6 58 56 54 52 5 48 46 25 3 35 4 45 5 55 Teachers' pay ($1,) County vote totals for Bush versus Buchanan, Florida 2: 4 3 Buchanan votes 2 1 1 2 3 Bush votes Correlation The correlation coefficient r is a measure of the strength of the linear relationship between two quantitative variables. 3

Notes 21, page 4 It has the following properties: -1 r 1 r = indicates no linear relationship, r > indicates a positive relationship and r < indicates a negative relationship. r = 1 occurs only when the data fall perfectly on a line with positive slope; r = -1 occurs only when the data fall perfectly on a line with negative slope. Computing the correlation coefficient x x y y sx = s y z x z r = n 1 n 1 This is sometimes called Pearson s r or Pearson s correlation to distinguish it from other measures of association; however, the phrase correlation coefficient in statistics refers specifically to r. Example: Airfare and distance to 12 destinations from Baltimore on Jan. 8, 1995: 3 y 25 Airfare ($) 2 15 1 5 2 4 6 8 1 12 14 16 Distance (miles) 4

Notes 21, page 5 Distance z-score Airfare z-score Product Atlanta 576 -.339 178.186 -.63 Boston 37 138 Chicago 612 -.25 94-1.226.37 Dallas/Fort Worth 1216 1.25 278 1.868 2.335 Detroit 49 -.754 158 -.15.113 Denver 152 1.96 258 1.532 3.3 Miami 946.579 198.523.33 New Orleans 998.79 188.355.251 New York 189-1.3 98-1.159 1.57 Orlando 787.185 179.23.38 Pittsburgh 21-1.248 138 -.486.67 St. Louis 737 98 Mean 712.67 166.92 Sum 8.745 Std. Dev. 42.69 59.45 r? Checkpoint 1: Why is r a measure of the linear relationship between two variables? Simulated Example: Let x~n(,1) and y=x+3. What is the correlation between x and y? x y z x z y z x z y -.4326-1.6656.1253.2877-1.1465 1.199 1.1892 -.376.3273.1746 2.5674 1.3344 3.1253 3.2877 1.8535 4.199 4.1892 2.9624 3.3273 3.1746 -.482-1.8451.1373.317-1.274 1.3168 1.3149 -.431.369.1919 -.482-1.8451.1373.317-1.274 1.3168 1.3149 -.431.369.1919.236 3.442.189.15 1.614 1.734 1.7289.19.132.368 Sum z x z y 9 N=1 r 1 Since the z-scores give us how far the value is from the mean, if the z x always vary from their mean to the same degree that the z y vary from their mean, the z-scores will be equal and the slope between them will 5

Notes 21, page 6 be one. If the deviation is only slight, then the correlation will be close to one. If the deviation is large, the correlation will be close to zero. Other properties of correlation: it makes no difference which variable you call x and which you call y in computing correlation the correlation is unchanged by changing the units of measurement for x or y The correlations between pairs of variables in a data set with more than two variables are often reported in a correlation matrix. For example, Correlations SAT verbal Percent taking SAT Teachers' pay ($1,) Percent Teachers' SAT verbal taking SAT pay ($1,) 1.97 -.887 -.455.97 1 -.869 -.379 -.887 -.869 1.63 -.455 -.379.63 1 Note that the correlation between a variable and itself is 1. Checkpoint 2: Why? A scatterplot matrix is a graphical analog to the correlation matrix. Remember, that correlations should never be examined without also examining the scatterplots. SAT verbal Percent taking SAT Teachers' pay ($1, 6

Notes 21, page 7 Further explorations of the correlation coefficient Describe the relationship between the two variables in each of the following scatterplots: 8 1 8 7 6 6 y 4 5 2 4 2 4 6 8 1 2 4 6 8 1 12 14 x Checkpoint 3: Using the z-score interpretation, guess approximately what the correlations are. The actual correlations are.36 and.975. The left-hand plot illustrates that the correlation coefficient is a measure of linear association. The right-hand plot illustrates, however, that relationships which are curved, but monotone, may have a very high value of r nonetheless. That s because the data still fall close to a line. Checkpoint 4: Is the correlation coefficient resistant? Guess what the correlations would be with and without the outlier in each of the following scatterplots. 5 25 4 2 3 y y15 2 1 1 5 1 2 3 4 5 x 5 1 15 x 2 Without outlier: With outlier: 7

Resistant measures of association: Notes 21, page 8 Kendall s tau: consider all pairs of points (except those with same x-value); count number of slopes that are positive, negative, and zero. Kendall s tau equals # positive slopes - # negative slopes # positive slopes + # negative slopes + # zero slopes Spearman s rho: replace x-values by their ranks (smallest =1, largest=n), replace y-values by their ranks and compute correlation between the two sets of ranks (practice on airfare data earlier). Distance rank Airfare Atlanta 576 5 Atlanta 178 5 Boston 37 3 Boston 138 3 Chicago 612 6 Chicago 94 1 Dallas/Ft. Worth 1216 11 Dallas/Ft. Worth 278 1 Denver 152 12 Denver 258 9 Detroit 49 4 Detroit 158 4 Miami 946 9 Miami 198 8 New Orleans 998 1 New Orleans 188 7 New York 189 1 New York 98 2 Orlando 787 8 Orlando 179 6 Pittsburgh 21 2 Pittsburgh 138 3 St. Louis 737 7 St. Louis 98 2 Checkpoint 5: When will Kendall s tau and Spearman s rho be equal to 1 or 1? Hence, Kendall s tau and Spearman s rho are measures of how monotone the relationship between x and y is. Checkpoint 6: Are they more resistant than r? Are they completely resistant to outliers? 8

Notes 21, page 9 Examine the scatterplots on the previous page. Roughly, what are the values of Kendall s tau and Spearman s rho for these four scatterplots? Lower left Lower right Upper left Upper right w/o outlier w/outlier w/o outlier w/outlier Kendall s tau: Spearman s rho: Like r, the actual value of Kendall s tau or Spearman s rho is hard to judge in an absolute sense. Hence, we mainly use them to compare the strength of the association between different pairs of variables. The correlation coefficient r is only appropriate as a measure of the strength of the relationship between two quantitative variables if the relationship is linear and there are no outliers. So why would we ever use it instead of a resistant measure like Kendall s tau or Spearman s rho? Because, if the relationship is linear with no outliers, then r (actually, the square of r) has a very nice interpretation, as we ll see in the next chapter. This is analogous to the mean and standard deviation; they re not resistant measures, but they have a nice interpretation (the 68-95-99.7 Rule) if the distribution is symmetric and unimodal with no outliers. 9