Correlation Correlation: Relationships between Variables So far, nearly all of our discussion of inferential statistics has focused on testing for differences between group means However, researchers are often interested in graded relationships between variables, such as how well one variable can predict another Examples: How well do SAT scores predict a student s GPA? How is the amount of time a student takes to complete an exam related to her grade on that exam? How well do IQ scores correlate with income? How does a child s height correlate with his running speed? How does class size affect student performance?
Correlation: Relationships between Variables Correlation is a statistical technique used to describe the relationship between two variables. Usually the two variables are simply observed as they exist in the environment (with no experimental manipulation a correlational study) However, results from experimental studies (in which one of the variables is systematically manipulated) can also be analyzed using correlation
Mean Comparison Approach Height Weight 70 150 67 140 72 180 75 190 68 145 69 150 71.5 164 71 140 72 142 69 136 67 123 68 155 66 140 72 145 73.5 160 73 190 69 155 73 165 72 150 Weights Short Tall 140 164 140 180 123 142 145 145 155 150 150 190 136 165 155 160 150 190 140
Correlation: Scatter Plots Height Weight 70 150 67 140 72 180 75 190 68 145 69 150 71.5 164 71 140 72 142 69 136 67 123 68 155 66 140 72 145 73.5 160 73 190 69 155 73 165 72 150
Scatter Plots Height Weight 70 150 67 140 72 180 75 190 68 145 69 150 71.5 164 71 140 72 142 69 136 67 123 68 155 66 140 72 145 73.5 160 73 190 69 155 73 165 72 150
Scatter Plots Height Weight 70 150 67 140 72 180 75 190 68 145 69 150 71.5 164 71 140 72 142 69 136 67 123 68 155 66 140 72 145 73.5 160 73 190 69 155 73 165 72 150
Scatter Plots Height Weight 70 150 67 140 72 180 75 190 68 145 69 150 71.5 164 71 140 72 142 69 136 67 123 68 155 66 140 72 145 73.5 160 73 190 69 155 73 165 72 150
Scatter Plots Height Weight 70 150 67 140 72 180 75 190 68 145 69 150 71.5 164 71 140 72 142 69 136 67 123 68 155 66 140 72 145 73.5 160 73 190 69 155 73 165 72 150
Characteristics of the Correlation A Correlation coefficient is a single number describing the relationship between two variables. This number describes: The direction of the relationship Variables sharing a positive correlation tend to change in the same direction (e.g., height and weight). As the value of one of the variables (height) increases, the value of other variable (weight) also increases Variables sharing a negative correlation tend to change in opposite directions (e.g., snowfall and beach visitors). As the value of one of the variables (amount of snowfall) increases, the value of the other variable (number of beach visitors) decreases. The strength of the relationship Variables that share a strong correlation (close to +1 or -1) strongly predict one another, while variables that share a weak correlation (near 0) do not.
Positive versus Negative Correlations Positive Correlation Negative Correlation
Strong versus Weak Correlations
Correlation is not Causation
Possible Sources of Correlation The relationship is causal. Manipulating the predictor variable causes an increase or decrease in the criterion variable. E.g., leg strength and sprinting speed The causal relationship is backwards (reverse causality). Manipulating the criterion variable causes changes in the predictor variable The two variables work together systematically to cause an effect The relationship may be due to one or more confounding variables Changes in both variables reflect the effect of a confounding variable E.g., intelligence as an explanation for correlated performance on different exams E.g., increasing density in cities increases the number of physicians and the number of crimes
Measuring Correlation: Pearson s r To compute a correlation you need a pair of scores, X and Y, for each individual in the sample. The most commonly used measure of correlation is Pearson s product-moment correlation coefficient, or more simply, Pearson s r. Conceptually, Pearson s r is a ratio between the degree to which two variables (X and Y) vary together and the degree to which they vary separately. r co-variability( XY, ) variability( X) variability( Y)
The Covariance The term in the numerator of Pearson s r is the covariance, an unnormalized statistic representing the degree to which two variables (X and Y) vary together. cov XY X M Y M X n 1 Y Mathematically, it is the average of the product of the deviations of two paired variables The covariance depends both on how consistently X and Y tend to vary together and on the individual variability of the variables (X and Y).
The Covariance Notice that the formula for covariance looks a lot like the formula for variance: s 2 X 2 X M X M X M n1 n1 X X X cov XY X M Y M X n 1 Y
The Covariance Moreover, they share a similar computational formula: s SS ; where SS n 1 X 2 X 2 X X 2 X X X XX n n cov XY SPXY ; where n 1 SP XY XY XY n
Computing Pearson s r Pearson s r is computed by dividing by the product of the standard deviations of each of the variables This removes the effect of the variability of the individual variables r cov XY s s SP XY SS SS X Y X Y
Computing Pearson s r: Example X Y 0 2 10 6 4 2 8 4 8 6
Computing Pearson s r: Example X Y XY 0 2 0 10 6 60 4 2 8 8 4 32 8 6 48 Compute SS X, SS Y, & SP XY : Compute r: SS SS SP X Y XY X 2 2 2 30 X 244 244 180 64 N 5 Y 2 2 2 20 Y 96 96 80 16 N 5 X Y XY 148 120 28 N r SPXY 28 28 0.875 SS SS 6416 32 X Y
Computing Pearson s r: Example Hypothesis testing for r: The null hypothesis is that the population correlation coefficient ρ = 0 The alternative hypothesis is that ρ 0 tcrit ( df ) rcrit ( df ) ; df N 2 df t 2 crit
t-distribution Table α t One-tailed test α/2 α/2 -t t Two-tailed test Level of significance for one-tailed test 0.25 0.2 0.15 0.1 0.05 0.025 0.01 0.005 0.0005 Level of significance for two-tailed test df 0.5 0.4 0.3 0.2 0.1 0.05 0.02 0.01 0.001 1 1.000 1.376 1.963 3.078 6.314 12.706 31.821 63.657 636.619 2 0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 31.599 3 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 12.924 4 0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 8.610 5 0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 6.869 6 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 5.959 7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 5.408 8 0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 5.041 9 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 4.781 10 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.587 11 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 4.437 12 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 4.318 13 0.694 0.870 1.079 1.350 1.771 2.160 2.650 3.012 4.221 14 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 4.140 15 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 4.073 16 0.690 0.865 1.071 1.337 1.746 2.120 2.583 2.921 4.015 17 0.689 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.965 18 0.688 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.922 19 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.883 20 0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.850 21 0.686 0.859 1.063 1.323 1.721 2.080 2.518 2.831 3.819 22 0.686 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.792 23 0.685 0.858 1.060 1.319 1.714 2.069 2.500 2.807 3.768 24 0.685 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.745 25 0.684 0.856 1.058 1.316 1.708 2.060 2.485 2.787 3.725 26 0.684 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.707 27 0.684 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.690 28 0.683 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.674 29 0.683 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.659 30 0.683 0.854 1.055 1.310 1.697 2.042 2.457 2.750 3.646 40 0.681 0.851 1.050 1.303 1.684 2.021 2.423 2.704 3.551 50 0.679 0.849 1.047 1.299 1.676 2.009 2.403 2.678 3.496 100 0.677 0.845 1.042 1.290 1.660 1.984 2.364 2.626 3.390
Computing Pearson s r: Example tcrit ( df ) rcrit ( df ) ; df N 2 5 3 3 df t ( df ) 2 crit tcrit (3) 3.182 r crit tcrit 3.182 3.182 0.878 df t 33. 182 3.623 2 2 crit
Critical values for Pearson s r Level of Significance for One-Tailed Test 0.05 0.025 0.01 0.005 0.0005 Level of Significance for Two-Tailed Test df = n-2 0.1 0.05 0.02 0.01 0.001 1 0.988 0.997 1.000 1.000 1.000 2 0.900 0.950 0.980 0.990 0.999 3 0.805 0.878 0.934 0.959 0.991 4 0.729 0.811 0.882 0.917 0.974 5 0.669 0.754 0.833 0.875 0.951 6 0.621 0.707 0.789 0.834 0.925 7 0.582 0.666 0.750 0.798 0.898 8 0.549 0.632 0.715 0.765 0.872 9 0.521 0.602 0.685 0.735 0.847 10 0.497 0.576 0.658 0.708 0.823 11 0.476 0.553 0.634 0.684 0.801 12 0.458 0.532 0.612 0.661 0.780 13 0.441 0.514 0.592 0.641 0.760 14 0.426 0.497 0.574 0.623 0.742 15 0.412 0.482 0.558 0.606 0.725 16 0.400 0.468 0.543 0.590 0.708 17 0.389 0.456 0.529 0.575 0.693 18 0.378 0.444 0.516 0.561 0.679 19 0.369 0.433 0.503 0.549 0.665 20 0.360 0.423 0.492 0.537 0.652 21 0.352 0.413 0.482 0.526 0.640 22 0.344 0.404 0.472 0.515 0.629 23 0.337 0.396 0.462 0.505 0.618 24 0.330 0.388 0.453 0.496 0.607 25 0.323 0.381 0.445 0.487 0.597 26 0.317 0.374 0.437 0.479 0.588 27 0.311 0.367 0.430 0.471 0.579 28 0.306 0.361 0.423 0.463 0.570 29 0.301 0.355 0.416 0.456 0.562 30 0.296 0.349 0.409 0.449 0.554 40 0.257 0.304 0.358 0.393 0.490 50 0.231 0.273 0.322 0.354 0.443 100 0.164 0.195 0.230 0.254 0.321
Computing Pearson s r: Example tcrit ( df ) rcrit ( df ) ; df N 2 53 3 df t 2 crit tcrit (3) 3.182 r crit tcrit 3.182 3.182 0.878 df t 33. 182 3.623 2 2 crit 0.875 0.878; accept H, the correlation is not significant. 0
Linear Correlation: Assumptions 1. Linearity Assumes that the relationship between the paired scores is best described by a straight line 2. Normality Assumes that the marginal score distributions, their joint distribution, and any conditional distributions are normally distributed 3. Homoscedasticity Assumes that the variability around the regression line is homogeneous across different score values
Other Correlation Coefficients Spearman s correlation coefficient (r s ) for ranked data As the name suggests, Spearman s correlation is used when the scores for both X and Y consist of (or have been converted to) ordinal ranks The point biserial correlation coefficient (r pb ) This correlation is used when one of the scores is continuous and the other is dichotomous, taking on one of only two possible values The phi correlation coefficient (r ϕ ) The phi correlation is used when both scores are dichotomous All of the above can be computed in the same manner as Pearson s correlation.
Converting Data for Spearman s Correlation Correlation Original Data Age Height 10 31.4 11 41 12 47.8 13 52.8 14 55.7 15 58.3 16 60.7 17 62.1 18 62.7 19 63.3 20 64.1 21 64.3 22 64.6 23 64.7 24 64.5 25 64.3 r = 0.86
Converting Data for Spearman s Correlation Correlation Original Data Converted Scores Age Height Age Rank Height rank 10 31.4 1 1 11 41 2 2 12 47.8 3 3 13 52.8 4 4 14 55.7 5 5 15 58.3 6 6 16 60.7 7 7 17 62.1 8 8 18 62.7 9 9 19 63.3 10 10 20 64.1 11 11 21 64.3 12 12.5 22 64.6 13 15 23 64.7 14 16 24 64.5 15 14 25 64.3 16 12.5 r = 0.86 r = 0.97
Converting Data for the Point Biserial Correlation
Converting Data for Phi Correlation