Correlation and Linear Regression
Correlation: Relationships between Variables So far, nearly all of our discussion of inferential statistics has focused on testing for differences between group means or proportions However, researchers are often interested in graded relationships between variables, such as how well one variable can predict another Examples: How well do SAT scores predict a student s GPA? How is the amount of time a student takes to complete an exam related to her grade on that exam? How well do IQ scores correlate with income? How does a child s height correlate with his running speed? How does class size affect student performance? 2
Height Weight 70 150 67 140 72 180 75 190 68 145 69 150 71.5 164 71 140 72 142 69 136 67 123 68 155 66 140 72 145 73.5 160 73 190 69 155 73 165 72 150 3
Characteristics of the Correlation Correlation and Regression A Correlation coefficient is a single number describing the relationship between two variables. This number describes: The direction of the relationship Variables sharing a positive correlation tend to change in the same direction (e.g., height and weight). As the value of one of the variables (height) increases, the value of other variable (weight) also increases Variables sharing a negative correlation tend to change in opposite directions (e.g., snowfall and beach visitors). As the value of one of the variables (amount of snowfall) increases, the value of the other variable (number of beach visitors) decreases. The strength of the relationship Variables that share a strong correlation (close to +1 or -1) strongly predict one another, while variables that share a weak correlation (near 0) do not. 4
Positive versus Negative Correlations Correlation and Regression Positive Correlation Negative Correlation 5
Strong versus Weak Correlations Correlation and Regression 6
Correlation is not Causation http://www.tylervigen.com/spurious-correlations 7
Possible Sources of Correlation Correlation and Regression The relationship is causal. Manipulating the predictor variable causes an increase or decrease in the criterion variable. E.g., leg strength and sprinting speed The causal relationship is backwards (reverse causality). Manipulating the criterion variable causes changes in the predictor variable E.g., relationship between hospital visits and illness The two variables work together systematically to cause an effect The relationship may be due to one or more confounding variables Changes in both variables reflect the effect of a confounding variable E.g., intelligence as an explanation for correlated performance on different exams E.g., increasing density in cities increases the number of physicians and the number of crimes 8
Measuring Correlation: Pearson s r Correlation and Regression To compute a correlation you need a pair of scores, X and Y, for each individual in the sample. The most commonly used measure of correlation is Pearson s product-moment correlation coefficient, or more simply, Pearson s r. Conceptually, Pearson s r is a ratio between the degree to which two variables (X and Y) vary together and the degree to which they vary separately. r = co-variability( XY, ) variability( X) variability( Y) 9
The Covariance The term in the numerator of Pearson s r is the covariance, an unnormalized statistic representing the degree to which two variables (X and Y) vary together. ( )( ) cov XY = E X E[ X] Y EY [ ] = ( X E[ X] )( Y EY [ ]) n Mathematically, it is the average of the product of the deviations of two paired variables The covariance depends both on how consistently X and Y tend to vary together and on the individual variability of the variables (X and Y). 10
Computing Pearson s r Pearson s r is computed by dividing the covariance by the product of the standard deviations of each of the variables This removes the effect of the variability of the individual variables [ ( ) ( )] ρ = E z X zy = cov σ σ X XY Y ˆ ρ = r = cov s s X XY Y 11
Linear Correlation: Assumptions Correlation and Regression 1. Linearity Assumes that the relationship between the paired scores is best described by a straight line 2. Normality Technically, assumes that the marginal score distributions, their joint distribution, and any conditional distributions are normally distributed However, most important assumption is that the residuals are normally distributed 3. Homoscedasticity (constant variance) Assumes that the variability around the regression line is homogeneous across different score values 12
Breaking Linearity and Normality Assumptions r = 0.816 for all of the above 13
Which has the strongest correlation coefficient? 14
Other Correlation Coefficients Correlation and Regression Spearman s correlation coefficient (r s ) for ranked data As the name suggests, Spearman s correlation is used when the scores for both X and Y consist of (or have been converted to) ordinal ranks The point biserial correlation coefficient (r pb ) This correlation is used when one of the scores is continuous and the other is dichotomous, taking on one of only two possible values The phi correlation coefficient (r ϕ ) The phi correlation is used when both scores are dichotomous All of the above can be computed in the same manner as Pearson s correlation. 15
Converting Data for Spearman s Correlation Original Data Age Height 10 31.4 11 41 12 47.8 13 52.8 14 55.7 15 58.3 16 60.7 17 62.1 18 62.7 19 63.3 20 64.1 21 64.3 22 64.6 23 64.7 24 64.5 25 64.3 r = 0.86 16
Converting Data for Spearman s Correlation Original Data Converted Scores Age Height Age Rank Height rank 10 31.4 1 1 11 41 2 2 12 47.8 3 3 13 52.8 4 4 14 55.7 5 5 15 58.3 6 6 16 60.7 7 7 17 62.1 8 8 18 62.7 9 9 19 63.3 10 10 20 64.1 11 11 21 64.3 12 12.5 22 64.6 13 15 23 64.7 14 16 24 64.5 15 14 25 64.3 16 12.5 r = 0.86 r = 0.97 17
The Coefficient of Determination Correlation and Regression For linear correlations (and regression generally) the strength of fit of a model is most commonly evaluated using the r 2, the coefficient of determination. r 2 is calculated as the square of the correlation coefficient r, and represents the proportion of variability explained by the model. The remainder of the variability in the data that cannot be explained by the variables included in the model is called residual error 18
Testing for Significance of r Under a null hypothesis of ρ = 0 (and assuming normal residuals), the point estimate r is distributed as a t variable, with standard error: 2 1 r ˆ σ r = sr =, where dfr = n 2 df r This means that we can test whether r is significantly different from 0 by computing a the t-statistic: t r s df 1 r r ( df r ) = = r 2 r And computing the tail probability associated with the corresponding t distribution. 19
Introduction to Linear Regression Correlation and Regression Regression is a statistical procedure for modeling the relationship among variables to predict the value of a response ( or dependent) variable from one or more explanatory (or predictor) variables. Linear regression is the model assumed in computing a correlation coefficient In regression, the statistical model is made explicit. Imagine that I ask you to guess the weight of a college-aged male who is hidden from view What would your best guess be? 20
Introduction to Regression µ weight = σ weight = 158.26 18.64 21
Introduction to Regression Regression is a statistical procedure for modeling the relationship among variables to predict the value of a dependent variable from one or more predictor variables. Imagine that I ask you to guess the weight of a college-aged male who is hidden from view What would your best guess be? What if I also gave you his height? Intuitively, it should be clear that you can do better 22
Introduction to Regression 23
Introduction to Regression The Pearson correlation (ρ or r) measures the degree to which a set of data points form a linear (straight line) relationship. Simple regression describes the linear relationship between a dependent variable (Y) and one predictor variable (X) The resulting line is called the regression line. 24
Regression and Linear Equations Correlation and Regression You should remember the following from your high school algebra course: Any straight line can be represented by an equation of the form Y = b 1 X + b 0, where b 1 and b 0 are constants. The value of b 1 is called the slope and determines the direction and degree to which the line is tilted. The value of b 0 is called the Y-intercept and determines the point where the line crosses the Y-axis. In the context of linear regression, b 0 and b 1 are called regression coefficients 25
Regression and Linear Equations Correlation and Regression b b 1 0 = 0.5 = 1.0 Yˆ = bx+ b = 0.5X + 1 1 0 26
Residuals: Errors of Prediction Correlation and Regression How well a regression line fits a set of data points can be measured by calculating the distance between the data points and the line. Using the formula Ŷ = b 1 X + b 0, it is possible to find the predicted value of Ŷ for any X. The residual, or error of prediction, between the predicted value and the actual value can be found by computing the difference Y -Ŷ The regression line is selected to be the best fit in the leastsquares sense. This means that we want to compute the line that minimizes the sum of squared residuals: SS ( Y ) 2 residual = Yˆ 27
Residuals: Errors of Prediction Correlation and Regression Yˆ = bx+ b 1 0 ( XY, ) ( Y Yˆ ) 28
Residuals: Errors of Prediction Correlation and Regression 29
The Standard Error of Estimate Correlation and Regression The measure of unpredicted variability or error for the regression line is called the standard error of estimate (s e or s Y-Ŷ ) You can think of it as analogous to the standard deviation if we were to use the mean Y as our estimate of the variable Y s SS Y = = df Y ( Y Y ) 2 n 1 s Y Yˆ SS residual = = df residual ( Y Yˆ ) 2 n 2 30
Computing Regression Coefficients Correlation and Regression Slope b 1 change in Y (as a function of X) = = change in X SP SS XY X or rs s Y X Intercept b0 = Y bx 1 31
Testing for Significance of b 1 Earlier, I pointed out that the sample correlation coefficient r is distributed as a t variable, with standard error 2 1 r ˆ σ r = sr =, where dfr = n 2 df r The slope b 1 is similarly distributed, but with a standard error s σ b1 b df n 1 s n 1 Y Yˆ ˆ = s =, with = 2 X This one s a bit more tricky to explain, but you can use the same procedure (a t-test) to evaluate the significance of the slope. 32
Prediction Using the linear model to predict the value of the response variable for a given value of the explanatory variable is called prediction, simply by plugging in the value of x in the linear model equation. There will be some uncertainty associated with the predicted value. 33
Example: Predicting Y from X Correlation and Regression You are told that a college student is 74 inches tall. Given the computed regression coefficients, what is your best estimate of his weight? X : height Y : weight b b 0 1 = 228.56 = 5.44 Yˆ = bx + b = 5.44X 228.56 1 0 ( ) Y ˆ = 5.44 74 228.56 = 402.56 228.56 = 174 34
Example: Computing Accuracy of Prediction Just as σ or s can be used to compute confidence intervals for population means, s Y-Ŷ can be used to compute predictive intervals for Y t ( ) ˆ df s crit Y Y 35
Example: Computing Accuracy of Prediction Just as σ or s can be used to compute confidence intervals for population means, s Y-Ŷ can be used to compute predictive intervals for Y t ci r t ( df ) s 1 Y Yˆ + ( x X ) 2 + nss SS X X Note that the actual formula for the predictive interval is slightly more complicated and depends on x. 36
Extrapolation Applying a model estimate to values outside of the realm of the original data is called extrapolation. Sometimes the intercept might be an extrapolation. 37
The Hazards of Extrapolation 38
The Hazards of Extrapolation 39
The Hazards of Extrapolation 40
41
42
43
44