THE PEARSON CORRELATION COEFFICIENT

CORRELATION Two variables are said to have a relation if knowing the value of one variable gives you information about the likely value of the second variable this is known as a bivariate relation There are several different ways to categorize relations. One way is based on the precision of the relation: Functional relations Statistical relations We can also categorize relations as to whether or not they imply that changes in one variable actually cause changes in the second variable: Causal relations Correlational relations

THE PEARSON CORRELATION COEFFICIENT We often use the Pearson correlation coefficient (r) to measure a correlational relation between two variables that are both continuous. There is no true independent variable here. Example: How is height related to weight?

WHAT DO CORRELATIONS TELLS US? Direction of the Relation a. Positive correlation (+): variables move in the same direction b. Negative correlation (-): variables move in opposite Form of the Relation a. Linear b. Nonlinear Magnitude of the Relation a. Range from -1.00 to +1.00 b. No correlation = 0

SCATTERPLOTS We often use scatterplots for detecting the presence of a correlation between variables. These are easy to draw, as well as graph in excel or SPSS Let s do some examples of judging correlational relations from scatterplots

HOW R IS CALCULATED r = Degree to which X and Y vary together Degree to which X and Y vary separately r = (Σz X z Y )/n df = n - 2

ISSUES INTERPRETING CORRELATIONS Correlation does not equal causation Correlation value can be greatly affected when the range of scores is limited Restricted range problem One or two extreme data points, or outliers, can dramatically affect the correlation value To describe how accurately one variable predicts the other, use r 2 (the coefficient of determination) For r =.5, 25% of the variability in one variable can be predicted from the other variable (.5 2 =. 25)

WHAT DOES A REGRESSION LINE DO? Makes the relationship between two variables easier to see. Identifies the center of the relationship, providing a simplified description of the relationship. Establishes a precise relationship between each X value and a corresponding Y value. Thus, the line can be used for prediction.

WHY USE REGRESSION? Goal of Regression: Find the equation for the line that best describes the relation for a set of X and Y data, in order to predict future values. Much like the correlation, we use regression when we have continuous variables involved. Same limitations apply to causal relations

LINEAR EQUATIONS Express a linear relationship between two variables (X and Y) as: Y = bx + a b is called the slope Determines how much the Y variable changes when X is increased Tells us direction of line and how steep it will be a is called the Y-intercept Tells us where the line crosses the Y axis

LINEAR EQUATIONS 80 70 Y: Amount Due 60 50 40 30 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 X: Hours of Exercise at YMCA Y = 5(X) + 20 Amt Due = $5(Hrs Exercise) + $20

THE LEAST-SQUARES SOLUTION Y = bx + a Numerically define the distance between the line and each data point For each X, the regression equation predicts a Y This is called Ŷ ( Y hat ) Distance between predicted value and actual Y is Y Ŷ Can be positive or negative The best-fitting line is the one that has the smallest error Commonly called the least-squared-error solution Ŷ = bx + a is the regression equation for Y

THE LEAST SQUARES SOLUTION 100 90 The relationship between average hours of sleep per night (X) and grades (Y) Grade 80 70 60 50 40 30 Regression Equation: Ŷ = 6.6X + 36.4 Grade = 6.6(Hrs Sleep) + 36.4 20 10 0 0 2 4 6 8 10 Hours of Sleep

THE STANDARD ERROR OF THE ESTIMATE In reality, data points rarely fall along the regression line (which would indicate a perfect correlation) Regression equation allows us to make predictions, but does not provide information about the accuracy of the predictions Compute standard error of estimate to measure the typical distance between a regression line and the actual data points

THE STANDARD ERROR OF THE ESTIMATE 100 Returning to our example of the relationship between average hours of sleep per night (X) and grades (Y) Standard Error of Estimate = 4.77 Grade 90 80 70 60 50 40 30 20 10 0 0 2 4 6 8 10 Hours of Sleep Ŷ = 6.6X + 36.4 The typical distance between data points and the regression line is 4.77.

RELATIONSHIP BETWEEN STANDARD ERROR AND CORRELATION When correlation is large in magnitude (close to +1 or -1), standard error of estimate will be small Points are clustered close to regression line When correlation is small (close to zero), standard error of estimate will be large Points are spread out and not close to regression line

WHAT IS THE CHI-SQUARE All of our tests through chapter 14 tested the relation between categorical IVs and a continuous DV. Correlations and regression examined the relation between two continuous variables. A chi-square is a nonparametric test used to test the relation between 2 categorical variables. Example: How is gender related to political orientation?

THE HOW OF THE CHI-SQUARE The chi-square (χ 2 ) uses sample frequencies to test a hypothesis about the presence of a relationship between two variables How well do the observed frequencies fit the expected frequencies specified by null hypothesis? Null states that two variables are independent: No consistent, predictable relationship Frequency distribution for one variable has same shape for all categories of the second variable