Correlation
Correlation A statistics method to measure the relationship between two variables Three characteristics Direction of the relationship Form of the relationship Strength/Consistency
Direction of the Relationship Positive correlation Variables moving in the same direction Negative correlation Variables moving in opposite directions
Form of the Relationship Linear or non-linear Predicting data
Strength/Consistency How well do data fit the specific form? Measured by the distance between actual data and the predicted data The absolute value of a correlation Measuring the fitness 1: Perfect fit 0: Not fit at all
Correlation Measures The Pearson correlation Linear relationship The sign of the correlation: direction The numerical value: the degree of the relationship The Spearman correlation For ordinal scale of measurement Both the X values and the Y values are ranks. Measuring consistency for data relationship Not necessarily to be linear The point-biserial correlation Used to measure the correlation between a regular variable and a dichotomous variable
The Pearson Correlation r degreeto which X and Y vary together degreeto which X and Y vary separately covariability of variability of SP SS X SS Y X X and Y and Y separately
SP SS ( X M X )( Y MY ) ( X M ) 2 SS ( X ) 2 X n SP XY 2 ( X )( Y) n
Check the Result Using the scatterplot of data Drawing the envelope around all data points Checking the direction and shape of the envelope 5 X Y 4 3 2 1 0 1 10 3 4 1 8 2 8 3 0 0 2 4 6 8 10 12
Interpreting Correlations Predication Correlation is just about relationship between two variables. Not necessarily causation!!
Interpreting Correlations Predication Correlation is just about relationship between two variables. Not necessarily causation!! The value could be affected greatly by the data range.
Data Range and Correlation
Interpreting Correlations Predication Correlation is just about relationship between two variables. Not necessarily causation!! The value could be affected greatly by the data range. Outliers can dramatically affect the value.
Outlier and Correlation
The Strength of Relationship
The Strength of Relationship The coefficient of determination Squaring the value of correlation How much of the variance in dependent variable is accounted for by independent variable. Similar to the power used in z- and t-tests
Hypothesis Tests with the Pearson Correlation Pearson correlation is usually computed for sample data, but used to test hypotheses about the relationship in the population Population correlation shown by Greek letter rho (ρ) Non-directional: H 0 : ρ = 0 and H 1 : ρ 0 Directional: H 0 : ρ 0 and H 1 : ρ > 0 or Directional: H 0 : ρ 0 and H 1 : ρ < 0
Population vs. Sample
Correlation Hypothesis Test Sample correlation r used to test population ρ Hypothesis test can be computed using either t or F Use t table to find critical value t r s r 1 r df 2 s r
About df What should the df be? Suppose the sample size is n t r 2 (1 r ) n 2
Example α =.05 n = 30 r = 0.35 t r 2 (1 r ) n 2 0.35 (1 0.35 28 2 ) 1.97 Two-tailed test: critical value ±2.048 Fail to reject the null hypothesis One-tailed test: reject: 1.701 Reject
Using r Directly
Report Correlations A correlation for the data revealed a significant relationship between amount of education and annual income, r (28)= 0.65, p <.01, two-tailed.
Usually, Multiple Variables Involved in Correlation Tests
Partial Correlation Involvement of other factors in correlation?
Partial Correlation
Partial Correlation A partial correlation measures the relationship between two variables while mathematically controlling the influence of a third variable by holding it constant r xy z r xy (1 r 2 xz ( r xz r yz )(1 ) r 2 yz )
Example Number of Churches (X) Number of Crimes (Y) 1 4 1 2 3 1 3 1 1 4 2 1 5 5 1 7 8 2 8 11 2 9 9 2 10 7 2 11 10 2 13 15 3 14 14 3 15 16 3 16 17 3 17 13 3 Population (Z) r xy z 0
What if the relationship looks like this?
The Spearman Correlation To measure the degree of consistency of direction Not necessarily linear. One extra step before calculating the Pearson correlation Ranking the X and Y values Analyze the correlation of ranking values. X Y (values) X Y (Ranks) 1 3 2 2 6 4 4 3 2 5 3 4 0 2 1 1
Ranking Tied Scores Using the same rank for same scores Ranking all scores Computing the mean for ranked position of same scores X Y (values) X Y (Ranks) 1 3 2 2 (2.5) 6 3 4 3 (2.5) 2 5 3 4 0 2 1 1
Special Formula for Spearman Correlation SS n( n 2 1) 12 r s 6 1 2 n( n D 2 1)
The Point-Biserial Correlation Just like the Pearson correlation One variable has only two values Gender, success/failure, college education or not, The value of correlation has nothing to do with the values you used in study (1/0, 1/-1, etc.)
Point-Biserial Correlation vs. t Test t test t = 4 p <.001 df = 18 Point-Biserail r = 0.686
If we know two variables are linearly related, how can we describe such a relationship? Using a linear equation y = bx + a
Regression
Goal of Regression Determining two constants for a linear equation: y=bx+a b: slope a: intercept Methods The least-squares solution
Distance = Y - Y^ ^ Minimizing S(Y-Y) 2
Formula b SP SS X a M Y bm X
Regression in Excel Draw a scatterplot Show the trendline
Linear Equations and Regression The Pearson correlation measures a linear relationship between two variables This figure Makes the relationship easier to see Shows the central tendency of the relationship Can be used for prediction
Linear Equations General equation for a line Equation: Y = bx + a X and Y are variables a and b are fixed constant
Regression Regression is a method of finding an equation describing the best-fitting line for a set of data Least square Minimizing errors of known data Or the error of prediction
Error of Prediction With a linear function from regression, we can calculate the predicted value based on a given X Ŷ Error of prediction: Y- Ŷ Often squared
Standard Error of Estimate Regression equation makes a prediction Precision of the estimate is measured by the standard error of estimate (SEoE) SEoE = SS residual df ( Y Yˆ) n 2 2 SS residual ( ˆ) Y Y n 2 2
Relationship Between Correlation and Standard Error of Estimate SS regression = r 2 SS Y SS residual = (1 - r 2 ) SS Y SS df residual 2 (1 r ) SS n 2 Y
Testing Regression Significance Analysis of Regression Similar to Analysis of Variance Uses an F-ratio of two Mean Square values Each MS is an SS divided by its df H 0 : the slope of the regression line (b or beta) is zero no regression
Mean Squares and F-ratio MS regression SS df regression regression MS residual SS df residual residual F MS MS regression residual
SS and df in Regression Analysis
SPSS Output Example
In Excel X Y 5 10 1 4 4 5 7 11 6 15 4 6 3 5 2 0
ANOVA and Regression Basically the same method, but different perspectives to look at the results Main effect in ANOVA == a variable in regression Interaction between two factors == multiplication of two variables in regression Regression not only tells difference, but also predicts by how much. Multivariate regression
Linear or Non-Linear Regression? Linear models are usually good enough to most research in IST. If non-linear models are involved, how to tell the linear model you have is not appropriate? Look at residual distribution
In Summary Correlation: the relationship between two variables Direction, form, degree Three methods For different purposes Regression Determining the linear equation that data best fit Slope and intercept
Homework Three problems to solve.