Statistics Introductory Correlation

Statistics Introductory Correlation Session 10 oscardavid.barrerarodriguez@sciencespo.fr April 9, 2018

Outline 1

Statistics are not used only to describe central tendency and variability for a single variable. Rather, statistics can be used to describe relationships between variables.

A correlation exists when changes in one dependent variable are statistically associated with systematic changes in another variable hence, a correlation is a type of bivariate relationship. Objective of this lecture This lecture introduces methods for measuring and describing the strength of the relationship between quantitative variables.

Examples of relationships between variables appear in the table below. Can you identify which examples show a relationship between X and Y and which do not?

1. Ex 1. Relationship. As values of X increase, values of Y increase. 2. Ex 2 also shows a relationship. as values of X decrease, values of Y increase. Even though scores for the two variables are heading in opposite directions, there is still a systematic change in Y that corresponds to the changes in X. 3. Ex Curvilinear relationship. Notice as X increases from 3 to 5 to 7, Y increases from 2 to 4 to 6, but the values of Y begin to decrease from 6 to 4 to 2 as X continues to increase. This is also a relationship, because as values of X increase, there is a systematic change in Y, but then the values of Y decrease. 4. Ex 4 and 5 do not show relationships between X and Y.

Scatterplots Determining whether a relationship is present between two dependent variables can be difficult looking at x-y data pairs. So, to see relationships most people begin by creating a scatterplot. Examples: say we measure n = 20 people on the following variables: Family Guy Watching: the number of Family Guy episodes a person watches per week Intelligence, as measured by an IQ test.

Scatterplots Examples: say we measure n = 20 people on the following variables:

Scatterplots You can see from the data as the number of Family Guy (X) episodes watched increases, Intelligence scores (Y) also increase. Scatterplot are used to display whether a relationship between two variables is positive linear, negative Statistics linear, Intermediate curvilinear, or absent.

Scatterplots A positive linear relationship is observed when the values of both variables have a trend that occurs in the same direction.

Scatterplots A negative linear relationship is observed when the values of variables have trends that occur in opposite directions (inverse relationship). That is, as the values of one variable increase the values of the other variable tend to decrease.

Scatterplots There are many forms of a curvilinear relationship, but generally, such a relationship exists whenever there is a change in the relationship between variables.

Scatterplots A relationship is absent whenever there is no systematic change between variables.

Pearson Correlation Coefficient The statistic most commonly used to measures the correlation between two quantitative variables is the Pearson Produce-Moment Correlation Coefficient or, more succinctly, the Pearson correlation (r or rxy). What for The Pearson correlation is used to measure the strength and direction of a linear relationship between two quantitative variables. However it has to meet some conditions

Pearson Correlation Coefficient However it has to meet some conditions 1 The variables must be quantitative; variables cannot be categorical (nominal) or ordinal (interval or ratio scale) the Spearman correlation is used to measure the correlation when at least one variable is ordinal. Chi square analyses are used if the data comes from nominal scales 2 Each variable must produce a wide range of its potential value: If a limited or restricted range of potential values are measured you may not observe the true relationship. 3 The relationship is not curvilinear. The Pearson correlation measures the direction and degree of two variables with a linear relationship.

Pearson Correlation Coefficient Some characteristics 1 measures the degree of linearity between variables. 2 has a range between -1.00 to +1.00. 3 The closer to -1.00 or +1.00, the more linear the relationship is between the variables 4 If Pearson correlation is found to be equal to -1.00 or +1.00, this indicates that there is a perfect linear relationship between variables 5 The sign (+/-) of the Pearson correlation tells you whether the relationship is positive-linear or negative-linear 6 the sign says nothing about the strength of the linear relationship 7 A zero correlation is a case in which the Pearson correlation is equal to zero (r = 0). In this case there is no relationship between variables.

Calculating the Pearson Correlation (r) The Pearson correlation is defined as standardized covariance between two quantitative variables. Recall from standard scores that when calculate a z-score, you divide the difference between a raw score and the mean of a distribution by the standard deviation of the distribution. A measure referred to as covariance is divided by the product of two standard deviations; hence, this measure of covariance is standardized. So what is covariance?

Before introducing covariance, let me introduce a set of data that will be used to calculate the Pearson correlation.

Before introducing covariance, let me introduce a set of data that will be used to calculate the Pearson correlation. Below is a set of data for n = 10 people. In this hypothetical set of data, assume I measured the age (X) of these ten people and measured the number of books that each of these 10 people read per month (Y).

Variance is the average variability among scores for a single variable. You can examine the scores for the variable age (X) and for the variable Books Read per Month (Y) in the table to the left and see that those scores vary; this variance. Covariance is the average co-variation of scores between variables, that is, the average amount by which two variables are changing together. The difference between variance and covariance covariance is the average variation between scores from two variables; variance is the average variation among scores of a single variable

Variance covariance is the average variation between scores from two variables; variance is the average variation among scores of a single variable Covariance cov = ŝ 2 = Σ(X X )2 n 1 Σ[(X X )(Y Y ) n 1 To calculate covariance you divide the numerator by n - 1, rather than by n, because we are estimating the covariance in a population from sample data.

Covariance is formally defined as the average sum of the cross produces between two variables. The sum of the cross products (SCP) is the sum of the products of the mean centered scores from each variable; SCP = Σ[(X X )(Y Y ) Calculating sum of cross products is similar to calculating sum of squares.

Importantly: unlike sum of squares, which is always positive, the sum of cross products can be positive or negative: If SCP is positive, covariance will be positive and the correlation is positive-linear. If SCP is negative, covariance will be negative and the correlation is negative-linear If SCP = 0, it indicates a zero relationship. SCP = Σ[(X X )(Y Y )

The next step is to calculate the covariance. Conceptually, all that you need to calculate covariance is to divide SCP from above by n - 1: cov = Σ[(X X )(Y Y ) n 1 = SCP n 1 = 43 10 1 = 4.77 The next step to calculate a Pearson correlation is to standardize the covariance. The formula for the Pearson correlation is: r = cov ŝ X ŝ Y The denominator is the product of the estimated standard deviation of each variable

The denominator is the product of the estimated standard deviation of each variable The numerator is covariance (cov xy = 4.778).

The denominator is the product of the estimated standard deviation of each variable The estimated of the standard deviation for each variable are: SSX 647 ŝ X = n 1 = 10 1 = 8.479 SSY 26 ŝ Y = n 1 = 10 1 = 1.7

The numerator is covariance (cov xy = 4.778). The estimated of the standard deviation for each variable are: SSX 647 ŝ X = n 1 = 10 1 = 8.479 ŝ Y = SSY n 1 = 26 10 1 = 1.7 Now, we have all three pieces needed to calculate the Pearson correlation: r = cov ŝ X ŝ Y

The estimated of the standard deviation for each variable are: SSX 647 ŝ X = n 1 = 10 1 = 8.479 ŝ Y = SSY n 1 = 26 10 1 = 1.7 Now, we have all three pieces needed to calculate the Pearson correlation: r = cov ŝ X ŝ Y = 4.778 (8.479)(1.7) = 4.778 14.41 = 0.331 The question is what a correlation coefficient of this size indicates.

r = cov ŝ X ŝ Y = 4.778 (8.479)(1.7) = 4.778 14.41 = 0.331 The question is what a correlation coefficient of this size indicates. the larger the absolute value of the Pearson correlation the better. That is, the closer to +1.00 or to -1.00 a correlation of r =.33 is generally large in the behavioral sciences, because there is so much variation in behavior.

Proportion of Explained Variance and Residual (Unexplained) Variance The coefficient of determination r 2 is calculated by squaring the Pearson correlation. This value is the proportion of variance that is accounted for (explained) in the relationship between the two variables. In the case of a positive linear relationship, the coefficient of determination is the proportion of X scores that increase with the Y scores. In the case of a negative linear relationship, the coefficient of determination is the proportion of X scores that decrease as Y increases. You can think of r2 as the proportion of scores that are correctly predicted by the relationship. IMPORTANT!!!! the coefficient of determination can never be negative and has a range of 0 to 1, In the the coefficient Oscar of determination BARRERA Statistics is rintermediate 2 = 0.3312 = 0.11

Proportion of Explained Variance and Residual (Unexplained) Variance Residual variance is calculated by subtracting the coefficient of determination (r2) from 1(1 r 2 ). Residual variance is the proportion of variance between two variables that is not accounted for in the relationship; it s the proportion of X-Y scores that do not co-vary together in the direction indicated by the relationship IMPORTANT!!!! the coefficient can never be negative and has a range of 0 to 1, In the the coefficient of determination is (1 r 2 ) = 1 0.11 = 0.89.

Characteristics of the Pearson Correlation having a correlation between two variables does not mean one variable caused the variation in the other variable: Correlation does not mean causation!

Characteristics of the Pearson Correlation having a correlation between two variables does not mean one variable caused the variation in the other variable: Correlation does not mean causation! The only way to determine whether changes in one variable cause changes in another variable is by manipulating an independent variable and conducting Oscaran BARRERA experiment.

Statistical significance of a Pearson correlation Correlation depends on whether the correlation under the null hypothesis is assumed to be zero or some non-zero value. When the correlation under the null hypothesis is assumed to be zero you use a type of t-test to assess the significance of r. When under the null hypothesis r is assumed to be a value other than zero, use Fishers z-test to assess significance. Recall, the value of a correlation has a range of -1.00 to +1.00 and a zero-correlation (r = 0) indicates no association between variables. The symbol for the Pearson correlation in a population is the Greek lowercase rho (ρ). Thus, the null and alternate hypotheses predict that: H 0 : ρ = 0H 1 : ρ 0

Statistical significance of a Pearson correlation H 0 : ρ = 0H 1 : ρ 0 Notice: the alternate hypothesis is not saying whether the correlation will be positive-linear or negative-linear. If the alternate hypothesis predicts the correlation will be positive-linear the hypotheses are: H 0 : ρ = 0H 1 : ρ > 0 If the alternate H1 predicts the correlation will be negative-linear: H 0 : ρ = 0H 1 : ρ < 0

Determining Statistical Significance Say you are interested in whether the Pearson correlation between age and book reading behavior from the earlier sections is statistically significant. H 0 : ρ a,b = 0H 1 : ρ a,b 0 Recall, from that example, n = 10 and r = 0.331. To determine whether this correlation is statistically significant, we use the following t-test: t = r (1 r 2 )(n 2)

Determining Statistical Significance t = r (1 r 2 )/(n 2) BEFORE THE CALCULATIONS IMPORTANT!!: note this t-test is used only when ρ is predicted to be zero under the null hypothesis. we ll select an alpha of α =.05 for a non-directional alternate hypothesis. for the Pearson correlation degrees of freedom are equal to n - 2 (we need to account for the degrees of freedom in each dependent variable). (df= 10 2 = 8) t = 0.331 (1 0.331 2 )/(10 2) = 0.331 (1 0.11)/(8) =

Determining Statistical Significance t = t = r (1 r 2 )/(n 2) 0.331 = 0.3311 (0.89)/(8) 0.333 = 0.934 This is the test statistic that we use to assess the statistical significance of the Pearson correlation. Look up in the table At 95 of confidence with df= 8, the t=2,3060 given that my t is lower than the t 0.5 I can t reject the null hypothesis.

Power and Effect size for Pearson Correlation The effect size of the Pearson correlation is the absolute value of the Pearson correlation.

Power and Effect size for Pearson Correlation

Statistics Introductory Correlation Session 10 oscardavid.barrerarodriguez@sciencespo.fr April 9, 2018