Correlation and Regression October 25, 2017 STAT 151 Class 9 Slide 1
Outline of Topics 1 Associations 2 Scatter plot 3 Correlation 4 Regression 5 Testing and estimation 6 Goodness-of-fit STAT 151 Class 9 Slide 2
Example We are often interested in the association between two or more variables. Suppose the Midterm (X ) and Final (Y ) exam scores of a sample of n = 8 students are recorded and we wish to study the association between X and Y in the population of students. Midterm (X ) Final (Y ) 55 45 60 75 80 85 77 62 35 50 75 72 92 78 65 53 We consider three approaches: (1) a graphical summary scatter plot (c.f., Class 3) (2) a numerical measure correlation coefficient (c.f., Class 3) (3) a model regression A SRS of independent observations STAT 151 Class 9 Slide 3
Scatter plot (1): Example Each observation (student) is represented by a symbol on the plot A scatter plot is useful for giving an overall impression of the kind of relationship between the variables, e.g., linear, nonlinear or no apparent relationship Final 0 20 40 60 80 0 20 40 60 80 100 Midterm linear nonlinear none STAT 151 Class 9 Slide 4
Scatter plot (2) Outliers are observations that deviate from the general trend of the rest of the data If we have a new observation (X, Y ) = (99, 10), it will appear as the red open circle The scatter plot shows the new observation is unusual Scatter plots are generally not useful when there are more than two variables, e.g., Projects, Midterm, Final, etc. Final 0 20 40 60 80 0 20 40 60 80 100 Midterm STAT 151 Class 9 Slide 5
Pearson correlation (Egon Sharp Pearson, 1895-1980) In Class 3, cov(x, Y ) is used to measure association between X and Y : X X 100 100 Y (Final) 80 60 40 Y 80 60 40 Y 20 cov(x,y)= 183.57 20 cov(x,y)= 18.357 0 0 0 20 40 60 80 0 2 4 6 8 10 X (Midterm) X (Midterm) cov(x, Y ) is not invariant to scale transformation, e.g., its value changes if midterm is recorded as (0,10) instead of (0,100) The sign of cov(x, Y ) (+ vs. -) can be used to tell direction of the association, but its magnitude has no meaning STAT 151 Class 9 Slide 6
Pearson correlation (Egon Sharp Pearson, 1895-1980) A Pearson (product moment) correlation coefficient, r corr(x, Y ), is a number that summarizes the linear relationship between X and Y For X from a population with mean µ X and variance σ 2 X, a Z-score: Z X = X µ X σ X tells us X relative to the rest of the population ( ) X µx Y µ Y r = }{{} E (Z X Z Y ) = E = E(X µ X )(Y µ Y ) = cov(x, Y ) σ X σ Y σ X σ Y σ X σ Y average measures, on average, whether X and Y are in tandem relative to their populations Using n observations (X 1, Y 1 ),..., (X n, Y n ) (Xi X )(Y i Ȳ ) r = n 1 (Xi X ) 2 (Yi Ȳ ) 2 = (Xi X )(Y i Ȳ ) (Xi X ) 2 (Yi Ȳ ) 2 n 1 n 1 STAT 151 Class 9 Slide 7
Correlation: Example For calculation, the equivalent formula is more convenient: r = Xi Y i n X Ȳ n 1 X 2 i n X 2 n 1 Y 2 i nȳ 2 n 1 = Xi Y i n X Ȳ X 2 i n X 2 Y 2 i nȳ 2 X recorded as (0,100) X = 67.375, Ȳ = 65, 8 i=1 X iy i = 36320 8 i=1 X i 2 = 38493, 8 i=1 Y i 2 = 35296 r = = 36320 8(67.375)(65) 8 1 38493 8(67.375) 2 8 1 183.57 17.63874 14.61897 0.712 35296 8(65) 2 8 1 X recorded as (0,10) X = 6.7375, Ȳ = 65, 8 i=1 X iy i = 3632 8 i=1 X i 2 = 384.93, 8 i=1 Y i 2 = 35296 r = = 3632 8(6.7375)(65) 8 1 384.93 8(6.7375) 2 8 1 18.357 1.763874 14.61897 0.712 35296 8(65) 2 8 1 On average, Z X Z Y = 0.712 > 0 Z X and Z Y are of the same sign (both + or both ) they are either both big or both small relative to their own populations STAT 151 Class 9 Slide 8
Sample correlation under various relationships (Fig. 3) 1 r 1 The magnitude of r measures the strength of the association. If r 1, the association is strong (B, C and D); if r 0, the association is weak (A) or non-linear The sign of r measures the direction of the association. If r > 0, large X tends to be associated with large Y (B and C); if r < 0, large X tends to be associated with small Y (D) 5 0 5 10 5 0 5 10 (A) r = 0.063 0.0 0.4X 0.8 (C) r = 0.652 0.0 0.4X 0.8 Y Y 5 0 5 10 5 0 5 10 (B) r = 0.935 0.0 0.4X 0.8 (D) r = 0.439 0.0 0.4X 0.8 Y Y STAT 151 Class 9 Slide 9
Correlation measures linear relationships (Fig. 4) A B r measures linear associations (A) A non-linear relationship may distort the value of r (B) Outliers may distort the value of r (C) A restrictive range (open circles) in X or Y may lead to a smaller r (D) C D STAT 151 Class 9 Slide 10
Prediction under a linear model (Fig. 5) A regression analysis allows us to determine if Midterm score (X ) can be used to predict Final score (Y ). The scatter plot suggests there may be a linear relationship between X and Y (i.e., each additional point in the Midterm is associated with b extra points in the Final). Final 0 20 40 60 80 0 20 40 60 80 100 A regression analysis uses a Midterm sample of students to determine whether a linear relationship exists for the population of students. STAT 151 Class 9 Slide 11
Simple linear regression We postulate that the relationship between Midterm score (X ) and Final score (Y ) in the population be represented by a straight line: Y = a + bx where a is the intercept and b is the slope. The variable X is called an independent or predictor variable and Y is called a dependent or outcome variable. A simple linear regression is a regression with only one predictor and the relationship between the predictor and the outcome variable is assumed to be linear. The intercept a gives the prediction of Y when X = 0 or b = 0. Often a is not of interest or may even be meaningless, e.g., if X represents the height of a person and Y represents the weight, then no person has a height (X ) of zero. The value of b is the change in Y for every unit difference in X. Figure 5 shows that the observations do not fall on the straight line. In fact, there is no straight line that fits all observations. We assume Y = a + bx + e, e N(0, σ 2 ) STAT 151 Class 9 Slide 12
Simple linear regression (2) Y = a } + {{ bx } + }{{} e, e N(0, σ 2 ). (A) (B) (A) a + bx is the average value of Y for observations with a particular value of X (B) Each observation Y differs from the average by an amount e, and e N(0, σ 2 ) (A)+(B) For each known value of X, the values of Y N(a + bx, σ 2 ). Therefore, in a regression, we assume we have known values of X at X 1,..., X n and we investigate how Y changes at these values, which is captured by the regression model We use maximum likelihood estimation (MLE), which is equivalent to a method called ordinary least squares (OLS) in this setting STAT 151 Class 9 Slide 13
Maximum Likelihood (1) Data Midterm (X ) Final (Y ) 55 45 60 75 a + b(55) 80 85 77 62 35 50 75 72 a + b(60). 92 78 65 53 a + b(65) STAT 151 Class 9 Slide 14
Maximum Likelihood (2) We have a sample Y 1,..., Y n at X 1,..., X n, respectively. Assuming Y i N(a + bx i, σ 2 ), where a, b, σ 2 are unknown, we can find the MLE of these parameters. The MLEs are a, b, σ 2 that jointly maximize the likelihood L(a, b, σ 2 ) = n i (a + bx i )} 2 1 e {Y 2σ 2 2πσ 2 Taking (natural) logarithm of L(a, b, σ 2 ) gives the log-likelihood i=1 i=1 n i (a + bx i )} 2 l(a, b, σ 2 1 n ) = log e {Y 2σ 2 = [ {Y i (a + bx i )} 2 2πσ 2 2σ 2 The MLEs are found by l(â, ˆb, ˆσ 2 ) l(â, ˆb, ˆσ 2 ) = 0, = 0, a b ˆb = i=1 (X i X )(Y i Ȳ ) i=1 (X i X ) 2 = â = Ȳ ˆb X, ˆσ 2 = 1 n STAT 151 Class 9 Slide 15 i=1 X 2 i=1 l(â, ˆb, ˆσ 2 ) σ 2 = 0 X iy i n X Ȳ i=1 i n( X ) = cov(x, Y ), 2 var(x ) i=1 {Y i (â + ˆbX i )} 2 1 ] log2π logσ 2
Least squares For any value of σ 2 in the log-likelihood l(a, b, σ 2 ) = n [ {Y i (a + bx i )} 2 i=1 l(a, b, σ 2 ) is maximized if 2σ 2 1 ] log2π logσ 2 n {Y i (a + bx i )} 2 i=1 is minimized (hence least squares ). The best fitting line using MLE or OLS is the line that minimizes the sum of squared deviations of the observations from the line STAT 151 Class 9 Slide 16 Final 0 20 40 60 80 0 20 40 60 80 100 Midterm
Example Using our sample of n = 8 students, what is the predicted Final score for a student who scored 65 on the Midterm using the MLE (OLS) estimates? ˆb = 36320 8(67.375)(65) 38493 8(67.375) 2 = 0.59, â = 65 0.59(67.375) = 25.247 The fitted regression line is Final = 25.247 + 0.59 Midterm For a student whose Midterm score is 65, her predicted Final score is 25.247 + 0.59 65 = 63.597 STAT 151 Class 9 Slide 17
Quality of the regression - Residual plots Under the regression model Y i = a + bx i + e i e i N(0, σ 2 ) ê i = Y i Ŷ i = Y i (â + ˆbX i ) 6 2 0 2 4 6 (a) Random 0.0 0.4 0.8 6 2 residuals 0 2 4 6 (c) Skewed distribution 0.0 0.4X 0.8 0 If the model is correct, ê i s should resemble a set of random observations from a normal distribution with mean zero like panel (a) STAT 151 Class 9 Slide 18 6 2 0 2 4 6 (b) Non linear 0.0 0.4 0.8 6 2 0 2 4 6 (d) Non constant varinace 0.0 0.4 0.8
Residual plot - Example Based on the regression model Ŷ = 25.247 +.59 X ê i = Y i Ŷ i = Y i (25.247 +.59 X i ) Y i Ŷ i ê i 45 57.70-12.70 75 60.65 14.35 85 72.45 12.55 62 70.68-8.68 50 45.90 4.10 72 69.50 2.50 78 79.53-1.53 53 63.60-10.60 residuals X 0 STAT 151 Class 9 Slide 19
Notes about a regression analysis A linear regression model makes 3 assumptions: 1. The relationship between X and Y is linear, i.e., Y = a + bx + e 2. The values of Y i s are normally distributed about the regression line 3. The variances of Y i s about the regression line are the same The regression line is fitted by MLE (= OLS), which means the sum of the squared distances of the observations to the regression line is minimized Prediction can only be made in the range of X used to obtain the regression line. In the example, since the lowest and the highest Midterm scores in the 8 students are 35 and 92, therefore, prediction can be made for other students who Midterm scores are within this range. For someone whose Midterm score falls outside (35,92), no prediction is possible. This restriction does not apply to the dependent variable, so the predicted Final score can be outside the range of Y values observed in the 8 students STAT 151 Class 9 Slide 20
Observed relationship Fact or Fiction? ˆb â {}}{{}}{ Final = 25.247 + 0.59 Midterm shows each additional point in the Midterm is associated with an extra 0.59 point in the Final for the 8 students. Our estimate ˆb comes from a sample and hence there is sampling error, i.e., ˆb b Does the association generalise to the population of students? Two approaches to answering this question: (1) Test the hypotheses: H 0 : b = 0 (no relationship) vs. H 1 : b 0 (some relationship) (2) Find an interval estimate: ˆb ± margin of error of ˆb STAT 151 Class 9 Slide 21
Hypothesis testing For a sample of students such that midterm (X ) and final (Y ) are unrelated: (1) ˆb is expected to be zero (2) sampling variation allows ˆb 0 but it is unlikely to be far from 0 5% unexpected 0 critical value expected Value of ˆb unexpected We use a test statistic to determine whether ˆb for our sample is far from 0: z = our sample {}}{ ˆb X and Y unrelated {}}{ 0 var(ˆb) }{{} allowance for sampling variation = 0.59 0 var(ˆb) STAT 151 Class 9 Slide 22
Hypothesis testing (2) estimating var(ˆb) var(ˆb) = var Earlier, we learned var(ˆb) = i=1 (X i X )(Y i Ȳ ) i=1 (X i X ) 2 = var (X i X ) 2 var(y i ) i=1 [ (X i X ] 2 = (X i X ) 2 σ 2 i=1 [ ) 2 (X i X ] 2 = ) 2 i=1 i=1 where σ 2 can be estimated using the MLE (X i X )Y i=1 i (X i X ) 2 i=1 σ 2 i=1 (X i X ) 2 ˆσ 2 = i=1 {Y i (â + ˆbX i )} 2 n = i=1 (Y i Ŷi) 2 n (X i X )(Y i Ȳ ) = (X i X )Y i (X i X )Ȳ = (X i X )Y i Ȳ X 1,..., X n are assumed known and hence constants =0 {}}{ (Xi X ) Sometimes, the denominator of ˆσ 2 uses n 2 to give an unbiased estimator for σ 2 STAT 151 Class 9 Slide 23
Hypothesis testing (3) For large n, we find: z = ˆb 0 var(ˆb) ˆb 0 = n ˆσ/ i=1 X i 2 n( X ) 2 0.59 0 = 10.305/ 38493 8(67.375) 2 = 2.671 > 1.96 For small n, we replace the critical value of 1.96 by a new critical value that depends on the degree of freedom (df ), defined as df = n 2. Critical values for selected df s are given below: df = n 2 5 6 10 20 120 >120 critical value 2.571 2.447 2.228 2.086 1.98 1.96 In our study, df = 8 2 = 6, the critical value is 2.447. Since z > 2.447, therefore, we arrive at the same conclusion of rejecting H 0 : b = 0. We are rarely interested in a one-sided test of b. STAT 151 Class 9 Slide 24
95% Confidence and prediction intervals Parameter MLE (OLS) 95% confidence interval Slope b ˆb ˆb ± 1.96 SD(ˆb) = ˆb ± 1.96ˆσ 1 i=1 X 2 i n( X ) 2 Average value â + ˆbX â + ˆbX ± 1.96SD(â + ˆbX ) of Y given X = â + ˆbX 1 ± 1.96ˆσ (a + bx ) 0 n + (X X ) 2 n i=1 X i 2 n( X ) 2 {}}{ Individual value â + ˆbX + ê â + ˆbX ± 1.96SD(â + ˆbX + ê) of Y given X = â + ˆbX ± 1.96ˆσ 1 + 1 n + (X X ) 2 n i=1 X i 2 n( X ) 2 (a + bx + e) For small values of n, 1.96 can be replaced by an appropriate value in the t-table â + ˆbX = (Ȳ ˆb X ) + ˆbX = Ȳ + ˆb(X X ) Also called a prediction interval STAT 151 Class 9 Slide 25
Example Final 0 20 40 60 80 100 Prediction Confidence 0 20 40 60 80 100 Midterm STAT 151 Class 9 Slide 26
Goodness-of-fit: R 2 How well does the model fit the data? We answer this question using a Goodness-of-fit measure called the coefficient of determination R 2 ( R-square ). R 2 can be justified as follows. Consider using n observations (X 1, Y 1 ),..., (X n, Y n ) of (X, Y ) to predict the next observation, Y n+1 of Y. Two possible estimates are: (1) Ȳ = 1 n i=1 Y i and (2) Ŷ i = â + ˆbX i How do they compare? Since Y n+1 is unknown, we cannot tell whether Ȳ and Ŷi is closer to Y n+1. However, we can compare their performances in predicting the observed Y i, i = 1,..., n. For Y i, the error incurred by these estimates are: (Y i Ȳ ) and (Y i Ŷ i ) R 2 is then defined as Total error using Ȳ Total error using Ŷ i Total error using Ȳ = i=1 (Y i Ȳ ) 2 i=1 (Y i Ȳ ) 2 i=1 (Y i Ŷ i ) 2 STAT 151 Class 9 Slide 27
R 2 SSE R 2 = SST SSE { }}{{ n (Y }}{ i Ȳ ) 2 n (Y i Ŷ i ) 2 i=1 i=1 i=1 (Y i Ȳ )2 SST Final 0 20 40 60 80 Errors Final 0 20 40 60 80 Errors 0 20 40 60 80 0 20 40 60 80 Midterm Midterm SSE is defined as the sum of the errors whereas SST is defined as the sum of the errors; SSE SST since SSE is total errors from the least squares line STAT 151 Class 9 Slide 28
Example For a simple linear regression model, a simple relationship exists between R 2 and r: R 2 = corr(x, Y ) 2 = r 2 = 0.712 2 = 0.507 in our example between Midterm and Final score, so the error is reduced by about half compared to without the model. Multiplying R 2 by 100% gives the percent variation explained R 2 100% = 50.7%, which tells us that about 50.7% of the differences in Final score between students can be accounted for by their Midterm score; while the remaining differences, i.e., 49.3% are due to other (unknown) factors. When there is more than one predictor, r cannot be calculated; in that case, R 2 gives the correlation between the outcome and the predictors STAT 151 Class 9 Slide 29