Correlation and Regression

Similar documents
Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

STAT FINAL EXAM

Lecture 15. Hypothesis testing in the linear model

Review of Statistics

Simple Linear Regression

The Simple Linear Regression Model

Correlation Analysis

Chapter 12 - Lecture 2 Inferences about regression coefficient

Inferences for Regression

Basic Business Statistics 6 th Edition

Ordinary Least Squares Regression Explained: Vartanian

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Probability and Statistics Notes

CAS MA575 Linear Models

Ch 2: Simple Linear Regression

Simple Linear Regression

Multiple Regression Analysis. Basic Estimation Techniques. Multiple Regression Analysis. Multiple Regression Analysis

Lecture 11: Simple Linear Regression

ECON The Simple Regression Model

Inference for Regression

Simple and Multiple Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Stat 101: Lecture 6. Summer 2006

Introduction and Single Predictor Regression. Correlation

Correlation 1. December 4, HMS, 2017, v1.1

Multiple Linear Regression

Applied Econometrics (QEM)

Statistics for Managers using Microsoft Excel 6 th Edition

Measuring the fit of the model - SSR

ECON3150/4150 Spring 2015

Econometrics I Lecture 3: The Simple Linear Regression Model

Review of Statistics 101

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Lecture 14 Simple Linear Regression

9 Correlation and Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Sociology 6Z03 Review II

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

STAT 4385 Topic 03: Simple Linear Regression

Business Statistics. Lecture 9: Simple Regression

Mathematics for Economics MA course

Chapter 1. Linear Regression with One Predictor Variable

Multiple Regression Analysis. Part III. Multiple Regression Analysis

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Chapter 23: Inferences About Means

BNAD 276 Lecture 10 Simple Linear Regression Model

Linear models and their mathematical foundations: Simple linear regression

Formal Statement of Simple Linear Regression Model

Simple Linear Regression

Regression Models - Introduction

Section 3: Simple Linear Regression

Semester 2, 2015/2016

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Chapter 4. Regression Models. Learning Objectives

Regression Analysis: Basic Concepts

Simple Linear Regression

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

9. Linear Regression and Correlation

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Lectures 5 & 6: Hypothesis Testing

Unit 6 - Introduction to linear regression

Ordinary Least Squares Regression Explained: Vartanian

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Chapter 7. Scatterplots, Association, and Correlation

Chapter 16. Simple Linear Regression and dcorrelation

Simple linear regression

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

Mathematical Notation Math Introduction to Applied Statistics

Ch Inference for Linear Regression

where x and ȳ are the sample means of x 1,, x n

Simple Linear Regression

ECON 450 Development Economics

This gives us an upper and lower bound that capture our population mean.

Interval estimation. October 3, Basic ideas CLT and CI CI for a population mean CI for a population proportion CI for a Normal mean

Chapter 5 Friday, May 21st

AMS 7 Correlation and Regression Lecture 8

Confidence Intervals, Testing and ANOVA Summary

The Simple Regression Model. Part II. The Simple Regression Model

The Multinomial Model

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Two-Variable Regression Model: The Problem of Estimation

Lecture 18: Simple Linear Regression

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Ch 3: Multiple Linear Regression

STAT5044: Regression and Anova. Inyoung Kim

ST430 Exam 1 with Answers

Midterm 2 - Solutions

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Inference for Regression Simple Linear Regression

Unit 6 - Simple linear regression

s e, which is large when errors are large and small Linear regression model

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Practical Econometrics. for. Finance and Economics. (Econometrics 2)

Lectures on Simple Linear Regression Stat 431, Summer 2012

Steps to take to do the descriptive part of regression analysis:

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Transcription:

Correlation and Regression October 25, 2017 STAT 151 Class 9 Slide 1

Outline of Topics 1 Associations 2 Scatter plot 3 Correlation 4 Regression 5 Testing and estimation 6 Goodness-of-fit STAT 151 Class 9 Slide 2

Example We are often interested in the association between two or more variables. Suppose the Midterm (X ) and Final (Y ) exam scores of a sample of n = 8 students are recorded and we wish to study the association between X and Y in the population of students. Midterm (X ) Final (Y ) 55 45 60 75 80 85 77 62 35 50 75 72 92 78 65 53 We consider three approaches: (1) a graphical summary scatter plot (c.f., Class 3) (2) a numerical measure correlation coefficient (c.f., Class 3) (3) a model regression A SRS of independent observations STAT 151 Class 9 Slide 3

Scatter plot (1): Example Each observation (student) is represented by a symbol on the plot A scatter plot is useful for giving an overall impression of the kind of relationship between the variables, e.g., linear, nonlinear or no apparent relationship Final 0 20 40 60 80 0 20 40 60 80 100 Midterm linear nonlinear none STAT 151 Class 9 Slide 4

Scatter plot (2) Outliers are observations that deviate from the general trend of the rest of the data If we have a new observation (X, Y ) = (99, 10), it will appear as the red open circle The scatter plot shows the new observation is unusual Scatter plots are generally not useful when there are more than two variables, e.g., Projects, Midterm, Final, etc. Final 0 20 40 60 80 0 20 40 60 80 100 Midterm STAT 151 Class 9 Slide 5

Pearson correlation (Egon Sharp Pearson, 1895-1980) In Class 3, cov(x, Y ) is used to measure association between X and Y : X X 100 100 Y (Final) 80 60 40 Y 80 60 40 Y 20 cov(x,y)= 183.57 20 cov(x,y)= 18.357 0 0 0 20 40 60 80 0 2 4 6 8 10 X (Midterm) X (Midterm) cov(x, Y ) is not invariant to scale transformation, e.g., its value changes if midterm is recorded as (0,10) instead of (0,100) The sign of cov(x, Y ) (+ vs. -) can be used to tell direction of the association, but its magnitude has no meaning STAT 151 Class 9 Slide 6

Pearson correlation (Egon Sharp Pearson, 1895-1980) A Pearson (product moment) correlation coefficient, r corr(x, Y ), is a number that summarizes the linear relationship between X and Y For X from a population with mean µ X and variance σ 2 X, a Z-score: Z X = X µ X σ X tells us X relative to the rest of the population ( ) X µx Y µ Y r = }{{} E (Z X Z Y ) = E = E(X µ X )(Y µ Y ) = cov(x, Y ) σ X σ Y σ X σ Y σ X σ Y average measures, on average, whether X and Y are in tandem relative to their populations Using n observations (X 1, Y 1 ),..., (X n, Y n ) (Xi X )(Y i Ȳ ) r = n 1 (Xi X ) 2 (Yi Ȳ ) 2 = (Xi X )(Y i Ȳ ) (Xi X ) 2 (Yi Ȳ ) 2 n 1 n 1 STAT 151 Class 9 Slide 7

Correlation: Example For calculation, the equivalent formula is more convenient: r = Xi Y i n X Ȳ n 1 X 2 i n X 2 n 1 Y 2 i nȳ 2 n 1 = Xi Y i n X Ȳ X 2 i n X 2 Y 2 i nȳ 2 X recorded as (0,100) X = 67.375, Ȳ = 65, 8 i=1 X iy i = 36320 8 i=1 X i 2 = 38493, 8 i=1 Y i 2 = 35296 r = = 36320 8(67.375)(65) 8 1 38493 8(67.375) 2 8 1 183.57 17.63874 14.61897 0.712 35296 8(65) 2 8 1 X recorded as (0,10) X = 6.7375, Ȳ = 65, 8 i=1 X iy i = 3632 8 i=1 X i 2 = 384.93, 8 i=1 Y i 2 = 35296 r = = 3632 8(6.7375)(65) 8 1 384.93 8(6.7375) 2 8 1 18.357 1.763874 14.61897 0.712 35296 8(65) 2 8 1 On average, Z X Z Y = 0.712 > 0 Z X and Z Y are of the same sign (both + or both ) they are either both big or both small relative to their own populations STAT 151 Class 9 Slide 8

Sample correlation under various relationships (Fig. 3) 1 r 1 The magnitude of r measures the strength of the association. If r 1, the association is strong (B, C and D); if r 0, the association is weak (A) or non-linear The sign of r measures the direction of the association. If r > 0, large X tends to be associated with large Y (B and C); if r < 0, large X tends to be associated with small Y (D) 5 0 5 10 5 0 5 10 (A) r = 0.063 0.0 0.4X 0.8 (C) r = 0.652 0.0 0.4X 0.8 Y Y 5 0 5 10 5 0 5 10 (B) r = 0.935 0.0 0.4X 0.8 (D) r = 0.439 0.0 0.4X 0.8 Y Y STAT 151 Class 9 Slide 9

Correlation measures linear relationships (Fig. 4) A B r measures linear associations (A) A non-linear relationship may distort the value of r (B) Outliers may distort the value of r (C) A restrictive range (open circles) in X or Y may lead to a smaller r (D) C D STAT 151 Class 9 Slide 10

Prediction under a linear model (Fig. 5) A regression analysis allows us to determine if Midterm score (X ) can be used to predict Final score (Y ). The scatter plot suggests there may be a linear relationship between X and Y (i.e., each additional point in the Midterm is associated with b extra points in the Final). Final 0 20 40 60 80 0 20 40 60 80 100 A regression analysis uses a Midterm sample of students to determine whether a linear relationship exists for the population of students. STAT 151 Class 9 Slide 11

Simple linear regression We postulate that the relationship between Midterm score (X ) and Final score (Y ) in the population be represented by a straight line: Y = a + bx where a is the intercept and b is the slope. The variable X is called an independent or predictor variable and Y is called a dependent or outcome variable. A simple linear regression is a regression with only one predictor and the relationship between the predictor and the outcome variable is assumed to be linear. The intercept a gives the prediction of Y when X = 0 or b = 0. Often a is not of interest or may even be meaningless, e.g., if X represents the height of a person and Y represents the weight, then no person has a height (X ) of zero. The value of b is the change in Y for every unit difference in X. Figure 5 shows that the observations do not fall on the straight line. In fact, there is no straight line that fits all observations. We assume Y = a + bx + e, e N(0, σ 2 ) STAT 151 Class 9 Slide 12

Simple linear regression (2) Y = a } + {{ bx } + }{{} e, e N(0, σ 2 ). (A) (B) (A) a + bx is the average value of Y for observations with a particular value of X (B) Each observation Y differs from the average by an amount e, and e N(0, σ 2 ) (A)+(B) For each known value of X, the values of Y N(a + bx, σ 2 ). Therefore, in a regression, we assume we have known values of X at X 1,..., X n and we investigate how Y changes at these values, which is captured by the regression model We use maximum likelihood estimation (MLE), which is equivalent to a method called ordinary least squares (OLS) in this setting STAT 151 Class 9 Slide 13

Maximum Likelihood (1) Data Midterm (X ) Final (Y ) 55 45 60 75 a + b(55) 80 85 77 62 35 50 75 72 a + b(60). 92 78 65 53 a + b(65) STAT 151 Class 9 Slide 14

Maximum Likelihood (2) We have a sample Y 1,..., Y n at X 1,..., X n, respectively. Assuming Y i N(a + bx i, σ 2 ), where a, b, σ 2 are unknown, we can find the MLE of these parameters. The MLEs are a, b, σ 2 that jointly maximize the likelihood L(a, b, σ 2 ) = n i (a + bx i )} 2 1 e {Y 2σ 2 2πσ 2 Taking (natural) logarithm of L(a, b, σ 2 ) gives the log-likelihood i=1 i=1 n i (a + bx i )} 2 l(a, b, σ 2 1 n ) = log e {Y 2σ 2 = [ {Y i (a + bx i )} 2 2πσ 2 2σ 2 The MLEs are found by l(â, ˆb, ˆσ 2 ) l(â, ˆb, ˆσ 2 ) = 0, = 0, a b ˆb = i=1 (X i X )(Y i Ȳ ) i=1 (X i X ) 2 = â = Ȳ ˆb X, ˆσ 2 = 1 n STAT 151 Class 9 Slide 15 i=1 X 2 i=1 l(â, ˆb, ˆσ 2 ) σ 2 = 0 X iy i n X Ȳ i=1 i n( X ) = cov(x, Y ), 2 var(x ) i=1 {Y i (â + ˆbX i )} 2 1 ] log2π logσ 2

Least squares For any value of σ 2 in the log-likelihood l(a, b, σ 2 ) = n [ {Y i (a + bx i )} 2 i=1 l(a, b, σ 2 ) is maximized if 2σ 2 1 ] log2π logσ 2 n {Y i (a + bx i )} 2 i=1 is minimized (hence least squares ). The best fitting line using MLE or OLS is the line that minimizes the sum of squared deviations of the observations from the line STAT 151 Class 9 Slide 16 Final 0 20 40 60 80 0 20 40 60 80 100 Midterm

Example Using our sample of n = 8 students, what is the predicted Final score for a student who scored 65 on the Midterm using the MLE (OLS) estimates? ˆb = 36320 8(67.375)(65) 38493 8(67.375) 2 = 0.59, â = 65 0.59(67.375) = 25.247 The fitted regression line is Final = 25.247 + 0.59 Midterm For a student whose Midterm score is 65, her predicted Final score is 25.247 + 0.59 65 = 63.597 STAT 151 Class 9 Slide 17

Quality of the regression - Residual plots Under the regression model Y i = a + bx i + e i e i N(0, σ 2 ) ê i = Y i Ŷ i = Y i (â + ˆbX i ) 6 2 0 2 4 6 (a) Random 0.0 0.4 0.8 6 2 residuals 0 2 4 6 (c) Skewed distribution 0.0 0.4X 0.8 0 If the model is correct, ê i s should resemble a set of random observations from a normal distribution with mean zero like panel (a) STAT 151 Class 9 Slide 18 6 2 0 2 4 6 (b) Non linear 0.0 0.4 0.8 6 2 0 2 4 6 (d) Non constant varinace 0.0 0.4 0.8

Residual plot - Example Based on the regression model Ŷ = 25.247 +.59 X ê i = Y i Ŷ i = Y i (25.247 +.59 X i ) Y i Ŷ i ê i 45 57.70-12.70 75 60.65 14.35 85 72.45 12.55 62 70.68-8.68 50 45.90 4.10 72 69.50 2.50 78 79.53-1.53 53 63.60-10.60 residuals X 0 STAT 151 Class 9 Slide 19

Notes about a regression analysis A linear regression model makes 3 assumptions: 1. The relationship between X and Y is linear, i.e., Y = a + bx + e 2. The values of Y i s are normally distributed about the regression line 3. The variances of Y i s about the regression line are the same The regression line is fitted by MLE (= OLS), which means the sum of the squared distances of the observations to the regression line is minimized Prediction can only be made in the range of X used to obtain the regression line. In the example, since the lowest and the highest Midterm scores in the 8 students are 35 and 92, therefore, prediction can be made for other students who Midterm scores are within this range. For someone whose Midterm score falls outside (35,92), no prediction is possible. This restriction does not apply to the dependent variable, so the predicted Final score can be outside the range of Y values observed in the 8 students STAT 151 Class 9 Slide 20

Observed relationship Fact or Fiction? ˆb â {}}{{}}{ Final = 25.247 + 0.59 Midterm shows each additional point in the Midterm is associated with an extra 0.59 point in the Final for the 8 students. Our estimate ˆb comes from a sample and hence there is sampling error, i.e., ˆb b Does the association generalise to the population of students? Two approaches to answering this question: (1) Test the hypotheses: H 0 : b = 0 (no relationship) vs. H 1 : b 0 (some relationship) (2) Find an interval estimate: ˆb ± margin of error of ˆb STAT 151 Class 9 Slide 21

Hypothesis testing For a sample of students such that midterm (X ) and final (Y ) are unrelated: (1) ˆb is expected to be zero (2) sampling variation allows ˆb 0 but it is unlikely to be far from 0 5% unexpected 0 critical value expected Value of ˆb unexpected We use a test statistic to determine whether ˆb for our sample is far from 0: z = our sample {}}{ ˆb X and Y unrelated {}}{ 0 var(ˆb) }{{} allowance for sampling variation = 0.59 0 var(ˆb) STAT 151 Class 9 Slide 22

Hypothesis testing (2) estimating var(ˆb) var(ˆb) = var Earlier, we learned var(ˆb) = i=1 (X i X )(Y i Ȳ ) i=1 (X i X ) 2 = var (X i X ) 2 var(y i ) i=1 [ (X i X ] 2 = (X i X ) 2 σ 2 i=1 [ ) 2 (X i X ] 2 = ) 2 i=1 i=1 where σ 2 can be estimated using the MLE (X i X )Y i=1 i (X i X ) 2 i=1 σ 2 i=1 (X i X ) 2 ˆσ 2 = i=1 {Y i (â + ˆbX i )} 2 n = i=1 (Y i Ŷi) 2 n (X i X )(Y i Ȳ ) = (X i X )Y i (X i X )Ȳ = (X i X )Y i Ȳ X 1,..., X n are assumed known and hence constants =0 {}}{ (Xi X ) Sometimes, the denominator of ˆσ 2 uses n 2 to give an unbiased estimator for σ 2 STAT 151 Class 9 Slide 23

Hypothesis testing (3) For large n, we find: z = ˆb 0 var(ˆb) ˆb 0 = n ˆσ/ i=1 X i 2 n( X ) 2 0.59 0 = 10.305/ 38493 8(67.375) 2 = 2.671 > 1.96 For small n, we replace the critical value of 1.96 by a new critical value that depends on the degree of freedom (df ), defined as df = n 2. Critical values for selected df s are given below: df = n 2 5 6 10 20 120 >120 critical value 2.571 2.447 2.228 2.086 1.98 1.96 In our study, df = 8 2 = 6, the critical value is 2.447. Since z > 2.447, therefore, we arrive at the same conclusion of rejecting H 0 : b = 0. We are rarely interested in a one-sided test of b. STAT 151 Class 9 Slide 24

95% Confidence and prediction intervals Parameter MLE (OLS) 95% confidence interval Slope b ˆb ˆb ± 1.96 SD(ˆb) = ˆb ± 1.96ˆσ 1 i=1 X 2 i n( X ) 2 Average value â + ˆbX â + ˆbX ± 1.96SD(â + ˆbX ) of Y given X = â + ˆbX 1 ± 1.96ˆσ (a + bx ) 0 n + (X X ) 2 n i=1 X i 2 n( X ) 2 {}}{ Individual value â + ˆbX + ê â + ˆbX ± 1.96SD(â + ˆbX + ê) of Y given X = â + ˆbX ± 1.96ˆσ 1 + 1 n + (X X ) 2 n i=1 X i 2 n( X ) 2 (a + bx + e) For small values of n, 1.96 can be replaced by an appropriate value in the t-table â + ˆbX = (Ȳ ˆb X ) + ˆbX = Ȳ + ˆb(X X ) Also called a prediction interval STAT 151 Class 9 Slide 25

Example Final 0 20 40 60 80 100 Prediction Confidence 0 20 40 60 80 100 Midterm STAT 151 Class 9 Slide 26

Goodness-of-fit: R 2 How well does the model fit the data? We answer this question using a Goodness-of-fit measure called the coefficient of determination R 2 ( R-square ). R 2 can be justified as follows. Consider using n observations (X 1, Y 1 ),..., (X n, Y n ) of (X, Y ) to predict the next observation, Y n+1 of Y. Two possible estimates are: (1) Ȳ = 1 n i=1 Y i and (2) Ŷ i = â + ˆbX i How do they compare? Since Y n+1 is unknown, we cannot tell whether Ȳ and Ŷi is closer to Y n+1. However, we can compare their performances in predicting the observed Y i, i = 1,..., n. For Y i, the error incurred by these estimates are: (Y i Ȳ ) and (Y i Ŷ i ) R 2 is then defined as Total error using Ȳ Total error using Ŷ i Total error using Ȳ = i=1 (Y i Ȳ ) 2 i=1 (Y i Ȳ ) 2 i=1 (Y i Ŷ i ) 2 STAT 151 Class 9 Slide 27

R 2 SSE R 2 = SST SSE { }}{{ n (Y }}{ i Ȳ ) 2 n (Y i Ŷ i ) 2 i=1 i=1 i=1 (Y i Ȳ )2 SST Final 0 20 40 60 80 Errors Final 0 20 40 60 80 Errors 0 20 40 60 80 0 20 40 60 80 Midterm Midterm SSE is defined as the sum of the errors whereas SST is defined as the sum of the errors; SSE SST since SSE is total errors from the least squares line STAT 151 Class 9 Slide 28

Example For a simple linear regression model, a simple relationship exists between R 2 and r: R 2 = corr(x, Y ) 2 = r 2 = 0.712 2 = 0.507 in our example between Midterm and Final score, so the error is reduced by about half compared to without the model. Multiplying R 2 by 100% gives the percent variation explained R 2 100% = 50.7%, which tells us that about 50.7% of the differences in Final score between students can be accounted for by their Midterm score; while the remaining differences, i.e., 49.3% are due to other (unknown) factors. When there is more than one predictor, r cannot be calculated; in that case, R 2 gives the correlation between the outcome and the predictors STAT 151 Class 9 Slide 29