Correlation and Linear Regression

Similar documents
Correlation: Relationships between Variables

Reminder: Student Instructional Rating Surveys

Intro to Linear Regression

Intro to Linear Regression

Can you tell the relationship between students SAT scores and their college grades?

REVIEW 8/2/2017 陈芳华东师大英语系

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Chapter 16: Correlation

Business Statistics. Lecture 10: Correlation and Linear Regression

Chapter 16: Correlation

Ch. 16: Correlation and Regression

Unit 6 - Introduction to linear regression

Unit 6 - Simple linear regression

Chs. 15 & 16: Correlation & Regression

Inference for the Regression Coefficient

11 Correlation and Regression

Measuring Associations : Pearson s correlation

Chs. 16 & 17: Correlation & Regression

CORELATION - Pearson-r - Spearman-rho

Readings Howitt & Cramer (2014) Overview

9. Linear Regression and Correlation

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Inferences for Regression

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

9 Correlation and Regression

Readings Howitt & Cramer (2014)

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

Correlation and regression

Review of Statistics 101

Statistics Introductory Correlation

INFERENCE FOR REGRESSION

Correlation. What Is Correlation? Why Correlations Are Used

Correlation Analysis

Relationship Between Interval and/or Ratio Variables: Correlation & Regression. Sorana D. BOLBOACĂ

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chapter 7 Linear Regression

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Chapter 27 Summary Inferences for Regression

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

Mathematics for Economics MA course

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Chapter 8: Correlation & Regression

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

How can we explore the association between two quantitative variables?

STAT 350 Final (new Material) Review Problems Key Spring 2016

Business Statistics. Lecture 9: Simple Regression

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Chapter 5 Friday, May 21st

Simple Linear Regression

Sampling Distributions: Central Limit Theorem

Chapter 9 - Correlation and Regression

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Final Exam - Solutions

Correlation and Regression Bangkok, 14-18, Sept. 2015

Psychology 282 Lecture #4 Outline Inferences in SLR

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

23. Inference for regression

Eco 391, J. Sandford, spring 2013 April 5, Midterm 3 4/5/2013

Hypothesis Testing hypothesis testing approach

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

Notes 6: Correlation

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

ANCOVA. ANCOVA allows the inclusion of a 3rd source of variation into the F-formula (called the covariate) and changes the F-formula

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Chapter 10. Correlation and Regression. McGraw-Hill, Bluman, 7th ed., Chapter 10 1

CRP 272 Introduction To Regression Analysis

Key Algebraic Results in Linear Regression

Lecture 10 Multiple Linear Regression

This document contains 3 sets of practice problems.

Business Statistics. Lecture 10: Course Review

Lecture 3: Inference in SLR

Chapter 11. Correlation and Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Econometrics Review questions for exam

Lecture 18: Simple Linear Regression

Statistics in medicine

Prob/Stats Questions? /32

The Simple Linear Regression Model

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Section 3: Simple Linear Regression

STAT Chapter 11: Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Multiple Linear Regression

Bivariate Relationships Between Variables

Review of Statistics

ST430 Exam 1 with Answers

Chapter 10-Regression

Multiple Linear Regression

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Simple Linear Regression: One Quantitative IV

CHAPTER 4 & 5 Linear Regression with One Regressor. Kazu Matsuda IBEC PHBU 430 Econometrics

Important note: Transcripts are not substitutes for textbook assignments. 1

appstats27.notebook April 06, 2017

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Data files for today. CourseEvalua2on2.sav pontokprediktorok.sav Happiness.sav Ca;erplot.sav

ECON3150/4150 Spring 2015

Transcription:

Correlation and Linear Regression

Correlation: Relationships between Variables So far, nearly all of our discussion of inferential statistics has focused on testing for differences between group means or proportions However, researchers are often interested in graded relationships between variables, such as how well one variable can predict another Examples: How well do SAT scores predict a student s GPA? How is the amount of time a student takes to complete an exam related to her grade on that exam? How well do IQ scores correlate with income? How does a child s height correlate with his running speed? How does class size affect student performance? 2

Height Weight 70 150 67 140 72 180 75 190 68 145 69 150 71.5 164 71 140 72 142 69 136 67 123 68 155 66 140 72 145 73.5 160 73 190 69 155 73 165 72 150 3

Characteristics of the Correlation Correlation and Regression A Correlation coefficient is a single number describing the relationship between two variables. This number describes: The direction of the relationship Variables sharing a positive correlation tend to change in the same direction (e.g., height and weight). As the value of one of the variables (height) increases, the value of other variable (weight) also increases Variables sharing a negative correlation tend to change in opposite directions (e.g., snowfall and beach visitors). As the value of one of the variables (amount of snowfall) increases, the value of the other variable (number of beach visitors) decreases. The strength of the relationship Variables that share a strong correlation (close to +1 or -1) strongly predict one another, while variables that share a weak correlation (near 0) do not. 4

Positive versus Negative Correlations Correlation and Regression Positive Correlation Negative Correlation 5

Strong versus Weak Correlations Correlation and Regression 6

Correlation is not Causation http://www.tylervigen.com/spurious-correlations 7

Possible Sources of Correlation Correlation and Regression The relationship is causal. Manipulating the predictor variable causes an increase or decrease in the criterion variable. E.g., leg strength and sprinting speed The causal relationship is backwards (reverse causality). Manipulating the criterion variable causes changes in the predictor variable E.g., relationship between hospital visits and illness The two variables work together systematically to cause an effect The relationship may be due to one or more confounding variables Changes in both variables reflect the effect of a confounding variable E.g., intelligence as an explanation for correlated performance on different exams E.g., increasing density in cities increases the number of physicians and the number of crimes 8

Measuring Correlation: Pearson s r Correlation and Regression To compute a correlation you need a pair of scores, X and Y, for each individual in the sample. The most commonly used measure of correlation is Pearson s product-moment correlation coefficient, or more simply, Pearson s r. Conceptually, Pearson s r is a ratio between the degree to which two variables (X and Y) vary together and the degree to which they vary separately. r = co-variability( XY, ) variability( X) variability( Y) 9

The Covariance The term in the numerator of Pearson s r is the covariance, an unnormalized statistic representing the degree to which two variables (X and Y) vary together. ( )( ) cov XY = E X E[ X] Y EY [ ] = ( X E[ X] )( Y EY [ ]) n Mathematically, it is the average of the product of the deviations of two paired variables The covariance depends both on how consistently X and Y tend to vary together and on the individual variability of the variables (X and Y). 10

Computing Pearson s r Pearson s r is computed by dividing the covariance by the product of the standard deviations of each of the variables This removes the effect of the variability of the individual variables [ ( ) ( )] ρ = E z X zy = cov σ σ X XY Y ˆ ρ = r = cov s s X XY Y 11

Linear Correlation: Assumptions Correlation and Regression 1. Linearity Assumes that the relationship between the paired scores is best described by a straight line 2. Normality Technically, assumes that the marginal score distributions, their joint distribution, and any conditional distributions are normally distributed However, most important assumption is that the residuals are normally distributed 3. Homoscedasticity (constant variance) Assumes that the variability around the regression line is homogeneous across different score values 12

Breaking Linearity and Normality Assumptions r = 0.816 for all of the above 13

Which has the strongest correlation coefficient? 14

Other Correlation Coefficients Correlation and Regression Spearman s correlation coefficient (r s ) for ranked data As the name suggests, Spearman s correlation is used when the scores for both X and Y consist of (or have been converted to) ordinal ranks The point biserial correlation coefficient (r pb ) This correlation is used when one of the scores is continuous and the other is dichotomous, taking on one of only two possible values The phi correlation coefficient (r ϕ ) The phi correlation is used when both scores are dichotomous All of the above can be computed in the same manner as Pearson s correlation. 15

Converting Data for Spearman s Correlation Original Data Age Height 10 31.4 11 41 12 47.8 13 52.8 14 55.7 15 58.3 16 60.7 17 62.1 18 62.7 19 63.3 20 64.1 21 64.3 22 64.6 23 64.7 24 64.5 25 64.3 r = 0.86 16

Converting Data for Spearman s Correlation Original Data Converted Scores Age Height Age Rank Height rank 10 31.4 1 1 11 41 2 2 12 47.8 3 3 13 52.8 4 4 14 55.7 5 5 15 58.3 6 6 16 60.7 7 7 17 62.1 8 8 18 62.7 9 9 19 63.3 10 10 20 64.1 11 11 21 64.3 12 12.5 22 64.6 13 15 23 64.7 14 16 24 64.5 15 14 25 64.3 16 12.5 r = 0.86 r = 0.97 17

The Coefficient of Determination Correlation and Regression For linear correlations (and regression generally) the strength of fit of a model is most commonly evaluated using the r 2, the coefficient of determination. r 2 is calculated as the square of the correlation coefficient r, and represents the proportion of variability explained by the model. The remainder of the variability in the data that cannot be explained by the variables included in the model is called residual error 18

Testing for Significance of r Under a null hypothesis of ρ = 0 (and assuming normal residuals), the point estimate r is distributed as a t variable, with standard error: 2 1 r ˆ σ r = sr =, where dfr = n 2 df r This means that we can test whether r is significantly different from 0 by computing a the t-statistic: t r s df 1 r r ( df r ) = = r 2 r And computing the tail probability associated with the corresponding t distribution. 19

Introduction to Linear Regression Correlation and Regression Regression is a statistical procedure for modeling the relationship among variables to predict the value of a response ( or dependent) variable from one or more explanatory (or predictor) variables. Linear regression is the model assumed in computing a correlation coefficient In regression, the statistical model is made explicit. Imagine that I ask you to guess the weight of a college-aged male who is hidden from view What would your best guess be? 20

Introduction to Regression µ weight = σ weight = 158.26 18.64 21

Introduction to Regression Regression is a statistical procedure for modeling the relationship among variables to predict the value of a dependent variable from one or more predictor variables. Imagine that I ask you to guess the weight of a college-aged male who is hidden from view What would your best guess be? What if I also gave you his height? Intuitively, it should be clear that you can do better 22

Introduction to Regression 23

Introduction to Regression The Pearson correlation (ρ or r) measures the degree to which a set of data points form a linear (straight line) relationship. Simple regression describes the linear relationship between a dependent variable (Y) and one predictor variable (X) The resulting line is called the regression line. 24

Regression and Linear Equations Correlation and Regression You should remember the following from your high school algebra course: Any straight line can be represented by an equation of the form Y = b 1 X + b 0, where b 1 and b 0 are constants. The value of b 1 is called the slope and determines the direction and degree to which the line is tilted. The value of b 0 is called the Y-intercept and determines the point where the line crosses the Y-axis. In the context of linear regression, b 0 and b 1 are called regression coefficients 25

Regression and Linear Equations Correlation and Regression b b 1 0 = 0.5 = 1.0 Yˆ = bx+ b = 0.5X + 1 1 0 26

Residuals: Errors of Prediction Correlation and Regression How well a regression line fits a set of data points can be measured by calculating the distance between the data points and the line. Using the formula Ŷ = b 1 X + b 0, it is possible to find the predicted value of Ŷ for any X. The residual, or error of prediction, between the predicted value and the actual value can be found by computing the difference Y -Ŷ The regression line is selected to be the best fit in the leastsquares sense. This means that we want to compute the line that minimizes the sum of squared residuals: SS ( Y ) 2 residual = Yˆ 27

Residuals: Errors of Prediction Correlation and Regression Yˆ = bx+ b 1 0 ( XY, ) ( Y Yˆ ) 28

Residuals: Errors of Prediction Correlation and Regression 29

The Standard Error of Estimate Correlation and Regression The measure of unpredicted variability or error for the regression line is called the standard error of estimate (s e or s Y-Ŷ ) You can think of it as analogous to the standard deviation if we were to use the mean Y as our estimate of the variable Y s SS Y = = df Y ( Y Y ) 2 n 1 s Y Yˆ SS residual = = df residual ( Y Yˆ ) 2 n 2 30

Computing Regression Coefficients Correlation and Regression Slope b 1 change in Y (as a function of X) = = change in X SP SS XY X or rs s Y X Intercept b0 = Y bx 1 31

Testing for Significance of b 1 Earlier, I pointed out that the sample correlation coefficient r is distributed as a t variable, with standard error 2 1 r ˆ σ r = sr =, where dfr = n 2 df r The slope b 1 is similarly distributed, but with a standard error s σ b1 b df n 1 s n 1 Y Yˆ ˆ = s =, with = 2 X This one s a bit more tricky to explain, but you can use the same procedure (a t-test) to evaluate the significance of the slope. 32

Prediction Using the linear model to predict the value of the response variable for a given value of the explanatory variable is called prediction, simply by plugging in the value of x in the linear model equation. There will be some uncertainty associated with the predicted value. 33

Example: Predicting Y from X Correlation and Regression You are told that a college student is 74 inches tall. Given the computed regression coefficients, what is your best estimate of his weight? X : height Y : weight b b 0 1 = 228.56 = 5.44 Yˆ = bx + b = 5.44X 228.56 1 0 ( ) Y ˆ = 5.44 74 228.56 = 402.56 228.56 = 174 34

Example: Computing Accuracy of Prediction Just as σ or s can be used to compute confidence intervals for population means, s Y-Ŷ can be used to compute predictive intervals for Y t ( ) ˆ df s crit Y Y 35

Example: Computing Accuracy of Prediction Just as σ or s can be used to compute confidence intervals for population means, s Y-Ŷ can be used to compute predictive intervals for Y t ci r t ( df ) s 1 Y Yˆ + ( x X ) 2 + nss SS X X Note that the actual formula for the predictive interval is slightly more complicated and depends on x. 36

Extrapolation Applying a model estimate to values outside of the realm of the original data is called extrapolation. Sometimes the intercept might be an extrapolation. 37

The Hazards of Extrapolation 38

The Hazards of Extrapolation 39

The Hazards of Extrapolation 40

41

42

43

44