Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Similar documents
Chapter 5 Friday, May 21st

appstats27.notebook April 06, 2017

Chapter 5 Least Squares Regression

Correlation and Regression

STAT Chapter 11: Regression

AMS 7 Correlation and Regression Lecture 8

9. Linear Regression and Correlation

Predicted Y Scores. The symbol stands for a predicted Y score

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Business Statistics. Lecture 9: Simple Regression

Lecture 14 Simple Linear Regression

Chapter 27 Summary Inferences for Regression

Introduction to Linear Regression

PS2: Two Variable Statistics

Chapter 12 - Lecture 2 Inferences about regression coefficient

Warm-up Using the given data Create a scatterplot Find the regression line

The Simple Linear Regression Model

Simple Linear Regression

Correlation and regression

Chapter 16. Simple Linear Regression and dcorrelation

Inferences for Regression

Unit 6 - Introduction to linear regression

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

LECTURE 15: SIMPLE LINEAR REGRESSION I

Linear Regression Spring 2014

BIOSTATISTICS NURS 3324

Simple Linear Regression

Lecture 9 SLR in Matrix Form

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Is a measure of the strength and direction of a linear relationship

Intro to Linear Regression

Section 3: Simple Linear Regression

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Scatter plot of data from the study. Linear Regression

Unit 6 - Simple linear regression

Lecture 16 - Correlation and Regression

Chapter 10 Correlation and Regression

Correlation & Simple Regression

Scatter plot of data from the study. Linear Regression

Gov 2000: 9. Regression with Two Independent Variables

Regression Analysis. Ordinary Least Squares. The Linear Model

Regression Models - Introduction

Lecture 11: Simple Linear Regression

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Chapter 4 Data with Two Variables

Correlation Analysis

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

STAT5044: Regression and Anova. Inyoung Kim

Analysis of Bivariate Data

Chapter 4 Data with Two Variables

Regression and Covariance

Simple Linear Regression for the MPG Data

Lecture 18: Simple Linear Regression

Correlation and Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Simple linear regression

Two-Variable Regression Model: The Problem of Estimation

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Section Least Squares Regression

STAT 111 Recitation 7

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Applied Statistics and Econometrics

Recall, Positive/Negative Association:

Intro to Linear Regression

Chapter 7. Scatterplots, Association, and Correlation

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Statistical Techniques II EXST7015 Simple Linear Regression

Simple Linear Regression Using Ordinary Least Squares

STAT 4385 Topic 03: Simple Linear Regression

Estadística II Chapter 4: Simple linear regression

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

11 Correlation and Regression

Scatterplots and Correlation

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

15.0 Linear Regression

Probability and Statistics Notes

PS2.1 & 2.2: Linear Correlations PS2: Bivariate Statistics

How to mathematically model a linear relationship and make predictions.

Mathematics for Economics MA course

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Midterm 2 - Solutions

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Chapter 16. Simple Linear Regression and Correlation

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Linear Regression and Correlation. February 11, 2009

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

ES-2 Lecture: More Least-squares Fitting. Spring 2017

TMA4255 Applied Statistics V2016 (5)

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Data Analysis and Statistical Methods Statistics 651

ECON 450 Development Economics

Regression. Marc H. Mehlman University of New Haven

Chapter 3: Examining Relationships

Can you tell the relationship between students SAT scores and their college grades?

Simple Linear Regression

Key Algebraic Results in Linear Regression

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Applied Regression Analysis

Transcription:

Regression and correlation Correlation & Regression, I 9.07 4/1/004 Involve bivariate, paired data, X & Y Height & weight measured for the same individual IQ & exam scores for each individual Height of mother paired with height of daughter Sometimes more than two variables (W, X, Y, Z, ) Regression & correlation Concerned with the questions: Does a statistical relationship exist between X & Y, which allows some predictability of one of the variables from the other? How strong is the apparent relationship, in the sense of predictive ability? Can a simple linear rule be used to predict one variable from the other, and if so how good is this rule? E.G. Y = 5X + 6 Regression vs. correlation Regression: Predicting Y from X (or X from Y) by a linear rule Correlation: How good is this relationship? 1

First tool: scatter plot Scatter plot: height vs. weight For each pair of points, plot one member of a pair 75 against the corresponding other member of that 70 pair. 65 In an experimental study, convention is to plot the 60 independent variable on the x-axis, the dependent 55 on the y-axis. Often we are describing the results of 45 observational or correlational studies, in which case it doesn t matter which variable is on which axis. 1 1 160 170 180 190 00 10 Weight ( lbs) He i ght (i nches) 80 nd tool: find the regression line Regression line 80 We attempt to predict the values of y from the values of x, by fitting a straight line to the data 75 70 The data probably doesn t fit on a straight line Scatter 65 60 The relationship between x & y may not quite be linear 55 (or it could be far from linear, in which case this technique isn t appropriate) The regression line is like a perfect version of 45 what the linear relationship in the data would look like 1 1 160 170 180 190 00 10 Height ( inches) Weight ( lbs)

How do we find the regression line that best fits the data? Straight Line We don t just sketch in something that looks good First, recall the equation for a line. Next, what do we mean by best fit? Finally, based upon that definition of best fit, find the equation of the best fit line General formula for any line is y=bx+a b is the slope of the line a is the intercept (i.e., the value of y when x=0) y a x b Least-squares regression: What does best fit mean? If yi is the true value of y paired with x i, let y i = our prediction of y i from x i We want to minimize the error in our prediction of y over the full range of x We ll do this by minimizing sse = (y i y i ) Express the formula as y i =a+bxi We want to find the values of a and b that give us the least squared error, sse, thus this is called least-squares regression Minimizing sum of squared errors Y y i y i y i y i X 3

For fun, we re going to derive the equations for the best-fit a and b But first, some preliminary work: Other forms of the variance And the definition of covariance A different form of the variance The covariance We talked briefly about covariance a few lectures ago, when we talked about the variance of the difference of two random variables, when the random variables are not independent var(m 1 m ) = σ 1 /n 1 + σ /n cov(m 1, m ) The covariance Recall: var(x) = E(x-µ x ) = E(x xµ x + µ x ) = E(x ) µ x + µ x ) µ N-1 for unbiased = E(x x = Σ x i /N (Σ x i ) /N estimate = (Σ x i (Σ x i ) /N) / N You may recognize this equation from the practise midterm (where it may have confused you). The covariance is a measure of how the x varies with y (co-variance = varies with ) cov(x, y ) = E[(x- µ x )(y-var(x) = cov(x, x) y )] Using algebra like that from two slides ago, we get an alternate form: cov(x, y) = E[(x- µ x ) (y-µ y )] = E(xy xµ y -yµ x + µ x µ y ) = E(xy) µ x µ y µ x µ y + µ x µ y = E(xy) µ x µ y 4

OK, deriving the equations for a and b y i = a + bx i We want the a and b that minimize sse = (y i y i ) = (y i a bx i ) Recall from calculus that to minimize this equation, we need to take derivatives and set them to zero. Derivative with respect to a ( ( yi a bxi ) ) = ( yi a bxi ) = 0 a y i an b y a = i b N a = y bx i x = 0 xi N This is the equation for a, however it s still in terms of b. Derivative with respect to b ( ( yi a bxi ) ) = ( yi a bxi ) xi b i x i ( y b x ) x b = x y 0 i = 0 1 1 b xi yi y xi + ( x xi xi ) = 0 N N N 1 1 xi yi x y = b( xi x ) N N b = cov( x, y i ) / sx Least-squares regression equations b = cov(x, y)/s x a = m y b mx (x = m x Powerpoint doesn t make it easy to create a bar over a letter, so we ll go back to our old notation) Alternative notation: ss = sum of squares let ssxx = Σ(x i m x ) ss yy = Σ(y i m y ) ss xy = Σ(x i m x )(y i m y ) then b = ssxy / ss xx 5

A typical question Can we predict the weight of a student if we are given their height? We need to create a regression equation relating the outcome variable, weight, to the explanatory variable, height. Start with the obligatory scatterplot Example: predicting weight from height x i y i 60 84 6 95 64 1 66 155 68 119 70 175 7 145 74 197 76 1 weight (lbs) First, plot a scatter plot, and see if the relationship seems even remotely linear: 00 1 Looks ok. Steps for computing the regression equation Compute m x and m y Compute (x i m x ) and (y i m y ) Compute (x i m x ) and (x i m x )(y i m y ) Compute ss xx and ss xy b=ss xy /ss xx a=m y -bm x Example: predicting weight from height x i y i (x i -m x ) (y i -m y ) (x i -m x ) (y i -m y ) (x i -m x ) (y i -m y ) 60 84-8 -56 64 3136 448 6 95-6 -45 36 05 70 64 1-4 0 16 0 0 66 155-15 4 5-30 68 119 0-1 0 441 0 70 175 35 4 15 70 7 145 4 5 16 5 0 74 197 6 57 36 349 34 76 1 8 10 64 80 Sum=61 160 ss xx = ss yy =1046 ss xy = m x =68 m y =1 b = ss xy /ss xx = / = 5; a = m y bm x = 1-5(68) = -00 6

Example: predicting weight from height x i y i (x (x (y i -m y ) i -m x ) i -m x ) (y i -m y ) (x i -m x ) (y i -m y ) 60 84-8 -56 64 3136 448 6 95-6 -45 36 05 70 64 1-4 0 16 0 0 66 155-15 4 5-30 68 119 0-1 0 441 0 70 175 35 4 15 70 7 145 4 5 16 5 0 74 197 6 57 36 349 34 76 1 8 10 64 80 Sum=61 160 ss xx = ss yy =1046 ss xy = m x =68 m y =1 b = ss xy /ss xx = / = 5; a = m y bm x = 1-5(68) = -00 weight ( lbs ) 00 1 Intercept at x=0 is -00. Intercept at x=60 is -00+60*5 = Plot the regression line height ( inches) Slope = 5 What weight do we predict for someone who is 65 tall? Weight = -00 + 5*height = 15 lbs weight ( lbs ) 00 1 Caveats Outliers and influential observations can distort the equation Be careful with extrapolations beyond the data For every bivariate relationship there are two regression lines height ( inches) 7

Effect of outliers Effect of influential observations weight (lbs) 00 1 weight (lbs) 00 1 80 85 90 weight (lbs) 00 1 Extrapolation 80 85 90 Be careful when using the linear regression eq n to estimate, e.g., the weight of a person 85 tall. The equation may only be a good fit within the x-range of your data.? Two regression lines Note that the definition of best fit that we used for least-squares regression was asymmetric with respect to x and y It cared about error in y, but not error in x. Essentially, we were assuming that x was known (no error), we were trying to estimate y, and our y-values had some noise in them that kept the Y relationship y i from y being perfectly i y i linear. y i X 8

Two regression lines Note that the definition of best fit that we used for least-squares regression was asymmetric with respect to x and y It cared about error in y, but not error in x. Essentially, we were assuming that x was known (no error), we were trying to estimate y, and our y-values had some noise in them that kept the relationship from being perfectly linear. Two regression lines But, in observational or correlational studies, the assignment of, e.g., weight to the y-axis, and height to the x-axis, is arbitrary. We could just as easily have tried to predict height from weight. If we do this, in general we will get a different regression line when we predict x from y than when we predict y from x. Swapping height and weight 75 70 y = 0.1151x + 51.886 65 60 1 00 weight ( lbs) height 0.11 weight + 51.89 weight = 5 height - 00 height ( inches ) Residual Plots Plotting the residuals (yi y i ) against x i can reveal how well the linear equation explains the data Can suggest that the relationship is significantly non-linear, or other oddities The best structure to see is no structure at all 9

What we like to see: no pattern 0 0-0 - height ( inches) If it looks like this, you did something wrong there s still a linear component! 0 0-0 - height ( inches) If there s a pattern, it was inappropriate to fit a line (instead of some other function) 0 0-0 - What to do if a linear function isn t appropriate Often you can transform the data so that it is linear, and then fit the transformed data. This is equivalent to fitting the data with a model, y = M(x), then plotting y vs. y and fitting that with a linear model. This is outside of the scope of this class. 10

If it looks like this, again the regression procedure is inappropriate 0 0-0 - Heteroscedastic data Data for which the amount of scatter depends upon the x- value (vs. homoscedastic, where it doesn t depend on x) Leads to residual plots like that on the previous slide Happens a lot in behavioral research because of Weber s law. As people how much of an increment in sound volume they can j ust distinguish from a standard volume How big a difference is required (and how much variability there is in the result) depends upon the standard volume Can often deal with this problem by transforming the data, or doing a modified, weighted regression ( Again, outside of the scope of this class.) Coming up next The regression fallacy Assumptions implicit in regression Confidence intervals on the parameters of the regression line Confidence intervals on the predicted value y, given x Correlation 11