Correlation: basic properties.

Similar documents
Relationships between variables. Association Examples: Smoking is associated with heart disease. Weight is associated with height.

Chapter 5 Least Squares Regression

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Math 147 Lecture Notes: Lecture 12

Chapter 6: Exploring Data: Relationships Lesson Plan

Chapter 7 Linear Regression

Chapter 5 Friday, May 21st

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Business Statistics. Lecture 10: Correlation and Linear Regression

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

APPENDIX 1 BASIC STATISTICS. Summarizing Data

Chapter 7. Scatterplots, Association, and Correlation

Simple Linear Regression for the MPG Data

Chapter 6. September 17, Please pick up a calculator and take out paper and something to write with. Association and Correlation.

MATH 1150 Chapter 2 Notation and Terminology

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Scatterplots and Correlation

Chapter 10 Correlation and Regression

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Homework 2 Solutions

AP Statistics. Chapter 9 Re-Expressing data: Get it Straight

Linear Regression and Correlation. February 11, 2009

Looking at Data Relationships. 2.1 Scatterplots W. H. Freeman and Company

Approximate Linear Relationships

Statistical View of Least Squares

A company recorded the commuting distance in miles and number of absences in days for a group of its employees over the course of a year.

Unit 5: Regression. Marius Ionescu 09/22/2011

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis

appstats27.notebook April 06, 2017

Chapter 27 Summary Inferences for Regression

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Section Linear Correlation and Regression. Copyright 2013, 2010, 2007, Pearson, Education, Inc.

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

Lecture 8 CORRELATION AND LINEAR REGRESSION

appstats8.notebook October 11, 2016

Intermediate Algebra Summary - Part I

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

Chapter 6. Exploring Data: Relationships. Solutions. Exercises:

Linear Equations. Find the domain and the range of the following set. {(4,5), (7,8), (-1,3), (3,3), (2,-3)}

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Inferences for Regression

Chapter 10 Regression Analysis

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Simple Linear Regression Using Ordinary Least Squares

Unit 6 - Introduction to linear regression

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

Scatterplots. STAT22000 Autumn 2013 Lecture 4. What to Look in a Scatter Plot? Form of an Association

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Midterm 2 - Solutions

Chapter 3: Examining Relationships

Data Analysis and Statistical Methods Statistics 651

SOLUTIONS FOR PROBLEMS 1-30

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

AP Statistics Unit 2 (Chapters 7-10) Warm-Ups: Part 1

Correlation A relationship between two variables As one goes up, the other changes in a predictable way (either mostly goes up or mostly goes down)

STATS DOESN T SUCK! ~ CHAPTER 16

Stat 20 Midterm 1 Review

Big Data Analysis with Apache Spark UC#BERKELEY

Pre-Calculus Multiple Choice Questions - Chapter S8

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Mathematics for Economics MA course

Scatterplots. 3.1: Scatterplots & Correlation. Scatterplots. Explanatory & Response Variables. Section 3.1 Scatterplots and Correlation

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

Regression and Correlation

Simple Linear Regression

Describing Bivariate Relationships

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

AdvAlg3 6FittingALineToData.notebook. February 20, Correlation Coefficient. Window. Regression line. Linear Regression Model.

Quantitative Bivariate Data

+ Statistical Methods in

Written Homework 7 Solutions

Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables. Interpreting scatterplots Outliers

Friday, September 21, 2018

Chapter 8. Linear Regression /71

Mathematical Modelling Lecture 14 Fractals

Important note: Transcripts are not substitutes for textbook assignments. 1

Modeling Data with Functions

Unit 6 - Simple linear regression

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model

Predicted Y Scores. The symbol stands for a predicted Y score

Watch TV 4 7 Read 5 2 Exercise 2 4 Talk to friends 7 3 Go to a movie 6 5 Go to dinner 1 6 Go to the mall 3 1

AP Stats ~ 3A: Scatterplots and Correlation OBJECTIVES:

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

5.1 Bivariate Relationships

The empirical ( ) rule

What s a good way to show how the results came out? The relationship between two variables can be represented visually by a SCATTER DIAGRAM.

Warm-up Using the given data Create a scatterplot Find the regression line

Approximations - the method of least squares (1)

ES-2 Lecture: More Least-squares Fitting. Spring 2017

Essential Statistics. Gould Ryan Wong

AMS 7 Correlation and Regression Lecture 8

Ordinary Least Squares Linear Regression

Reminder: Univariate Data. Bivariate Data. Example: Puppy Weights. You weigh the pups and get these results: 2.5, 3.5, 3.3, 3.1, 2.6, 3.6, 2.

The Simple Linear Regression Model

Chapter 15. Correlation and Regression

Correlation and regression. Correlation and regression analysis. Measures of association. Why bother? Positive linear relationship

Northwood High School Algebra 2/Honors Algebra 2 Summer Review Packet

Remark: Do not treat as ordinary numbers. These symbols do not obey the usual rules of arithmetic, for instance, + 1 =, - 1 =, 2, etc.

Chapter 11. Correlation and Regression

Transcription:

Correlation: basic properties. 1 r xy 1 for all sets of paired data. The closer r xy is to ±1, the stronger the linear relationship between the x-data and y-data. If r xy = ±1 then there is a perfect linear relationship between the two variables: y j = mx j + b, for all (x j, y j ). r xy does not identify nonlinear relationships. E.g., you can have r xy 0, even though y f(x) for some nonlinear function f. (The relationships between paired data with significant outliers are also not well-described by the correlation coefficient.) Conclusion: If r xy is not too close to 0, then we may expect the y-values to be somewhat well approximated by a linear function of the corresponding x-values (or vice versa). The question is: which one? 1

Lines and linear functions: a refresher A straight line is the graph of a linear equation. These equations are most frequently written as (i) ax + by = c or (ii) y = mx + b. Version (ii) expresses y as a (linear) function of x. The slope of a line is the ratio rise run = change in y change in x = y x In equation (ii), the slope is given by m. The slope is the amount by which y is changing for every unit change in x. I.e., y = m x. 2

What (straight) line best describes the relationship seen in the data? First Guess: The SD-line. This is the line that passes through the point of averages whose slope is given by ±SD y /SD x, (+SD y /SD x for positive relationships and SD y /SD x for negative ones). 3 Statistics, Four Copyright 2007 W. W. Norton &

The SD-line does......run through the center of the scatterplot, i.e., the points in the scatterplot are spread more-or-less symmetrically around the SD-line;...describe the trend in the data as a whole. I.e., it gives the correct direction for the cloud of points in the scatterplot. The SD-line does not......do a good job of predicting y-values from given x-values. Why not? The SD-line doesn t take the correlation between the variables into account. 4

What we want: A formula for the approximate y-value of an observation with a given x-value. What we can reasonably hope to find: A formula for the approximate average y-value for all observations with the same x-value. The SD-line prediction: For every 1 SD x increase in x, there is a 1 SD y increase in the average value of y... But this is usually wrong: As the observations move away from the point of averages, the points on the SD-line tend to lie above or below the average y-value that we are trying to estimate, and the further we move from the point of averages, the bigger the errors become. 5

Statistics, Fourth Edition Copyright 2007 W. W. Norton & Co., Inc. Conclusion: The SD-line goes up (or down) too steeply. I.e., its slope ±SD y /SDx is too big (in absolute value)... To better predict the average y-value for a given x-value, we use... 6

The regression line. The regression line passes through the point of averages (like the SD-line). The slope of the regression line (for y on x) is given by r xy SD y SD x. In practical terms, the regression line predicts that for every SD x change in x, there is an approximate r xy SD y change in the average value of the corresponding y s. Paired data and the relationship between the two variables (x and y) is summarized by the five statistics: x, SD x, y, SD y and r xy. 7

Example: Regression of weight on height for women in Great Britain in 1951. h 63 inches, w 132 lbs, s h 2.7 inches, s w 22.5 lbs. r hw 0.32 These numbers can be used to answer questions about average weights for women of different heights... 8

How much do 5 6 -tall women weigh on average? These women are 3/SD h = 3/2.7 1.11 standard deviations above average (height), so, on average they will about 0.32 1.11 0.355 standard deviations above the average weight. I.e., the average weight for these women is about 132 + 0.355 22.5 140 lbs. How much does average weight go up when height increases by 1 inch? 1 inch represents 1/SD h = 1/2.7 0.37 standard deviations for height, so each additional inch of height adds about 0.32 0.37 0.1184 SD w to the average weight: this is about 2.66 lbs. 9

How much do 5 6 -tall women weigh on average? These women are 3/SD h = 3/2.7 1.11 standard deviations above average (height), so, on average they will about 0.32 1.11 0.355 standard deviations above the average weight. I.e., the average weight for these women is about 132 + 0.355 22.5 140 lbs. How much does average weight go up when height increases by 1 inch? 1 inch represents 1/SD h = 1/2.7 0.37 standard deviations for height, so each additional inch of height adds about 0.32 0.37 0.1184 SD w to the average weight: this is about 2.66 lbs more. How accurate are the predictions in this example..? 10

The green line is the regression line. The red asterisks are the average weights for each height class. The graph of averages is the graph that plots average y for each x. In the example above, the graph of averages is represented by the red asterisks. 11

The regression equation Since the regression line is a line, we can find its equation: ŷ = β 0 + β 1 x Comment: ŷ is the predicted or estimated value of the average y-value for all observations whose x-value is x. β 1 is the slope of the regression line: β 1 = r xy SD y SD x To find β 0, we remember that regression line passes through the point of averages, (x, y). Therefore y = β 0 + β 1 x, so that β 0 = y β 1 x. = First find β 1, then use that to find β 0. 12

In the weight-on-height example, we have β 1 = 0.32 22.5 2.7 2.667 and β 0 = 132 2.667 63 36, so the regression equation for weight (w) on height (h) is ŵ = 36 + 2.667 h. We can use this equation to answer the same questions as before. How much do 5 6 -tall women weigh on average? The average weight of women who are 66 inches tall is estimated by ŵ: w(66) ŵ(66) = 36 + 2.667 66 = 140 lbs. How much does average weight go up when height increases by 1 inch? For every additional inch of height, average weight increases by about 2.667 lbs. 13

Using the regression line for individual predictions Example: A study of education and income is done for men age 30-35. A representative sample of 10000 men in this age group is surveyed, and the following statistics are collected: E = 13 SD E = 1.5 I = 46 SD I = 10 r I,E = 0.45 where E = years of education and I = annual income, in $1000s. What is the best guess for the income of a man in this age range? The average income for the whole group: $46,000. 14

What is the best guess for the income of a man in this age range, given that he has 16 years of education? Once again, the best guess for the income of an individual is the average income in the class to which he belongs i.e., the average income for men (in the given age group) with 16 years of education. We can use the regression equation to estimate this: β 1 = 0.45 10 1.5 = 3, β 0 = 46 3 13 = 7, so the regression equation is Î = 7 + 3 E. Using this, we estimate the average income for men with 16 years of education to be I(16) Î(16) = 7 + 3 16 = 55, which is our best guess for the income of any individual in the group. 15

Question: How can we make our estimate of the y-value of a subject whose observed x-value is x 0 more precise? And more precise in what way? In data that is approximately normally distributed, roughly 68% of the data lies within 1 SD of the average value and 95% of the data lies within 2 standard deviations of the average value. The average y-value for all observations with x-value x 0 y(x 0 ) ŷ(x 0 ) = β 0 + β 1 x 0. Applying the normal approximation, we can say that if we observe an x-value of x 0, it is reasonable to conclude that the corresponding y-value is likely be in the range y(x 0 ) ± SD ŷ(x 0 ) ± SD, or is very likely to be in the range y(x 0 ) ± 2SD ŷ(x 0 ) ± 2SD. is 16

This leaves two important questions: 1. How do we know that the y-data for a given x-value has an approximately normal distribution? and 2. What do we use for the SD? 17

The R.M.S. error of the regression If we compare y j to the predicted value ŷ j = β 0 + β 1 x j, then we will typically observe an error ε j = y j ŷ j which may be either positive or negative, depending on whether the prediction was too small or too big. These errors reflect the fact that the data does not lie exactly on any straight line. The square-root of the average of the squares of these errors, 1 n n (y j ŷ j ) 2 j=1 is called the R.M.S. error of the regression. This is roughly equal to the average distance of points in the scatterplot to the regression line. The R.M.S. error of regression is also called the standard error of regression, or SER, 18