Chapter 4 Data with Two Variables

Similar documents
Chapter 4 Data with Two Variables

Chapter 6 Scatterplots, Association and Correlation

STATISTICS Relationships between variables: Correlation

AP Statistics Two-Variable Data Analysis

Ch. 3 Review - LSRL AP Stats

Bivariate Data Summary

Chapter 3: Examining Relationships

BIVARIATE DATA data for two variables

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis

Describing Bivariate Relationships

Chapter 8. Linear Regression /71

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

5.1 Bivariate Relationships

Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight?

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Relationships Regression

Prob/Stats Questions? /32

Chapter 10 Correlation and Regression

Steps to take to do the descriptive part of regression analysis:

If the roles of the variable are not clear, then which variable is placed on which axis is not important.

Finite Mathematics : A Business Approach

Chapter 12: Linear Regression and Correlation

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

The empirical ( ) rule

Chapter 6. September 17, Please pick up a calculator and take out paper and something to write with. Association and Correlation.

3.2: Least Squares Regressions

Relationships between variables. Visualizing Bivariate Distributions: Scatter Plots

The response variable depends on the explanatory variable.

Ch Inference for Linear Regression

CORRELATION. compiled by Dr Kunal Pathak

Learning Objectives. Math Chapter 3. Chapter 3. Association. Response and Explanatory Variables

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Solving Quadratic & Higher Degree Equations

Chapter 9. Correlation and Regression

Chapter 12 : Linear Correlation and Linear Regression

determine whether or not this relationship is.

IOP2601. Some notes on basic mathematical calculations

Describing Bivariate Data

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

BIOSTATISTICS NURS 3324

Math 243 OpenStax Chapter 12 Scatterplots and Linear Regression OpenIntro Section and

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

11 Correlation and Regression

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Chapter 2: Looking at Data Relationships (Part 3)

Lecture 4 Scatterplots, Association, and Correlation

Solving Quadratic & Higher Degree Equations

appstats27.notebook April 06, 2017

y n 1 ( x i x )( y y i n 1 i y 2

This document contains 3 sets of practice problems.

Algebra Review. Finding Zeros (Roots) of Quadratics, Cubics, and Quartics. Kasten, Algebra 2. Algebra Review

Chapter 7. Scatterplots, Association, and Correlation

Correlation & Regression

Chapter 27 Summary Inferences for Regression

Algebra & Trig Review

P.1 Prerequisite skills Basic Algebra Skills

Important note: Transcripts are not substitutes for textbook assignments. 1

Chapter 5 Least Squares Regression

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

a. Length of tube: Diameter of tube:

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Lecture 4 Scatterplots, Association, and Correlation

Introduce Exploration! Before we go on, notice one more thing. We'll come back to the derivation if we have time.

3.1 Scatterplots and Correlation

Chapter 1 Review of Equations and Inequalities

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

MATH 1130 Exam 1 Review Sheet

Talking feet: Scatterplots and lines of best fit

CORRELATION AND REGRESSION

Chapter 7 Linear Regression

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Unit 6 - Introduction to linear regression

CORRELATION AND REGRESSION

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Correlation and Regression

Unit 1 Science Models & Graphing

appstats8.notebook October 11, 2016

Chapter 3: Examining Relationships

Linear Regression. Al Nosedal University of Toronto. Summer Al Nosedal University of Toronto Linear Regression Summer / 115

Review of Regression Basics

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

6.1.1 How can I make predictions?

Introduction to Determining Power Law Relationships

Business Statistics 41000:

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Chapter 6: Exploring Data: Relationships Lesson Plan

Math 147 Lecture Notes: Lecture 12

Solving Quadratic & Higher Degree Equations

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

SOLUTIONS FOR PROBLEMS 1-30

Nov 13 AP STAT. 1. Check/rev HW 2. Review/recap of notes 3. HW: pg #5,7,8,9,11 and read/notes pg smartboad notes ch 3.

The Simple Linear Regression Model

9. Linear Regression and Correlation

About Bivariate Correlations and Linear Regression

Unit 9 Regression and Correlation Homework #14 (Unit 9 Regression and Correlation) SOLUTIONS. X = cigarette consumption (per capita in 1930)

Transcription:

Chapter 4 Data with Two Variables

1 Scatter Plots and Correlation and 2 Pearson s Correlation Coefficient

Looking for Correlation Example Does the number of hours you watch TV per week impact your average grade in a class? Hours 12 10 5 3 15 16 8 Grade 70 85 82 88 65 75 68

Looking for Correlation Example Does the number of hours you watch TV per week impact your average grade in a class? Hours 12 10 5 3 15 16 8 Grade 70 85 82 88 65 75 68 To see if there is a relationship, we will create a scatter plot and analyze it. Definition A scatter plot is a geographical representation between two quantitative variables. They may be from the same individual (i.e. education v. income, height v. weight) or from paired individuals (i.e. age of partners in a relationship).

Scatter Plots When working with scatter plots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study.

Scatter Plots When working with scatter plots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study. Definition An explanatory variable may explain or influence changes in a response variable.

Scatter Plots When working with scatter plots, there are two variables. They may be two different types. Definition A response variable measures the outcome of a study. Definition An explanatory variable may explain or influence changes in a response variable. Explanatory variables are often called independent and are on the x-axis. Response variables are often called dependent and are on the y-axis.

Back to Our Example In our example, which is the explanatory variable?

Back to Our Example In our example, which is the explanatory variable? Watched TV hours.

Back to Our Example In our example, which is the explanatory variable? Watched TV hours. The response variable is there for the average grade. So the question we are trying to answer is Does watching TV influence the average grade in a class?

Back to Our Example In our example, which is the explanatory variable? Watched TV hours. The response variable is there for the average grade. So the question we are trying to answer is Does watching TV influence the average grade in a class? Let s plot the data and see what we have.

The Scatter Plot Grades v. Hours of TV Grade 90 85 80 75 70 65 5 10 15 Hours of TV

How Does the Relationship Look? What do we think?

How Does the Relationship Look? What do we think? It looks like the more hours of TV that are watched, the lower the average grade. But how good is the relationship? We can measure this in different ways. One is direction (+, ) and another is by ranking the strength. These are both accomplished by looking at Pearson s Correlation Coefficient.

Facts About Pearson s Correlation Coefficient: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad.

Facts About Pearson s Correlation Coefficient: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice.

Facts About Pearson s Correlation Coefficient: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice. 3 Correlation measures only the linear relationship.

Facts About Pearson s Correlation Coefficient: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice. 3 Correlation measures only the linear relationship. 4 Correlation is not resistant.

Facts About Pearson s Correlation Coefficient: 1 1 r 1. The least correlation is 0 and the best correlation is ±1. Whether r is positive or negative only tells us which direction the relationship goes - whether y increases as x increases or if y decreases as x increases. Being negative is not bad. 2 Correlation makes no distinction between x and y, that is, between the choice of explanatory and response variables. We need to make sure we are careful, though, as the next part (regression line) depends heavily on the correct choice. 3 Correlation measures only the linear relationship. 4 Correlation is not resistant. 5 Correlation has no units.

So How Do We Find Pearson s Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y

So How Do We Find Pearson s Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y Let s find the correlation coefficient for our example. First, we need a few values, x, y, S x, S y.

So How Do We Find Pearson s Correlation Coefficient? The Correlation Coefficient r = 1 ( ) ( ) x i x yi y n 1 S x S y = 1 n 1 zx z y Let s find the correlation coefficient for our example. First, we need a few values, x, y, S x, S y. x = 9.857 y = 76.143 S x = 4.880 S y = 8.971

Finding Pearson s Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007

Finding Pearson s Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007 2.0293.9873.0289 3 -.9953.6529 -.6498 4-1.4050 1.3217-1.8570 5 1.0539-1.2421-1.3090 6 1.2588 -.1274 -.1604 7 -.3805 -.9077.3454-3.9026

Finding Pearson s Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007 2.0293.9873.0289 3 -.9953.6529 -.6498 4-1.4050 1.3217-1.8570 5 1.0539-1.2421-1.3090 6 1.2588 -.1274 -.1604 7 -.3805 -.9077.3454-3.9026 r = 1 ( 3.9026) =.6504 6

Finding Pearson s Correlation Coefficient For each pair, find the z-score for each value. Then multiply them together. After summing, divide by n 1. i z x z y product 1.4391 -.6848 -.3007 2.0293.9873.0289 3 -.9953.6529 -.6498 4-1.4050 1.3217-1.8570 5 1.0539-1.2421-1.3090 6 1.2588 -.1274 -.1604 7 -.3805 -.9077.3454-3.9026 r = 1 ( 3.9026) =.6504 6 Interpretation: Moderate negative correlation

Another Way To Calculate We can use a formula similar to the way we found the standard deviation to find this correlation coefficient. First, we need a few formulas:

Another Way To Calculate We can use a formula similar to the way we found the standard deviation to find this correlation coefficient. First, we need a few formulas: The Three S s S xx = Σ(x x) 2

Another Way To Calculate We can use a formula similar to the way we found the standard deviation to find this correlation coefficient. First, we need a few formulas: The Three S s S xx = Σ(x x) 2 S yy = Σ(y y) 2

Another Way To Calculate We can use a formula similar to the way we found the standard deviation to find this correlation coefficient. First, we need a few formulas: The Three S s S xx = Σ(x x) 2 S yy = Σ(y y) 2 S xy = Σ(x x)(y y)

Another Way To Calculate We can use a formula similar to the way we found the standard deviation to find this correlation coefficient. First, we need a few formulas: The Three S s S xx = Σ(x x) 2 S yy = Σ(y y) 2 S xy = Σ(x x)(y y) Pearson s Correlation Coefficient r = S xy Sxx S yy

Another Way To Calculate This makes it look just as complicated, or worse, than the other method, but there are simpler ways to calculate the three S s that make this method not so painful.

Another Way To Calculate This makes it look just as complicated, or worse, than the other method, but there are simpler ways to calculate the three S s that make this method not so painful. Calculating the Three S s S xx = Σx 2 (Σx)2 n

Another Way To Calculate This makes it look just as complicated, or worse, than the other method, but there are simpler ways to calculate the three S s that make this method not so painful. Calculating the Three S s S xx = Σx 2 (Σx)2 n S yy = Σy 2 (Σy)2 n

Another Way To Calculate This makes it look just as complicated, or worse, than the other method, but there are simpler ways to calculate the three S s that make this method not so painful. Calculating the Three S s S xx = Σx 2 (Σx)2 n S yy = Σy 2 (Σy)2 n S xy = Σxy (Σx)(Σy) n

Another Way To Calculate This makes it look just as complicated, or worse, than the other method, but there are simpler ways to calculate the three S s that make this method not so painful. Calculating the Three S s S xx = Σx 2 (Σx)2 n S yy = Σy 2 (Σy)2 n S xy = Σxy (Σx)(Σy) n And, we can get this from the calculator...

Another Way To Calculate Σx 2 823 Σx 69 (Σx) 2 4761 (Σx) 2 7 680.14 Σy 2 41067 Σy 533 (Σy) 2 284089 (Σy) 2 7 40584.14 Σxy 5083 (Σx)(Σy) 36777 (Σx)(Σy) 7 5253.86

Another Way To Calculate So we have r = S xy Sxx S yy

Another Way To Calculate So we have = S xy r = Sxx S yy 5083 5253.86 (823 680.14)(41067 40584.14)

Another Way To Calculate So we have = S xy r = Sxx S yy 5083 5253.86 (823 680.14)(41067 40584.14) 170.86 = (142.86)(482.86)

Another Way To Calculate So we have = S xy r = Sxx S yy 5083 5253.86 (823 680.14)(41067 40584.14) 170.86 = (142.86)(482.86) =.6505

So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast...

So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast... Correlation does not necessarily imply causation.

So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast... Correlation does not necessarily imply causation. Just because it looks the part does not mean we have evidence that there is a relationship. We have to consider a couple of other things. One is lurking variables. These are variables that may be present but we are not actually considering them within the data.

So Can We Say There Is A Relationship? So, can we say that there is a direct relationship between the number of hours of TV watched and the average grade? Not so fast... Correlation does not necessarily imply causation. Just because it looks the part does not mean we have evidence that there is a relationship. We have to consider a couple of other things. One is lurking variables. These are variables that may be present but we are not actually considering them within the data. Can you think of any lurking variables that would impact our example?

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant.

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant. Reasons why data may not be significant: 1 Genuine lack of correlation

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant. Reasons why data may not be significant: 1 Genuine lack of correlation 2 Not enough data

Significance We also need to test for significance to see what is going on. If r n > 3, the correlation is significant Otherwise it is not significant The smaller this value, the smaller the probability that the correlation will be significant. Reasons why data may not be significant: 1 Genuine lack of correlation 2 Not enough data Our example is not significant because of quantity. So we cannot consider that watching TV has a direct impact on grades.

Assumptions and Conditions for Correlation Quantitative Variables Condition Don t make the common error of calling an association involving a categorical variable a correlation. Correlation is only about quantitative variables.

Assumptions and Conditions for Correlation Quantitative Variables Condition Don t make the common error of calling an association involving a categorical variable a correlation. Correlation is only about quantitative variables. Straight Enough Condition The best check for the assumption that the variables are truly linearly related is to look at the scatter plot to see whether it looks reasonably straight. That s a judgment call, but not a difficult one.

Assumptions and Conditions for Correlation Quantitative Variables Condition Don t make the common error of calling an association involving a categorical variable a correlation. Correlation is only about quantitative variables. Straight Enough Condition The best check for the assumption that the variables are truly linearly related is to look at the scatter plot to see whether it looks reasonably straight. That s a judgment call, but not a difficult one. No Outliers Condition Outliers can distort the correlation dramatically, making a weak association look strong or a strong one look weak. Outliers can even change the sign of the correlation. But it s easy to see outlier in the scatter plot, so to check this condition, just look.

Another Example Example The following gives the power numbers for the starting 9 for the 2007 Boston Red Sox. Is there relationship between the number of home runs and the number of RBIs? Does the number of home runs affect the number of RBIs? Produce a scatter plot and discuss the correlation. Player Home Runs RBIs Varitek 17 68 Youkilis 16 83 Pedroia 8 50 Lowell 21 120 Lugo 8 73 Ramirez 20 88 Crisp 6 60 Drew 11 64 Ortiz 35 117

Red Sox Example Which is the response variable? Which is the response variable?

Red Sox Example Which is the response variable? Which is the response variable? Since we are asking if HR affects RBIs, HR would be the explanatory variable and therefore x. So RBIs is the y variable. 2007 Red Sox Power Numbers RBIs 120 100 80 60 40 20 10 20 30 Home Runs

Before We Go On Something to notice: we have two values with the same x-coordinate. 2007 Red Sox Power Numbers 120 RBIs 100 80 60 40 20 10 20 30 Home Runs

Finding Pearson s Correlation Coefficient What is our guess as to the correlation?

Finding Pearson s Correlation Coefficient What is our guess as to the correlation? Now let s find Pearson s correlation coefficient.

Finding Pearson s Correlation Coefficient What is our guess as to the correlation? Now let s find Pearson s correlation coefficient. Input data in usual way, with explanatory variable under L 1 and response variable under L 2

Finding Pearson s Correlation Coefficient What is our guess as to the correlation? Now let s find Pearson s correlation coefficient. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS

Finding Pearson s Correlation Coefficient What is our guess as to the correlation? Now let s find Pearson s correlation coefficient. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS Select LinRegTTest

Finding Pearson s Correlation Coefficient What is our guess as to the correlation? Now let s find Pearson s correlation coefficient. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS Select LinRegTTest Make sure the XList and YList are the lists where the data for the explanatory and response variables are located, respectively

Finding Pearson s Correlation Coefficient What is our guess as to the correlation? Now let s find Pearson s correlation coefficient. Input data in usual way, with explanatory variable under L 1 and response variable under L 2 Press STAT and scroll to TESTS Select LinRegTTest Make sure the XList and YList are the lists where the data for the explanatory and response variables are located, respectively Press Calculate and scroll to find r and r 2

Using Technology For our example, we have

Using Technology For our example, we have r r 2.8463.7162

Using Technology For our example, we have r r 2.8463.7162 So the correlation coefficient tells us that there is a strong positive correlation.

What Does r 2 Tell Us? r 2 tells us how much better our predictions will be if we go through the trouble to find the regression line rather than just make our predictions with the means. Ours is pretty good here, indicating that we should find the regression line. 2007 Red Sox Power Numbers RBIs 120 100 80 60 40 20 10 20 30 Home Runs

Technology and Scatter Plots We can also create a scatter plot on the calculator.

Technology and Scatter Plots We can also create a scatter plot on the calculator. Make sure there are no functions in the grapher (press Y= to check)

Technology and Scatter Plots We can also create a scatter plot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example)

Technology and Scatter Plots We can also create a scatter plot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu

Technology and Scatter Plots We can also create a scatter plot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu Make sure only the plot we want is turned on

Technology and Scatter Plots We can also create a scatter plot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu Make sure only the plot we want is turned on Select the first graph in the first row and then make sure the XList and YList are correct

Technology and Scatter Plots We can also create a scatter plot on the calculator. Make sure there are no functions in the grapher (press Y= to check) Input the data in the usual way (we already have it there for this example) Press 2 nd and Y= to get into the STAT PLOT menu Make sure only the plot we want is turned on Select the first graph in the first row and then make sure the XList and YList are correct Press ZOOM 9

One More Example Example There is some evidence that drinking moderate amounts of wine helps prevent heart attacks. The accompanying table gives data on yearly wine consumption (in liters of alcohol from drinking wine per person) and yearly deaths from heart disease (per 100,000 people) in 19 developing nations. Construct a scatter plot and describe what you see. Country Alcohol Deaths County Alcohol Deaths Australia 2.5 211 Austria 3.9 167 Belgium 2.9 131 Canada 2.4 191 Denmark 2.9 220 Finland 0.8 297 France 9.1 71 Iceland 0.8 211 Ireland 0.7 300 Italy 7.9 107 Netherlands 1.8 167 New Zealand 1.9 266 Norway 0.8 227 Spain 6.5 86 Sweden 1,.6 207 Switzerland 5.8 115 United Kingdom 1.3 285 United States 1.2 199 West Germany 2.7 172

The Scatter Plot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters)

The Scatter Plot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters) r =.8428, strong negative correlation

The Scatter Plot Heart Disease v. Alcohol from Wine Deaths (per 100,000) 300 250 200 150 100 50 2 4 6 8 Alcohol from Wine (in liters) r =.8428, strong negative correlation r 2 =.7103, worthwhile to find linear regression line

3 Slopes and Equations of Fitted Lines

This Section Is... The focus of this section is the idea of a fitted curve and the formulas used to find the equation of a line given two points.

This Section Is... The focus of this section is the idea of a fitted curve and the formulas used to find the equation of a line given two points. We will not talk about this section here unless we need to.

4 The Least Squares Line

Getting Started Definition A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x.

Getting Started Definition A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Linear functions are of the form y = mx + b but we will consider them as y = b 0 + b 1 x where b 0 is the y-intercept and b 1 is the slope. The calculator actually uses the form y = a + bx so be careful.

Formulas What we will be finding is the least squares regression line of y on x. This is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

Formulas What we will be finding is the least squares regression line of y on x. This is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Least Squares Regression Line b 1 = r s y s x

Formulas What we will be finding is the least squares regression line of y on x. This is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Least Squares Regression Line b 1 = r s y s x b 0 = y b 1 x

Formulas What we will be finding is the least squares regression line of y on x. This is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Least Squares Regression Line b 1 = r s y s x b 0 = y b 1 x If the correlation coefficient is too small, there is no point in finding y since b 0 and b 1 are both dependent on r.

Example Using Given Values Example The following list gives the power numbers for starting 9 Red Sox players for the 2007 season. Name Home Runs RBIs Jason Varitek 17 68 Kevin Youkilis 16 83 Dustin Pedroia 8 50 Mike Lowell 21 120 Julio Lugo 8 73 Manny Ramirez 20 88 Coco Crisp 6 60 J.D. Drew 11 64 David Ortiz 35 117 We want to know if there the number of home runs affects the number of RBIs.

Variable Types What is the response variable?

Variable Types What is the response variable? What is the explanatory variable?

The Needed Values We can find the mean and standard deviation of both sets of data quickly using our technology.

The Needed Values We can find the mean and standard deviation of both sets of data quickly using our technology. Variable Mean Standard Deviation x 15.78 9.05 y 80.33 24.47

The Needed Values We can find the mean and standard deviation of both sets of data quickly using our technology. Variable Mean Standard Deviation x 15.78 9.05 y 80.33 24.47 And, since the data is already in the calculator, we can obtain the values of r and r 2 quickly as well. Variable Mean Standard Deviation x 15.78 9.05 y 80.33 24.47 r =.8463 r 2 =.7162

The Correlation Coefficient How would we describe this correlation coefficient?

The Correlation Coefficient How would we describe this correlation coefficient? It indicates a strong, positive correlation.

The Correlation Coefficient How would we describe this correlation coefficient? It indicates a strong, positive correlation. We can use these values to find the equation of the regression line. b 1 = r s y s ( ) x 24.47 =.8463 9.05 = 2.29

The Correlation Coefficient How would we describe this correlation coefficient? It indicates a strong, positive correlation. We can use these values to find the equation of the regression line. b 1 = r s y s ( ) x 24.47 =.8463 9.05 = 2.29 b 0 = y b 1 x

The Correlation Coefficient How would we describe this correlation coefficient? It indicates a strong, positive correlation. We can use these values to find the equation of the regression line. b 1 = r s y s ( ) x 24.47 =.8463 9.05 = 2.29 b 0 = y b 1 x = 80.33 2.29(15.78) = 44.19

The Correlation Coefficient How would we describe this correlation coefficient? It indicates a strong, positive correlation. We can use these values to find the equation of the regression line. So, the regression line is b 1 = r s y s ( ) x 24.47 =.8463 9.05 = 2.29 b 0 = y b 1 x = 80.33 2.29(15.78) = 44.19 y = 44.19 + 2.29x

Alternative Calculation We can also use the three S s to find the equation of the least squares line.

Alternative Calculation We can also use the three S s to find the equation of the least squares line. Least Squares Regression Line and m = S xy S xx b = y mx

Alternative Calculation S xy = 12907 (142)(723) 9 = 1499.67

Alternative Calculation S xy = 12907 (142)(723) 9 = 1499.67 S xx = 2896 (142)2 9 = 655.56

Alternative Calculation S xy = 12907 (142)(723) 9 = 1499.67 S xx = 2896 (142)2 9 = 655.56 m = 1499.67 655.56 = 2.29

Alternative Calculation S xy = 12907 (142)(723) 9 = 1499.67 S xx = 2896 (142)2 9 = 655.56 m = 1499.67 655.56 = 2.29 b = 80.33 2.29(15.78) = 44.19 And we have the same results.

Practical Interpretation What do these values mean in practical terms?

Practical Interpretation What do these values mean in practical terms? The slope tells us that a change in the explanatory variable by one unit will result in a change in the response variable by the amount and direction of the slope.

Practical Interpretation What do these values mean in practical terms? The slope tells us that a change in the explanatory variable by one unit will result in a change in the response variable by the amount and direction of the slope. In our example, the slope is b 1 = 2.29, which tells us that for every home run hit, you d expect to get an additional 2.29 RBIs.

Practical Interpretation What do these values mean in practical terms? The slope tells us that a change in the explanatory variable by one unit will result in a change in the response variable by the amount and direction of the slope. In our example, the slope is b 1 = 2.29, which tells us that for every home run hit, you d expect to get an additional 2.29 RBIs. The y-intercept tells us the value of the response variable when the explanatory variable is 0.

Practical Interpretation What do these values mean in practical terms? The slope tells us that a change in the explanatory variable by one unit will result in a change in the response variable by the amount and direction of the slope. In our example, the slope is b 1 = 2.29, which tells us that for every home run hit, you d expect to get an additional 2.29 RBIs. The y-intercept tells us the value of the response variable when the explanatory variable is 0. In our example b 0 = 44.19, which means that we expect a player who hits no home runs to have 44.19 RBIs.

Practical Interpretation What do these values mean in practical terms? The slope tells us that a change in the explanatory variable by one unit will result in a change in the response variable by the amount and direction of the slope. In our example, the slope is b 1 = 2.29, which tells us that for every home run hit, you d expect to get an additional 2.29 RBIs. The y-intercept tells us the value of the response variable when the explanatory variable is 0. In our example b 0 = 44.19, which means that we expect a player who hits no home runs to have 44.19 RBIs. Note: In context, we may need to round to whole numbers for the answers to make any sense.

The Scatter Plot Let s see how good the regression line is by plotting it over the scatter plot.

The Scatter Plot Let s see how good the regression line is by plotting it over the scatter plot. RBIs 120 100 80 60 40 20 2007 Red Sox Power Numbers 10 20 30 Home Runs To do so, we press Y= and put the line under Y 1, then select GRAPH

Plot and Line And now with the regression line y = 44.19 + 2.29x 2007 Red Sox Power Numbers RBIs 120 100 80 60 40 20 10 20 30 Home Runs

Predictions One use of the regression line is making predictions. Suppose we wanted to know about how many RBI we could expect a player to have if they hit 60 home runs. We are looking to predict the value of y (so we want y) and we are given a value of x = 60.

Predictions One use of the regression line is making predictions. Suppose we wanted to know about how many RBI we could expect a player to have if they hit 60 home runs. We are looking to predict the value of y (so we want y) and we are given a value of x = 60. Our prediction would be y = 44.19 + 2.29(60) = 181.59

Predictions One use of the regression line is making predictions. Suppose we wanted to know about how many RBI we could expect a player to have if they hit 60 home runs. We are looking to predict the value of y (so we want y) and we are given a value of x = 60. Our prediction would be y = 44.19 + 2.29(60) = 181.59 So, our prediction is 182 RBIs.

Facts about Regression Lines 1 Distinction between explanatory variables is essential - remember the formulas...

Facts about Regression Lines 1 Distinction between explanatory variables is essential - remember the formulas... 2 There is a close connection between slope and correlation

Facts about Regression Lines 1 Distinction between explanatory variables is essential - remember the formulas... 2 There is a close connection between slope and correlation 3 (x, y) is always on the line

Facts about Regression Lines 1 Distinction between explanatory variables is essential - remember the formulas... 2 There is a close connection between slope and correlation 3 (x, y) is always on the line 4 r 2 gives the fraction of the variation in the values of y that is explained by the least squares regression line of y on x. In other words, r 2 gives the fraction of the data s variation accounted for by the model.

Facts about Regression Lines 1 Distinction between explanatory variables is essential - remember the formulas... 2 There is a close connection between slope and correlation 3 (x, y) is always on the line 4 r 2 gives the fraction of the variation in the values of y that is explained by the least squares regression line of y on x. In other words, r 2 gives the fraction of the data s variation accounted for by the model. 5 This only shows us the linear model; it is possible that there is little correlation linearly but that the data has a strong correlation if we were using some other type of model.

Facts about Regression Lines 1 Distinction between explanatory variables is essential - remember the formulas... 2 There is a close connection between slope and correlation 3 (x, y) is always on the line 4 r 2 gives the fraction of the variation in the values of y that is explained by the least squares regression line of y on x. In other words, r 2 gives the fraction of the data s variation accounted for by the model. 5 This only shows us the linear model; it is possible that there is little correlation linearly but that the data has a strong correlation if we were using some other type of model. 6 We will not always get perfect correlation (probably never) but we need the line to be straight enough for the data to make sense. What that means is variable. Depending on the situation, r =.3 could be good enough; other times r =.8 would be a minimum.

Another Sox Example Suppose we wanted to know if a player was expected to score more runs if he got more hits. To answer this question, we will use the roster of the 2011 Boston Red Sox. Name Runs Hits Jarred Saltalamacchia 52 84 Adrian Gonzalez 108 213 Dustin Pedroia 102 195 Marco Scutaro 59 118 Kevin Youkilis 68 111 Carl Crawford 65 129 Jacoby Ellsbury 119 212 J.D. Drew 23 55 David Ortiz 84 162 Jed Lowrie 40 78 Josh Reddick 41 71 Jason Varitek 32 49 Darnell McDonald 26 37 Mike Aviles 17 32 Mike Cameron 9 14 Drew Sutton 11 17 Ryan Lavarnway 5 9 Yamaico Navarro 6 8 Conor Jackson 2 3 Jose Iglesias 3 2 Lars Anderson 2 0 Joey Gathright 1 0

The Correlation Coefficient The first thing we will do is find the correlation coefficient.

The Correlation Coefficient The first thing we will do is find the correlation coefficient. When we plug all of the data into our technology, we get r =.9942.

The Correlation Coefficient The first thing we will do is find the correlation coefficient. When we plug all of the data into our technology, we get r =.9942. Interpretation?

The Correlation Coefficient The first thing we will do is find the correlation coefficient. When we plug all of the data into our technology, we get r =.9942. Interpretation? There is a strong, positive correlation between hits and runs scored.

Interpretation of r 2 What is the value of r 2 here?

Interpretation of r 2 What is the value of r 2 here? r 2 =.9885. What does this mean?

Interpretation of r 2 What is the value of r 2 here? r 2 =.9885. What does this mean? r 2 tells us that the fraction of the variability of the response variable that is accounted for by the variation in the explanatory variable.

Interpretation of r 2 What is the value of r 2 here? r 2 =.9885. What does this mean? r 2 tells us that the fraction of the variability of the response variable that is accounted for by the variation in the explanatory variable. Here, this means that we can explain 98.85% of the variability in the runs total by the variability in the number of hits.

Interpretation of r 2 What is the value of r 2 here? r 2 =.9885. What does this mean? r 2 tells us that the fraction of the variability of the response variable that is accounted for by the variation in the explanatory variable. Here, this means that we can explain 98.85% of the variability in the runs total by the variability in the number of hits. For each standard deviation above the mean for the explanatory variable x, y will be r standard deviations above the mean of the response variable.

Interpretation of r 2 What is the value of r 2 here? r 2 =.9885. What does this mean? r 2 tells us that the fraction of the variability of the response variable that is accounted for by the variation in the explanatory variable. Here, this means that we can explain 98.85% of the variability in the runs total by the variability in the number of hits. For each standard deviation above the mean for the explanatory variable x, y will be r standard deviations above the mean of the response variable. Here, we would say that for every standard deviation we are above the mean number of hits, we will be.9885 standard deviations above the mean number of runs.

Interpretation of r 2 What is the value of r 2 here? r 2 =.9885. What does this mean? r 2 tells us that the fraction of the variability of the response variable that is accounted for by the variation in the explanatory variable. Here, this means that we can explain 98.85% of the variability in the runs total by the variability in the number of hits. For each standard deviation above the mean for the explanatory variable x, y will be r standard deviations above the mean of the response variable. Here, we would say that for every standard deviation we are above the mean number of hits, we will be.9885 standard deviations above the mean number of runs. Also note that 1 r 2 is the fraction of the variation in the original data left in the residuals.

Producing the Scatter Plot Now, let s produce a scatter plot for the data.

Producing the Scatter Plot Now, let s produce a scatter plot for the data. 2011 Red Sox Runs 120 100 80 60 40 20 50 100 150 200 Hits

The Assumptions The points give us a pretty good indication that there is a very strong positive correlation here. Before we go on, we want to make sure all of the assumptions about regression lines are met.

The Assumptions The points give us a pretty good indication that there is a very strong positive correlation here. Before we go on, we want to make sure all of the assumptions about regression lines are met. Quantitative Variable Condition If either y or x is categorical, you cannot make a scatter plot and you cannot perform a regression.

The Assumptions The points give us a pretty good indication that there is a very strong positive correlation here. Before we go on, we want to make sure all of the assumptions about regression lines are met. Quantitative Variable Condition If either y or x is categorical, you cannot make a scatter plot and you cannot perform a regression. Straight Enough Condition Does the data look straight enough that we can see a linear relationship in the data set?

The Assumptions The points give us a pretty good indication that there is a very strong positive correlation here. Before we go on, we want to make sure all of the assumptions about regression lines are met. Quantitative Variable Condition If either y or x is categorical, you cannot make a scatter plot and you cannot perform a regression. Straight Enough Condition Does the data look straight enough that we can see a linear relationship in the data set? Outlier Condition Are there any outliers that dramatically influence the fit of the regression line?

The Assumptions The points give us a pretty good indication that there is a very strong positive correlation here. Before we go on, we want to make sure all of the assumptions about regression lines are met. Quantitative Variable Condition If either y or x is categorical, you cannot make a scatter plot and you cannot perform a regression. Straight Enough Condition Does the data look straight enough that we can see a linear relationship in the data set? Outlier Condition Are there any outliers that dramatically influence the fit of the regression line? Does the Plot Thicken Condition Does the spread of the data around the generally straight relationship seem to be consistent for all values of x?

Linear Regression Line Since all of these are satisfied, we will continue on to find the formula of the regression line. Using our technology, we have

Linear Regression Line Since all of these are satisfied, we will continue on to find the formula of the regression line. Using our technology, we have y = b0 + b 1 x = 1.92 +.52x

Linear Regression Line Since all of these are satisfied, we will continue on to find the formula of the regression line. Using our technology, we have y = b0 + b 1 x = 1.92 +.52x What is the practical interpretation of the slope b 1?

Linear Regression Line Since all of these are satisfied, we will continue on to find the formula of the regression line. Using our technology, we have y = b0 + b 1 x = 1.92 +.52x What is the practical interpretation of the slope b 1? For each hit, we expect a player to score.52 additional runs.

Linear Regression Line Since all of these are satisfied, we will continue on to find the formula of the regression line. Using our technology, we have y = b0 + b 1 x = 1.92 +.52x What is the practical interpretation of the slope b 1? For each hit, we expect a player to score.52 additional runs. What is the practical interpretation of the y-intercept b 0?

Linear Regression Line Since all of these are satisfied, we will continue on to find the formula of the regression line. Using our technology, we have y = b0 + b 1 x = 1.92 +.52x What is the practical interpretation of the slope b 1? For each hit, we expect a player to score.52 additional runs. What is the practical interpretation of the y-intercept b 0? If a player has no hits, we expect 1.92 runs to be scored.

Scatter Plot With Regression Line 2011 Red Sox Runs 120 100 80 60 40 20 50 100 150 200 Hits

Scatter Plot With Regression Line 2011 Red Sox Runs 120 100 80 60 40 20 50 100 150 200 Hits So, when we plot the regression line over the scatter plot, we see that the line is a good fit.

Predictions 1 If a player got 200 hits, how many runs would we expect them to have?

Predictions 1 If a player got 200 hits, how many runs would we expect them to have? Here, we are given the x value and using our regression line, we find the predicted value.

Predictions 1 If a player got 200 hits, how many runs would we expect them to have? Here, we are given the x value and using our regression line, we find the predicted value. y = 1.92 +.52(200) 105.92 So, we d expect about 106 runs for a player with 200 hits.

Predictions 1 If a player got 200 hits, how many runs would we expect them to have? Here, we are given the x value and using our regression line, we find the predicted value. y = 1.92 +.52(200) 105.92 So, we d expect about 106 runs for a player with 200 hits. 2 What if we wanted to know how many hits a player had if they scored 120 runs?

Predictions 1 If a player got 200 hits, how many runs would we expect them to have? Here, we are given the x value and using our regression line, we find the predicted value. y = 1.92 +.52(200) 105.92 So, we d expect about 106 runs for a player with 200 hits. 2 What if we wanted to know how many hits a player had if they scored 120 runs? We are given the value of y and want to find the value of x. So, we use our algebra skills...

Predictions 1 If a player got 200 hits, how many runs would we expect them to have? Here, we are given the x value and using our regression line, we find the predicted value. y = 1.92 +.52(200) 105.92 So, we d expect about 106 runs for a player with 200 hits. 2 What if we wanted to know how many hits a player had if they scored 120 runs? We are given the value of y and want to find the value of x. So, we use our algebra skills... We expect about 227 hits. y = 1.92 +.52x 120 = 1.92 +.52x 118.08 =.52x 227.08 = x

Important Points A few important points to keep in mind 1 An observation is influential for a statistical calculation if removing it would markedly change the results of the calculation. Point that are outliers in either the x or y direction are often influential points.

Important Points A few important points to keep in mind 1 An observation is influential for a statistical calculation if removing it would markedly change the results of the calculation. Point that are outliers in either the x or y direction are often influential points. 2 Correlation and least squares regression lines are not resistant

Important Points A few important points to keep in mind 1 An observation is influential for a statistical calculation if removing it would markedly change the results of the calculation. Point that are outliers in either the x or y direction are often influential points. 2 Correlation and least squares regression lines are not resistant 3 They only describe linear relationships.

Important Points A few important points to keep in mind 1 An observation is influential for a statistical calculation if removing it would markedly change the results of the calculation. Point that are outliers in either the x or y direction are often influential points. 2 Correlation and least squares regression lines are not resistant 3 They only describe linear relationships. 4 There could be lurking variables. Those are ones that are not among the explanatory or response variables but may influence the interpretation of the relationship.

Important Points A few important points to keep in mind 1 An observation is influential for a statistical calculation if removing it would markedly change the results of the calculation. Point that are outliers in either the x or y direction are often influential points. 2 Correlation and least squares regression lines are not resistant 3 They only describe linear relationships. 4 There could be lurking variables. Those are ones that are not among the explanatory or response variables but may influence the interpretation of the relationship. 5 An association between an explanatory variable x and a response variable y, even if r is very strong, is not itself good evidence that changes in x actually cause changes in y. The phrase to remember is that correlation does not necessarily imply causation.

Example Where You Are Doing The Work Example We want to know know if there is a relationship between the score on the math portion of the SAT exam and the number of hours studying for the test. The question is, Does studying more increase the score on the exam? The following data was taken from a study conducted of 20 students as they prepared and took the SAT exam. Hours 4 9 10 14 4 7 12 22 1 3 Score 390 580 650 730 410 530 600 790 350 400 Hours 8 11 5 6 10 11 16 13 13 10 Score 590 640 450 520 690 690 770 700 730 640

Variable Types What is the response variable?

Variable Types What is the response variable? Math SAT score

Variable Types What is the response variable? Math SAT score What is the explanatory variable?

Variable Types What is the response variable? Math SAT score What is the explanatory variable? Hours of study

Pearson s Correlation Coefficient So let s get first find the correlation coefficient to see what we are dealing with.

Pearson s Correlation Coefficient So let s get first find the correlation coefficient to see what we are dealing with. Our interpretation? r =.9336

Pearson s Correlation Coefficient So let s get first find the correlation coefficient to see what we are dealing with. Our interpretation? r =.9336 This tells us there is a strong positive correlation.

And Now r 2 What about r 2?

And Now r 2 What about r 2? r 2 =.8716 This tells us what?

And Now r 2 What about r 2? r 2 =.8716 This tells us what? We can explain 87.16% of the variation in the Math SAT score can be explained by the variation in the hours of study.

Is The Data Significant? What is the inequality we are using?

Is The Data Significant? What is the inequality we are using? r n > 3

Is The Data Significant? What is the inequality we are using? r n > 3 Is this data significant?

Is The Data Significant? What is the inequality we are using? r n > 3 Is this data significant? r n =.9336 20 4.17 > 3 So, the data is significant based on this criteria.

Visual Representation Next, let s produce our scatter plot so we can see what we are dealing with.

Visual Representation Next, let s produce our scatter plot so we can see what we are dealing with. Math SAT Score v. Hours of Study SAT Score 800 700 600 500 400 300 5 10 15 20 Hours of Study

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this?

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this? Quantitative Variable Condition

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this? Quantitative Variable Condition Both variables are quantitative.

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this? Quantitative Variable Condition Both variables are quantitative. Straight Enough Condition

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this? Quantitative Variable Condition Both variables are quantitative. Straight Enough Condition Data looks reasonably straight.

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this? Quantitative Variable Condition Both variables are quantitative. Straight Enough Condition Data looks reasonably straight. Outlier Condition

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this? Quantitative Variable Condition Both variables are quantitative. Straight Enough Condition Data looks reasonably straight. Outlier Condition There do not seem to be any outliers.

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this? Quantitative Variable Condition Both variables are quantitative. Straight Enough Condition Data looks reasonably straight. Outlier Condition There do not seem to be any outliers. Does the Plot Thicken Condition

Checking Conditions/Assumptions We are feeling pretty good about this - it seems to have a strong, positive correlation. When we consider the conditions, are we still happy with this? Quantitative Variable Condition Both variables are quantitative. Straight Enough Condition Data looks reasonably straight. Outlier Condition There do not seem to be any outliers. Does the Plot Thicken Condition Pretty much - other than the one person who studied for 22 hours, the relationship seems very strong.