Scatterplots and Correlation

Similar documents
Chapter 3: Describing Relationships

Chapter 3: Describing Relationships

7. Do not estimate values for y using x-values outside the limits of the data given. This is called extrapolation and is not reliable.

3.2: Least Squares Regressions

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Determine is the equation of the LSRL. Determine is the equation of the LSRL of Customers in line and seconds to check out.. Chapter 3, Section 2

Chapter 3: Examining Relationships

Scatterplots. 3.1: Scatterplots & Correlation. Scatterplots. Explanatory & Response Variables. Section 3.1 Scatterplots and Correlation

Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight?

Chapter 2: Looking at Data Relationships (Part 3)

Bivariate Data Summary

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables. Interpreting scatterplots Outliers

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

Nov 13 AP STAT. 1. Check/rev HW 2. Review/recap of notes 3. HW: pg #5,7,8,9,11 and read/notes pg smartboad notes ch 3.

The response variable depends on the explanatory variable.

Linear Regression Communication, skills, and understanding Calculator Use

3.1 Scatterplots and Correlation

Correlation & Regression

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

BIVARIATE DATA data for two variables

Least-Squares Regression. Unit 3 Exploring Data

AP Statistics Two-Variable Data Analysis

Unit 6 - Introduction to linear regression

Chapter 5 Friday, May 21st

What is the easiest way to lose points when making a scatterplot?

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

The following formulas related to this topic are provided on the formula sheet:

AP Statistics Unit 2 (Chapters 7-10) Warm-Ups: Part 1

Linear Regression and Correlation. February 11, 2009

Algebra 1 Practice Test Modeling with Linear Functions Unit 6. Name Period Date

AP Statistics - Chapter 2A Extra Practice

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

1) A residual plot: A)

Unit 6 - Simple linear regression

Describing Bivariate Relationships

AP Statistics Bivariate Data Analysis Test Review. Multiple-Choice

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Section I: Multiple Choice Select the best answer for each question.

Conditions for Regression Inference:

Mrs. Poyner/Mr. Page Chapter 3 page 1

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

5.1 Bivariate Relationships

Chapter 5: Data Transformation

Pre-Calculus Multiple Choice Questions - Chapter S8

Chapter 3: Examining Relationships

6.1.1 How can I make predictions?

appstats8.notebook October 11, 2016

Chapter 8. Linear Regression /71

Correlation. Relationship between two variables in a scatterplot. As the x values go up, the y values go down.

Section 5.4 Residuals

Prob/Stats Questions? /32

ST Correlation and Regression

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Review of Regression Basics

Least Squares Regression

Test 3A AP Statistics Name:

Ch. 3 Review - LSRL AP Stats

1. Use Scenario 3-1. In this study, the response variable is

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

9. Linear Regression and Correlation

Warm-up Using the given data Create a scatterplot Find the regression line

a. Yes, it is consistent. a. Positive c. Near Zero

NAME: DATE: SECTION: MRS. KEINATH

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

Practice Questions for Exam 1

Math 2311 TEST 2 REVIEW SHEET KEY

Chapter 3: Examining Relationships Review Sheet

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

AMS 7 Correlation and Regression Lecture 8

Chapter 3: Examining Relationships

Chapter 4 Describing the Relation between Two Variables

INFERENCE FOR REGRESSION

Study Guide AP Statistics

appstats27.notebook April 06, 2017

Lecture 48 Sections Mon, Nov 16, 2009

Index I-1. in one variable, solution set of, 474 solving by factoring, 473 cubic function definition, 394 graphs of, 394 x-intercepts on, 474

AP Statistics S C A T T E R P L O T S, A S S O C I A T I O N, A N D C O R R E L A T I O N C H A P 6

Chapter 7 Linear Regression

Stat 101: Lecture 6. Summer 2006

Chapter 6: Exploring Data: Relationships Lesson Plan

Nonlinear Regression Section 3 Quadratic Modeling

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Information Sources. Class webpage (also linked to my.ucdavis page for the class):

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

A Cubic Regression Group Activity 4 STEM Project Week #7

y n 1 ( x i x )( y y i n 1 i y 2

IF YOU HAVE DATA VALUES:

Chapter 9. Correlation and Regression

q3_3 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Final Exam - Solutions

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Looking at data: relationships

IT 403 Practice Problems (2-2) Answers

THE PEARSON CORRELATION COEFFICIENT

HW38 Unit 6 Test Review

Copyright, Nick E. Nolfi MPM1D9 Unit 6 Statistics (Data Analysis) STA-1

Transcription:

Bivariate Data Page 1 Scatterplots and Correlation Essential Question: What is the correlation coefficient and what does it tell you? Most statistical studies examine data on more than one variable. Fortunately, analysis of several-variable data builds on the tolls we used to examine individual variables. The principles that guide our work also remain the same: Plot the data, then add numerical summaries. Look for overall patterns and deviations from those patterns. When there's regular overall pattern, use a simplified model to describe it. We think that car weight helps explain accident deaths and that smoking influences life expectancy. In these relationships, the two variables play different roles. Accident death rates and life expectancy are the response variable of interest. Car weight and number of cigarettes smoked are the explanatory variables. It is easiest to identify explanatory and response variables when we actually specify values of one variable to see how it affects another variable. When we don't specify the values of either variable but just observe both variables, there may or may not be explanatory and response variables. Whether there are depends on how you plan to use the data. Scatterplots The most useful graph for displaying the relationship between two quantitative variables is a scatterplot. Always plot the explanatory variable, if there is one, on the horizontal axis (the x-axis) of a scatterplot. To make a scatterplot: 1) Decide which variable should go on each axis, 2) label and scale your axes, 3) Plot individual data values. example: Sprint time (seconds) vs long jump distance (inches) Sprint time: 5.41 5.05 9.49 8.09 7.01 7.17 6.83 6.73 8.01 5.68 5.78 6.31 6.04 Long-jump: 171 184 48 151 90 65 94 78 71 130 173 143 141 **AP EXAM TIP: Always be sure to label the axes of your graph.

Page 2 Interpreting Scatterplots When interpreting scatterplots we use DOFS. D = direction, O = outliers, F = form, and S = strength. Look for a clear direction (correlation): positive, negative, or none. positive negative none Sometimes it can be difficult to see a direction, but you should look for the overall pattern in the graph. Outliers are classified as being an outlier or an influential point. If an outlier has an extreme y- value but has a typical x-value, then the point is an outlier. If an outlier has an unusual x-value, but a y-value that does not follow the pattern, then we have an influential point. The graph displays a scatterplot for the number of missing assignments and a student's score on the quiz. We can see there is a negative correlation between the two variables. Points A, B, and C are all outliers. Point A has a typical x-value but an extreme y-value; therefore point A is an outlier. Point B has an unusual x-value and its y-value is not following the pattern; therefore point B is an influential point. Point C has an unusual x-value, but its y-value is following the pattern; therefore it is an outlier. Bivariate Data Page 2

Page 3 Next we look at the form of the scatterplot. Here we are looking to see if the graph appears linear or if it may be quadratic, cubic, or exponential. linear quadratic exponential Finally, we look at the strength of the correlation. The correlation coefficient, represented by the letter r, gives us a value for the strength of the correlation. The value of r will fall between -1 and 1 inclusively and it has no units. When the value of r is negative we have a negative correlation and when r is positive we have a positive correlation. The strength of the correlation is given below: 0.8 r 1.0 : strong correlation 0.5 r < 0.8 : moderate correlation 0 r < 0.5 : weak correlation r is used because a positive or negative value of 0.85 would be considered a strong correlation. The correlation coefficient can be calculated using the formula: r =. Where (x, y ) represent each ordered pair, x and y are the average x and y value, s and s are the standard deviations for each variable, and n is the number of observations. example: Body weight and backpack weight Body weight: 120 187 109 103 131 165 158 116 Backpack weight: 26 30 26 24 29 35 31 28 x y 120 -.5322 26 -.7582.4036 187 1.6793 30.3972.6670 109 -.8953 26 -.7582.6789 103-1.0934 24-1.3359 1.4607 131 -.1692 29.1083-0.0183 165.9531 35 1.8414 1.7551 158.7220 31.6860.4953 116 -.6642 28 -.1805.1199 x = 136.125 s = 30.296 y = 28.625 s = 3.462 = 5.5622 and divide by 7 because n-1. r =.795 Bivariate Data Page 3

Least Squares Regression Essential Question: What is a least squares regression line and what does it tell us? A regression line is a line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. example: Does fidgeting keep you slim? Some people don't gain weight even when they overeat. Perhaps fidgeting and other "nonexercise activity " (NEA) explains why - some people may spontaneously increase nonexercise activity when fed more. Researchers deliberately overfed 16 healthy young adults for 8 weeks. They measured fat gain (in kilograms) as the response variable and change in energy use (in calories) from activity other than deliberate exercise - fidgeting, daily living, and the like - as the explanatory variable. Below is the data: NEA change: -94-57 -29 135 143 151 245 355 392 473 486 535 571 580 620 690 Fat gain: 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3 3.8 1.7 1.6 2.2 1.0 0.4 2.3 1.1 Below is a scatterplot of the data. From the scatterplot we can see a moderately strong negative linear correlation. r = -.7786 Using our calculator we can find the Least Squares Regression Line (LSRL) for the data. The linear equation will be of the form: y = a + bx. We use y (read "y hat") for the predicted value of y, a for the y-intercept and b for the slope. The calculator finds: y = 3.505 When interpreting the slope we say: "On average, for every 1 unit increase in the explanatory variable we would see a (slope value) unit increase/decrease in the response variable." For this problem: "On average, for every additional calorie of NEA we would see a 0.00344 kg decrease in fat gain." We can now use our LSRL to make predictions about y. For example; if we wanted to know the fat gain when NEA is 425 calories; we just plug in 425 for x. We find y = 1.957 kg. Be sure to use y any time you are referring to a predicted value. Sometimes you are asked to predict a y-value for an x-value much larger or smaller than the values used to create the LSRL. When this happens be aware the an extrapolation error could occur. In other words, the predicted value you find for y may not be valid because of the extreme x-value used to find it. Bivariate Data Page 4

Page 2 A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is: residual = observed y - predicted y or residual = y - y. A positive residual value means that the observed value is greater than the value predicted by the LSRL. A negative residual value means that the observed value is less than the value predicted by the LSRL. To find a residual we select an x-value from our table and use the LSRL to make a prediction for the y-value. For example for x = 135 we have a y-value of 2.7 and a y of 3.0406. Therefore the residual would be y y = -0.3406. Note: The least squares regression line of y on x is the line that makes the sum of the squared residuals as small as possible. Graphing residuals You can plot the residuals values on the y-axis and the explanatory values on the x-axis and this will give you a residual plot. The residual plot allows us to see if a line is a good fit for the data. If the plots appear to be random, then a line is an appropriate fit for the data. Because the plots appear to be random we can say that a linear regression is an appropriate fit for the data. The LSRL can be found using the calculator or by using the formulas: b = r and a = y bx. Where r is the correlation coefficient, x and y are the average values of x and y, and s and s are the standard deviations for the x and y values. Coefficient of Determination tells us how well the LSRL fits the data. We use r to represent the coefficent of determination. r values fall between 0 and 1. A value closer to 1 means the line is a better fit. r = 1 Where SSE (sum squared of estimated variation) = (y y) and SST (sum squared of total variation) = (y y ) Bivariate Data Page 5

Bivariate Data Page 6

Page 3 The coefficient of determination is the fraction of the variation in the values of y that is accounted for by the LSRL of y on x. We would say: "The LSRL accounts for (r 2 value)% of the variation in the (response variable)." The standard deviation of the residuals gives the approximate size of a "typical" or "average" prediction error (residual). To calculate the standard deviation we use s = ( ). Here we divide by n - 2 because there are two variables instead of just one. Bivariate Data Page 7