Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables. Interpreting scatterplots Outliers

Similar documents
Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables Interpreting scatterplots Outliers

Looking at Data Relationships. 2.1 Scatterplots W. H. Freeman and Company

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Scatterplots and Correlation

Chapter 6: Exploring Data: Relationships Lesson Plan

Chapter 3: Examining Relationships

Chapter 3: Describing Relationships

7. Do not estimate values for y using x-values outside the limits of the data given. This is called extrapolation and is not reliable.

Chapter 3: Describing Relationships

M 225 Test 1 B Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Sampling, Frequency Distributions, and Graphs (12.1)

Chapter 6. September 17, Please pick up a calculator and take out paper and something to write with. Association and Correlation.

Describing Bivariate Relationships

Scatterplots and Correlations

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

5.1 Bivariate Relationships

3.1 Scatterplots and Correlation

AP Statistics Unit 2 (Chapters 7-10) Warm-Ups: Part 1

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Linear Regression and Correlation. February 11, 2009

The response variable depends on the explanatory variable.

Scatterplots. STAT22000 Autumn 2013 Lecture 4. What to Look in a Scatter Plot? Form of an Association

Chapter 3: Examining Relationships Review Sheet

y n 1 ( x i x )( y y i n 1 i y 2

AP Statistics - Chapter 2A Extra Practice

Chapter 3: Examining Relationships

Chapter 7. Scatterplots, Association, and Correlation

S.ID.C.8: Correlation Coefficient

Unit 6 - Introduction to linear regression

Quantitative Bivariate Data

Chapter 10. Correlation and Regression. McGraw-Hill, Bluman, 7th ed., Chapter 10 1

Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight?

Deskription. Exempel 1. Exempel 1 (lösning) Normalfördelningsmodellen (forts.)

Chapter 6. Exploring Data: Relationships. Solutions. Exercises:

M 140 Test 1 B Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

Chapter 8. Linear Regression /71

Chapter 10. Correlation and Regression. McGraw-Hill, Bluman, 7th ed., Chapter 10 1

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

THE PEARSON CORRELATION COEFFICIENT

Sampling Distribution Models. Chapter 17

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

SCATTERPLOTS. We can talk about the correlation or relationship or association between two variables and mean the same thing.

Relationships Regression

Upon completion of this chapter, you should be able to:

9. Linear Regression and Correlation

appstats8.notebook October 11, 2016

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

Chapter 5 Friday, May 21st

The empirical ( ) rule

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Pre-Calculus Multiple Choice Questions - Chapter S8

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

Vocabulary: Samples and Populations

Warm-up Using the given data Create a scatterplot Find the regression line

Correlation: basic properties.

Chapter 7 Summary Scatterplots, Association, and Correlation

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

11 Correlation and Regression

Lecture 8 CORRELATION AND LINEAR REGRESSION

Recall, Positive/Negative Association:

Lecture 27. DATA 8 Spring Sample Averages. Slides created by John DeNero and Ani Adhikari

MATH 1150 Chapter 2 Notation and Terminology

Chapter 2: Looking at Data Relationships (Part 3)

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

AMS 7 Correlation and Regression Lecture 8

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

Sem. 1 Review Ch. 1-3

Test 3A AP Statistics Name:

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Ch. 3 Review - LSRL AP Stats

AP Stats ~ 3A: Scatterplots and Correlation OBJECTIVES:

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Copyright, Nick E. Nolfi MPM1D9 Unit 6 Statistics (Data Analysis) STA-1

Probability and Samples. Sampling. Point Estimates

Stat 101 Exam 1 Important Formulas and Concepts 1

Scatterplots and Correlation

CHAPTER 3 Describing Relationships

Lecture 1: Description of Data. Readings: Sections 1.2,

Mrs. Poyner/Mr. Page Chapter 3 page 1

3.2: Least Squares Regressions

Determining the Spread of a Distribution Variance & Standard Deviation

A C E. Answers Investigation 4. Applications

Analysing data: regression and correlation S6 and S7

Measures of the Location of the Data

Unit 6 - Simple linear regression

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Watch TV 4 7 Read 5 2 Exercise 2 4 Talk to friends 7 3 Go to a movie 6 5 Go to dinner 1 6 Go to the mall 3 1

SECTION I Number of Questions 42 Percent of Total Grade 50

Relationships between variables. Association Examples: Smoking is associated with heart disease. Weight is associated with height.

Chapter 6 Scatterplots, Association and Correlation

Scatterplots and Correlation

Chapter 5 Least Squares Regression

Transcription:

Objectives 2.1 Scatterplots Scatterplots Explanatory and response variables Interpreting scatterplots Outliers Adapted from authors slides 2012 W.H. Freeman and Company

Relationships A very important aspect of statistics is the study of relationships between two variables. We have already partly studied this problem when we were doing two-sample procedures Relationship between location and level of student debt Relationship between gender and height Also we have looked at relationships between categorical variables. Binge drinking and gender. In this section we start to `quantify and model these relationships. There are situations when the relationship is so clear we do not need any form of statistical analysis: For example, suppose we want to buy a latte at a coffee shop. The barista explains that the latte comes in three sizes, small, medium and large, the prices are $3.50, $4.00 and $4.50 respectively. Clearly in this example, knowing the price tells you exactly the price of the coffee. However, in many situations the relationship is not so clear cut. This is where statistical tools become useful.

Relationship of two numerical variables Most statistical studies involve more than one variable and the primary questions are about their relationships. Questions one can ask: Which variable(s) are explanatory and which are responses? Do we want to know how one variable affects the value of another? Or do we simply want to measure their association? How is the relationship best described? Is the association positive or negative? How can we predict one variable from the value of the other(s)? Can a straight line be used effectively or is the relationship more complex? How well (close) do the data fit the relationship we describe? How strong (or weak) is the relationship? Is the relationship significant? (Can we reject H 0 : no association?) How do the data deviate from the overall pattern?

Looking at relationships: Scatterplots In a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph. We look for an overall pattern and for deviations from the pattern. Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05

Example: Relationships in weight gain A study was done to investigate why some people do not gain weight even when they overeat. One theory is that these people tend to do `non-exercise activity (such as fidgeting and twitching) which prevents their weight gain. To investigate this issue researchers overfed 16 healthy volunteers for a period of 8 weeks. Before the study they measured the average amount of NEA (non-exercise activity) each volunteer did per day (measure in calories). Then during the study they also measured the amount of NEA that each volunteer did. The difference in the NEA (before and after the study) and the weight gain is given on my website.

Scatterplot NEA against weight gain From the plot it is clear that the people with larger increases in non-exercise activity gained the least weight. How to quantify the strength of this relationship?

Positive or Negative? Positive association: High values of the response variable tend to occur together with high values of the explanatory variable. Negative association: High values of the response variable tend to occur together with low values of the explanatory variable. Flat (no) association: The values of the response variable are similarly distributed for all values of the other variable. There is no information about the response variable that can be predicted from the explanatory variable. Complex association: For some values of the explanatory variable the variables appear to be positively associated, but for other values of that variable they appear to be negatively associated (curvature). Or information other than the general (average) level of the response variable can be predicted from the explanatory variable.

Form and direction of an association Straight Line Relationship No Relationship Negative Positive Curved Relationship Positive Neither

Example: Negative association for weights From the plot it is clear that the people with larger increases in non-exercise activity gained the least weight. This means the association is negative.

Example: Positive association for temp and CO2 This is a scatter plot of average global yearly temperatures against the yearly man-made CO2 emissions. There are 150 points each corresponding from one year from 1855-2005. We can see a clear positive association. Large CO2 values tend to correspond to larger temperatures.

Strength of the association The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. This is a weak positive relationship. For a particular median household income (X), you cannot predict the state per capita income (Y) very well. Y varies widely for a given X. This is a very strong positive relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value. Y varies very little for a given X.

Issues: How to scale a scatterplot Same data in all four plots. There is a negative relationship between swim time and pulse rate. Using an inappropriate scale for a scatterplot will give an incorrect impression and interpretation of the data. Both variables should be given a similar amount of space: The plot is roughly square. Space cannot be reduced without removing some points.

Issues: Outliers An outlier is a data point that is exceptionally unusual or unexpected. They fall outside of the overall pattern of the relationship. This point is unusual in its values but it is not an outlier of the relationship. This point is not in line with the others. It is an outlier of the relationship.

Review: Interpreting scatterplots After plotting two variables on a scatterplot, we describe the relationship by examining the direction, form, and strength of the association. We look for an overall pattern Direction: positive, negative, no direction. Form: straight line, curved, clusters, no pattern. Strength: how closely the points fit the form. and for deviations from that pattern. Do the points fit more closely for one part of the form than it does for another? Are there outliers? Would it be appropriate to extrapolate the relationship we see?

Objectives 2.2 Correlation The correlation coefficient r Properties of the correlation coefficient Adapted from authors slides 2012 W.H. Freeman and Company

Measuring the strength of a linear relationship We recall that in the previous section: The midterms grades appeared to be positively associated but the strength of the association is weak. In particular the association between midterm 1 and the other midterms seemed very weak. The association between midterm 2 and 3 appeared to be stronger. Whereas the weight and NEA appeared to have a negative association that was strong. How to quantify and compare these associations? How to compare the associations between the midterms? The linear association between two numerical variables can be measured using the notion of correlation. The correlation coefficient is a number which lies between -1 and 1. 1 = complete positive association (no spread) -1 = complete negative association (no spread) 0 = no linear association but there could be other types of nonlinear associations.

Measuring relationship: correlation It is calculated using the standardized values (z-scores) of both the x and y variables. r is positive if the relationship is positive and negative if the relationship is negative. r is always between 1 and 1. The closer it is to 1 or 1, the stronger the relationship. r = 1 n 1 n i=1 But close to 0 does not necessarily mean no relationship. r has no units of measurement and does not depend on the units for x and y. It does not matter whether you plot x against y or y against x, the correlation coefficient will be the same. x i x s x y i y s y z-score for x z-score for y

Weight gain and NEA The correlation for the weight gain example is -0.778. It is negative because large NEA corresponds to smaller weight gain and it is close to -1, because there is not much spread about the line.

Yearly temperature and man- made CO2 This is a scatter plot of average global yearly temperatures against the yearly man-made CO2 emissions. There are 150 points each corresponding from one year from 1855-2005. The correlation between temperature and CO2 is 0.799. The correlation is positive because large amounts of CO2 emissions tend to correspond to large temperatures. The correlation is relatively close to one, since there is some spread about the line, but not a huge amount.

The correlation coefkicient r Time to swim: Pulse rate: x = 35; s x = 0.70 y = 140; s y = 9.5 Correlation: r = 0.75 This indicates a moderately strong negative relationship. The value of r would be the same if, for example, Time to Swim was measured in seconds and Pulse Rate was measured in beats per hour. "Time to Swim" is the explanatory variable here, and belongs on the x axis. However, the value of r is the same regardless of how we label or plot the variables.

r ranges from 1 to +1 The correlation coefficient r quantifies the strength and direction of a linear relationship between two quantitative variables. Strength: how closely the points follow a straight line. Direction: is positive when individuals with higher X values tend to have higher values of Y, and is negative when individuals with higher X values tend to have lower values of Y.

Direction? Form? Strength? Automobiles in Albuquerque were randomly selected (at a shopping center) in 1974 and given an emissions test. Total hydrocarbon emissions level and model year were observed. Negative Straight Line? Weak r =.483

Direction? Form? Strength? Pollutants were observed over a 28 day period. The carbon pollutants and the ozone level are to be related. Positive Straight Line Moderate r =.687

Direction? Form? Strength? The efficiency of an industrial biofilter is tested at different temperature levels. Positive Straight Line Moderate to Strong r =.891

Direction? Form? Strength? The nickel-to-iron ratio was measured in oat plants and the plant age (in days after emergence) was also recorded. Complex (positive until 50 days, then negative) Curved Strong (if curve is taken into account) r =.479 The correlation measures the degree to which the points fit a straight line, not a curve.

What s wrong with the statement? In my genetics class there is a perfect correlation (correlation coefficient = 1) between midterm 2 and midterm 3, both midterms were out of 15 so if a student scored 12 in midterm 2 then he scored 12 in midterm 3 too. A perfect (or high) correlation does not mean that the numbers for both variables are the same. For example in midterm 2 the students could have scored less than in midterm 3, but there can still be a perfect correlation (this is easiest seen with a graph). There is a high correlation between the age of American workers and their occupation. Occupation is a categorical variable (Teacher, Lorry driver, Miner etc). So it is impossible to define a correlation between age and occupation. The article probably means a strong association between age (where age was grouped eg 20-29, 30-39,..) and occupation, they do this by comparing conditional probabilities (see previous lectures). But the word correlation makes no sense, how can higher age correspond to a higher occupation! We found a correlation of 1.19 between students ratings of faculty teaching and ratings made by other faculty. Correlation can only lie between -1 and 1!