Chapter 7 Linear Regression

Similar documents
Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

appstats8.notebook October 11, 2016

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

The Whopper has been Burger King s signature sandwich since 1957.

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Warm-up Using the given data Create a scatterplot Find the regression line

Chapter 27 Summary Inferences for Regression

Chapter 8. Linear Regression /71

Inferences for Regression

Intro to Linear Regression

Announcement. Student Opinion of Courses and Teaching (SOCT) forms will be available for you during the dates indicated:

appstats27.notebook April 06, 2017

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Intro to Linear Regression

Chapter 7. Scatterplots, Association, and Correlation

3.2: Least Squares Regressions

Chapter 5 Friday, May 21st

Fish act Water temp

Relationships Regression

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

Correlation and Linear Regression

9. Linear Regression and Correlation

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

REVIEW 8/2/2017 陈芳华东师大英语系

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Chapter 2: Looking at Data Relationships (Part 3)

Chapter 10-Regression

INFERENCE FOR REGRESSION

Chapter 3: Describing Relationships

Correlation: basic properties.

Chapter 3: Examining Relationships

UNIT 12 ~ More About Regression

Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight?

Unit 6 - Introduction to linear regression

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Chapter 11. Correlation and Regression

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Chapter 9. Correlation and Regression

7.0 Lesson Plan. Regression. Residuals

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chapter 3: Describing Relationships

Algebra 1 Practice Test Modeling with Linear Functions Unit 6. Name Period Date

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Chapter 5 Least Squares Regression

Unit #2: Linear and Exponential Functions Lesson #13: Linear & Exponential Regression, Correlation, & Causation. Day #1

Simple Linear Regression Using Ordinary Least Squares

Chapter 10 Correlation and Regression

Bivariate Data Summary

AP Statistics. Chapter 9 Re-Expressing data: Get it Straight

Predicted Y Scores. The symbol stands for a predicted Y score

Unit 6 - Simple linear regression

BIVARIATE DATA data for two variables

7. Do not estimate values for y using x-values outside the limits of the data given. This is called extrapolation and is not reliable.

A company recorded the commuting distance in miles and number of absences in days for a group of its employees over the course of a year.

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

THE PEARSON CORRELATION COEFFICIENT

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model

Business Statistics. Lecture 10: Correlation and Linear Regression

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Chapter 10 Regression Analysis

Review. Number of variables. Standard Scores. Anecdotal / Clinical. Bivariate relationships. Ch. 3: Correlation & Linear Regression

3.1 Scatterplots and Correlation

Review of Multiple Regression

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Reteach 2-3. Graphing Linear Functions. 22 Holt Algebra 2. Name Date Class

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

Scatterplots and Correlation

Interpreting Correlation & Examining Cause and Effect

MODELING. Simple Linear Regression. Want More Stats??? Crickets and Temperature. Crickets and Temperature 4/16/2015. Linear Model

6.1.1 How can I make predictions?

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

Can you tell the relationship between students SAT scores and their college grades?

Mathematical Notation Math Introduction to Applied Statistics

10.1: Scatter Plots & Trend Lines. Essential Question: How can you describe the relationship between two variables and use it to make predictions?

ST Correlation and Regression

Analysis of Bivariate Data

Summarizing Data: Paired Quantitative Data

Stat 101 Exam 1 Important Formulas and Concepts 1

Chapter 6. September 17, Please pick up a calculator and take out paper and something to write with. Association and Correlation.

Mathematics for Economics MA course

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Overview. 4.1 Tables and Graphs for the Relationship Between Two Variables. 4.2 Introduction to Correlation. 4.3 Introduction to Regression 3.

Intermediate Algebra Summary - Part I

Topic 10 - Linear Regression

The Simple Linear Regression Model

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Lecture 4 Scatterplots, Association, and Correlation

Unit Six Information. EOCT Domain & Weight: Algebra Connections to Statistics and Probability - 15%

Looking at data: relationships

SOLUTIONS FOR PROBLEMS 1-30

8/28/2017. Both examine linear (straight line) relationships Correlation works with a pair of scores One score on each of two variables (X and Y)

Chapter 3: Examining Relationships

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Uncertainty, Error, and Precision in Quantitative Measurements an Introduction 4.4 cm Experimental error

Transcription:

Chapter 7 Linear Regression 1

7.1 Least Squares: The Line of Best Fit 2

The Linear Model Fat and Protein at Burger King The correlation is 0.76. This indicates a strong linear fit, but what line? The line should be closest to the points. 3

The Residual ŷ is called the predicted value. For each point (x,y) look at the point ( xy, ˆ) on the line with the same x-coordinate. The residual is defined by y yˆ The residual is the difference between the observed value and the predicted value. 4

More on Residuals Residual: Observed Predicted Points above the line have positive residuals Points below the line have negative residuals. This line gives the average fat content expected for a given amount of protein. 5

The Line of Best Fit The best fitting line will have small residuals. High negative residuals are just as bad as high positive residuals. Squaring all residuals makes them all positive. The line of best fit is the line for which the sum of the squares of the residuals is the smallest, also called the least squares line. 6

7.2 The Linear Model 7

The Line of Best Fit Line from Algebra y = mx + b Line of Best Fit ŷ b b x 0 1 b 1 is the slope: how rapidly to x. ŷ changes with respect b 0 is the y-intercept: The value of ŷ when x is 0. 8

Interpreting the Line of Best Fit Protein and Fat Slope = 0.91: A Burger King item with one more gram of protein is expected to have 0.91 additional grams of fat. y-intercept = 8.4: A Burger King item with no grams of protein is expected to have 8.4 grams of fat. In reality the two items with no protein also have no fat. 9

A Linear Model for Hurricanes Line of Best Fit Slope = 0.897 For every 1 mb increase in central pressure, we can expect a 0.897 decrease in the maximum wind speed. 10

A Linear Model for Hurricanes Continued Line of Best Fit y-intercept = 955.27 The y-intercept is not meaningful since 0 mb of Central Pressure cannot happen. 11

7.3 Finding the Least Squares Line 12

Slope and Correlation Formula b 1 s r s y x Since the standard deviations are always positive, the slope and the correlation always have the same sign. The correlation has no units, but the slope has units of y over units of x. For the Burger King example, the units for the slope are grams of fat per grams of protein. 13

The y-intercept The y-intercept and the slope are related by y b b x 0 1 The point corresponding to the means of x and y: ( xy, ) will always lie on the line of best fit. Given the mean of x, the mean y, and the slope, we can find the y-intercept: b b x y 0 1 14

Best Fit Line With Technology The formulas are useful to seeing relationships. For computations use a computer. The equation of the best fit line is found with StatCrunch along with the correlation. Stat Regression Simple Linear 15

Predicting With the Line of Best Fit Use the line of best fit to make a prediction for the fat content of a Tendercrisp Chicken Sandwich which contains 31 g of protein. We can predict that the Tendercrisp Chicken Sandwich contains 36.6 g of fat. The Tendercrip Chicken Sandwich actually contains 22 g of fat. Residual: 16

Conditions for Using Regression The line of best fit is also called the least squares line or the regression line. Only use the regression line to make predictions if: The variable must be Quantitative. The relationship is Straight Enough. There should be no Outliers. 17

Step-by-Step Example: Bridge Age and Safety Score Think Plan: Is there a relationship between age of a bridge and the Safety Score? Variables: Quantitative data of the Safety Score and age of 194 bridges. 18

Step-by-Step Example: Bridge Age and Safety Score Requirements Quantitative Variables: Age and Score are both quantitative. Straight Enough: The scatterplot looks relatively linear. No Outliers: Nothing stands out on the scatterplot. 19

Step-by-Step Example: Bridge Age and Safety Score Mechanics Use StatCrunch to Find: r = 0.681 Slope: 0.0228 y-intercept: 6.30 20

Step-by-Step Example: Bridge Age and Safety Score Conclusions The correlation of 0.681 shows that older bridges tend to be less safe. The slope of 0.0228 indicates that on average bridge safety scores go down about 0.7 points for each year of age. The y-intercept of 6.30 means that new bridges tend to have safety score of around 6.3. 21

7.4 Regression to the Mean 22

Correlation and Prediction A new male student joins the class. How tall is he in inches? Best guess would be the mean ( = 0). What if you also know his GPA was 3.94 (z GPA = 2)? Best guess for height would not change. If you were told his height in cm had z cm = 2. Now you could find his height in inches exactly. z in = 2 z ˆin 23

Correlation and Prediction Continued What would your guess for height be if you knew the student s shoe size had z = 2? The correlation is positive(0 < z in < 2). Since b0 b1x y and the means for z-scores are both 0, this gives b 0 = 0. Since the standard deviations are both 1, 1 gives b 1 = r. Plugging into ŷ b b x gives: 0 1 b s r s y x z ˆy rz x 24

Interpreting z-score Regression If x is k standard deviations from its mean, then y will be r k standard deviations from its mean. 25

Galton s Discovery Tall parents have tall children, but the children s heights are likely to be closer to the mean than the parent s heights. Since 1 r 1, rz x is smaller in absolute value than the z x. This is called regression to the mean. 26

7.5 Examining the Residuals 27

Residuals Revisited The residual is the difference between the y value of the data point and the ŷ value found by plugging the x value into the least squares equation. Residual = y yˆ To find the residual: 1. Plug x into the least squares equation to get ŷ. 2. Subtract what you get from y to produce the residual. 28

Residual Example Residual y yˆ That data that compared central pressure and maximum wind speed had Hurricane Katrina s central pressure was x = 920 millibars and the maximum wind speed was y = 110 knots. Plugging in 920 gives ŷ = 955.27 0.897(920) = 130.03 The residual is Residual = 110 130.03 20 yˆ 955.27 0.897x 29

A Good Regression Model The regression model is a good model if the residual scatterplot has no interesting features. No direction No shape No bends No outliers No identifiable pattern 30

The Residual Standard Deviation Since the mean of the residuals is 0, the standard deviation of the residuals is a measure of how small the residuals are. Equal Variance Assumption: A good model will have the spread of the residuals consistent and small. A histogram of the residuals helps us understand the residuals distribution. 31

7.6 R 2 - The Variation Accounted for by the Model 32

Comparing the Variation of y with the Variation of the Residuals r = 1 or 1 The residuals are all 0. There is no variation of the residuals. r = 0 The regression line is horizontal through the mean. The residuals are the y values minus the mean. The variation of the residuals would be the same as the variation of the original y values. 33

Comparing the Variation of y with the Variation of the Residuals: General r The variation of the residuals for protein vs. fat for Burger King menu items is less than the variation for fat. r 2 (written R 2 ) gives the fraction of the data s variation accounted for by the model. 34

Variation of y and the Variation of the Residuals (Continued) R 2 = 0.762 = 0.58 58% of the variability in fat content in Burger King s menu items is accounted for by the variation in the protein content. 42% of the variability in fat content is left in the residuals. Other factors such as how the food is prepared account for this remaining variability. 35

R 2 for Hurricanes R 2 = 0.773 for pressure and maximum wind speed 77.3% of the variability in maximum wind speeds of hurricanes is accounted for by the variation in pressure. 22.7% of the variability in maximum wind speeds is left in the residuals. Other factors such as temperature and location account for this remaining variability. 36

When is R 2 Big Enough R 2 provides us with a measure of how useful the regression line is as a prediction tool. If R 2 is close to 1, then the regression line is useful. If R 2 is close to 0, then the regression line is not useful. What close to means depends on who is using it. Good Practice: Always report R 2 and let the researcher decide. 37

Beware of Just Switching x and y Switching x and y in the regression equation and solving for x does not give the equation of the regression line in reverse. Instead, you must start over with all the computations. This is no big deal if you use a computer or calculator, since the data is already entered. 38

7.7 Regression Assumptions and Conditions 39

Conditions to Check For Quantitative Variable Condition: Regression analysis cannot be used for qualitative variables. Straight Enough Condition: The scatterplot should indicate a relatively straight pattern. Outlier Condition: Outliers dramatically influence the fit of the least squares line. Does the Plot Thicken? Condition: The data should not become more spread out as the values of x increase. The spread should be relatively consistent for all x. 40

Conditions on the Scatterplot of the Residuals There should be no bends. There should be no outliers. There should be no changes in the spread from one part of the plot to another. 41

Sugar and Calories Think Plan: What is the relationship between sugar content and calories in cereal? Variables: Sugar content and calories are both quantitative. The scatterplot looks straight enough. No obvious outliers. The plot does not thicken. The spread is consistent from left to right. 42

Sugar and Calories Mechanics: From StatCrunch r = 0.564 ŷ = 89.5 + 2.50x R 2 = 0.318 s = 19.5 s e = 16.2 Conclusions Slope: A serving of cereal has about 2.50 calories per gram of sugar on average. Intercept: Predict that a serving of sugar free cereal will have about 89.5 calories. 43

Sugar and Calories Mechanics: From StatCrunch r = 0.564 ŷ = 89.5 + 2.50x R 2 = 0.318 s = 19.5 s e = 16.2 Conclusions R 2 : 31.8% of the variation in calories is accounted for by the variation in sugar content. The standard deviation of the residuals,16.2, is somewhat smaller than the standard deviation of calories, 19.5. 44

Sugar and Calories Think Again Check Again Horizontal direction Shapeless pattern Equal scatter from left to right The linear model appears appropriate. 45

Reality Check Always Check to See if the Prediction is Reasonable Is it reasonable to believe that a serving of cereal could have about 2.50 calories for each additional gram of sugar? Checking a few actual boxes of cereal would be a good idea to see if this slope is reasonable. 46

Causation and Regression Never report out a cause and effect relationship based solely on regression analysis. Even though the correlation was high and the model was reasonably linear for pressure vs. wind in the hurricane data, we would need a scientific explanation to conclude cause and effect. Regression analysis alone can never prove cause and effect. 47

What Can Go Wrong? Don t fit a straight line to a nonlinear relationship. If there are curves and bends in the scatterplot, don t use regression analysis. Don t ignore outliers. Instead report them out and think twice before using regression analysis. Don t invert the regression. Switching x and y does not mean just solving for x in the least squares line. You must start over. 48