HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

Similar documents
The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line

appstats27.notebook April 06, 2017

Chapter 27 Summary Inferences for Regression

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Descriptive Statistics (And a little bit on rounding and significant digits)

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Chapter 26: Comparing Counts (Chi Square)

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

Quadratic Equations Part I

appstats8.notebook October 11, 2016

AP Statistics. Chapter 9 Re-Expressing data: Get it Straight

Warm-up Using the given data Create a scatterplot Find the regression line

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Business Statistics. Lecture 9: Simple Regression

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

MATH 1150 Chapter 2 Notation and Terminology

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Lines and Their Equations

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

LECTURE 15: SIMPLE LINEAR REGRESSION I

Algebra & Trig Review

Calculus II. Calculus II tends to be a very difficult course for many students. There are many reasons for this.

Chapter 3: Describing Relationships

COLLEGE ALGEBRA. Solving Equations and Inequalities. Paul Dawkins

Getting Started with Communications Engineering

Water tank. Fortunately there are a couple of objectors. Why is it straight? Shouldn t it be a curve?

Making Piecewise Functions Continuous and Differentiable by Dave Slomer

Chapter 3: Describing Relationships

Chapter 3: Examining Relationships

DIFFERENTIAL EQUATIONS

Lecture 11: Extrema. Nathan Pflueger. 2 October 2013

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Parabolas and lines

5.2 Infinite Series Brian E. Veitch

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

Modeling Data with Functions

Confidence intervals

Answers for Ch. 6 Review: Applications of the Integral

Lecture 18: Simple Linear Regression

Bell s spaceship paradox

Solution to Proof Questions from September 1st

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Change & Rates of Change

Lecture 4: Training a Classifier

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Introduction. So, why did I even bother to write this?

Section 5.4. Ken Ueda

Lesson 21 Not So Dramatic Quadratics

RATES OF CHANGE. A violin string vibrates. The rate of vibration can be measured in cycles per second (c/s),;

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Bivariate Data Summary

One-to-one functions and onto functions

Fitting a Straight Line to Data

from Euclid to Einstein

Finite Mathematics : A Business Approach

PHYSICS 15a, Fall 2006 SPEED OF SOUND LAB Due: Tuesday, November 14

Math 243 OpenStax Chapter 12 Scatterplots and Linear Regression OpenIntro Section and

Understanding Exponents Eric Rasmusen September 18, 2018

CALCULUS I. Review. Paul Dawkins

Chapter 8. Linear Regression /71

AP Statistics. The only statistics you can trust are those you falsified yourself. RE- E X P R E S S I N G D A T A ( P A R T 2 ) C H A P 9

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Notes 6. Basic Stats Procedures part II

Multiple Regression Theory 2006 Samuel L. Baker

Chapter 7 Summary Scatterplots, Association, and Correlation

The Gram-Schmidt Process

CALCULUS III. Paul Dawkins

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science

Stat 20 Midterm 1 Review

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Solving with Absolute Value

3.4 Complex Zeros and the Fundamental Theorem of Algebra

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER /2018

Instructor (Brad Osgood)

SECTION 3.1 SIMPLE LINEAR REGRESSION

PHY 101L - Experiments in Mechanics

Please bring the task to your first physics lesson and hand it to the teacher.

Name Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.

Slope Fields: Graphing Solutions Without the Solutions

Chapter 3: Examining Relationships

Econ A Math Survival Guide

Properties of Arithmetic

Conceptual Explanations: Modeling Data with Functions

Chapter 19 Sir Migo Mendoza

This is the third of three lectures on cavity theory.

MITOCW MIT18_01SCF10Rec_24_300k

Math 1710 Class 20. V2u. Last Time. Graphs and Association. Correlation. Regression. Association, Correlation, Regression Dr. Back. Oct.

Introduction to Algebra: The First Week

Section 1.x: The Variety of Asymptotic Experiences

Systematic Uncertainty Max Bean John Jay College of Criminal Justice, Physics Program

How to use these notes

- a value calculated or derived from the data.

1.5 GEOMETRIC PROPERTIES OF LINEAR FUNCTIONS

MITOCW ocw-18_02-f07-lec17_220k

Lecture 2: Linear regression

Scatterplots and Correlation

Math 138: Introduction to solving systems of equations with matrices. The Concept of Balance for Systems of Equations

Correlation: basic properties.

Transcription:

Chapter 08: Linear Regression There are lots of ways to model the relationships between variables. It is important that you not think that what we do is the way. There are many paths to the summit We are concerned with linear relationships; the simplest type of function that we can use. So, we ll be trying to fit a line to the data. As we ve seen, it s quite rare that all the data actually fall on a line. There s going to be some error in any model that we construct not all of the points are going to actually fall on the line; rather, they will fall near the line. Naturally, our goal will be to construct a model with a minimum amount of error. As it turns out, there are many ways to do this but there is only one method that we ll be discussing. In fact, the method that we ll use is so popular that some people refer to it as The line of best fit. I take some exception to that definite article. Is a Linear Model Appropriate? Naturally, you should not just blindly launch into finding a linear model for data. There are a few things that you can (and should!) do before you attempt to establish a model for your data. First (and most obvious) graph the data! Does the relationship even appear vaguely linear? Next, calculate the correlation coefficient (we learned about that in Chapter 7). If r is far from zero, this lends support to your observation that the data have a linear relationship. Finally, calculate the coefficient of determination r. This measures the proportion of variation in the response variable that can be attributed to a linear relationship with the explanatory variable. OK maybe that needs a little more explanation. The response variable all by itself varies. Specifically, it varies about its mean value. Here s a visual of that (remember that I m looking at variation in the response variable the variable that is always plotted on the vertical axis): Figure 1 - Variation in the Response Variable Make a mental note of the amount of variation around that line! That s a range of almost 140. Now let s look at these data plotted against another variable. HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11

Figure - Variation about the Linear Model Remember that we measure variation around some center in this case, around an oblique line. Is the variation around this oblique line larger or smaller than the amount of variation around the horizontal line? Much smaller, of course! In fact, the amount of reduction in variation is precisely the value of r. When r is close to 1, a linear relationship can explain most of the variation. This is a good thing if you are trying to find a linear model of the data! The Method of Least Squares The Method of Least Squares defines best fit using something called residuals. A residual is the difference between the observed response ( y ) and the predicted response ( ŷ ). Figure 3 - Residuals, Illustrated (Notice that the residual can be thought of as the observed value minus the expected value this is a common theme in statistics!) You can think of a residual as the error for a single observation. A model that fits perfectly would have no error all of the residuals would be zero. Naturally, this never happens! Thus, HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE OF 11

we ll be wanting a model that makes these errors (the residuals) as small as possible. Mathematically, that would mean making the sum of the residuals as small as possible. Actually, that is not quite correct. What I meant to say is that the sum of the residuals should be as close to zero as possible. This is an important distinction, since some residuals will be negative (when the prediction is larger than the observation). In particular, since some residuals will be negative, just summing the residuals won t work in fact, you ll always get a sum of zero (which ought to mean perfect fit ). Fortunately, we ve run into this before, and we know exactly what to do: sum the squares of the residuals. Thus, the method of least squares creates a model where the sum of the squares of the residuals is as close to zero as possible. Happily, you don t need to know the algebra (and calculus) that makes this work you just need the result. The Formulas As you recall from algebra, we need two pieces of information in order to write the equation of a line: a point (often the y-intercept but it doesn t have to be!) and the slope. I ll go ahead x, y. Hopefully it makes perfect sense and tell you that one point that will be on this line is ( ) that the mean value of the explanatory variable should predict the mean value of the response variable! As for the slope remember that slope is rise over run. A better way to say this is that it s change in y over change in x. One thing that we ve learned this year is that another word for change would be variation so it isn t a big leap to see that the slope of the least squares regression line involves the variation in y divided by the variation in x. Equation 1 - Slope of the Least Squares Model sy b= r s x Notice that the correlation coefficient modifies that slope this forces the slope to be closer to zero when there is little or no strength to the relationship between the variables. Your textbook does a nice job of showing you exactly how these formulas arise but you are not required to know how to do that; I d rather you just have the idea instead. Once you have the slope, you can write the equation of the least squares regression line (using the point-slope form)! Equation - The Least Squares Regression Equation ɵ sy y y= r ( x x) s Most teachers want you to solve that for ɵ y. Hopefully that isn t hard. Now, some good news: it will be highly unusual for you to need to use these formulas! It is much more likely that you will have a set of raw data in which case you should use the built in features of your calculator. Don t waste time doing things by hand if your technology will do them for you! x HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 3 OF 11

Examples [1.] When cars were still a fairly new idea, someone measured the stopping distance (feet) for cars travelling at various rates of speed (miles per hour). The speeds of the cars had a mean of 15.4 mph and a standard deviation of 5.876 mph. The stopping distances had a mean of 4.98 feet and a standard deviation of 5.7694 feet. The correlation between the variables was 0.8069. Find the equation of the least squares regression line that predicts stopping distance from speed. First of all since we want to predict stopping distance, that will be the y variable, and speed will be the x variable. The slope of the line will be ( ) 5.7694 0.8069 3.935, so the equation is 5.876 ɵ ( ) ɵ ɵ y 4.98= 3.935 x 15.4 y= 3.935x 60.56+ 4.98 y= 3.935x 17.58. (by the way a graph of the data appeared fairly linear, so constructing the least squares regression model was appropriate) [.] Concerning the previous example what percentage of the variation in stopping distance can be explained by a linear relationship with speed? That would be ( ) be explained by the speed. r = 0.8069 0.6511 so 65.11% of the variation in stopping distance can [3.] Pottery in Great Britain was analyzed using atomic absorption spectrophotometry. In particular, the percentage of aluminum oxide (Al) and the percentage of calcium oxide (Ca) were measured for 6 pieces of pottery. The data are shown in tables 1 and below. Find the least squares regression model that predicts aluminum oxide percentage from calcium oxide percentage. Table 1 - Pottery Data, Part 1 1 3 4 5 6 7 8 9 10 11 1 13 Al 14.4 13.8 14.6 11.5 13.8 10.9 10.1 11.6 11.1 13.4 1.4 13.1 1.7 Ca 1.5 11.8 11.6 18.3 15.8 18.0 18.0 0.8 17.7 18.3 16.7 14.8 19.1 Table - Pottery Data, Part 14 15 16 17 18 19 0 1 3 4 5 6 Al 0.15 0.1 0.13 0.16 0.0 0.17 0.0 0.18 0.9 0.8 0. 0.31 0.0 Ca 0. 0.30 0.9 0.03 0.01 0.01 0.01 0.07 0.06 0.06 0.01 0.03 0.10 First of all, we should graph the data to see whether linear regression is even appropriate. Since we are predicting aluminum oxide percentage, that should be graphed on the vertical axis. HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 4 OF 11

Pottery Example Aluminum Oxide % 10 1 14 16 18 0 0.00 0.05 0.10 0.15 0.0 0.5 0.30 Figure 4 - Plot for Example 3 Since the plot appears fairly linear, constructing a least squares model should be OK. There s no need to go through the formulas just let your calculator do the work! I get ɵ y=.57x+ 17.8. Calcium Oxide % Interpreting the Coefficients We will, naturally, need to interpret these values. Happily, you should have been doing this ever since you took Algebra 1 so interpreting these coefficients shouldn t be brand-new or scary. There are, however, a few differences that need to be addressed. In Algebra, you (probably) interpreted slope as the amount of change in y for a one-unit change in x. That s the right idea, but not acceptable for us. In Algebra, the relationships are definite there s no error or variation. Here, there is some variation the least squares line doesn t actually touch the data points; rather, it produces the mean response for a particular observation. Thus, when we interpret the slope, we need a few wiggle-words to make it clear to the reader that we know the line doesn t touch the data points. Hence, it is better to say that the slope is the average amount of change in y that is predicted by the model for a one unit change in x. We must adjust the interpretation of the y-intercept as well. In Algebra, you said that it was the y-value when x was zero. Now, we should say that it is the average predicted y-value for an x value of zero. Examples [4.] From the speed and stopping distance example above interpret the coefficients of the least squares equation in the context of the situation. The slope of 3.93 means that the model predicts an average increase of 3.93 feet of stopping distance for each additional mile per hour of speed. The y-intercept of -17.58 means that the model predicts an average stopping distance of - 17.58 feet for a car moving at 0 miles per hour. HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 5 OF 11

(Don t worry that the value is unreasonable it would be an example of extrapolation anyway!) [5.] From the pottery example above interpret the coefficients of the least squares equation in the context of the situation. The slope of -.57 means that the model predicts an average decrease of.57% of aluminum oxide content for each additional percentage of calcium oxide content. The y-intercept of 17.8 means that the model predicts an average aluminum oxide percentage of 17.8% when the calcium oxide content is 0%. Making Predictions Using the least squares model to make predictions couldn t be any simpler just substitute! Be careful this model can only be used to predict y based on x. Back in Algebra, you spent quite a bit of time substituting y into equations and then solving for x. Since this is not an algebraic equation since there is variation in the y-variable our equation is not that versatile. Our equation reduces the total squared variation in the residuals, which are measured vertically. To get an equation that predicts x, we d need to consider horizontal variation. Since we didn t (and won t!) do that, you must always substitute for x. You calculator stores the regression equation for you use it! The specifics on how to do so are sufficiently varied (based on calculator model and operating system) that I won t try to give instructions here. I will remind you that you must show some work on your paper, even though the calculator is doing the grunt work for you. Examples [6.] From the speed and stopping distance above predict the stopping distance of a car travelling at 15 mph. The model predicts ɵ y= 3.935( 15) 17.58 41.4075 feet of stopping distance for a car travelling at 15 mph. [7.] From the pottery example above predict the aluminum oxide content for a sample with a calcium oxide content of 0.5%. The model predicts ɵ y=.57( 0.5) + 17.8 1.157 % aluminum oxide content when the calcium oxide content is 0.5%. [8.] A study in Western Australia measured the carapace length and carapace width (both measured in millimeters) for 00 crabs. A random sample of the collected data are shown in table 3. HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 6 OF 11

Predict the carapace width for a crab that has a carapace length of 4mm. Table 3 - Crab Data Crab 1 3 4 5 6 7 8 9 10 Length 47.6 47. 37.8 6.1 33.9 9.0 37.9 33.8 6.6 37.6 Width 5.8 5.1 41.9 31.6 39.5 3.1 4. 39.0 9.6 43.9 First, let s graph the data. Since we re predicting width, that needs to be on the vertical axis. Crab Measurements Width (mm) 30 35 40 45 50 30 35 40 45 Length (mm) Figure 5 - Plot for Example 8 The data appear linear, so a least squares model should be appropriate. The model is ɵ y= 1.04885x+.9736. When the length is 4mm, the model predicts a width of ɵ y= 1.04885( 4) +.9736 47.053 millimeters. Evaluating the Model Not all models are created equal. Even a single model might be good at making some predictions and bad at making other predictions. Thus, we need to evaluate the model that the least squares method produces. There are several ways in which we might do this Assessing Fit Visually First (and perhaps most obvious) look at the graph of the model with the data. How well does the model follow the data? Are there any patterns regions where the data all fall to one side of the line, or where the variation around the line swells? Take a look and describe what you see! Correlation and the Coefficient of Determination The correlation coefficient ( r ) can be used to support your (previous!) observation that the data appear to have a linear relationship. HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 7 OF 11

Using The Residuals A Histogram When the model does a good job of following the data, the line ought to run through the middle of the data points. That would cause the residuals to be equally distributed around zero. Also, a good model would lie amongst the data so that more points are close to the line than are far from the line. That would cause more residuals to have values close to zero (rather than far from zero). Both of these ideas suggest that a graph of the residuals ought to appear symmetric and mound-shaped, centered at zero. Thus, one way to assess the model is to look at a graph of the residuals perhaps a histogram? A Scatterplot We expect the response variable to vary around the regression line ideally, the amount of variation is the same everywhere along the regression line. One way to check this is to create a plot of the residuals against the predicted responses. This type of graph is called a residual plot. There are some people who create residual plots with the explanatory variable on the horizontal axis. For the kind of regression that we are doing, that is fine it s basically equivalent to the kind that I defined. The reason why I use my version is that it is the only version you ll use if you continue with statistics and learn about multiple regression. If the model is doing a good job with the data, then a residual plot will have no pattern the points will be scattered along the x-axis in a sort of random cloud. If there is a pattern in this plot, then there is some potential problem with the model. There are a few patterns that you should know (including what the patterns indicate): 1. A curve (usually a parabola-like shape) indicates that there is some curvature in the data. Maybe a linear model isn t the best choice. Residual Plot Residual (lbs) -50 0 50 100 150 0 100 00 300 400 Predicted Weight (lbs) Figure 6 - A Curved Residual Plot. An oblique line indicates that there is a linear relationship in the data, but there is one outlier that is pulling the regression line away from the rest of the data. This is sufficiently unusual that I don t have an image to illustrate it. 3. A triangle indicates that there is more variation amongst some predictions than other predictions (or, if you used the explanatory variable on the x-axis there is more HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 8 OF 11

variation on one end of the regression line than another end). This should be taken into account when making predictions. Figure 7 - Non-Constant Variation in a Residual Plot Example [9.] In the 1980s, researchers took various body measurements from over 500 women of Pima Indian heritage. Two of the measured variables were triceps skin fold thickness (measured in kg millimeters) and body mass index ( ). The data from a random sample of these measurements m is shown below in table 4. Establish a model that could be used to predict skin fold thickness from body mass index. kg Use your model to predict the triceps skin fold thickness for a women with a BMI of 35 m. Table 4 - Pima Indian Measurements Woman 1 3 4 5 6 7 8 9 10 BMI 35.9.9 37.6 3.4 33.7 43.3 18. 3.8 4. 8.4 TSFT 8 13 34 3 39 37 19 31 0 3 Since we re predicting skin fold thickness, that should be on the y-axis and we do want to graph the data before we do anything else. HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 9 OF 11

Skin Fold (mm) 15 0 5 30 35 40 Pima Indian Measurements 0 5 30 35 40 BMI (kg/m^) Figure 8 - Scatterplot for Example 9 The relationship appears linear, so a least squares model may be appropriate. r= 0.85, which indicates a moderately linear relationship. r = 0.76 7.6% of the variation in skin fold thickness can be explained by a linear relationship with BMI. Both of these indicate that a least squares regression model is appropriate. The least squares model is ɵ y= 0.963x.016, where ɵ y is the predicted skin fold thickness and x is the observed BMI. Now for some model diagnostics starting with the residual plot. Residual Plot for Example 9 Residual -5 0 5 15 0 5 30 35 40 Predicted Skin Fold Thickness (mm) Figure 9 - Residual Plot for Example 9 The residual plot shows no clear pattern, which indicates that the least squares model is appropriate. Here s a histogram of the residuals: HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 10 OF 11

Histogram of Residuals for Example 9 Frequency 0.0 0.5 1.0 1.5.0-10 -5 0 5 10 Residual Figure 10 - Histogram of Residuals for Example 9 The graph shows no clear sign of skew or outliers, which further indicates that a least squares regression model is appropriate. Since all of the signs are favorable, I ll go ahead and use the model for the prediction. ɵ y= 0.963 ( 35 ).016 31.5106 The model predicts a skin fold thickness of 31.5106 mm for a woman of Pima Indian heritage kg with a BMI of 35 m. HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 11 OF 11