Chapter 7 Linear Regression (Pt. 1) 7.1 Introduction Recall that r, the correlation coefficient, measures the linear association between two quantitative variables. Linear regression is the method of fitting a linear model to a scatterplot. As you already learned, r is closely related to this model, as it tells us both its strength and direction. 7.2 The Least-Squares Regression Line When given a scatterplot, there are many different lines you can draw through the data points. However, instead of using an arbitrary line to model our data, a more principled approach might be better. In many applications, people use the least-squares regression line (hereafter, regression line ) to model linear relationships in scatterplots. To find the regression line, we do some calculations (which we will go over later) and end up with a line of the form ŷ = f (x) = b 1 x + b 0 Recall from high school math that this equation is in slope-intercept form, where 1
b 1 is the slope of the line and b 0 is the y-intercept. Taken together, b 0 and b 1 are the parameters of our linear model. The following list is a summary of the properties of the regression line: 1. The regression line minimizes the sum of squared residuals between the line and the actual data. 2. The slope (b 1 ) of the line describes how the response (Y) variable changes with the predictor (X) variable. 3. For a given value of x (whether or not it appears in our data), we can get a prediction ŷ = f (x) for the response variable. Below is a scatterplot with its regression line drawn over the data. Stopping Distance of Cars Stopping Distance (feet) 0 20 40 60 80 100 120 5 10 15 20 25 Speed (mph) 2
7.2.1 Residuals Suppose we have data {(x 1, y 1 ),..., (x N, y N )}, from which we derive the following regression model: ŷ = f (x) = b 1 x + b 0 The ith residual is the difference between the actual value of y i and the predicted value ŷ i = f (x i ). That is, res i = y i ŷ i = Real y Predicted y Sometimes, residuals are called errors as they represent the degree to which the model deviates from the data. As mentioned before, the least-squares line is intended to reduce the sum of the squared residuals between data and model. That is, we choose b 0 and b 1 so that the line f (x) = ŷ = b 1 x + b 0 minimizes Sum of squared residuals = N i=1 (y i ŷ i ) 2 Example (Looking at Residual Plots) The graphic below is the residual plot of the previous car example. The horizontal line represents points where the model s prediction is exactly the same as the data. Points below the line represent cases where the model overestimates the data. Points above the line represent cases where the model underestimates the data. The parameters of the model are b 1 = 3.932 and b 0 = 17.579 (think about these for a minute). The correlation coefficient is r = 0.807. 3
Residual Plot of Cars Data Real Distance - Predicted Distance -20 0 20 40 0 20 40 60 80 Speed 1. Suppose we have a car driving 33 mph. What is the model s prediction for its stopping distance? 2. Suppose we know that a car driving at 15 mph takes 45 feet to stop. Calculate the residual. 3. Suppose a car driving at 22 mph has residual of 11 feet. How long did it take to stop? 4
7.2.2 Interpreting Slope and Intercept Again, suppose we have the linear regression model ŷ = b 1 x + b 0 In general terms, we might interpret the model as follows: 1. For every 1 unit increase in x, y increases (or decreases) by b 1. 2. When x = 0, then y = b 0. This is a template for how you want to give your own interpretation. Remember to include the proper units when writing this interpretation. Note that sometimes, especially for the intercept, you will arrive at some absurd conclusions. Let us take a look at the next example. Example (Interpreting Slope and Intercept) New snowboarders (those who have snowboarded a year or less) often suffer from minor injuries. A random sample of seven new snowboarders produced the data on number of months snowboarding and number of minor injuries in the last month that they snowboarded. The linear regression equation: Minor Injuries = 9.5904 0.7349 Months Snowboarding (r = 0.9614) 1. Identify the slope and write a one-sentence interpretation. 2. Identify the y-intercept and write a one-sentence interpretation. 5
3. If a new snowboarder has snowboarded for five (5) months, how many injuries would you predict s/he had in the last month snowboarding? 4. If a new snowboarder had 4 minor injuries having snowboarded for only 5 months, what is the residual for this amount of time? 7.3 Determining the Regression Line In this class, you will not be asked to compute the parameters of the regression line by hand. However, we give the formulas here to make some observation. ( ) sy b 1 = r b 0 = y b 1 x s x where, if you don t remember, r is the correlation coefficient, s x and s y are the sample standard deviations of x and y, respectively. Note that the correlation plays a role in creating the line. Also note that since the mean and standard deviation appear, the regression line is influenced by outliers, as those statistics are influenced by outliers. Also, note that, for most calculators, the coefficients of the regression line are also calculated when finding the correlation coefficient. 7.4 Preparation for the Quiz Practice Problems Chapter 8: 1, 5, 27, 33, 34, 38, 40 6