Name Class Date 4-8 and Linear Regression Going Deeper Essential question: How can you use residuals and linear regression to fit a line to data? You can evaluate a linear model s goodness of fit using residuals. A residual is the difference between an actual value of the dependent variable and the value predicted by the linear model. After calculating residuals, you can draw a residual plot, which is a scatter plot of points whose x-coordinates are the values of the independent variable and whose y-coordinates are the corresponding residuals. Whether the fit of a line to data is suitable and good depends on the distribution of the residuals, as illustrated below. Distribution of residuals about the x-axis is random and tight. A linear fit to the data is suitable and strong. Distribution of residuals about the x-axis is random but loose. A linear fit to the data is suitable but weak. Distribution of residuals about the x-axis is not random. A linear fit to the data may not be suitable. 1 S-ID.2.6b EXAMPLE Creating a Residual Plot and Evaluating Fit Using t as the years since 1970 and as the median age of females, a student fit the line = 0.25t + 29 to the data shown in the table. Make a residual plot and evaluate the goodness of fit. A Calculate the residuals. Substitute each value of t into the equation to find the value predicted for by the linear model. Then subtract predicted from actual to find the residual. t actual predicted Residual 0 29.2 29.0 0.2 Year Median Age of Females 1970 29.2 1980 31.3 1990 34.0 2000 36.5 2010 38.2 10 31.3 20 34.0 30 36.5 40 38.2 Chapter 4 231 Lesson 8
B Plot the residuals. 0.8 0.4 0-0.4-0.8 0 10 20304050 t values C Evaluate the suitability of a linear fit and the goodness of the fit. Is there a balance between positive and negative residuals? Is there a pattern to the residuals? If so, describe it. Is the absolute value of each residual small relative to (actual)? For instance, when t = 0, the residual is 0.2 and the value of is 29.2, so the relative size of the residual is 0.2 0.7%, which is quite small. 29.2 What is your overall evaluation of the suitability and goodness of the linear fit? REFLECT 1a. Suppose the line of fit with equation = 0.25t + 29 is changed to = 0.25t + 28.8. What effect does this change have on the residuals? On the residual plot? Is the new line a better fit to the data? Explain. Chapter 4 232 Lesson 8
You can use a graphing calculator to fit a line to a set of paired numerical data that have a strong positive or negative correlation. The calculator uses a method called linear regression, which involves minimizing the sum of the squares of the residuals. 2 S-ID.2.6c EXPLORE Comparing Sums of Squared Suppose in the first Example one person came up with the equation = 0.25t + 29.0 while another came up with = 0.25t + 28.8 where, in each case, t is the time in years since 1970 and is the median age of females. A Complete each table below in order to calculate the squares of the residuals for each line of fit. Table for = 0.25t + 29.0 t (actual) = 0.25t + 29.0 (predicted) Square of 0 29.2 29.0 0.2 0.04 10 31.3 20 34.0 30 36.5 40 38.2 Table for = 0.25t + 28.8 t (actual) = 0.25t + 28.8 (predicted) Square of 0 29.2 28.8 0.4 0.16 10 31.3 20 34.0 30 36.5 40 38.2 Chapter 4 233 Lesson 8
B Find the sum of the squared residuals for each line of fit. Sum of squared residuals for = 0.25t + 29.0: Sum of squared residuals for = 0.25t + 28.8: C Identify the line that has the smaller sum of the squared residuals. REFLECT 2a. If you use a graphing calculator to perform linear regression on the data, you obtain the equation = 0.232t + 29.2. Complete the table to calculate the squares of the residuals and then the sum of the squares for this line of fit. t (actual) = 0.232t + 29.2 (predicted) Square of 0 29.2 29.2 0 0 10 31.3 20 34.0 30 36.5 40 38.2 Sum of squared residuals: 2b. Explain why the model = 0.232t + 29.2 is a better fit to the data than = 0.25t + 29.0 or = 0.25t + 28.8. Chapter 4 234 Lesson 8
3 Because linear regression produces an equation for which the sum of the squared residuals is as small as possible, the line obtained from linear regression is sometimes called the least-squares regression line. It is also called the line of best fit. Not only will a graphing calculator automatically find the equation of the line of best fit, but it will also give you the correlation coefficient and display the residual plot. S-ID.2.6c Performing Linear Regression on a Graphing Calculator EXAMPLE The table gives the distances (in meters) that a discus was thrown by men to win the gold medal at the Olympic Games from 1920 to 1964. (No Olympic Games were held during World War II.) Use a graphing calculator to find the line of best fit, to find the correlation coefficient, and to evaluate the goodness of fit. A Identify the independent and dependent variables, and specify how you will represent them. The independent variable is time. Since the graphing calculator uses the variables x and y, let x represent time. To simplify the values of x, define x as years since 1920 so that, for instance, x = 0 represents 1920 and x = 44 represents 1964. Then x = represents 1924, x = represents 1928, x = represents 1932, and so on. The dependent variable is the distance that won the gold medal for the men s discus throw. Let y represent that distance. Year of Olympic Games Men s Gold Medal Discus Throw (meters) 1920 44.685 1924 46.155 1928 47.32 1932 49.49 1936 50.48 1940 No Olympics 1944 No Olympics 1948 52.78 1952 55.03 1956 56.36 B Enter the paired data into two lists, L 1 and L 2, on your graphing calculator after pressing STAT. Do the distances increase or decrease over time? What does this mean for the correlation? 1960 59.18 1964 61.00 Chapter 4 235 Lesson 8
C Create a scatter plot of the paired data using STAT PLOT. The calculator will choose a good viewing window and plot the points automatically if you press ZOOM and select ZoomStat. Describe the correlation. D Perform linear regression by pressing STAT and selecting LinReg (ax + b) from the CALC menu. The calculator reports the slope a and y-intercept b of the line of best fit. It also reports the correlation coefficient r. Does the correlation coefficient agree with your description of the correlation in Part C? Explain. E Graph the line of best fit by pressing Y=, entering the equation of the line of best fit, and then pressing GRAPH. You should round the values of a and b when entering them so that each has at most 4 significant digits. What is the equation of the line of best fit? F Create a residual plot by replacing L 2 with RESID in STAT PLOT as the choice for Ylist. (You can select RESID from the NAMES menu after pressing 2nd STAT.) Evaluate the suitability and goodness of the fit. Chapter 4 236 Lesson 8
REFLECT 3a. Interpret the slope and y-intercept of the line of best fit in the context of the data. 3b. Use the line of best fit to make predictions about the distances that would have won gold medals if the Olympic Games had been held in 1940 and 1944. Are the predictions interpolations or extrapolations? 3c. Several Olympic Games were held prior to 1920. Use the line of best fit to make a prediction about the distance that would have won a gold medal in the 1908 Olympics. What value of x must you use? Is the prediction an interpolation or an extrapolation? How does the prediction compare with the actual value of 40.89 meters? PRACTICE Throughout these exercises, use a graphing calculator. 1. The table gives the distances (in meters) that a discus was thrown by men to win the gold medal at the Olympic Games from 1968 to 2008. a. Find the equation of the line of best fit. b. Find the correlation coefficient. c. Evaluate the suitability and goodness of the fit. Year of Olympic Games Men s Gold Medal Discus Throw (meters) 1968 64.78 1972 64.40 1976 67.50 1980 66.64 1984 66.60 1988 68.82 1992 65.12 1996 69.40 2000 69.30 d. Does the slope of the line of best fit for the 1968 2008 data equal the slope of the line of best fit for the 1920 1964 data? If not, speculate about why this is so. 2004 69.89 2008 68.82 Chapter 4 237 Lesson 8
2. Women began competing in the discus throw in the 1928 Olympic Games. The table gives the distances (in meters) that a discus was thrown by women to win the gold medal at the Olympic Games from 1928 to 1964. a. Find the equation of the line of best fit. Year of Olympic Games Women s Gold Medal Discus Throw (meters) 1928 39.62 1932 40.58 1936 47.63 1940 No Olympics b. Find the correlation coefficient. 1944 No Olympics 1948 41.92 c. Evaluate the suitability and goodness of the fit. 1952 51.42 1956 53.69 1962 55.10 1964 57.27 3. Research the distances that a discus was thrown by women to win the gold medal at the Olympic Games from 1968 to 2008. Explain why a linear model is not appropriate for the data. 4. The table lists the median heights (in centimeters) of girls and boys from age 2 to age 10. Choose either the data for girls or the data for boys. a. Identify the real-world variables that x and y will represent. Age (years) Median Height (cm) of Girls Median Height (cm) of Boys 2 84.98 86.45 3 93.92 94.96 4 100.75 102.22 b. Find the equation of the line of best fit. c. Find the correlation coefficient. d. Evaluate the suitability and goodness of the fit. 5 107.66 108.90 6 114.71 115.39 7 121.49 121.77 8 127.59 128.88 9 132.92 133.51 10 137.99 138.62 Chapter 4 238 Lesson 8
Name Class Date Additional Practice 4-8 1. The data in the table are graphed at right along with two lines of fit. 0 2 4 6 7 3 4 6 a. Find the sum of the squares of the residuals for 3 9. b. Find the sum of the squares of the residuals for 1 5. 2 c. Which line is a better fit for the data? 2. Use the data in the table to answer the questions that follow. 5 6 6.5 7.5 9 0 1 3 2 4 a. Find an equation for a line of best fit. b. What is the correlation coefficient? c. How well does the line represent the data? d. Describe the correlation. 3. Use the data in the table to answer the questions that follow. 10 8 6 4 2 1 1.1 1.2 1.3 1.5 a. Find an equation for a line of best fit. b. What is the correlation coefficient? c. How well does the line represent the data? d. Describe the correlation. 4. The table shows the number of pickles four students ate during the week versus their grades on a test. The equation of the least-squares line is 2.11 79.28, and 0.97. Discuss correlation and causation for the data set. Pickles Eaten 0 2 5 10 Test Score 77 85 92 99 Chapter 4 239 Lesson 8
Problem Solving 1. The table shows the number of hours different players practice basketball each week and the number of baskets each player scored during a game. Alan Brenda Caleb Shawnernando Gabriela 5 10 7 2 0 21 6 11 8 4 2 19 a. Find an equation for a line of best fit. Round decimals to the nearest tenth. b. Interpret the meaning of the slope and -intercept. c. Find the correlation coefficient. 2. Use your equation above to predict the number of baskets scored by a player who practices 40 hours a week. Round to the nearest whole number. A 32 baskets B 33 baskets C 34 baskets D 35 baskets 3. Which is the best description of the correlation? F strong positive G weak positive H weak negative J strong negative 4. Given the data, what advice can you give to a player who wants to increase the number of baskets he or she scores during a game? A Practice more hours per week. B Practice fewer hours per week. C Practice the same hours per week. D There is no way to increase baskets. 5. Do the data support causation, correlation, or chance? F correlation G causation H chance J chance and correlation Chapter 4 240 Lesson 8