Activity: TEKS: Overview: Materials: Regression Exploration (A.2) Foundations for functions. The student uses the properties and attributes of functions. The student is expected to: (D) collect and organize data, make and interpret scatterplots (including recognizing positive, negative, or no correlation for data approximating linear situations), and model, predict, and make decisions and critical judgments in problem situations. This is the second activity in a series of three activities that explore the concept of linear regression and residuals. Students will investigate the concept of goodness-of-fit and its significance in determining the regression line or best-fit line for the data. Students will develop the idea behind the least squares regression method via absolute value regression. Graphing calculator Regression Exploration Scatterplot Spaghetti (if students need to make their regression lines) Tape for each group Overhead 1 Grouping: 3-4 Time: 50 to 60 minutes Prerequisites: There is an activity called Spaghetti Regression that is the precursor of this activity. Students use the scatterplot and the line they designated with the spaghetti from this activity. If the class did not do the Spaghetti Regression activity, then you need to have students complete the procedures described in procedure 1 below. Students will need to be able to write the equation of a line from a graph or from two points. Lesson: Procedures 1 Before the lesson Students will need their Scatterplot and spaghetti line from the Spaghetti Regression lesson. If students have not done the Spaghetti Notes This activity builds on the ideas presented in the Spaghetti Regression lesson. Each student creates their own trend line. The equations that they Regression Exploration Page 1
Procedures Regression lesson, give them a copy of the Scatterplot and a piece of spaghetti. Give each group a roll of tape. Notes find later will not necessarily match anyone else s in their group. Have students examine the plot and visually determine a line of best-fit (or trend line) using a piece of spaghetti. Tape the spaghetti line onto their graph. 2. Hand out Regression Exploration. 3. Before starting question 1 on Regression Exploration, students will write an equation from their spaghetti line function on the Scatterplot using function notation. Working with a partner, verify each other s calculations in finding the equations of their spaghetti lines. 4. After students have verified their equation, they are to work questions 1 5 on Regression Exploration in their groups. Tell students to look for the FYIs on the handout for calculator help. In the directions, the lower left corner of the graph is (0, 0). Students can choose two points that are on their spaghetti function to write a linear equation in slope intercept form. The students will each need a graphing calculator. Encourage the calculator-capable students to help out within their groups. Discuss with the class how to setup up the WINDOW on the calculator to match the scatterplot. You may also have to address with students what is meant by predicted values in question 3. 5. Have students stop when they finish question 5 and have answered all the corresponding questions. Use Overhead 1 to facilitate a class discussion of the additional questions in #5 before proceeding further. It is important that participants understand why the residuals must be absolute valued or squared before summing the residuals. 6. Each group needs to complete question 6. In each group determine who has the best line. Regression Exploration Page 2
Procedures As a class, discuss who has the best line. 7. Students are to go on to question 7. Repeat the procedure. Make sure the class discusses the question: Did the best line in the group change? Why or why not? 8. Have groups complete questions 8 10. Take the time to compare the calculator s equation with those in the class. Discuss with the class the final two paragraphs of the handout. These paragraphs provide a summation of how we can measure the goodness of fit of a linear model. Notes Did using a different method of examining the residuals yield different results? If the results are different, are there situations where one measure would be preferred over the other? You may need to remind students of the calculator steps needed to do a linear regression and to paste the equation in Y1. Students may feel their lines are a better fit than the calculator s. Least square regression, as used by the calculator, is more sensitive to outliers. When students write their own equations, they can ignore points that lie far outside the pattern. Assessment/ Extension: Give different groups real life data and have them create their own line of best-fit and compare it to the calculator s equation. Explore median-median regression (another statistical method of finding a line of best fit). Students could be asked to conjecture if the method of evaluating residuals can be expanded to examine models of nonlinear data. Modifications: Posters or a hand out can be created to review the different steps on the calculator to enter data and to do regression. Regression Exploration Page 3
Scatterplot Regression Exploration Page 4
Overhead 1 Regression Exploration Page 5
Regression Exploration Objective: Investigate various methods of regression. Whose model makes the best predictions? Let s compare everyone s line using the residuals. Before we begin, we need to know the equation for your spaghetti function, f(x) = mx + b. Assume the lower left corner of the graph is (0,0). f(x) = 1. Enter your function at Y1= in the calculator. 2. The data from the scatterplot is given below. Enter the data into L1 and L2. Input the x-values in L1 and the y-values in L2. Make certain that the x s are typed in correspondence to the y s. FYI: You will need to turn the scatterplot on ( [2nd] [y =]). Also you need to adjust the WINDOW of the calculator to match that of the scatterplot. x 2 5 6 10 12 15 16 20 20 y 14 19 9 21 7 21 18 10 22 3. Place the predicted values, f(x i ), created by your function, in L3. To do this, place your cursor on (the name) L3 at the very top of the list and create the values using your function and using L1 as the inputs of the function. (Your calculator screen should look like the one pictured below.) FYI: Y1 can be found under [vars] [Y-vars] [1:function] 1:Y1 Regression Exploration Page 6
4. Compute the residuals (the distances between the predicted values, f(x i ), and actual y values) and place them in L4. This can be done by entering L4 = L2-L3. FYI: Go to the top of L4 and where it says L4 = type L2 L1 5. a) On your home screen compute the sum of L4, Sum(L4). Record your group s functions and the corresponding sums in the table below. FYI: Sum can be found under [2 nd ][stat] [math] 5:sum Function Sum of the residual errors b) Examine your values in L4. What is the meaning of a negative residual in terms of the graph and in terms of function s predictions? What is the meaning of a positive or negative total for the functions? Regression Exploration Page 7
c) Examine the following student work. In L4 what is the meaning of 39.23? What is the corresponding value in your table? Describe its meaning. What is the meaning of a low total residual error? Is it a good measure of fit? Why or why not? There are two possible ways to fix the above problem. One way is to take the absolute value of the residual; the other is to square the residual. Taking the absolute value of the residuals is synonymous with using our spaghetti segments to measure the vertical error. Regression Exploration Page 8
6. Find Sum(abs(L4)). Record your group s functions and the corresponding sums in the table below. FYI: abs can be found under [2 nd ][0] Function Sum of the residual error Compare with those in the class to determine who now has the lowest total error. Note: The calculator s regression method uses the squared residuals when measuring the goodness-of-fit of a regression line. Let s compare our lines of best-fit using the squared residuals. 7. Find the total of the squared residuals by Sum((L4) 2 ). This is often referred to as the Sum of the Squared Error, noted SSE. Record your group s functions and the corresponding sums in the table below. Function SSE Compare with those in the class to determine who has the lowest sum of the squared error. Did the best line in the group change? Why or why not? Regression Exploration Page 9
Let s compare our lines against the calculator s regression line. 8. Use your calculator to compute the linear regression function, f(x) = mx + b. f(x) = 9. Enter the function into Y1 and place the function s predicted values f(x i ) in L3, i.e., L3 = Y1(L1). 10. Quickly, compute the total square error by using SUM((L2- L3) 2 ). SSE = How do the functions in the class compare to this one? At least two methods exist for evaluating goodness of fit: (1) taking the absolute value of the residuals (2) squaring the residuals Although taking the absolute value seems most intuitive, relying on squaring does several things. The most desirable one is that it simplifies the mathematics needed to guarantee the best line. Understanding what you are looking for is always the toughest part of any problem, so the hard part is done. You now know how to measure goodness of fit. We can also say exactly what the calculator means by the line of best-fit. If we compute the residuals (i.e., the error in the y direction), square each one, and add up the squares, we say the line of best-fit is the line for which that sum is the least. Since it's a sum of squares, the method is called the Method of Least Squares! This is the most commonly used method but, as we have seen, it isn t the only way! Regression Exploration Page 10
Regression Exploration Teacher Solutions Objective: Investigate various methods of regression. Whose model makes the best predictions? Let s compare everyone s line using the residuals. Before we begin, we need to know the equation for your spaghetti function, f(x) = mx + b. Assume the lower left corner of the graph is (0,0). f(x) = Answers will vary 1. Enter your function at Y1= in the calculator. 2. The data from the scatterplot is given below. Enter the data into L1 and L2. Input the x-values in L1 and the y-values in L2. Make certain that the x s are typed in correspondence to the y s. FYI: You will need to turn the scatterplot on ( [2nd] [y =]). Also you need to adjust the WINDOW of the calculator to match that of the scatterplot. x 2 5 6 10 12 15 16 20 20 y 14 19 9 21 7 21 18 10 22 The calculator window should match the scatterplot XMIN = 0 XMAX = 22 YMIN = 0 YMAX = 25 3. Place the predicted values, f(x i ), created by your function, in L3. To do this, place your cursor on (the name) L3 at the very top of the list and create the values using your function and using L1 as the inputs of the function. (Your calculator screen should look like the one pictured below.) The most common mistake students make on the calculator is that they are trying to enter the Y1(L1) on the wrong line in the calculator. The cursor needs to be on the name of the list, L3. Regression Exploration Page 11
FYI: Y1 can be found under [vars] [Y-vars] [1:function] 1:Y1 4. Compute the residuals (the distances between the predicted values, f(x i ), and actual y values) and place them in L4. This can be done by entering L4 = L2-L3. FYI: Go to the top of L4 and where it says L4 =, type L2 L1 5. a) On your home screen compute the sum of L4, Sum(L4). Record your group s functions and the corresponding sums in the table below. FYI: Sum can be found under [2 nd ][stat] [math] 5:sum Function Sum of the residual errors Answers will vary. You may want different groups to post their table for the class. Regression Exploration Page 12
b) Examine your values in L4. What is the meaning of a negative residual in terms of the graph and in terms of function s predictions? What is the meaning of a positive or negative total for the functions? Answers will vary. c) Examine the following student work. In L4 what is the meaning of 39.23? What is the corresponding value in your table? Describe its meaning. What is the meaning of a low total residual error? Is it a good measure of fit? Why or why not? Possible answer: There was almost as much residual error below the actual values as above the actual values. That made the sum of the residual error close to 0. This measure is not good. The sum is close to 0 which might imply Regression Exploration Page 13
that the predicted and actual values are close. If you look at the line, it does not seem to follow; the data and the residual errors seem very large. At x = 2, the predicted value is 40 below the actual value and at x = 20, the predicted value is 39 above. Overall the error shows that the predicted values of their line is not really close to any of the actual values except at x = 10. There are two possible ways to fix the above problem. One way is to take the absolute value of the residual; the other is to square the residual. Taking the absolute value of the residuals is synonymous with using our spaghetti segments to measure the vertical error. 6. Find Sum(abs(L4)). Record your group s functions and the corresponding sums in the table below. FYI: abs can be found under [2 nd ][0] Answers will vary. Function Sum of the residual error Compare with those in the class to determine who now has the lowest total error. The best fit using this measure would be the equation whose sum of the absolute value of error is the smallest. Note: The calculator s regression method uses the squared residuals when measuring the goodness-of-fit of a regression line. Let s compare our lines of best-fit using the squared residuals. 7. Find the total of the squared residuals by Sum((L4) 2 ). This is often referred to as the Sum of the Squared Error, noted SSE. Record your group s functions and the corresponding sums in the table below. Regression Exploration Page 14
Function Answers will vary. SSE Compare with those in the class to determine who has the lowest sum of the squared error. Did the best line in the group change? Why or why not? The person with the lowest sum could change. If all the residual errors for a particular line were relatively small, then the sum of the square of the residual error will get larger but not increase as much as others. If just one of the residual errors was large, by squaring it, the sum will increase by a large amount. It is possible that the person with the lowest sum in #7 may not have the lowest sum for this question. Let s compare our lines against the calculator s regression line. 8. Use your calculator to compute the linear regression function, f(x) = mx + b. f(x) =.15615 x + 13.823 9. Enter the function into Y1 and place the function s predicted values f(x i ) in L3, i.e., L3 = Y1(L1). 10. Quickly, compute the total square error by using SUM((L2- L3) 2 ). SSE = 259.67 How do the functions in the class compare to this one? There may be student equations that have a smaller SSE. When a student fits a line of best fit to data, they can ignore data points that appear NOT to follow the pattern of data (outliers), but the calculator has to consider all data points entered. A point that falls far outside the pattern of data can throw off the linear regression equation of the calculator away from the general pattern of the majority of the data points. Regression Exploration Page 15
At least two methods exist for evaluating goodness of fit: (1) taking the absolute value of the residuals (2) squaring the residuals Although taking the absolute value seems most intuitive, relying on squaring does several things. The most desirable one is that it simplifies the mathematics needed to guarantee the best line. Understanding what you are looking for is always the toughest part of any problem, so the hard part is done. You now know how to measure goodness of fit. We can also say exactly what the calculator means by the line of best-fit. If we compute the residuals (i.e., the error in the y direction), square each one, and add up the squares, we say the line of best-fit is the line for which that sum is the least. Since it's a sum of squares, the method is called the Method of Least Squares! This is the most commonly used method but, as we have seen, it isn t the only way! Regression Exploration Page 16