Student Notes Prep Session Topic: Exploring Content The AP Statistics topic outline contains a long list of items in the category titled Exploring Data. Section D topics will be reviewed in this session. D. Exploring bivariate data 1. Analyzing patterns in scatterplots. Correlation and linearity 3. Least squares regression line 4. Residual plots, outliers, and influential points 5. Transformations to achieve linearity: logarithmic and power transformations Formulas Provided The following formulas related to this topic are provided on the formula sheet: y = b0 + b1 x (note: your calculator uses a and b rather than b0 and b1 ) b1 = " (x! x)(y! y) " (x! x) i i i (note: it is unlikely you will need this formula. The slope will be found on the calculator or given in computer output) b0 = y! b1 x (note: this formula simply tells us that the ordered pair (x, y) lies on the line y = b0 + b1 x ) r= " xi! x % " yi! y % 1 ( n!1 $# sx '& $# sy '& b1 = r sy sx Calculator Use You may need to use your calculator to create a scatter plot, compute the equation of a least squares regression line (and the values of r and r ), graph the regression line with the data, and create a residual plot. Generally computer output and graphs are provided with bivariate data analysis questions, but you cannot be sure that these will be provided.
Communication, skills, and understanding 1. If you are asked to make a scatter plot or residual plot, be sure to include a title, labels on the horizontal and vertical axes, and scales on both axes. In a scatter plot, the explanatory or predictor variable is on the horizontal axis and the response variable is on the vertical axis.. When you describe the information provided by a scatter plot, be sure to comment on the direction, shape (form), and strength of the relationship in context. Also, comment on unusual features (such as outliers). 3. The least squares regression line passes through the point (x, y) and has slope r sy. So given the values of sx x, y, sx, sy and r, you can find the equation of the least squares regression line using point slope form of a sy linear equation y! y = (r )(x! x). sx 4. When you write the equation of a least squares regression line, be sure to include the hat on the y variable and be sure to identify both variables in your equation. 5. If asked to interpret the intercept or slope, be sure that your interpretation is in the context of the problem. Also make sure that you do not make a deterministic statement about the slope or intercept. The intercept provides an estimate for the value of y when x is zero. The slope provides information about the estimated amount that the y variable changes (or the amount that the y variable changes on average) for each unit change in the x variable. 6. Residual = observed y value predicted y value = y! y 7. To determine whether the model is a good fit for the data, examine a residual plot and make sure that the residuals are randomly scattered about the horizontal axis. 8. Be careful in using the least squares line to predict outside the domain of the observed values of the explanatory variable. Extrapolation is risky! 9. The least squares line is the line that minimizes the sum of the squared residuals. 10. An influential point is a point that noticeably affects the slope of the regression line when removed from (or added to) the data set. An outlier is a point that noticeably stands apart from the other points. 11. The magnitude of the correlation coefficient gives information about the strength of the linear relationship between two quantitative variables over the observed domain. That is, correlation provides information about how tightly points are clustered about a line. 1. If asked to interpret the correlation coefficient, be sure to comment on the strength and direction of the relationship in context. 13. The correlation coefficient is sensitive to the effect of outliers. 14. The magnitude of the correlation coefficient does not provide information about whether a linear model is appropriate. You must also consider the residual plot. 15. When there is a strong linear association between two variables, the value of r is close to 1 or 1; when there is a very weak linear association between two variables, the value of r is close to 0. A value of r close to 0 could be associated with a strong curved relationship. 16. A strong association does not imply causation. 17. The coefficient of determination, r, gives the proportion of variation in the observed y values that can be attributed to the linear relationship with the x variable. You must be able to interpret r in context. 18. The transformation (x,ln y) or (x,log y) will straighten data that can be modeled by an exponential function. A exponential function y = ab x does not pass through the origin. 19. The transformation (ln x,ln y) or (log x,log y)will straighten data that can be modeled by a power function. A power function y = ax n does pass through the origin.
Multiple Choice questions from 1997 Exam. 8. There is a linear relationship between the number of chirps made by the striped ground cricket and the air temperature. A least squares fit of some data collected by a biologist gives the model y = 5. + 3.3x 9 < x < 5 where x is the number of chirps per minute and y is the estimated temperature in degrees Fahrenheit. What is the estimated increase in temperature that corresponds to an increase of 5 chirps per minute? A. B. C. D. E. 3.3 F 16.5 F 5. F 8.5 F 41.7 F 31. The equation of the least squares regression line for the points on the scatterplot (not pictured) is y = 1.3 + 0.73x. What is the residual for the point (4, 7)? A. B. C. D. E..78 3.00 4.00 4. 7.00 Multiple Choice Questions from 00 Exam 6. The correlation between two scores X and Y equals 0.8. If both the X scores and the Y scores are converted to z scores, then the correlation between the z scores for X and the z scores for Y would be A. B. C. D. E. 0.8 0. 0.0 0. 0.8 17. A least squares regression line was fitted to the weights (in pounds) versus age (in months) of a group of many young children. The equation of the line is y = 16.6 + 0.65x where y is the predicted weight and x is the age of the child. A 0 month old child in this group has an actual weight of 5 pounds. Which of the following is the residual weight, in pounds, for this child? A. B. C. D. E. 7.85 4.60 4.60 5.00 7.85
31. A wildlife biologist is interested in the relationship between the number of chirps per minute for crickets and temperature. Based on the collected data, the least squares regression line is y = 10.53 + 3.41x, where x is the number of degrees Fahrenheit by which the temperature exceeds 50! F and y is the number of chirps per minute. Which of the following best describes the meaning of the slope of the least squares regression line? A. B. C. D. E. For each increase in temperature of 1 F, the estimated number of chirps per minute increases by 10.53. For each increase in temperature of 1 F, the estimated number of chirps per minute increases by 3.41. For each increase of one chirp per minute, there is an estimated increase in temperature of 10.53 F. For each increase of one chirp per minute, there is an estimated increase in temperature of 3.41 F. The slope has no meaning because the units of measure for x and y are not the same. 34. Each of 100 laboratory rats has available both plain water and a mixture of water and caffeine in their cages. After 4 hours, two measures were recorded for each rat: the amount of caffeine the rat consumed, X, and the rat's blood pressure, Y. The correlation between X and Y was 0.48. Which of the following conclusions is justified on the basis of this study? A. The correlation between X and Y in the population of rats is also 0.48. B. If the rats stop drinking the water/caffeine mixture, this would cause a reduction in their blood pressure. C. About 18 percent of the variation in blood pressure can be explained by a linear relationship between blood pressure and caffeine consumed. D. Rats with lower blood pressure do not like the water/caffeine mixture as much as do rats with higher blood pressure. E. Since the correlation is not very high, the relationship between the amount of caffeine consumed and blood pressure is not linear.
MC Answers: 8-B 31-A 6-E 17-B 31-B 34-C 1999 #1 1. Lydia and Bob were searching the Internet to find information on air travel in the United States. They found data on the number of commercial aircraft flying in the United States during the years 1990-1998. The dates were recorded as years since 1990. Thus, the year 1990 was recorded as year 0. They fit a least squares regression line to the data. The graph of the residuals and part of the computer output for their regression are given below. A. Is a line an appropriate model to use for these data? What information tells you this? B. What is the value of the slope of the least squares regression line? Interpret the slope in the context of this situation.
C. What is the value of the intercept of the least squares regression line? Interpret the intercept in the context of this situation. D. What is the predicted number of commercial aircraft flying in 199? E. What was the actual number of commercial aircraft flying in 199?
AP STATISTICS 1999 SCORING GUIDELINES Question 1 Solution: a. Yes. Test for slope indicates that the linear model is useful (Ho: BETA is equal to 0, Ha: BETA is not equal to 0, t = 54.11, p-value =.000) and the residual plot shows no pattern, indicating a linear model is appropriate. b. Slope = 33.517 aircraft/year. On average, the number of commercial aircraft flying in the U.S. increased by approximately 33.517 each year. (OK if rounded to 34 in interpretation) c. Intercept = 939.93 aircraft. Predicted number of commercial aircraft that were flying in 1990 (since x = 0 corresponds to year 1990) was 939.93. (OK if rounded to 940 in interpretation) d. For 199, x =, so predicted number of commercial aircraft flying is 939.93 + 33.517() = 3406.964 aircraft e. From the residual plot, the residual for 199 is +40, so actual - predicted = 40 and Actual = 3406.964 + 40 = 3446.964 aircraft. Since actual number flying must be an integer, actual must have been 3447. Notes: Part (a) can be considered essentially correct even if it fails to mention the t test, as long as it discusses the residual plot. Parts (b) and (c) should draw the distinction between the model and the data. They can beconsidered essentially correct if the student incorporates the idea of estimation using words such as on average, predicted, approximately, about, etc. Parts (b) and (c) can be considered partially correct if the student (1) incorrectly identifies the values for the slope and intercept but gives an essentially correct interpretation OR () correctly identifies the values for the slope and intercept but gives an incomplete interpretation or an interpretation not in context for one or both. Parts (d) and (e) can be considered essentially correct if incorrect numbers from previous parts are correctly substituted. Part (e) can be considered essentially correct even if it fails to round to an integer. Points: 4 Complete Response Gives an essentially correct response to all 5 parts. 3 Substantial Response Essentially correct on 4 of the 5 parts. OR Essentially correct responses on a, d, and e AND partially correct responses on both b and c. Developing Response Essentially correct on 3 of the 5 parts. OR Partially correct responses on both b and c AND essentially correct responses on of the remaining parts 1 Minimal Response Essentially correct on 1 or of the 5 parts. OR Partially correct responses on both b and c
005 #3 3. The Great Plains Railroad is interested in studying how fuel consumption is related to the number of railcars for its trains on a certain route between Oklahoma City and Omaha. A random sample of 10 trains on this route has yielded the data in the table below. A scatterplot, a residual plot, and the output from the regression analysis for these data are shown below.
A. Is a linear model appropriate for modeling these data? Clearly explain your reasoning. B. Suppose the fuel consumption cost is $5 per unit. Give a point estimate (single value) for the change in the average cost of fuel per mile for each additional railcar attached to a train. Show your work. C. Interpret the value of r in the context of this problem. D. Would it be reasonable to use the fitted regression equation to predict the fuel consumption for a train on this route if the train had 65 railcars? Explain.
AP STATISTICS 005 SCORING GUIDELINES Question 3 Solution Part (a): Yes, the linear model is appropriate for these data. The scatterplot shows a strong, positive, linear association between the number of railcars and fuel consumption, and the residual plot shows a reasonably random scatter of points above and below zero. Part (b): According to the regression output, fuel consumption will increase by.15 units for each additional railcar. Since the fuel consumption cost is $5 per unit, the average cost of fuel per mile will increase by approximately ($5)(.15) = $53.75 for each railcar that is added to the train. Part (c): The regression output indicates that r = 96.7% or 0.967. Thus, 96.7% of the variation in the fuel consumption values is explained by using the linear regression model with number of railcars as the explanatory variable. Part (d): No, the data set does not contain any information about fuel consumption for any trains with more than 50 cars. Using the regression model to predict the fuel consumption for a train with 65 railcars, known as extrapolation, is not reasonable. Scoring Each part is scored as essentially correct (E), partially correct (P), or incorrect (I). Part (a) is essentially correct (E) if the model is deemed appropriate AND the explanation clearly indicates: There is a linear pattern in the scatterplot; OR There is no pattern in the residual plot. Part (a) is partially correct (P) if the: Model is deemed appropriate AND the student refers to the scatterplot or residual plot but fails to state the relevant characteristic of the plot; OR Student refers to the relevant characteristic of the scatterplot or residual plot without deeming model appropriate. Part (a) is incorrect (I) if the student: States that the model is appropriate without an explanation; OR States that the model is inappropriate; OR Makes a decision based only on numeric values from the computer output. Part (b) is essentially correct (E) if the point estimate for the slope (.15 or.1495) and the fuel consumption cost per unit ($5) are used to calculate the correct point estimate ($53.75 or $53.7375 $53.74). Part (b) is partially correct (P) if only the point estimate for the slope (.15 or.1495) is stated with a supporting calculation or interpretation. Part (c) is essentially correct (E) if the student states: 96.7% of the variation in fuel consumption is explained by the linear regression model; OR 96.7% of the variation in fuel consumption is explained by the number of railcars. Part (c) is partially correct (P) if the student makes one of the above statements using R-Sq(adj) = 96.3%.
Part (d) is essentially correct (E) if the student states that this is unreasonable due to extrapolation. Part (d) is partially correct (P) if the student states this is: Unreasonable but provides a weak explanation; OR Reasonable even though it is considered a slight extrapolation. Note: Any answer appearing without supporting work is scored as incorrect (I). Each essentially correct (E) response counts as 1 point, each partially correct (P) response counts as 1/ point. 4 3 1 Complete Response Substantial Response Developing Response Minimal Response Note: If a response is in between two scores (for example, 1/ points), use a holistic approach to determine whether to score up or down depending on the strength of the response and communication.
Sample: 3A Score: 4 In part (a) the student s comment that the original data appears linear is too vague on its own to earn credit. However, the subsequent statement about the residual plot being randomly distributed is sufficient. The student gives a clear explanation for a correct calculation in part (b). Although the response makes no mention of the linear model in part (c), it does convey a generally correct understanding of what rmeasures. In part (d) the response shows a clear understanding of why extrapolation is not appropriate.