Bivariate Data Page 1 Scatterplots and Correlation Essential Question: What is the correlation coefficient and what does it tell you? Most statistical studies examine data on more than one variable. Fortunately, analysis of several-variable data builds on the tolls we used to examine individual variables. The principles that guide our work also remain the same: Plot the data, then add numerical summaries. Look for overall patterns and deviations from those patterns. When there's regular overall pattern, use a simplified model to describe it. We think that car weight helps explain accident deaths and that smoking influences life expectancy. In these relationships, the two variables play different roles. Accident death rates and life expectancy are the response variable of interest. Car weight and number of cigarettes smoked are the explanatory variables. It is easiest to identify explanatory and response variables when we actually specify values of one variable to see how it affects another variable. When we don't specify the values of either variable but just observe both variables, there may or may not be explanatory and response variables. Whether there are depends on how you plan to use the data. Scatterplots The most useful graph for displaying the relationship between two quantitative variables is a scatterplot. Always plot the explanatory variable, if there is one, on the horizontal axis (the x-axis) of a scatterplot. To make a scatterplot: 1) Decide which variable should go on each axis, 2) label and scale your axes, 3) Plot individual data values. example: Sprint time (seconds) vs long jump distance (inches) Sprint time: 5.41 5.05 9.49 8.09 7.01 7.17 6.83 6.73 8.01 5.68 5.78 6.31 6.04 Long-jump: 171 184 48 151 90 65 94 78 71 130 173 143 141 **AP EXAM TIP: Always be sure to label the axes of your graph.
Page 2 Interpreting Scatterplots When interpreting scatterplots we use DOFS. D = direction, O = outliers, F = form, and S = strength. Look for a clear direction (correlation): positive, negative, or none. positive negative none Sometimes it can be difficult to see a direction, but you should look for the overall pattern in the graph. Outliers are classified as being an outlier or an influential point. If an outlier has an extreme y- value but has a typical x-value, then the point is an outlier. If an outlier has an unusual x-value, but a y-value that does not follow the pattern, then we have an influential point. The graph displays a scatterplot for the number of missing assignments and a student's score on the quiz. We can see there is a negative correlation between the two variables. Points A, B, and C are all outliers. Point A has a typical x-value but an extreme y-value; therefore point A is an outlier. Point B has an unusual x-value and its y-value is not following the pattern; therefore point B is an influential point. Point C has an unusual x-value, but its y-value is following the pattern; therefore it is an outlier. Bivariate Data Page 2
Page 3 Next we look at the form of the scatterplot. Here we are looking to see if the graph appears linear or if it may be quadratic, cubic, or exponential. linear quadratic exponential Finally, we look at the strength of the correlation. The correlation coefficient, represented by the letter r, gives us a value for the strength of the correlation. The value of r will fall between -1 and 1 inclusively and it has no units. When the value of r is negative we have a negative correlation and when r is positive we have a positive correlation. The strength of the correlation is given below: 0.8 r 1.0 : strong correlation 0.5 r < 0.8 : moderate correlation 0 r < 0.5 : weak correlation r is used because a positive or negative value of 0.85 would be considered a strong correlation. The correlation coefficient can be calculated using the formula: r =. Where (x, y ) represent each ordered pair, x and y are the average x and y value, s and s are the standard deviations for each variable, and n is the number of observations. example: Body weight and backpack weight Body weight: 120 187 109 103 131 165 158 116 Backpack weight: 26 30 26 24 29 35 31 28 x y 120 -.5322 26 -.7582.4036 187 1.6793 30.3972.6670 109 -.8953 26 -.7582.6789 103-1.0934 24-1.3359 1.4607 131 -.1692 29.1083-0.0183 165.9531 35 1.8414 1.7551 158.7220 31.6860.4953 116 -.6642 28 -.1805.1199 x = 136.125 s = 30.296 y = 28.625 s = 3.462 = 5.5622 and divide by 7 because n-1. r =.795 Bivariate Data Page 3
Least Squares Regression Essential Question: What is a least squares regression line and what does it tell us? A regression line is a line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. example: Does fidgeting keep you slim? Some people don't gain weight even when they overeat. Perhaps fidgeting and other "nonexercise activity " (NEA) explains why - some people may spontaneously increase nonexercise activity when fed more. Researchers deliberately overfed 16 healthy young adults for 8 weeks. They measured fat gain (in kilograms) as the response variable and change in energy use (in calories) from activity other than deliberate exercise - fidgeting, daily living, and the like - as the explanatory variable. Below is the data: NEA change: -94-57 -29 135 143 151 245 355 392 473 486 535 571 580 620 690 Fat gain: 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3 3.8 1.7 1.6 2.2 1.0 0.4 2.3 1.1 Below is a scatterplot of the data. From the scatterplot we can see a moderately strong negative linear correlation. r = -.7786 Using our calculator we can find the Least Squares Regression Line (LSRL) for the data. The linear equation will be of the form: y = a + bx. We use y (read "y hat") for the predicted value of y, a for the y-intercept and b for the slope. The calculator finds: y = 3.505 When interpreting the slope we say: "On average, for every 1 unit increase in the explanatory variable we would see a (slope value) unit increase/decrease in the response variable." For this problem: "On average, for every additional calorie of NEA we would see a 0.00344 kg decrease in fat gain." We can now use our LSRL to make predictions about y. For example; if we wanted to know the fat gain when NEA is 425 calories; we just plug in 425 for x. We find y = 1.957 kg. Be sure to use y any time you are referring to a predicted value. Sometimes you are asked to predict a y-value for an x-value much larger or smaller than the values used to create the LSRL. When this happens be aware the an extrapolation error could occur. In other words, the predicted value you find for y may not be valid because of the extreme x-value used to find it. Bivariate Data Page 4
Page 2 A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is: residual = observed y - predicted y or residual = y - y. A positive residual value means that the observed value is greater than the value predicted by the LSRL. A negative residual value means that the observed value is less than the value predicted by the LSRL. To find a residual we select an x-value from our table and use the LSRL to make a prediction for the y-value. For example for x = 135 we have a y-value of 2.7 and a y of 3.0406. Therefore the residual would be y y = -0.3406. Note: The least squares regression line of y on x is the line that makes the sum of the squared residuals as small as possible. Graphing residuals You can plot the residuals values on the y-axis and the explanatory values on the x-axis and this will give you a residual plot. The residual plot allows us to see if a line is a good fit for the data. If the plots appear to be random, then a line is an appropriate fit for the data. Because the plots appear to be random we can say that a linear regression is an appropriate fit for the data. The LSRL can be found using the calculator or by using the formulas: b = r and a = y bx. Where r is the correlation coefficient, x and y are the average values of x and y, and s and s are the standard deviations for the x and y values. Coefficient of Determination tells us how well the LSRL fits the data. We use r to represent the coefficent of determination. r values fall between 0 and 1. A value closer to 1 means the line is a better fit. r = 1 Where SSE (sum squared of estimated variation) = (y y) and SST (sum squared of total variation) = (y y ) Bivariate Data Page 5
Bivariate Data Page 6
Page 3 The coefficient of determination is the fraction of the variation in the values of y that is accounted for by the LSRL of y on x. We would say: "The LSRL accounts for (r 2 value)% of the variation in the (response variable)." The standard deviation of the residuals gives the approximate size of a "typical" or "average" prediction error (residual). To calculate the standard deviation we use s = ( ). Here we divide by n - 2 because there are two variables instead of just one. Bivariate Data Page 7