Chapter 3. Introduction to Linear Correlation and Regression Part 3

Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 1 Richard Lowry, 1999-2000 All rights reserved. Chapter 3. Introduction to Linear Correlation and Regression Part 3 Regression The appearance of the term regression at this point (literally, backward movement) is something of an historical accident. It could just as easily have been called progression. The basic concept is the same we found for correlation, though now it has added into it the visual imagery of movement essentially, of two things, two variables, moving together. As indicated earlier, correlation and regression are two sides of the same statistical coin. When you measure the linear correlation of two variables, what you are in effect doing is laying out a straight line that best fits the average "together-movement" of these two variables. That line is spoken of as the line of regression, and its utility is not only as a device for helping us to visualize the relationship between the two variables. It can also serve very usefully as a basis for making rational predictions. To illustrate, consider again our 1993 SAT correlation. Assuming that the negative correlation for that year is likely to occur in subsequent years, you are now in a position to predict a state's average SAT score for some subsequent year, before the results are reported, simply on the basis of knowing the percentage of students within the state who take the SAT that year. If 10% of the high school seniors within a state take the SAT, it is a fairly safe bet that the average combined SAT score for that state will be somewhere in the vicinity of 1,010 perhaps a bit higher or lower, but in any event somewhere in the vicinity. If 70% of the high school seniors in some other state take the SAT, it is a fairly safe bet that the average for that state will be nowhere near 1,010, but rather somewhere in the vicinity of 880. Regression analysis provides a rational foundation for making such

Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 2 predictions; it also provides a basis for specifying precisely what we mean by "somewhere in the vicinity." As we noted earlier, when you perform the computational procedures for linear correlation and regression, what you are essentially doing is defining the straight line that best fits the bivariate distribution of data points. The criterion for "best fit" is that the sum of the squared vertical distances between the data points and the regression line must be as small as possible. The slant of the resulting line will correspond to the direction of correlation (upward, +; downward, ); and the tightness of the data points around the line will correspond to the strength of the correlation. You can think of the regression line as representing the average relationship that exists between X and Y, as observed within this particular sample. The location and orientation of the regression line are defined by two quantities, spoken of as regression constants, that can be easily derived from the results of calculations already performed in Table 3.2. These are a = the point at which the line crosses the Y axis (the 'intercept'); and b = the rate at which the line angles upward or downward along the X axis (the 'slope'). The computational formulas for these two quantities are quite simple and can be introduced without elaborate comment: For the slope: b = SC XY SS X and for the intercept: a = M Y bm X Before we perform these calculations for the SAT data, I think it might be useful to illustrate the process with a simpler data set. For this purpose, consider yet again the pairing of X i and values that produced the positive correlation shown in Example II of Figure 3.3.

Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 3 Pair a b c d e f X i 1 2 3 4 5 6 6 2 4 10 12 8 means 3.5 7.0 SS X = 17.5 SS Y = 70.0 SC XY = 23.0 Given these previously calculated values: slope: b = = SC XY SS X 23.0 17.5 = +1.31 intercept: a = M Y bm X = 7.0 [1.31(3.5)] = 2.4 In the following graph I show the same figure that appears above, but now constructed in such a way as to emphasize the intercept and slope of the regression line. The intercept, shown on the left-hand side of the graph, is the point at which the regression line crosses the vertical Y axis providing that the Y axis is lined up with the point on the horizontal axis where X is equal to zero. (Be careful with this, because scatter plots do not always begin the X axis at X = 0.) The slope of the regression line is indicated by the green pattern in the graph that looks like a flight of stairs. What this pattern shows is that for each increase of one unit in the value of X, the value of ncreases by 1.31 units. Thus, when X is equal to zero, s equal to the intercept, which is 2.4; when X = 1.0, s equal to the intercept plus (+) 1.31 (i.e., 2.4+1.31 = 3.71); when X = 2.0, s equal to the intercept plus 2.62 (i.e., 2.4+2.62 = 5.02); and so on.

Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 4 Now we perform the same calculations for the data set of our 1993 SAT correlation. In Table 3.2 we have already arrived at the summary values mean of X = 36.32 mean of Y = 952.54 SS X = 36,764.88 SS Y = 231,478.42 SC XY = 79627.64 X = percentage of high school seniors taking the SAT Y = state average combined SAT score Given these values, the slope of the regression line can be calculated as b = = SC XY SS X. 79627.64 36,764.88 = 2.17 and the intercept as a = M Y bm X = 952.54 [ 2.17(36.32)] = 1031.35 For this data set, the regression line intercepts the vertical axis at the point where Y is equal to 1031.35, and then slants downward ( ) 2.17 units of Y for each unit of X. Thus, when X is equal to zero, s equal to 1031.35; when X = 10, s equal to the intercept minus 2.17x10 (i.e., 1031.35 21.7 = 1009.65); when X = 20, s equal to the intercept minus 2.17x20 (i.e., 1031.35 43.4 = 987.95); and so on.

Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 5 These are the mechanics of regression in a nutshell; now to the logic and strategy of prediction. If the observed correlation between two variables, X and Y, proves to be statistically significant unlikely to have occurred through mere chance coincidence the rational presumption is that it pertains not just to this particular sample of X i pairs, but to the relationship between X and n general. And once you know the relationship between X and n general, you are then in a position to figure out the value of that is likely to be associated with any particular newly observed value of X i. The procedure for making such a prediction is illustrated pictorially below. From the observed correlation in this 1993 sample, we infer that the general relationship between X and Y can be described by a regression line that has an intercept of a = 1,031.35 and a slope of b = 2.17. Suppose, now, that for some subsequent year a certain state has X i = 10% of its high school seniors taking the SAT. If you wanted to predict, the average SAT score for that state, the obvious way to proceed would be to start with the observed value of X i = 10%, go straight up to the line of regression, and then turn left to see where you end up on the Y axis. That will be your predicted value of, which as you can see from the graph is

Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 6 something quite close to Y = 1,010. For X i = 50%, on the other hand, the predicted value is in the vicinity of Y = 925. In practice, of course, the predicted values of are not arrived at graphically, but through calculation. For any particular observed linear correlation between two variables, X and Y, the value of to be predicted on the basis of a newly observed value of X i is given by the following formula. Please note, however, that this version of the formula is only preliminary. There is something we will need to add to it a bit later. Try this formula out with a few different values of X i and you will see that it is arriving mathematically, hence more precisely, at the same result that would be reached through the graphical method shown above. The formula does it by starting at a, the point at which the regression line intercepts the Y axis, and then moving up or down the Y axis (depending on the direction of the correlation) one unit of slope (b) for each unit of X. for X i = 10% = 1,031.35+( 2.17 x 10) and for X i = 50% = 1,009.65 = 1,031.35+( 2.17 x 50) = 922.85 Now we are of course not claiming for either of these cases that the actual values of will fall precisely at the points we have calculated. All we can rationally assert is that actual values of for the case where X i = 10% will tend to approximate the predicted regression-line value of 1,009.65; that actual values of for the case where X i = 50% will tend to approximate the predicted regression-line value of 922.85; and so on for any other values of X i that fall within the range of X i values observed within the sample. It will probably be intuitively obvious to you that the strength of this "tendency to approximate" will be determined by the strength of the correlation observed within the original sample: The stronger the observed correlation, the more closely the actual values of will tend to approximate their predicted values; and conversely, the weaker the correlation, the greater will be the tendency of the actual values of to deviate from their predicted values. A moment ago I indicated that the formula for a predicted value of

Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 7 needs to have something added to it. What needs to be added is a measure of probable error, something that reflects the strength of the observed correlation, hence the strength of the tendency for actual values of to approximate their predicted values. Although the full conceptual background for this step will not be available until we have covered some basic concepts of probability, it is possible at this point to convey at least a practical working knowledge of it. Within the context of linear regression, the measure of probable error is a quantity spoken of as the standard error of estimate. Essentially, it is a kind of standard deviation. Here again is the scatter plot for the 1993 SAT correlation. In your mind's eye, please try to envision a green line extending straight up or straight down from each of the blue data points to the red regression line. Each of these imaginary green lines is a measure of the degree to which the associated data point deviates (along the Y axis) from the regression line. Square each of these distances, then take the sum of those squares, and you will have a sum of squared deviates. In statistical parlance, each deviate (the imaginary green line) is spoken of as a residual, so the sum of their squares can be denoted as the sum of squared residuals, which we will abbreviate as SS residual. At any rate, divide this sum of squared deviates (residuals) by N, and you will have a variance. Take the square root of that variance, and you will have a standard deviation. As it happens, the sum of squared residuals can be arrived at mathematically through the simple formula SS residual = SS Y x (1 r 2 ) Recall that r 2 is the proportion of variability in Y that is associated with variability in X, and that 1 r 2 is the proportion (residual) that is not associated with variability in X. Multiplying SS Y by 1 r 2 therefore gives you the proportion of SS Y that is residual, "left over," not accounted for by the correlation between X and Y.

Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 8 For the 1993 SAT example, this yield SS residual = 231,478.42 x (1 0.86 2 ) = 60,184.38 Divide this quantity by N, and you will have the residual variance of Y: 60,184.38/50 = 1,203.69. Take the square root of it, and you will have the standard deviation of the residuals: sqrt[1,203.69] = ±34.69 This standard deviation of the residuals is almost, but not quite, equivalent to the standard error of estimate. The difference is that the quantity we have just calculated is purely descriptive it pertains only to this particular sample of paired X i values whereas the standard error of estimate aims to reach beyond the sample into the realm of events as yet unobserved. This extension from the particular sample of X i pairs to the relationship between X and n general is achieved through the simple expedient of dividing SS residual by N 2 rather than by N. The rationale for this N 2 denominator will have to wait until a later chapter. For the moment, suffice it to say that the standard error of estimate, which we will abbreviate as SE, is given by the formula SE = sqrt[(ss residual ) / (N 2)] For the present example, our standard error of estimate is therefore SE = sqrt[60,184.38 / (50 2)] = ±35.41 In brief: On the basis of what we have observed within our sample of X i pairs, we estimate that if the regression line of the sample were to be applied to the entire population of pairs, the Y residuals of the population would have a standard deviation somewhere very close to ±35.41. The next version of the SAT scatter plot shows how all of this applies to the task of prediction. A parallel line drawn 35.41 units of Y above the regression line will give you +1 standard error of estimate; one drawn 35.41 units of Y below the regression line will give 1 standard error of estimate; and the inference (details in a later chapter) is that the range between +1 SE and 1 SE will include approximately two-thirds of all the X i pairs within the population. Thus, when you predict an unknown value of according to the formula

Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 9 the true value of has about a two-thirds chance of falling within plus-or-minus 35.41 points of your predicted value, that is, within plus-or-minus 1 standard error of estimate. In making predictions of this type, the convention is to state the predicted value not simply as but rather as 'predicted Y' plus-or-minus 1 standard error of estimate. That is ±SE Thus, our predicted state average SAT scores for the cases where 10% and 50% of a state's high school seniors take the test are, in their full form for X i = 10% = 1,031.35+( 2.17 x 10)±35.41 and for X i = 50% = 1,009.65±35.41 = 1,031.35+( 2.17 x 50)±35.41 = 922.85±35.41 That is, for X i = 10% we predict that the corresponding value of has a two-thirds chance of falling between Y = 974.24 and Y = 1,045.06; for X i = 50%, we predict that the corresponding value of has a two-thirds chance of falling between Y = 887.44 and Y = 958.26; and so on. Providing that the sample is adequately representative of the relationship between X and n general, we can expect approximately two-thirds of the entire 'population' of X i pairs to fall within the range defined by plus-or-minus 1 standard error of estimate, and only about one-third to fall outside that range. Hence, any particular prediction of the general form ±SE will have about a two-thirds chance of catching the true value of in its net and only a one-third chance of missing it. Another way of expressing this concept is in terms of confidence. For a linear-regression prediction of this general form, you can be about two-thirds confident that the true value of will fall within ±1SE of the predicted value. In a later chapter we will examine procedures by which you can increase the confidence you might have in an estimate or a prediction to much higher levels such as 95% or 99%.

Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 10 But the proof, as they say, is in the pudding. If you examine the SAT data for any testing year subsequent to 1993, you will find that about two-thirds of the actual values of do in fact fall within the range defined by the regression line of the 1993 sample, plus-or-minus 1SE. Hence any particular prediction of the form ±SE would have had about a two-thirds chance of falling within the net. In Part 2 of this chapter we noted briefly that the first question to be asked of an observed correlation is whether it comes from anything other than mere chance coincidence. It is now time to take that question up in greater depth; however, as it is a question whose implications extend far beyond the confines of correlation and regression, we will make it a separate chapter. *Note, however, that Chapter 3 also has two subchapters examining a couple of aspects of correlation not covered in the main body of the chapter. End of Chapter 3. Return to Top of Chapter 3, Part 3 Go to Subchapter 3a [Partial Correlation] Go to Subchapter 3b [Rank-Order Correlation] Go to Chapter 4 [A First Glance at the Question of Statistical Significance] Home Click this link only if the present page does not appear in a frameset headed by the logo Concepts and Applications of Inferential Statistics