14.1 The Scatterplot and Correlation Coefficient

Size: px

Start display at page:

Download "14.1 The Scatterplot and Correlation Coefficient"

Alexis Simon
6 years ago
Views:

1 Chapter 14 Simple Linear Regression Regression analysis is too big a topic for just one chapter in these notes. If you have an interest in this methodology, I recommend that you consider taking a course on regression. At the UW Madison, for example, there is an excellent semester course on regression analysis, Statistics 333, that is offered at least once per year The Scatterplot and Correlation Coefficient For each subject/trial/unit or case, as they tend to be called in regression, we have two numbers, denoted by X and Y. The number of greater interest to us is denoted by Y and is called the response. Predictor is the common term for the X variable. Very roughly speaking, we want to study whether there is an association or relationship between X and Y, with special interest in the question of using a case s value of X to describe or predict its value of Y. It is very important to remember that the idea of experimental and observational studies introduced in Chapter 9 applies here too, in a way that will be discussed below. We have data on n cases. When we think of them as random variables we use upper case letters and when we think of specific numerical values we use lower case letters. Thus, we have the n pairs (X 1, Y 1 ), (X 2, Y 2 ), (X 3, Y 3 ),...(X n, Y n ), which take on specific numerical values (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ),...(x n, y n ). The difference between experimental and observational lies in how we obtain the X s. There are two possibilities: 1. Experimental: The researcher deliberately selects the values of X 1, X 2, X 3,...X n. 2. bservational: The researcher selects units (usually assumed to be at random from a population or to be i.i.d. trials) and observes the values of two random variables per unit. Here are two brief examples. 173

2 1. Experimental: The researcher is interested in the yield, per acre, of a certain crop. Denote the yield by Y. The researcher believes that the yield will be affected by the concentration of a certain fertilizer that will be applied to the plants. The variable X represents the concentration of the fertilizer. The researcher selects n one acre plots for study and selects n numbers, x 1, x 2, x 3,...x n, for the concentrations of the fertilizer. Finally, the researcher assigns the n selected values of X to the n plots by randomization. 2. bservational: The researcher selects n men at random from a population of men and measures each man s height, X, and weight, Y. Note that for the experimental setting, the X s are not random variables, but for the observational setting, the X s are random variables. It turns out that several key components of the desired analysis become impossible for random X s, so the standard practice and all that we will cover here is to condition on the values of the X s when they are random and henceforth pretend that they are not random. The main consequence of conditioning this way is that the computations and predictions still make sense, but we must temper our enthusiasm in our conclusions. Just as in Chapter 9, for the experimental setting we can claim causation, but for the observational setting, the most we get is association. First, we will learn how to draw a picture of our data, called the scatterplot. Below I present eight scatterplots that come from three sets of data. 1. Seven offensive variables were determined for each of the 16 teams in Major League Baseball s National League in These data are presented in Table Three additional variables for the same 16 teams in (Actually, one variable runs is common to both data sets.) These data are presented in Table Two variables the score on midterm and final exams for 36 students in one of my sections of Statistics 371. These data are presented later in this chapter. In most scientific problems, we are quite sure which variable should be the response Y, but might have a number of candidates for X. For example, for the data in Table 14.1, the obvious choice, to me, for Y is the number of runs scored by the team. (If you are a baseball fan, this position of mine likely makes sense; if you are not a baseball fan, don t worry about this issue.) Any one of the remaining six variables could be taken as X. Before I proceed, let me digress and mention a topic we will not be considering in this chapter, but that is of great interest in science. This topic is covered in any course devoted to regression analysis. In this chapter, we restrict our attention to problems with exactly one X variable. The use of one predictor is conveyed by the adjective simple and the method we learn is simple regression analysis. It is often desirable in science to allow for two or more predictors; this method is noted by the adjective multiple and the method is referred to as multiple regression analysis. A regression analyst often has either or both of the goals of using the predictor(s) to describe or predict the value of the response. Rather obviously, there is no reason to restrict our attention to only one predictor. (For example, if we want to predict the height of an adult male, it seems sensible to use both of his parent s heights as predictors.) 174

3 Table 14.1: Various Team Statistics, National League, Team Runs Triples Home Runs BA BP SLG PS Philadelphia Colorado Milwaukee LA Dodgers Florida Atlanta St. Louis Arizona Washington Chicago Cubs Cincinnati NY Mets San Francisco Houston San Diego Pittsburgh For the data in Table 14.2 the natural choice for Y is the number of wins achieved by the team during the 162 game regular season. Either of the remaining variables (or both, but not in this chapter) is an attractive candidate for the predictor. As baseball fans know, the total number of runs a team scores is a measure of its offensive performance and its earned run average (ERA) is a measure of the effectiveness of its pitching staff. (Side note: of the ten variables in the two tables that have been discussed, ERA is unique in that it the only variable for which smaller values reflect a better performance.) For the last of our three examples, it seems more natural, to me, to let Y denote the student s score on the final and X denote the student s score on the midterm. With this set-up we will learn how to use a particular student s midterm score to predict his/her score on the final exam. But the methods we learn could be applied to the reverse problem: using the score on the final to predict the score on the midterm. For this latter situation, Y would be the midterm score and X the final score. Take a minute and quickly do a visual scan of the scatterplots in Figures First, I need to explain how to read a scatterplot. Look at Figure Locate the circle that is farthest to the left in the picture. You can see that its x (horizontal) coordinate is approximately 20 and its y (vertical) coordinate is approximately 730. Now, look at Table 14.1 again and note that Atlanta has x = 20 and y = 735; thus, this circle in the picture presents the values of x and y for Atlanta. Similarly, each of the 16 circles in the scatterplot represent a different team s values of x and y. Take a minute now, please, and make sure you are able to locate the circles for: Philadelphia (x = 35 and y = 820) and Colorado. 175

4 Table 14.2: Wins, Runs Scored and Earned Run Average for the National League Teams in (ne game, Pittsburgh at Chicago, was canceled; I arbitrarily count it as a victory for Chicago.) Team Wins Runs ERA Team Wins Runs ERA LA Dodgers Milwaukee Philadelphia Cincinnati Colorado San Diego St. Louis Houston San Francisco Arizona Florida NY Mets Atlanta Pittsburgh Chicago Cubs Washington Figure 14.1: Scatterplot of Runs Scored Versus the Number of Triples r=

5 Figure 14.2: Scatterplot of Runs Scored Versus Batting Average r= Figure 14.3: Scatterplot of Runs Scored Versus the Number of Home Runs r=

6 Figure 14.4: Scatterplot of Runs Scored Versus PS. 840 r= Figure 14.5: Scatterplot of Wins Versus Runs Scored r=

7 Figure 14.6: Scatterplot of Wins Versus Earned Run Average r= Figure 14.7: Final Exam Score Versus Midterm Exam Score for 36 Students. There is a 2 in the Scatterplot Because Two Subjects had (x, y)= (55.5, 96.0) r=

8 Figure 14.8: Final Exam Score Versus Midterm Exam Score for 35 Students, After Deleting ne Isolated Case. There is a 2 in the Scatterplot Because Two Subjects had (x, y)= (55.5, 96.0) r= Now look at Figure 14.3, the scatterplot of runs scored versus home runs. Scan the picture, running your eyes from left to right, which corresponds to increasing the value of x (going from the lowest value of x, which is farthest to the left, to the largest value of x, which is farthest to the right). As your eyes are scanning left-to-right, what happens to the circles? Well, there is not a deterministic relationship, such as all the circles being on a line or a curve; but there is a tendency for the circles to rise (i.e., y is becoming larger) as the x values become larger. Also, in my judgment, the tendency looks like a straight line tendency rather than a (more complicated) curved tendency. Now look at the remaining three scatterplots of runs versus some X. Here are the conclusions I draw: 1. In all scatterplots the tendency between x and y appears to be linear (a straight line) rather than curved. 2. In the first scatterplot, in which X is the number of triples, whereas the tendency is linear, it is neither increasing nor decreasing; just flat. In the remaining three scatterplots the tendency is definitely increasing. The increasing tendency: is strongest for X equal to PS; is weakest for X equal to BA; with X equal to Home Runs falling between these extremes. Each scatterplot also presents a number denoted by r, which takes on values: r = (for X equal to number of triples); r = (BA); r = (Home Runs); and r = (PS). This number r is called the (Pearson s product moment) correlation coefficient. There is an algebraic formula for r, but I will not present it for two reasons: 1. It is quite a mess to compute; as a result, we will use a computer to obtain its value. 180

9 2. There is some insight to be gained by examining the algebraic formula for r, but not enough to squeeze this topic into our ever diminishing remaining time. (I went online and found a presentation of the formula for r on Wikipedia.) Please take a minute to look at the remaining four scatterplots. The values of the correlation coefficient for these data sets are: r = for victories versus runs scored; r = for victories versus ERA; r = for final versus midterm; and r = for final versus midterm after the deletion of the unusual case with x = 35.5 and y = Note the following. 1. We have our first example of a negative number for the correlation coefficient. This reflects the (visual) observation that the tendency between x and y is decreasing as the ERA increases (noted earlier, a bad thing), the number of wins decreases. 2. Consider the two scatterplots for the scores on exams. The deletion of only one case (out of 36) results in a substantial increase in the value of the correlation coefficient. In the terminology of Chapter 10, the value of the correlation coefficient is fragile to unusual cases. Also note that both scatterplots contain the numeral 2 because two students had (x, y) = (55.5, 96.0), and, of course, two circles placed at the same place look like one circle. The correlation coefficient has several important properties. They are listed below, after a bit of terminology in the first item. 1. If the correlation coefficient is greater than zero, the variables Y and X are said to have a positive linear relationship; if it is less than zero, the variables are said to have a negative linear relationship; if it equals zero, the variables are said to have no linear relationship, or to be uncorrelated. 2. The correlation coefficient is not appropriate for summarizing a curved relationship between Y and X. This fact is illustrated in Figure 14.9 in which there is a perfect (deterministic) curved relationship between Y and X, yet the correlation coefficient equals zero. Therefore, it is always necessary to examine a scatterplot of the data to determine whether computation of the correlation coefficient is appropriate. 3. The value of the correlation coefficient is always between 1 and +1. It equals +1 if, and only if, all data points lie on a straight line with positive slope; it equals 1 if, and only if, all data points lie on a straight line with negative slope. (Extra: Why is it that statisticians and scientists are not interested in data sets for which all points lie on a horizontal or vertical line?) 181

10 4. The farther the value of the correlation coefficient from zero, in either direction, the stronger the linear relationship. This fact will be justified and made more precise later in this chapter. 5. The value of the correlation coefficient does not depend on the units of measurement chosen by the experimenter. More precisely, if X is replaced by ax + b and Y is replaced by cy + d, where a, b, c, and d are any numbers with a and c bigger than zero, then the correlation coefficient of the new variables is equal to the correlation coefficient of X and Y. (The numbers a and c are required to be positive to avoid reversing the direction of the relationship. If a and c are both negative, r is unchanged; if exactly one of a and c is negative, r becomes its negative.) This result is true because the correlation coefficient is defined in terms of the standardized values of X and Y (not shown in this text) and these do not change when the units change. Among other examples, this result shows that changing from miles to inches, pounds to kilograms, degrees Celsius to degrees Fahrenheit, or seconds to hours will not change the correlation coefficient. 6. The correlation coefficient is symmetric in X and Y. In other words, if the researcher interchanges the labels predictor and response, the correlation coefficient will not change. In particular, if there is no natural assignment of the labels predictor and response to the two numerical variables, the value of the correlation coefficient is not affected by which assignment is chosen. Here is an example of item 6: Suppose that in the population of married couples, you want to study the relationship between the husband s IQ and the wife s IQ. To me, there is no natural way to assign the labels response and predictor to these variables. In view of item 4 in the above list, let s revisit the two scatterplots of wins versus runs and wins versus ERA. Recall that the correlation coefficients are r = for runs and r = for ERA. By item 4, descriptively, ERA is a better predictor than runs. Sadly, there is no way to compare these two predictors inferentially, for example, with a hypothesis test. This result is beyond the scope of these notes The Simple Linear Regression Model I am now ready to show you the simple linear regression model. We assume the following relationship between X and Y. Y i = β 0 + β 1 X i + ǫ i, for i = 1, 2, 3,...n (14.1) It will take some time to examine and explain this model. First, β 0 and β 1 are parameters of the model; this means, of course, that they are numbers and by changing either or both of their values we get a different model. The actual numerical values of β 0 and β 1 are known by Nature and unknown to the researcher; thus, the researcher will want to estimate both of these parameters and, perhaps, test hypotheses about them. The ǫ i s are random variables with the following properties. They are i.i.d. with mean 0 and variance σ 2. Thus, σ 2 is the third parameter of the model. Again, its value is known by Nature but 182

11 Figure 14.9: An Example of a Curved Relationship Between Y and X. r= unknown to the researcher. It is very important to note that we assume these ǫ i s, which are called errors, are statistically independent. In addition, we assume that every case s error has the same variance. h, and by the way, not only is σ 2 unknown to the researcher, the researcher does not get to observe the ǫ i s. Remember this: all that the researcher observes are the n pairs of (x, y) values. Now, we look at some consequences of our model. I will be using the rules of means and variances from Chapter 7. It is not necessary for you to know how to recreate the arguments below. Remember, the Y i s are random variables; the X i s are viewed as constants. The mean of Y i given X i is µ Yi X i = β 0 + β 1 X i + µ ǫi = β 0 + β 1 X i, (14.2) because the mean of a constant is the constant and the mean of each error is 0. The variance of Y i is σ 2 Y i = σ 2 ǫ i = σ 2, (14.3) because the variance of a constant (β 0 + β 1 X i ) is 0. In words, we see that the relationship between X and Y is that the mean of Y given the value of X is a linear function of X with y-intercept given by β 0 and slope given by β 1. The first issue we turn to is: How do we use data to estimate the values of β 0 and β 1? First, note that this is a familiar question, but the current situation is much more difficult than any we have encountered. It is familiar because we previously have asked the questions: How do we estimate p? r µ? r ν? In all previous cases, our estimate was simply the sample version of the parameter; for example, to estimate the proportion of successes in a population, we use the proportion of successes in the sample. There is no such easy analogy for the current situation: for example, what is the sample version of β 1? 183

12 Instead we adopt the Principle of Least Squares to guide us. Here is the idea. Suppose that we have two candidate pairs for the values of β 0 and β 1 ; namely, 1. β 0 = c 0 and β 1 = c 1 for the first candidate pair, and 2. β 0 = d 0 and β 1 = d 1 for the second candidate pair, where c 0, c 1, d 0 and d 1 are known numbers, possibly calculated from our data. The Principle of Least Squares tells us which candidate pair is better. ur data are the values of the y s and the x s. We know from above that the mean of Y given X is β 0 + β 1 X, so we evaluate the candidate pair c 0, c 1 by comparing the value of y i to the value of the c 0 + c 1 x i, for all i. We compare, as we usually do, by subtracting to obtain: y i [c 0 + c 1 x i ]. This quantity can be negative, zero or positive. Ideally, it is 0, which indicates that c 0 + c 1 x i is an exact match for y i. The farther this quantity is from 0, in either direction, the worse job c 0 + c 1 x i is doing in describing or predicting y i. The Principle of Least Squares tells us to take this quantity and square it: (y i [c 0 + c 1 x i ]) 2, and then sum this square over all cases: n (y i [c 0 + c 1 x i ]) 2. i=1 Thus, to compare two candidate pairs: c 0, c 1 and d 0, d 1, we calculate two sums of squares: n (y i [c 0 + c 1 x i ]) 2 and i=1 n (y i [d 0 + d 1 x i ]) 2. i=1 If these sums are equal, then we say that the pairs are equally good. therwise, whichever sum is smaller designates the better pair. This is the Principle of Least Squares, although with only two candidate pairs, it should be called the Principle of Lesser Squares. f course, it is impossible to compute the sum of squares for every possible set of candidates, but we don t need to do this if we use calculus. Define the function n f(c 0, c 1 ) = (y i [c 0 + c 1 x i ]) 2. i=1 ur goal is to find the pair of numbers (c 0, c 1 ) that minimizes the function f. If you have studied calculus, you know that differentiation is a method that can be used to minimize a function. If we take two partial derivatives of f, one with respect to c 0 and one with respect to c 1 ; set the two resulting equations equal to 0; and solve for c 0 and c 1, we find that f is minimized at (b 0, b 1 ) with b 1 = (xi x)(y i ȳ) (xi x) 2 and b 0 = ȳ b 1 x. (14.4) 184

13 We write ŷ = b 0 + b 1 x; (14.5) This is called the regression line or least squares regression line or least squares prediction line or even just the best line. You will not be required to compute (b 0, b 1 ) by hand from raw data. Instead, I will show you how to interpret computer output from the Minitab statistical software system. There is another way to write the expression for b 1 given in Equation This alternative formula does not help us computationally we don t compute by hand but, as we will see later, gives us additional insight into the regression line. Remember that every case gives us the values of two numerical variables, X and Y. Regression analysis focuses on finding an association between these variables, but it is certainly allowable to look at them separately. In particular, let x and s x denote the mean and standard deviation of the x s in the data set, and let ȳ and s y denote the mean and standard deviation of the y s in the data set. With this notation, it can be shown that (details not given) b 1 = r(s y /s x ), (14.6) where, recall, r denotes the correlation coefficient of X and Y. Note that s y (s x ) is a summary of the y (x) values alone. With this observation, Equation 14.6 has the following interesting implication: In order to compute the regression line, the correlation coefficient contains all the information we need to know about the association between X and Y. Thus, we see that r has at least one important feature. In a regression analysis, every case begins with two values: x and y. Applying Equation 14.5 we can obtain a third value for each case, namely its predicted response ŷ. Finally, a fourth value for each response is its residual, denoted by e and defined below. Note the following features of residuals: e i = y i ŷ i = y i b 0 b 1 x i. (14.7) 1. A case s residual can be positive, zero or negative. A residual of zero is ideal in that it means y i = ŷ i ; in words, the predicted response is equal to the actual response. A positive residual means that y i is larger than ŷ i ; i.e., the actual response is too large. Finally, a negative residual means that y i is smaller than ŷ i ; i.e., the actual response is too small. 2. In view of the previous item, the farther the residual is from 0, in either direction, the worse the agreement between the actual and predicted responses. It turns out that the residuals have two linear constraints: ei = 0 and (x i e i ) = 0. (14.8) 185

14 14.3 Reading Computer utput In this section we will study a data set on n = 35 students in my Statistics 371 class. The two variables are: Y, the score on the final exam, and X, the score on the midterm exam. Recall that Figure 14.8 presents a scatterplot of these data and that the correlation coefficient is r = Table 14.3 presents output obtained from my statistical software package, Minitab. The output has been edited in order to fit on one page. The deleted item will be presented later; it is not necessary for our analysis. Note that there is some redundancy in this output; the most obvious example is that to the right of the observation numbers 21 and 22, the remaining entries are identical. We will work through the output in some detail. Before turning to an inspection of this computer output, I want to give you a few facts about these data. I grade my exam in half-point increments. Thus, for example, 88.0, 88.5 and 89.0 are all possible scores on my final exam. The maximum number of possible points was 60.0 on the midterm exam on the final exam. The midterm exam scores ranged from a low of 39.0 to a high of 59.0; the final exam scores ranged from a low of 81.0 to a high of The means and standard deviations are below. x = , s x = 5.295, ȳ = , s y = (14.9) The first information in the output is the line: The regression equation is: Final = Midterm This is Minitab s way of telling us that the regression line: ŷ = b 0 + b 1 x is ŷ = x. Thus, we see that the least squares estimates of β 0 and β 1 are b 0 = 68.4 and b 1 = 0.452, respectively. Minitab is user friendly in that it uses words (admittedly, provided by me, the programmer) to remind us that the response is the score on the final exam and the predictor is the score on the midterm. But Minitab is quite anachronistic (we will see more of this below) in not having a hat above Final. (Reason: Minitab is a very old software package. It was designed to create output for a typewriting-like machine; i.e., no hats, no subscripts, no Greek letters and no math symbols. For some reason unknown to me the package has never been updated well, at least not on my version which is a few years old to utilize modern printers.) Next in the output are the three lines Predictor Coef SE Coef T P Constant Midterm The first column in this group provides labels for the rows of this table. Its heading, Predictor, seems strange because we only have one predictor, namely Midterm. The output has promoted the intercept to Predictor status because that is a natural thing to do in multiple regression (again, 186

15 Table 14.3: Edited Minitab utput for the Regression of Final Exam Score on Midterm Exam Score for 35 Students. R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence. The regression equation is: Final = Midterm Predictor Coef SE Coef T P Constant Midterm S = R-Sq = 21.5% bs Midterm Final Fit SE Fit Residual St Resid X R R R

16 Figure 14.10: Final Exam Score Versus Midterm Exam Score for 35 Students. 100 ŷ + s r= ŷ = x 95 ŷ s details not given). So, to summarize, don t fret about the heading on the first column above; the important thing is that row Constant (Midterm) provides information about b 0 (b 1 ). The first feature to note in these three lines is that the computer has given us more precision in the values of b 0 and b 1 ; for example, earlier b 0 = 68.4, but now b 0 = Next, skip down to the 36 lines of output headed by: bs Midterm Final Fit SE Fit Residual St Resid The first column numbers the observations 1 35 for ease of reference. The second and third columns Midterm and Final lists the values of x (Midterm) and y (Final) for each of the 35 cases. The fourth column Fit lists the value of ŷ for each case. It is important to pause for a moment and make sure you understand the values x, y and ŷ. The values of x and y are the actual exam scores for the case (student). The value of ŷ is the predicted response for the case, which is obtained by plugging the case s value of x into the regression line. Thus, for example, case 1 scored 39.0 on the midterm and 83.5 on the final. The predicted final score for this student is ŷ = (39.0) = , which the computer rounds to Let me digress briefly and mention a curious convention in reporting the value of ŷ. In Chapter 7 we learned how to predict the total number of successes in m future Bernoulli Trials. At that time, I said, for example, that a point prediction of 72.3 should be rounded to 72 because it is impossible to obtain a fractional number of successes. Following this principle, you might expect me to round to 86.0 because, as mentioned earlier, exam scores could occur only in onehalf point increments; i.e. 85.5, 86.0 and 86.5 are all possible scores on the final, but not Surprisingly (?) we never round in regression. There are two reasons for this: 188

17 1. If we rounded, then the regression line would not really be a straight line it would be horizontal between jumps, called a step-function. Statisticians won t do this; we won t call it a line when it is a step-function! 2. In many (most?) applications the variable Y is a measurement, for examples, distance or time or weight. In these cases we would not have any reason to round. It s confusing to round sometimes and not others. Thus, we don t round. Let s return to our examination of case 1. This student s final exam score is a disappointment in two ways. First, only four students scored lower than 83.5; thus, relative to the class as a whole, this student did poorly on the final. Second, the student s actual y is smaller than the score predicted by his/her score on the midterm: 83.5 versus ŷ = As statisticians, we are much more interested in this second form of disappointment. Indeed, we see that for this student, the residual is e = y ŷ = = 2.532, as reported in column 6. (We will learn about the meaning and utility of the entries in column 5 later.) After a brief reflection, we see that a negative (positive) number for a residual means that the student performed worse (better) than predicted by the regression line. For these data, overall 15 students have negative residuals and 20 have positive residuals. To be fair, we should note that two students with positive residuals (numbers 11 and 30) scored exactly as predicted, if we round-off ŷ. Now, please refer to Figure With three additions, this is the scatterplot of final versus midterm that was presented earlier in Figure First, note the solid line, which is the graph of our regression line. This allows us to visualize how the regression line performs in making predictions for our data set. Below are a few of many possible things to note. 1. The student with the largest value of y (99.5; case 8 with x = 49.0) is, happily for the student, represented by the circle that is points above the regression line s prediction. 2. The two students with residuals closest to 0 (cases 11 and 30 as mentioned earlier) have circles that touch the line. This touching shows the near agreement between y and ŷ. 3. Student 19 has the distinction of having the negative residual farthest from 0. The actual final (y = 83.0) is more than 10 points lower than the prediction based on the fairly high midterm exam score (x = 54.5). There are n = 35 residuals in our data set with two linear restrictions, given earlier in Equation 14.8 on page 185. Thus, the residuals have (n 2) degrees of freedom. I want to examine the spread in the distribution of the residuals. Following the ideas of Chapter 10, I begin with the variance of the residuals: (e ē) s 2 2 e 2 = = n 2 n 2, 189

18 because the mean of the residuals is 0 (because their sum is 0). Minitab does this computation for us and reports the value of the standard deviation of the residuals, which can be found in the computer output immediately above the listing of the cases; s = for these data. The value of s is very important. Here is the reasoning. The residuals, as a group, tell us how well the actual y s agrees with their predicted values. If we apply the empirical rule from Chapter 10, we conclude that for approximately 68% of the cases the value of e falls between s and +s. (Remembering, again, that ē = 0.) In other words for approximately 68% of the cases the value of y is within s units of its predicted value; i.e., the absolute error in prediction is at most s. Thus, the value of s tells us how good the regression line is at making predictions. It is natural to wonder whether the empirical rule approximation is any good for our data. It is indeed a very good approximation; here are two ways to see why I say this. 1. Scan down the numbers in column 6 of Table I count seven residuals that lie outside the interval [ s, +s] = [ 4.635, ]. If you want to check this, note that the extreme residuals belong to cases 4, 8, 10, 13, 14, 17 and (This way is more fun!) Look again at Figure In addition to the regression line, the picture contains two dotted lines. The 28 circles that lie between the two dotted lines correspond to the 28 cases that have a residual in the interval [ s, +s]. Thus, we see that 28/35 = 0.70 = 70% of the residuals lie in the interval [ s, +s] which is a very close agreement with the empirical rule s 68% approximation. To summarize, if asked the question, How well does the midterm score predict the final score? I would reply as follows. For approximately two-thirds of the students the final score can be predicted within points. For the remaining one-third of the students, the prediction errs by more than points. Clearly, in many scientific questions, if s is small enough, the regression is very useful. But if s is too large, the regression is scientifically useless. As George Box, the founder of the Statistics Department at UW Madison, would say, Just because something is the best, doesn t mean its any good. The regression line is the best way (according to the Principle of Least Squares) to use X in a linear way to predict Y. This is a math result. Whether this best line is any good in science must be determined by a scientist Inference for Regression Literally, β 0 is the mean value of Y when X = 0. But there are two problems: First, X = 0 is not even close to the range of X values in our data set. Thus, we have no data-based reason to believe that the linear relation in our data for x values between 39.0 and 59.0 will extend all the way down to X = 0. Second, if a student scored 0 on the midterm he/she would likely drop 190

19 the course. In other words, first we don t want to extend our model to X = 0 because it might not be statistically valid to do so and, second, we don t want to because X = 0 is uninteresting scientifically. Thus, viewed alone, β 0, and hence its estimate b 0 is not of interest to us. This is not to say that the intercept is totally uninteresting; it is an important component of the regression line. But because β 0 alone is not interesting we will not learn how to estimate it with confidence, nor will we learn how to test a hypothesis about its value. Inference for the Slope. Unlike the intercept, the slope, β 1, is always of great interest to the researcher. As you have learned in math, the slope is the change in y for a unit change in x. Still in math, we learned that a slope of 0 means that changing x has no effect on y; a positive (negative) slope means that as x increases, y increases (decreases). Similar comments are true in regression. Now, the interpretation is that β 1 is the true change in the mean of Y for a unit change in X, and b 1 is the estimated change in the mean of Y for a unit change in X. In the current example, we see that for each additional point on the midterm, we estimate that the mean number of points on the final increases by points. Not surprisingly, one of our first tasks of interest is to estimate the slope with confidence. The computer output allows us to do this with only a small amount of work. The confidence interval for β 1 is given by b 1 ± t SE(b 1 ). (14.10) Note that we follow Gosset in using the t curves as our reference. As discussed earlier, the residuals have (n 2) degrees of freedom and, even though one can t tell from our notation, the estimated in the estimated SE of b 1 comes from estimating the unknown σ 2 by s 2. Thus, it is no surprise that our reference curve is the t-curve with (n 2) degrees of freedom, which equals 33 for this study. The SE (estimated standard error) of b 1 is given in the output and is equal to ; the output also gives a more precise value of b 1, , in the event you want to be more precise; I do. With the help of our online calculator, the t for df = 33 and 95% is Thus, the 95% CI for b 1 is ± 2.035(0.1501) = ± = [0.1461, ]. This confidence interval is very wide. Next, we can test a hypothesis about β 1. Consider the null hypothesis β 1 = β 10, where β 10 (read beta-one-zero, not beta-ten) is a known number specified by the researcher. As usual, there are three possible alternatives: β 1 > β 10 ; β 1 < β 10 ; and β 1 β 10. The observed value of the test statistic is given by t = b 1 β 10 SE(b 1 ) (14.11) We obtain the P-value by using the t-curve with df = n 2 and the familiar rules relating the direction of the area to the alternative. Here are two examples. Bert believes that, on average, each additional point on the midterm should result in an additional 5/3 = points on the final. (Can you create a plausible reason why he might believe 191

20 this?) Thus, his choice is β 10 = He chooses the alternative < because he thinks it is inconceivable that the slope could be larger than Thus, his observed test statistic is t = = With the help of our calculator, we find that the area under the t(33) curve to the left of is ; this is Bert s P-value. With a more precise program, I found that the P-value is , which is just a bit larger than 1 in one billion. Bert s theory looks pretty how can I say this dumb. The next example involves the computer s attempt to be helpful. In many, but not all, applications the researcher is interested in β 10 = 0. The idea is that if the slope is 0, then there is no reason to bother using X to predict or describe Y. For this choice the test statistic becomes t = b 1 SE(b 1 ), which for this example is t = / = , which, of course, could be rounded to Notice that the computer has done this calculation for us! (Look in the T column of the Midterm row.) For the alternative and our calculator, we find that the P-value is 2(0.0025) = which the computer gives us (located in the output to the right of 3.01). Inference for the mean response for a given value of the predictor. Next, let us consider a specific possible value of X, call it x 0. Now given that X = x 0, the mean of Y is β 0 + β 1 x 0 ; call this µ 0. We can use the computer output to obtain a point estimate and CI estimate of µ 0. For example, suppose we select x 0 = Then, the point estimate is b 0 + b 1 (182) = (49.0) = If, however, you look at the output, you will find this value, , in the column Fit and the row Midterm = 182. (This is observation 8, 9 or 10.) f course, this is not a huge aid because, frankly, the above computation of was pretty easy. But it s the next entry in the output that is useful. Just to the right of Fit is SE Fit, with, as before, SE the abbreviation for estimated standard error. Thus, we are able to calculate a CI for µ 0 : Fit ± t (SE Fit). (14.12) For the current example, the 95% CI for the mean of Y given X = 49.0 is ± 2.035(0.968) = ± = [88.578, ]. The obvious question is: We were pretty lucky that X = x 0 = 182 was in our computer output; what do we do if our x 0 isn t there? It turns out that the answer is easy: trick the computer. Here is how. Suppose we want to estimate the mean of Y given X = An inspection of the Midterm column in the computer output reveals that there is no row for X = Go back to the data set 192

21 and add a 36th student. For this student, enter 51.0 for Midterm and a missing value for Final. (For Minitab, my software package, this means you enter a *.) Then rerun the regression analysis. In all of the computations the computer will ignore the 36th student because, of course, it has no value for Y and therefore cannot be used. Thus, the computer output is unchanged by the addition of this extra student. But, and this is the key point, the computer includes observation 36 in the last section of the output, creating the row: bs Midterm Final Fit SE Fit Residual St Resid * * * From this we see that the point estimate of the mean of Y given X = 51.0 is (This, of course, is easy to verify.) But now we also have the SE Fit, so we can obtain the 95% CI for this mean: ± 2.035(0.828) = ± = [98.766, ]. I will remark that we could test a hypothesis about the value of the mean of Y for a given X, but people rarely do this. And if you wanted to do it, I believe you could work out the details. Prediction of the response for a given value of the predictor. Suppose that beyond our n cases for which we have data, we have an additional case. For this new case, we know that X = x n+1, for a known number x n+1, and we want to predict the value of Y n+1. Now, of course, Y n+1 = β 0 + β 1 x n+1 + ǫ n+1. We assume that ǫ n+1 is independent of all previous ǫ s, and, like the previous errors, has mean 0 and variance σ 2. The natural prediction of Y n+1 is obtained by replacing the β s by their estimates and ǫ n+1 by its mean, 0. The result is ŷ n+1 = b 0 + b 1 x n+1. We recognize this as the Fit for X = x n+1 ; as such, its value and its SE are both presented in (or can be made to be presented in) our computer output. Using the results of Chapter 7 on variances, we find that after replacing the unknown σ 2 by its estimate s 2, the estimated variance of our prediction is s 2 + [SE(Fit)] 2. For example, suppose that x n+1 = From our computer output, and our earlier work, the point prediction of y n+1 is Fit, which is The estimated variance of this prediction is (4.635) 2 + (0.968) 2 = ; thus, the estimated standard error of this prediction is = It now follows that we can obtain a prediction interval. In particular, the 95% prediction interval for y n+1 is ± 2.035(4.735) = ± = [80.912, ]. This is not a particularly useful prediction interval. (Why?) 193

22 Table 14.4: The ANVA Table for the Regression of Final Exam Score on Midterm Exam Score for 35 Students. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Some Loose Ends There are a few loose ends from the computer output and regression analysis in general that are worth mentioning. In the earlier computer output I omitted the Analysis of Variance Table (ANVA Table) in order to fit the output onto one page. The ANVA table is presented now in Table The first thing to note in this table is that the column heading SS is short for Sum of Squares. There are three sums of squares in this table; one each for regression, (residual) error and total. These are abbreviated by SSR, SSE and SST, respectively. The next thing to note is that these sums of squares add: SSR + SSE = SST. This is actually a fairly amazing result, a Pythagorean Theorem for Statistics. We begin by considering the n values of the response: y 1, y 2,...,y n. Following Chapter 10, these give rise to n deviations: y 1 ȳ, y 2 ȳ,...,y n ȳ. We then write a generic deviation as follows: y i ȳ = (y i ŷ i ) + (ŷ i ȳ) (14.13) In words, we take the deviation for case i and break it into the sum of two pieces: the deviation of y i from its (regression line) predicted value ŷ i ; and the deviation of its predicted value from the overall mean. For example, refer back to the computer output in Table 14.3 and examine case 29. For this case, y 29 = 99.0 and ŷ 29 = Recall, also, from Equation 14.9 that ȳ = Thus, for case 29, Equation becomes = ( ) + ( ) or = In words, the response of 99.0 deviates from the mean of all responses by This deviation is broken into two pieces: 99.0 is points larger than its predicted value which, in turn, is points larger than the mean of all responses. This identity remains true if we sum over all cases: (yi ȳ) = (y i ŷ i ) + (ŷ i ȳ). What is remarkable, however, is that this last identity remains true if we square all of its terms: (yi ȳ) 2 = (y i ŷ i ) 2 + (ŷ i ȳ)

23 In words, this last identity is SST = SSE + SSR. The degrees of freedom (DF) in the ANVA Table also sum, as I will now explain. From Chapter 10, the deviations have n 1 degrees of freedom, which for our current data set is 35 1 = 34. As argued earlier in this chapter, the residuals have two linear constraints and, hence, have n 2 degrees of freedom, which is 33 for the current data set. The degrees of freedom for regression is a bit trickier: the regression line is determined by two numbers (intercept and slope) which is one more than the number determined by the mean (itself); hence, there are 2 1 = 1 degrees of freedom for regression. I will now explain why these sums of squares are so interesting. First, consider SST; this sum of squares measures the variation in the responses, ignoring the predictor. For our data, SST = f course, we could divide this by its degrees of freedom to obtain s 2 y, the variance of the y values. And, of course, we could take the square root of the variance to obtain s y which I reported earlier as equaling points. But I don t want to do this! I just want to focus on This number measures the total squared variation in the y values. It must be nonnegative. If it were 0 we could infer that all y values are the same. Finally, the larger the value of SST the more squared variation we have in our y values. Now, contrast SST with SSE. SSE is the sum of squared residuals the square of the differences between the actual responses and their respective predicted values. For these data, SSE = Next, consider the difference between SST and SSE: = = SSR. Here is a picturesque way to interpret this difference: is the amount of squared error that has been removed by using the predictor X. Removing error is good. (Why?) Thus, this number is a measure of how much using X has improved our predictions. The problem, however, is how do we interpret ; Is it large? Is it small? ur answer lies in comparing it to something, but what? We compare it to by taking a ratio: = 0.215, or 21.5% Thus, of the total squared error in the response (SST = ), 21.5% is removed or explained or accounted for by a linear relationship with the predictor. The above ratio is called the coefficient of determination; it is denoted by R 2 and, as above, if often reported as a percentage. Thus, R 2 = SST SSE SST Recall the correlation coefficient, r, discussed earlier. It can be shown that R 2 = r 2. For example, with our current data set, r = which gives r 2 = (0.464) 2 = = R (14.14)

24 I am now able to fulfill an earlier promise to make item 4 in our list of properties of r (on page 182) more precise. We now have an interpretation for the value of r: its square is equal to the coefficient of determination. Thus, the farther r is from 0, the larger the value of r 2 and, hence, the better X is at predicting/describing Y. The fact that r 2 = R 2 has led many analysts to state that in order to interpret r we must first square it. Indeed, I have heard several people say that it is dishonest to report r instead of r 2. Here is there reasoning: If one reports, for example, that r = 0.8 this sounds much stronger than the more accurate r 2 = This argument is flawed, in my opinion, for two reasons. First, whereas I believe that R 2 is a useful way to summarize one aspect of a regression analysis, we must remember that it is based on squaring errors which, at the very least, is an unnatural activity. My second reason requires a bit more work. Combining the results in Equations , we find that Rewriting we obtain the following. ŷ = b 0 + b 1 x = ȳ b 1 x + b 1 x = ȳ + b 1 (x x) = ȳ r(s y /s x )(x x). ŷ = ȳ r(s y /s x )(x x), ŷ ȳ s y = r( x x s x ) (14.15) Now this equation is not useful for obtaining ŷ for a given x, but it gives us great insight into the meaning of r, as I will now argue. Recall that x = , s x = 5.295, ȳ = and s y = Consider a new case whose value of x is x = x + s x = = (Please ignore that such a score is impossible with my one-half point system.) This is a good student; she scored one standard deviation above the mean on the midterm. According to Equation her predicted response satisfies the following equation. Rewriting this, we obtain, ŷ ȳ s y = r = ŷ = ȳ s y. In words, her predicted response is only standard deviations above the mean response! I ended this last sentence with an exclamation point because something quite remarkable is happening. n the midterm, this student is exactly one standard deviation above the mean, but we predict that she will be much closer to the mean on the final. In picturesque language, we predict that only 46.4% of her advantage on the midterm will persist to the final. Thus, only 46.4% of her advantage on the midterm reflects her superior ability; the other 53.6%, for lack of a better term, 196

Chapter 14. Simple Linear Regression Preliminary Remarks The Simple Linear Regression Model

Chapter 14. Simple Linear Regression Preliminary Remarks The Simple Linear Regression Model Chapter 14 Simple Linear Regression 14.1 Preliminary Remarks We have only three lectures to introduce the ideas of regression. Worse, they are the last three lectures when you have many other things academic