14.1 The Scatterplot and Correlation Coefficient

Size: px
Start display at page:

Download "14.1 The Scatterplot and Correlation Coefficient"

Transcription

1 Chapter 14 Simple Linear Regression Regression analysis is too big a topic for just one chapter in these notes. If you have an interest in this methodology, I recommend that you consider taking a course on regression. At the UW Madison, for example, there is an excellent semester course on regression analysis, Statistics 333, that is offered at least once per year The Scatterplot and Correlation Coefficient For each subject/trial/unit or case, as they tend to be called in regression, we have two numbers, denoted by X and Y. The number of greater interest to us is denoted by Y and is called the response. Predictor is the common term for the X variable. Very roughly speaking, we want to study whether there is an association or relationship between X and Y, with special interest in the question of using a case s value of X to describe or predict its value of Y. It is very important to remember that the idea of experimental and observational studies introduced in Chapter 9 applies here too, in a way that will be discussed below. We have data on n cases. When we think of them as random variables we use upper case letters and when we think of specific numerical values we use lower case letters. Thus, we have the n pairs (X 1, Y 1 ), (X 2, Y 2 ), (X 3, Y 3 ),...(X n, Y n ), which take on specific numerical values (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ),...(x n, y n ). The difference between experimental and observational lies in how we obtain the X s. There are two possibilities: 1. Experimental: The researcher deliberately selects the values of X 1, X 2, X 3,...X n. 2. bservational: The researcher selects units (usually assumed to be at random from a population or to be i.i.d. trials) and observes the values of two random variables per unit. Here are two brief examples. 173

2 1. Experimental: The researcher is interested in the yield, per acre, of a certain crop. Denote the yield by Y. The researcher believes that the yield will be affected by the concentration of a certain fertilizer that will be applied to the plants. The variable X represents the concentration of the fertilizer. The researcher selects n one acre plots for study and selects n numbers, x 1, x 2, x 3,...x n, for the concentrations of the fertilizer. Finally, the researcher assigns the n selected values of X to the n plots by randomization. 2. bservational: The researcher selects n men at random from a population of men and measures each man s height, X, and weight, Y. Note that for the experimental setting, the X s are not random variables, but for the observational setting, the X s are random variables. It turns out that several key components of the desired analysis become impossible for random X s, so the standard practice and all that we will cover here is to condition on the values of the X s when they are random and henceforth pretend that they are not random. The main consequence of conditioning this way is that the computations and predictions still make sense, but we must temper our enthusiasm in our conclusions. Just as in Chapter 9, for the experimental setting we can claim causation, but for the observational setting, the most we get is association. First, we will learn how to draw a picture of our data, called the scatterplot. Below I present eight scatterplots that come from three sets of data. 1. Seven offensive variables were determined for each of the 16 teams in Major League Baseball s National League in These data are presented in Table Three additional variables for the same 16 teams in (Actually, one variable runs is common to both data sets.) These data are presented in Table Two variables the score on midterm and final exams for 36 students in one of my sections of Statistics 371. These data are presented later in this chapter. In most scientific problems, we are quite sure which variable should be the response Y, but might have a number of candidates for X. For example, for the data in Table 14.1, the obvious choice, to me, for Y is the number of runs scored by the team. (If you are a baseball fan, this position of mine likely makes sense; if you are not a baseball fan, don t worry about this issue.) Any one of the remaining six variables could be taken as X. Before I proceed, let me digress and mention a topic we will not be considering in this chapter, but that is of great interest in science. This topic is covered in any course devoted to regression analysis. In this chapter, we restrict our attention to problems with exactly one X variable. The use of one predictor is conveyed by the adjective simple and the method we learn is simple regression analysis. It is often desirable in science to allow for two or more predictors; this method is noted by the adjective multiple and the method is referred to as multiple regression analysis. A regression analyst often has either or both of the goals of using the predictor(s) to describe or predict the value of the response. Rather obviously, there is no reason to restrict our attention to only one predictor. (For example, if we want to predict the height of an adult male, it seems sensible to use both of his parent s heights as predictors.) 174

3 Table 14.1: Various Team Statistics, National League, Team Runs Triples Home Runs BA BP SLG PS Philadelphia Colorado Milwaukee LA Dodgers Florida Atlanta St. Louis Arizona Washington Chicago Cubs Cincinnati NY Mets San Francisco Houston San Diego Pittsburgh For the data in Table 14.2 the natural choice for Y is the number of wins achieved by the team during the 162 game regular season. Either of the remaining variables (or both, but not in this chapter) is an attractive candidate for the predictor. As baseball fans know, the total number of runs a team scores is a measure of its offensive performance and its earned run average (ERA) is a measure of the effectiveness of its pitching staff. (Side note: of the ten variables in the two tables that have been discussed, ERA is unique in that it the only variable for which smaller values reflect a better performance.) For the last of our three examples, it seems more natural, to me, to let Y denote the student s score on the final and X denote the student s score on the midterm. With this set-up we will learn how to use a particular student s midterm score to predict his/her score on the final exam. But the methods we learn could be applied to the reverse problem: using the score on the final to predict the score on the midterm. For this latter situation, Y would be the midterm score and X the final score. Take a minute and quickly do a visual scan of the scatterplots in Figures First, I need to explain how to read a scatterplot. Look at Figure Locate the circle that is farthest to the left in the picture. You can see that its x (horizontal) coordinate is approximately 20 and its y (vertical) coordinate is approximately 730. Now, look at Table 14.1 again and note that Atlanta has x = 20 and y = 735; thus, this circle in the picture presents the values of x and y for Atlanta. Similarly, each of the 16 circles in the scatterplot represent a different team s values of x and y. Take a minute now, please, and make sure you are able to locate the circles for: Philadelphia (x = 35 and y = 820) and Colorado. 175

4 Table 14.2: Wins, Runs Scored and Earned Run Average for the National League Teams in (ne game, Pittsburgh at Chicago, was canceled; I arbitrarily count it as a victory for Chicago.) Team Wins Runs ERA Team Wins Runs ERA LA Dodgers Milwaukee Philadelphia Cincinnati Colorado San Diego St. Louis Houston San Francisco Arizona Florida NY Mets Atlanta Pittsburgh Chicago Cubs Washington Figure 14.1: Scatterplot of Runs Scored Versus the Number of Triples r=

5 Figure 14.2: Scatterplot of Runs Scored Versus Batting Average r= Figure 14.3: Scatterplot of Runs Scored Versus the Number of Home Runs r=

6 Figure 14.4: Scatterplot of Runs Scored Versus PS. 840 r= Figure 14.5: Scatterplot of Wins Versus Runs Scored r=

7 Figure 14.6: Scatterplot of Wins Versus Earned Run Average r= Figure 14.7: Final Exam Score Versus Midterm Exam Score for 36 Students. There is a 2 in the Scatterplot Because Two Subjects had (x, y)= (55.5, 96.0) r=

8 Figure 14.8: Final Exam Score Versus Midterm Exam Score for 35 Students, After Deleting ne Isolated Case. There is a 2 in the Scatterplot Because Two Subjects had (x, y)= (55.5, 96.0) r= Now look at Figure 14.3, the scatterplot of runs scored versus home runs. Scan the picture, running your eyes from left to right, which corresponds to increasing the value of x (going from the lowest value of x, which is farthest to the left, to the largest value of x, which is farthest to the right). As your eyes are scanning left-to-right, what happens to the circles? Well, there is not a deterministic relationship, such as all the circles being on a line or a curve; but there is a tendency for the circles to rise (i.e., y is becoming larger) as the x values become larger. Also, in my judgment, the tendency looks like a straight line tendency rather than a (more complicated) curved tendency. Now look at the remaining three scatterplots of runs versus some X. Here are the conclusions I draw: 1. In all scatterplots the tendency between x and y appears to be linear (a straight line) rather than curved. 2. In the first scatterplot, in which X is the number of triples, whereas the tendency is linear, it is neither increasing nor decreasing; just flat. In the remaining three scatterplots the tendency is definitely increasing. The increasing tendency: is strongest for X equal to PS; is weakest for X equal to BA; with X equal to Home Runs falling between these extremes. Each scatterplot also presents a number denoted by r, which takes on values: r = (for X equal to number of triples); r = (BA); r = (Home Runs); and r = (PS). This number r is called the (Pearson s product moment) correlation coefficient. There is an algebraic formula for r, but I will not present it for two reasons: 1. It is quite a mess to compute; as a result, we will use a computer to obtain its value. 180

9 2. There is some insight to be gained by examining the algebraic formula for r, but not enough to squeeze this topic into our ever diminishing remaining time. (I went online and found a presentation of the formula for r on Wikipedia.) Please take a minute to look at the remaining four scatterplots. The values of the correlation coefficient for these data sets are: r = for victories versus runs scored; r = for victories versus ERA; r = for final versus midterm; and r = for final versus midterm after the deletion of the unusual case with x = 35.5 and y = Note the following. 1. We have our first example of a negative number for the correlation coefficient. This reflects the (visual) observation that the tendency between x and y is decreasing as the ERA increases (noted earlier, a bad thing), the number of wins decreases. 2. Consider the two scatterplots for the scores on exams. The deletion of only one case (out of 36) results in a substantial increase in the value of the correlation coefficient. In the terminology of Chapter 10, the value of the correlation coefficient is fragile to unusual cases. Also note that both scatterplots contain the numeral 2 because two students had (x, y) = (55.5, 96.0), and, of course, two circles placed at the same place look like one circle. The correlation coefficient has several important properties. They are listed below, after a bit of terminology in the first item. 1. If the correlation coefficient is greater than zero, the variables Y and X are said to have a positive linear relationship; if it is less than zero, the variables are said to have a negative linear relationship; if it equals zero, the variables are said to have no linear relationship, or to be uncorrelated. 2. The correlation coefficient is not appropriate for summarizing a curved relationship between Y and X. This fact is illustrated in Figure 14.9 in which there is a perfect (deterministic) curved relationship between Y and X, yet the correlation coefficient equals zero. Therefore, it is always necessary to examine a scatterplot of the data to determine whether computation of the correlation coefficient is appropriate. 3. The value of the correlation coefficient is always between 1 and +1. It equals +1 if, and only if, all data points lie on a straight line with positive slope; it equals 1 if, and only if, all data points lie on a straight line with negative slope. (Extra: Why is it that statisticians and scientists are not interested in data sets for which all points lie on a horizontal or vertical line?) 181

10 4. The farther the value of the correlation coefficient from zero, in either direction, the stronger the linear relationship. This fact will be justified and made more precise later in this chapter. 5. The value of the correlation coefficient does not depend on the units of measurement chosen by the experimenter. More precisely, if X is replaced by ax + b and Y is replaced by cy + d, where a, b, c, and d are any numbers with a and c bigger than zero, then the correlation coefficient of the new variables is equal to the correlation coefficient of X and Y. (The numbers a and c are required to be positive to avoid reversing the direction of the relationship. If a and c are both negative, r is unchanged; if exactly one of a and c is negative, r becomes its negative.) This result is true because the correlation coefficient is defined in terms of the standardized values of X and Y (not shown in this text) and these do not change when the units change. Among other examples, this result shows that changing from miles to inches, pounds to kilograms, degrees Celsius to degrees Fahrenheit, or seconds to hours will not change the correlation coefficient. 6. The correlation coefficient is symmetric in X and Y. In other words, if the researcher interchanges the labels predictor and response, the correlation coefficient will not change. In particular, if there is no natural assignment of the labels predictor and response to the two numerical variables, the value of the correlation coefficient is not affected by which assignment is chosen. Here is an example of item 6: Suppose that in the population of married couples, you want to study the relationship between the husband s IQ and the wife s IQ. To me, there is no natural way to assign the labels response and predictor to these variables. In view of item 4 in the above list, let s revisit the two scatterplots of wins versus runs and wins versus ERA. Recall that the correlation coefficients are r = for runs and r = for ERA. By item 4, descriptively, ERA is a better predictor than runs. Sadly, there is no way to compare these two predictors inferentially, for example, with a hypothesis test. This result is beyond the scope of these notes The Simple Linear Regression Model I am now ready to show you the simple linear regression model. We assume the following relationship between X and Y. Y i = β 0 + β 1 X i + ǫ i, for i = 1, 2, 3,...n (14.1) It will take some time to examine and explain this model. First, β 0 and β 1 are parameters of the model; this means, of course, that they are numbers and by changing either or both of their values we get a different model. The actual numerical values of β 0 and β 1 are known by Nature and unknown to the researcher; thus, the researcher will want to estimate both of these parameters and, perhaps, test hypotheses about them. The ǫ i s are random variables with the following properties. They are i.i.d. with mean 0 and variance σ 2. Thus, σ 2 is the third parameter of the model. Again, its value is known by Nature but 182

11 Figure 14.9: An Example of a Curved Relationship Between Y and X. r= unknown to the researcher. It is very important to note that we assume these ǫ i s, which are called errors, are statistically independent. In addition, we assume that every case s error has the same variance. h, and by the way, not only is σ 2 unknown to the researcher, the researcher does not get to observe the ǫ i s. Remember this: all that the researcher observes are the n pairs of (x, y) values. Now, we look at some consequences of our model. I will be using the rules of means and variances from Chapter 7. It is not necessary for you to know how to recreate the arguments below. Remember, the Y i s are random variables; the X i s are viewed as constants. The mean of Y i given X i is µ Yi X i = β 0 + β 1 X i + µ ǫi = β 0 + β 1 X i, (14.2) because the mean of a constant is the constant and the mean of each error is 0. The variance of Y i is σ 2 Y i = σ 2 ǫ i = σ 2, (14.3) because the variance of a constant (β 0 + β 1 X i ) is 0. In words, we see that the relationship between X and Y is that the mean of Y given the value of X is a linear function of X with y-intercept given by β 0 and slope given by β 1. The first issue we turn to is: How do we use data to estimate the values of β 0 and β 1? First, note that this is a familiar question, but the current situation is much more difficult than any we have encountered. It is familiar because we previously have asked the questions: How do we estimate p? r µ? r ν? In all previous cases, our estimate was simply the sample version of the parameter; for example, to estimate the proportion of successes in a population, we use the proportion of successes in the sample. There is no such easy analogy for the current situation: for example, what is the sample version of β 1? 183

12 Instead we adopt the Principle of Least Squares to guide us. Here is the idea. Suppose that we have two candidate pairs for the values of β 0 and β 1 ; namely, 1. β 0 = c 0 and β 1 = c 1 for the first candidate pair, and 2. β 0 = d 0 and β 1 = d 1 for the second candidate pair, where c 0, c 1, d 0 and d 1 are known numbers, possibly calculated from our data. The Principle of Least Squares tells us which candidate pair is better. ur data are the values of the y s and the x s. We know from above that the mean of Y given X is β 0 + β 1 X, so we evaluate the candidate pair c 0, c 1 by comparing the value of y i to the value of the c 0 + c 1 x i, for all i. We compare, as we usually do, by subtracting to obtain: y i [c 0 + c 1 x i ]. This quantity can be negative, zero or positive. Ideally, it is 0, which indicates that c 0 + c 1 x i is an exact match for y i. The farther this quantity is from 0, in either direction, the worse job c 0 + c 1 x i is doing in describing or predicting y i. The Principle of Least Squares tells us to take this quantity and square it: (y i [c 0 + c 1 x i ]) 2, and then sum this square over all cases: n (y i [c 0 + c 1 x i ]) 2. i=1 Thus, to compare two candidate pairs: c 0, c 1 and d 0, d 1, we calculate two sums of squares: n (y i [c 0 + c 1 x i ]) 2 and i=1 n (y i [d 0 + d 1 x i ]) 2. i=1 If these sums are equal, then we say that the pairs are equally good. therwise, whichever sum is smaller designates the better pair. This is the Principle of Least Squares, although with only two candidate pairs, it should be called the Principle of Lesser Squares. f course, it is impossible to compute the sum of squares for every possible set of candidates, but we don t need to do this if we use calculus. Define the function n f(c 0, c 1 ) = (y i [c 0 + c 1 x i ]) 2. i=1 ur goal is to find the pair of numbers (c 0, c 1 ) that minimizes the function f. If you have studied calculus, you know that differentiation is a method that can be used to minimize a function. If we take two partial derivatives of f, one with respect to c 0 and one with respect to c 1 ; set the two resulting equations equal to 0; and solve for c 0 and c 1, we find that f is minimized at (b 0, b 1 ) with b 1 = (xi x)(y i ȳ) (xi x) 2 and b 0 = ȳ b 1 x. (14.4) 184

13 We write ŷ = b 0 + b 1 x; (14.5) This is called the regression line or least squares regression line or least squares prediction line or even just the best line. You will not be required to compute (b 0, b 1 ) by hand from raw data. Instead, I will show you how to interpret computer output from the Minitab statistical software system. There is another way to write the expression for b 1 given in Equation This alternative formula does not help us computationally we don t compute by hand but, as we will see later, gives us additional insight into the regression line. Remember that every case gives us the values of two numerical variables, X and Y. Regression analysis focuses on finding an association between these variables, but it is certainly allowable to look at them separately. In particular, let x and s x denote the mean and standard deviation of the x s in the data set, and let ȳ and s y denote the mean and standard deviation of the y s in the data set. With this notation, it can be shown that (details not given) b 1 = r(s y /s x ), (14.6) where, recall, r denotes the correlation coefficient of X and Y. Note that s y (s x ) is a summary of the y (x) values alone. With this observation, Equation 14.6 has the following interesting implication: In order to compute the regression line, the correlation coefficient contains all the information we need to know about the association between X and Y. Thus, we see that r has at least one important feature. In a regression analysis, every case begins with two values: x and y. Applying Equation 14.5 we can obtain a third value for each case, namely its predicted response ŷ. Finally, a fourth value for each response is its residual, denoted by e and defined below. Note the following features of residuals: e i = y i ŷ i = y i b 0 b 1 x i. (14.7) 1. A case s residual can be positive, zero or negative. A residual of zero is ideal in that it means y i = ŷ i ; in words, the predicted response is equal to the actual response. A positive residual means that y i is larger than ŷ i ; i.e., the actual response is too large. Finally, a negative residual means that y i is smaller than ŷ i ; i.e., the actual response is too small. 2. In view of the previous item, the farther the residual is from 0, in either direction, the worse the agreement between the actual and predicted responses. It turns out that the residuals have two linear constraints: ei = 0 and (x i e i ) = 0. (14.8) 185

14 14.3 Reading Computer utput In this section we will study a data set on n = 35 students in my Statistics 371 class. The two variables are: Y, the score on the final exam, and X, the score on the midterm exam. Recall that Figure 14.8 presents a scatterplot of these data and that the correlation coefficient is r = Table 14.3 presents output obtained from my statistical software package, Minitab. The output has been edited in order to fit on one page. The deleted item will be presented later; it is not necessary for our analysis. Note that there is some redundancy in this output; the most obvious example is that to the right of the observation numbers 21 and 22, the remaining entries are identical. We will work through the output in some detail. Before turning to an inspection of this computer output, I want to give you a few facts about these data. I grade my exam in half-point increments. Thus, for example, 88.0, 88.5 and 89.0 are all possible scores on my final exam. The maximum number of possible points was 60.0 on the midterm exam on the final exam. The midterm exam scores ranged from a low of 39.0 to a high of 59.0; the final exam scores ranged from a low of 81.0 to a high of The means and standard deviations are below. x = , s x = 5.295, ȳ = , s y = (14.9) The first information in the output is the line: The regression equation is: Final = Midterm This is Minitab s way of telling us that the regression line: ŷ = b 0 + b 1 x is ŷ = x. Thus, we see that the least squares estimates of β 0 and β 1 are b 0 = 68.4 and b 1 = 0.452, respectively. Minitab is user friendly in that it uses words (admittedly, provided by me, the programmer) to remind us that the response is the score on the final exam and the predictor is the score on the midterm. But Minitab is quite anachronistic (we will see more of this below) in not having a hat above Final. (Reason: Minitab is a very old software package. It was designed to create output for a typewriting-like machine; i.e., no hats, no subscripts, no Greek letters and no math symbols. For some reason unknown to me the package has never been updated well, at least not on my version which is a few years old to utilize modern printers.) Next in the output are the three lines Predictor Coef SE Coef T P Constant Midterm The first column in this group provides labels for the rows of this table. Its heading, Predictor, seems strange because we only have one predictor, namely Midterm. The output has promoted the intercept to Predictor status because that is a natural thing to do in multiple regression (again, 186

15 Table 14.3: Edited Minitab utput for the Regression of Final Exam Score on Midterm Exam Score for 35 Students. R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence. The regression equation is: Final = Midterm Predictor Coef SE Coef T P Constant Midterm S = R-Sq = 21.5% bs Midterm Final Fit SE Fit Residual St Resid X R R R

16 Figure 14.10: Final Exam Score Versus Midterm Exam Score for 35 Students. 100 ŷ + s r= ŷ = x 95 ŷ s details not given). So, to summarize, don t fret about the heading on the first column above; the important thing is that row Constant (Midterm) provides information about b 0 (b 1 ). The first feature to note in these three lines is that the computer has given us more precision in the values of b 0 and b 1 ; for example, earlier b 0 = 68.4, but now b 0 = Next, skip down to the 36 lines of output headed by: bs Midterm Final Fit SE Fit Residual St Resid The first column numbers the observations 1 35 for ease of reference. The second and third columns Midterm and Final lists the values of x (Midterm) and y (Final) for each of the 35 cases. The fourth column Fit lists the value of ŷ for each case. It is important to pause for a moment and make sure you understand the values x, y and ŷ. The values of x and y are the actual exam scores for the case (student). The value of ŷ is the predicted response for the case, which is obtained by plugging the case s value of x into the regression line. Thus, for example, case 1 scored 39.0 on the midterm and 83.5 on the final. The predicted final score for this student is ŷ = (39.0) = , which the computer rounds to Let me digress briefly and mention a curious convention in reporting the value of ŷ. In Chapter 7 we learned how to predict the total number of successes in m future Bernoulli Trials. At that time, I said, for example, that a point prediction of 72.3 should be rounded to 72 because it is impossible to obtain a fractional number of successes. Following this principle, you might expect me to round to 86.0 because, as mentioned earlier, exam scores could occur only in onehalf point increments; i.e. 85.5, 86.0 and 86.5 are all possible scores on the final, but not Surprisingly (?) we never round in regression. There are two reasons for this: 188

17 1. If we rounded, then the regression line would not really be a straight line it would be horizontal between jumps, called a step-function. Statisticians won t do this; we won t call it a line when it is a step-function! 2. In many (most?) applications the variable Y is a measurement, for examples, distance or time or weight. In these cases we would not have any reason to round. It s confusing to round sometimes and not others. Thus, we don t round. Let s return to our examination of case 1. This student s final exam score is a disappointment in two ways. First, only four students scored lower than 83.5; thus, relative to the class as a whole, this student did poorly on the final. Second, the student s actual y is smaller than the score predicted by his/her score on the midterm: 83.5 versus ŷ = As statisticians, we are much more interested in this second form of disappointment. Indeed, we see that for this student, the residual is e = y ŷ = = 2.532, as reported in column 6. (We will learn about the meaning and utility of the entries in column 5 later.) After a brief reflection, we see that a negative (positive) number for a residual means that the student performed worse (better) than predicted by the regression line. For these data, overall 15 students have negative residuals and 20 have positive residuals. To be fair, we should note that two students with positive residuals (numbers 11 and 30) scored exactly as predicted, if we round-off ŷ. Now, please refer to Figure With three additions, this is the scatterplot of final versus midterm that was presented earlier in Figure First, note the solid line, which is the graph of our regression line. This allows us to visualize how the regression line performs in making predictions for our data set. Below are a few of many possible things to note. 1. The student with the largest value of y (99.5; case 8 with x = 49.0) is, happily for the student, represented by the circle that is points above the regression line s prediction. 2. The two students with residuals closest to 0 (cases 11 and 30 as mentioned earlier) have circles that touch the line. This touching shows the near agreement between y and ŷ. 3. Student 19 has the distinction of having the negative residual farthest from 0. The actual final (y = 83.0) is more than 10 points lower than the prediction based on the fairly high midterm exam score (x = 54.5). There are n = 35 residuals in our data set with two linear restrictions, given earlier in Equation 14.8 on page 185. Thus, the residuals have (n 2) degrees of freedom. I want to examine the spread in the distribution of the residuals. Following the ideas of Chapter 10, I begin with the variance of the residuals: (e ē) s 2 2 e 2 = = n 2 n 2, 189

18 because the mean of the residuals is 0 (because their sum is 0). Minitab does this computation for us and reports the value of the standard deviation of the residuals, which can be found in the computer output immediately above the listing of the cases; s = for these data. The value of s is very important. Here is the reasoning. The residuals, as a group, tell us how well the actual y s agrees with their predicted values. If we apply the empirical rule from Chapter 10, we conclude that for approximately 68% of the cases the value of e falls between s and +s. (Remembering, again, that ē = 0.) In other words for approximately 68% of the cases the value of y is within s units of its predicted value; i.e., the absolute error in prediction is at most s. Thus, the value of s tells us how good the regression line is at making predictions. It is natural to wonder whether the empirical rule approximation is any good for our data. It is indeed a very good approximation; here are two ways to see why I say this. 1. Scan down the numbers in column 6 of Table I count seven residuals that lie outside the interval [ s, +s] = [ 4.635, ]. If you want to check this, note that the extreme residuals belong to cases 4, 8, 10, 13, 14, 17 and (This way is more fun!) Look again at Figure In addition to the regression line, the picture contains two dotted lines. The 28 circles that lie between the two dotted lines correspond to the 28 cases that have a residual in the interval [ s, +s]. Thus, we see that 28/35 = 0.70 = 70% of the residuals lie in the interval [ s, +s] which is a very close agreement with the empirical rule s 68% approximation. To summarize, if asked the question, How well does the midterm score predict the final score? I would reply as follows. For approximately two-thirds of the students the final score can be predicted within points. For the remaining one-third of the students, the prediction errs by more than points. Clearly, in many scientific questions, if s is small enough, the regression is very useful. But if s is too large, the regression is scientifically useless. As George Box, the founder of the Statistics Department at UW Madison, would say, Just because something is the best, doesn t mean its any good. The regression line is the best way (according to the Principle of Least Squares) to use X in a linear way to predict Y. This is a math result. Whether this best line is any good in science must be determined by a scientist Inference for Regression Literally, β 0 is the mean value of Y when X = 0. But there are two problems: First, X = 0 is not even close to the range of X values in our data set. Thus, we have no data-based reason to believe that the linear relation in our data for x values between 39.0 and 59.0 will extend all the way down to X = 0. Second, if a student scored 0 on the midterm he/she would likely drop 190

19 the course. In other words, first we don t want to extend our model to X = 0 because it might not be statistically valid to do so and, second, we don t want to because X = 0 is uninteresting scientifically. Thus, viewed alone, β 0, and hence its estimate b 0 is not of interest to us. This is not to say that the intercept is totally uninteresting; it is an important component of the regression line. But because β 0 alone is not interesting we will not learn how to estimate it with confidence, nor will we learn how to test a hypothesis about its value. Inference for the Slope. Unlike the intercept, the slope, β 1, is always of great interest to the researcher. As you have learned in math, the slope is the change in y for a unit change in x. Still in math, we learned that a slope of 0 means that changing x has no effect on y; a positive (negative) slope means that as x increases, y increases (decreases). Similar comments are true in regression. Now, the interpretation is that β 1 is the true change in the mean of Y for a unit change in X, and b 1 is the estimated change in the mean of Y for a unit change in X. In the current example, we see that for each additional point on the midterm, we estimate that the mean number of points on the final increases by points. Not surprisingly, one of our first tasks of interest is to estimate the slope with confidence. The computer output allows us to do this with only a small amount of work. The confidence interval for β 1 is given by b 1 ± t SE(b 1 ). (14.10) Note that we follow Gosset in using the t curves as our reference. As discussed earlier, the residuals have (n 2) degrees of freedom and, even though one can t tell from our notation, the estimated in the estimated SE of b 1 comes from estimating the unknown σ 2 by s 2. Thus, it is no surprise that our reference curve is the t-curve with (n 2) degrees of freedom, which equals 33 for this study. The SE (estimated standard error) of b 1 is given in the output and is equal to ; the output also gives a more precise value of b 1, , in the event you want to be more precise; I do. With the help of our online calculator, the t for df = 33 and 95% is Thus, the 95% CI for b 1 is ± 2.035(0.1501) = ± = [0.1461, ]. This confidence interval is very wide. Next, we can test a hypothesis about β 1. Consider the null hypothesis β 1 = β 10, where β 10 (read beta-one-zero, not beta-ten) is a known number specified by the researcher. As usual, there are three possible alternatives: β 1 > β 10 ; β 1 < β 10 ; and β 1 β 10. The observed value of the test statistic is given by t = b 1 β 10 SE(b 1 ) (14.11) We obtain the P-value by using the t-curve with df = n 2 and the familiar rules relating the direction of the area to the alternative. Here are two examples. Bert believes that, on average, each additional point on the midterm should result in an additional 5/3 = points on the final. (Can you create a plausible reason why he might believe 191

20 this?) Thus, his choice is β 10 = He chooses the alternative < because he thinks it is inconceivable that the slope could be larger than Thus, his observed test statistic is t = = With the help of our calculator, we find that the area under the t(33) curve to the left of is ; this is Bert s P-value. With a more precise program, I found that the P-value is , which is just a bit larger than 1 in one billion. Bert s theory looks pretty how can I say this dumb. The next example involves the computer s attempt to be helpful. In many, but not all, applications the researcher is interested in β 10 = 0. The idea is that if the slope is 0, then there is no reason to bother using X to predict or describe Y. For this choice the test statistic becomes t = b 1 SE(b 1 ), which for this example is t = / = , which, of course, could be rounded to Notice that the computer has done this calculation for us! (Look in the T column of the Midterm row.) For the alternative and our calculator, we find that the P-value is 2(0.0025) = which the computer gives us (located in the output to the right of 3.01). Inference for the mean response for a given value of the predictor. Next, let us consider a specific possible value of X, call it x 0. Now given that X = x 0, the mean of Y is β 0 + β 1 x 0 ; call this µ 0. We can use the computer output to obtain a point estimate and CI estimate of µ 0. For example, suppose we select x 0 = Then, the point estimate is b 0 + b 1 (182) = (49.0) = If, however, you look at the output, you will find this value, , in the column Fit and the row Midterm = 182. (This is observation 8, 9 or 10.) f course, this is not a huge aid because, frankly, the above computation of was pretty easy. But it s the next entry in the output that is useful. Just to the right of Fit is SE Fit, with, as before, SE the abbreviation for estimated standard error. Thus, we are able to calculate a CI for µ 0 : Fit ± t (SE Fit). (14.12) For the current example, the 95% CI for the mean of Y given X = 49.0 is ± 2.035(0.968) = ± = [88.578, ]. The obvious question is: We were pretty lucky that X = x 0 = 182 was in our computer output; what do we do if our x 0 isn t there? It turns out that the answer is easy: trick the computer. Here is how. Suppose we want to estimate the mean of Y given X = An inspection of the Midterm column in the computer output reveals that there is no row for X = Go back to the data set 192

21 and add a 36th student. For this student, enter 51.0 for Midterm and a missing value for Final. (For Minitab, my software package, this means you enter a *.) Then rerun the regression analysis. In all of the computations the computer will ignore the 36th student because, of course, it has no value for Y and therefore cannot be used. Thus, the computer output is unchanged by the addition of this extra student. But, and this is the key point, the computer includes observation 36 in the last section of the output, creating the row: bs Midterm Final Fit SE Fit Residual St Resid * * * From this we see that the point estimate of the mean of Y given X = 51.0 is (This, of course, is easy to verify.) But now we also have the SE Fit, so we can obtain the 95% CI for this mean: ± 2.035(0.828) = ± = [98.766, ]. I will remark that we could test a hypothesis about the value of the mean of Y for a given X, but people rarely do this. And if you wanted to do it, I believe you could work out the details. Prediction of the response for a given value of the predictor. Suppose that beyond our n cases for which we have data, we have an additional case. For this new case, we know that X = x n+1, for a known number x n+1, and we want to predict the value of Y n+1. Now, of course, Y n+1 = β 0 + β 1 x n+1 + ǫ n+1. We assume that ǫ n+1 is independent of all previous ǫ s, and, like the previous errors, has mean 0 and variance σ 2. The natural prediction of Y n+1 is obtained by replacing the β s by their estimates and ǫ n+1 by its mean, 0. The result is ŷ n+1 = b 0 + b 1 x n+1. We recognize this as the Fit for X = x n+1 ; as such, its value and its SE are both presented in (or can be made to be presented in) our computer output. Using the results of Chapter 7 on variances, we find that after replacing the unknown σ 2 by its estimate s 2, the estimated variance of our prediction is s 2 + [SE(Fit)] 2. For example, suppose that x n+1 = From our computer output, and our earlier work, the point prediction of y n+1 is Fit, which is The estimated variance of this prediction is (4.635) 2 + (0.968) 2 = ; thus, the estimated standard error of this prediction is = It now follows that we can obtain a prediction interval. In particular, the 95% prediction interval for y n+1 is ± 2.035(4.735) = ± = [80.912, ]. This is not a particularly useful prediction interval. (Why?) 193

22 Table 14.4: The ANVA Table for the Regression of Final Exam Score on Midterm Exam Score for 35 Students. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Some Loose Ends There are a few loose ends from the computer output and regression analysis in general that are worth mentioning. In the earlier computer output I omitted the Analysis of Variance Table (ANVA Table) in order to fit the output onto one page. The ANVA table is presented now in Table The first thing to note in this table is that the column heading SS is short for Sum of Squares. There are three sums of squares in this table; one each for regression, (residual) error and total. These are abbreviated by SSR, SSE and SST, respectively. The next thing to note is that these sums of squares add: SSR + SSE = SST. This is actually a fairly amazing result, a Pythagorean Theorem for Statistics. We begin by considering the n values of the response: y 1, y 2,...,y n. Following Chapter 10, these give rise to n deviations: y 1 ȳ, y 2 ȳ,...,y n ȳ. We then write a generic deviation as follows: y i ȳ = (y i ŷ i ) + (ŷ i ȳ) (14.13) In words, we take the deviation for case i and break it into the sum of two pieces: the deviation of y i from its (regression line) predicted value ŷ i ; and the deviation of its predicted value from the overall mean. For example, refer back to the computer output in Table 14.3 and examine case 29. For this case, y 29 = 99.0 and ŷ 29 = Recall, also, from Equation 14.9 that ȳ = Thus, for case 29, Equation becomes = ( ) + ( ) or = In words, the response of 99.0 deviates from the mean of all responses by This deviation is broken into two pieces: 99.0 is points larger than its predicted value which, in turn, is points larger than the mean of all responses. This identity remains true if we sum over all cases: (yi ȳ) = (y i ŷ i ) + (ŷ i ȳ). What is remarkable, however, is that this last identity remains true if we square all of its terms: (yi ȳ) 2 = (y i ŷ i ) 2 + (ŷ i ȳ)

23 In words, this last identity is SST = SSE + SSR. The degrees of freedom (DF) in the ANVA Table also sum, as I will now explain. From Chapter 10, the deviations have n 1 degrees of freedom, which for our current data set is 35 1 = 34. As argued earlier in this chapter, the residuals have two linear constraints and, hence, have n 2 degrees of freedom, which is 33 for the current data set. The degrees of freedom for regression is a bit trickier: the regression line is determined by two numbers (intercept and slope) which is one more than the number determined by the mean (itself); hence, there are 2 1 = 1 degrees of freedom for regression. I will now explain why these sums of squares are so interesting. First, consider SST; this sum of squares measures the variation in the responses, ignoring the predictor. For our data, SST = f course, we could divide this by its degrees of freedom to obtain s 2 y, the variance of the y values. And, of course, we could take the square root of the variance to obtain s y which I reported earlier as equaling points. But I don t want to do this! I just want to focus on This number measures the total squared variation in the y values. It must be nonnegative. If it were 0 we could infer that all y values are the same. Finally, the larger the value of SST the more squared variation we have in our y values. Now, contrast SST with SSE. SSE is the sum of squared residuals the square of the differences between the actual responses and their respective predicted values. For these data, SSE = Next, consider the difference between SST and SSE: = = SSR. Here is a picturesque way to interpret this difference: is the amount of squared error that has been removed by using the predictor X. Removing error is good. (Why?) Thus, this number is a measure of how much using X has improved our predictions. The problem, however, is how do we interpret ; Is it large? Is it small? ur answer lies in comparing it to something, but what? We compare it to by taking a ratio: = 0.215, or 21.5% Thus, of the total squared error in the response (SST = ), 21.5% is removed or explained or accounted for by a linear relationship with the predictor. The above ratio is called the coefficient of determination; it is denoted by R 2 and, as above, if often reported as a percentage. Thus, R 2 = SST SSE SST Recall the correlation coefficient, r, discussed earlier. It can be shown that R 2 = r 2. For example, with our current data set, r = which gives r 2 = (0.464) 2 = = R (14.14)

24 I am now able to fulfill an earlier promise to make item 4 in our list of properties of r (on page 182) more precise. We now have an interpretation for the value of r: its square is equal to the coefficient of determination. Thus, the farther r is from 0, the larger the value of r 2 and, hence, the better X is at predicting/describing Y. The fact that r 2 = R 2 has led many analysts to state that in order to interpret r we must first square it. Indeed, I have heard several people say that it is dishonest to report r instead of r 2. Here is there reasoning: If one reports, for example, that r = 0.8 this sounds much stronger than the more accurate r 2 = This argument is flawed, in my opinion, for two reasons. First, whereas I believe that R 2 is a useful way to summarize one aspect of a regression analysis, we must remember that it is based on squaring errors which, at the very least, is an unnatural activity. My second reason requires a bit more work. Combining the results in Equations , we find that Rewriting we obtain the following. ŷ = b 0 + b 1 x = ȳ b 1 x + b 1 x = ȳ + b 1 (x x) = ȳ r(s y /s x )(x x). ŷ = ȳ r(s y /s x )(x x), ŷ ȳ s y = r( x x s x ) (14.15) Now this equation is not useful for obtaining ŷ for a given x, but it gives us great insight into the meaning of r, as I will now argue. Recall that x = , s x = 5.295, ȳ = and s y = Consider a new case whose value of x is x = x + s x = = (Please ignore that such a score is impossible with my one-half point system.) This is a good student; she scored one standard deviation above the mean on the midterm. According to Equation her predicted response satisfies the following equation. Rewriting this, we obtain, ŷ ȳ s y = r = ŷ = ȳ s y. In words, her predicted response is only standard deviations above the mean response! I ended this last sentence with an exclamation point because something quite remarkable is happening. n the midterm, this student is exactly one standard deviation above the mean, but we predict that she will be much closer to the mean on the final. In picturesque language, we predict that only 46.4% of her advantage on the midterm will persist to the final. Thus, only 46.4% of her advantage on the midterm reflects her superior ability; the other 53.6%, for lack of a better term, 196

Chapter 14. Simple Linear Regression Preliminary Remarks The Simple Linear Regression Model

Chapter 14. Simple Linear Regression Preliminary Remarks The Simple Linear Regression Model Chapter 14 Simple Linear Regression 14.1 Preliminary Remarks We have only three lectures to introduce the ideas of regression. Worse, they are the last three lectures when you have many other things academic

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Confidence intervals

Confidence intervals Confidence intervals We now want to take what we ve learned about sampling distributions and standard errors and construct confidence intervals. What are confidence intervals? Simply an interval for which

More information

Section 3: Simple Linear Regression

Section 3: Simple Linear Regression Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction

More information

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math. Regression, part II I. What does it all mean? A) Notice that so far all we ve done is math. 1) One can calculate the Least Squares Regression Line for anything, regardless of any assumptions. 2) But, if

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

INFERENCE FOR REGRESSION

INFERENCE FOR REGRESSION CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We

More information

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i B. Weaver (24-Mar-2005) Multiple Regression... 1 Chapter 5: Multiple Regression 5.1 Partial and semi-partial correlation Before starting on multiple regression per se, we need to consider the concepts

More information

Simple Linear Regression

Simple Linear Regression CHAPTER 13 Simple Linear Regression CHAPTER OUTLINE 13.1 Simple Linear Regression Analysis 13.2 Using Excel s built-in Regression tool 13.3 Linear Correlation 13.4 Hypothesis Tests about the Linear Correlation

More information

Chapter 27 Summary Inferences for Regression

Chapter 27 Summary Inferences for Regression Chapter 7 Summary Inferences for Regression What have we learned? We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test

More information

appstats27.notebook April 06, 2017

appstats27.notebook April 06, 2017 Chapter 27 Objective Students will conduct inference on regression and analyze data to write a conclusion. Inferences for Regression An Example: Body Fat and Waist Size pg 634 Our chapter example revolves

More information

Chapter 3. Estimation of p. 3.1 Point and Interval Estimates of p

Chapter 3. Estimation of p. 3.1 Point and Interval Estimates of p Chapter 3 Estimation of p 3.1 Point and Interval Estimates of p Suppose that we have Bernoulli Trials (BT). So far, in every example I have told you the (numerical) value of p. In science, usually the

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

Warm-up Using the given data Create a scatterplot Find the regression line

Warm-up Using the given data Create a scatterplot Find the regression line Time at the lunch table Caloric intake 21.4 472 30.8 498 37.7 335 32.8 423 39.5 437 22.8 508 34.1 431 33.9 479 43.8 454 42.4 450 43.1 410 29.2 504 31.3 437 28.6 489 32.9 436 30.6 480 35.1 439 33.0 444

More information

Mathematics for Economics MA course

Mathematics for Economics MA course Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between

More information

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com 12 Simple Linear Regression Material from Devore s book (Ed 8), and Cengagebrain.com The Simple Linear Regression Model The simplest deterministic mathematical relationship between two variables x and

More information

STAT 350 Final (new Material) Review Problems Key Spring 2016

STAT 350 Final (new Material) Review Problems Key Spring 2016 1. The editor of a statistics textbook would like to plan for the next edition. A key variable is the number of pages that will be in the final version. Text files are prepared by the authors using LaTeX,

More information

Lecture 10: F -Tests, ANOVA and R 2

Lecture 10: F -Tests, ANOVA and R 2 Lecture 10: F -Tests, ANOVA and R 2 1 ANOVA We saw that we could test the null hypothesis that β 1 0 using the statistic ( β 1 0)/ŝe. (Although I also mentioned that confidence intervals are generally

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

Business Statistics. Lecture 9: Simple Regression

Business Statistics. Lecture 9: Simple Regression Business Statistics Lecture 9: Simple Regression 1 On to Model Building! Up to now, class was about descriptive and inferential statistics Numerical and graphical summaries of data Confidence intervals

More information

t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression

t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression Recall, back some time ago, we used a descriptive statistic which allowed us to draw the best fit line through a scatter plot. We

More information

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II 1 st Nine Weeks,

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II 1 st Nine Weeks, STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I Part II 1 st Nine Weeks, 2016-2017 OVERVIEW Algebra I Content Review Notes are designed by the High School Mathematics Steering Committee as a resource

More information

Swarthmore Honors Exam 2012: Statistics

Swarthmore Honors Exam 2012: Statistics Swarthmore Honors Exam 2012: Statistics 1 Swarthmore Honors Exam 2012: Statistics John W. Emerson, Yale University NAME: Instructions: This is a closed-book three-hour exam having six questions. You may

More information

appstats8.notebook October 11, 2016

appstats8.notebook October 11, 2016 Chapter 8 Linear Regression Objective: Students will construct and analyze a linear model for a given set of data. Fat Versus Protein: An Example pg 168 The following is a scatterplot of total fat versus

More information

Chapter 4: Regression Models

Chapter 4: Regression Models Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,

More information

28. SIMPLE LINEAR REGRESSION III

28. SIMPLE LINEAR REGRESSION III 28. SIMPLE LINEAR REGRESSION III Fitted Values and Residuals To each observed x i, there corresponds a y-value on the fitted line, y = βˆ + βˆ x. The are called fitted values. ŷ i They are the values of

More information

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions Know the definitions of the following words: bivariate data, regression analysis, scatter diagram, correlation coefficient, independent

More information

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed. This document was written and copyrighted by Paul Dawkins. Use of this document and its online version is governed by the Terms and Conditions of Use located at. The online version of this document is

More information

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b). Confidence Intervals 1) What are confidence intervals? Simply, an interval for which we have a certain confidence. For example, we are 90% certain that an interval contains the true value of something

More information

Slope Fields: Graphing Solutions Without the Solutions

Slope Fields: Graphing Solutions Without the Solutions 8 Slope Fields: Graphing Solutions Without the Solutions Up to now, our efforts have been directed mainly towards finding formulas or equations describing solutions to given differential equations. Then,

More information

Conditions for Regression Inference:

Conditions for Regression Inference: AP Statistics Chapter Notes. Inference for Linear Regression We can fit a least-squares line to any data relating two quantitative variables, but the results are useful only if the scatterplot shows a

More information

Chapter 8. Linear Regression /71

Chapter 8. Linear Regression /71 Chapter 8 Linear Regression 1 /71 Homework p192 1, 2, 3, 5, 7, 13, 15, 21, 27, 28, 29, 32, 35, 37 2 /71 3 /71 Objectives Determine Least Squares Regression Line (LSRL) describing the association of two

More information

Understanding and Using Variables

Understanding and Using Variables Algebra is a powerful tool for understanding the world. You can represent ideas and relationships using symbols, tables and graphs. In this section you will learn about Understanding and Using Variables

More information

Looking Ahead to Chapter 10

Looking Ahead to Chapter 10 Looking Ahead to Chapter Focus In Chapter, you will learn about polynomials, including how to add, subtract, multiply, and divide polynomials. You will also learn about polynomial and rational functions.

More information

1 Review of the dot product

1 Review of the dot product Any typographical or other corrections about these notes are welcome. Review of the dot product The dot product on R n is an operation that takes two vectors and returns a number. It is defined by n u

More information

Descriptive Statistics (And a little bit on rounding and significant digits)

Descriptive Statistics (And a little bit on rounding and significant digits) Descriptive Statistics (And a little bit on rounding and significant digits) Now that we know what our data look like, we d like to be able to describe it numerically. In other words, how can we represent

More information

Final Exam - Solutions

Final Exam - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis March 17, 2010 Instructor: John Parman Final Exam - Solutions You have until 12:30pm to complete this exam. Please remember to put your

More information

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc. Chapter 8 Linear Regression Copyright 2010 Pearson Education, Inc. Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu: Copyright

More information

Correlation. January 11, 2018

Correlation. January 11, 2018 Correlation January 11, 2018 Contents Correlations The Scattterplot The Pearson correlation The computational raw-score formula Survey data Fun facts about r Sensitivity to outliers Spearman rank-order

More information

Chapter 19 Sir Migo Mendoza

Chapter 19 Sir Migo Mendoza The Linear Regression Chapter 19 Sir Migo Mendoza Linear Regression and the Line of Best Fit Lesson 19.1 Sir Migo Mendoza Question: Once we have a Linear Relationship, what can we do with it? Something

More information

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b). Confidence Intervals 1) What are confidence intervals? Simply, an interval for which we have a certain confidence. For example, we are 90% certain that an interval contains the true value of something

More information

Review of Multiple Regression

Review of Multiple Regression Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Lecture 2 January 27, 2005 Lecture #2-1/27/2005 Slide 1 of 46 Today s Lecture Simple linear regression. Partitioning the sum of squares. Tests of significance.. Regression diagnostics

More information

Algebra Exam. Solutions and Grading Guide

Algebra Exam. Solutions and Grading Guide Algebra Exam Solutions and Grading Guide You should use this grading guide to carefully grade your own exam, trying to be as objective as possible about what score the TAs would give your responses. Full

More information

Introduction to Algebra: The First Week

Introduction to Algebra: The First Week Introduction to Algebra: The First Week Background: According to the thermostat on the wall, the temperature in the classroom right now is 72 degrees Fahrenheit. I want to write to my friend in Europe,

More information

The following are generally referred to as the laws or rules of exponents. x a x b = x a+b (5.1) 1 x b a (5.2) (x a ) b = x ab (5.

The following are generally referred to as the laws or rules of exponents. x a x b = x a+b (5.1) 1 x b a (5.2) (x a ) b = x ab (5. Chapter 5 Exponents 5. Exponent Concepts An exponent means repeated multiplication. For instance, 0 6 means 0 0 0 0 0 0, or,000,000. You ve probably noticed that there is a logical progression of operations.

More information

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n = Hypothesis testing I I. What is hypothesis testing? [Note we re temporarily bouncing around in the book a lot! Things will settle down again in a week or so] - Exactly what it says. We develop a hypothesis,

More information

In the previous chapter, we learned how to use the method of least-squares

In the previous chapter, we learned how to use the method of least-squares 03-Kahane-45364.qxd 11/9/2007 4:40 PM Page 37 3 Model Performance and Evaluation In the previous chapter, we learned how to use the method of least-squares to find a line that best fits a scatter of points.

More information

The simple linear regression model discussed in Chapter 13 was written as

The simple linear regression model discussed in Chapter 13 was written as 1519T_c14 03/27/2006 07:28 AM Page 614 Chapter Jose Luis Pelaez Inc/Blend Images/Getty Images, Inc./Getty Images, Inc. 14 Multiple Regression 14.1 Multiple Regression Analysis 14.2 Assumptions of the Multiple

More information

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation? Did You Mean Association Or Correlation? AP Statistics Chapter 8 Be careful not to use the word correlation when you really mean association. Often times people will incorrectly use the word correlation

More information

Chapter 2: Tools for Exploring Univariate Data

Chapter 2: Tools for Exploring Univariate Data Stats 11 (Fall 2004) Lecture Note Introduction to Statistical Methods for Business and Economics Instructor: Hongquan Xu Chapter 2: Tools for Exploring Univariate Data Section 2.1: Introduction What is

More information

Sociology 593 Exam 2 Answer Key March 28, 2002

Sociology 593 Exam 2 Answer Key March 28, 2002 Sociology 59 Exam Answer Key March 8, 00 I. True-False. (0 points) Indicate whether the following statements are true or false. If false, briefly explain why.. A variable is called CATHOLIC. This probably

More information

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals (SW Chapter 5) Outline. The standard error of ˆ. Hypothesis tests concerning β 3. Confidence intervals for β 4. Regression

More information

2.1 Definition. Let n be a positive integer. An n-dimensional vector is an ordered list of n real numbers.

2.1 Definition. Let n be a positive integer. An n-dimensional vector is an ordered list of n real numbers. 2 VECTORS, POINTS, and LINEAR ALGEBRA. At first glance, vectors seem to be very simple. It is easy enough to draw vector arrows, and the operations (vector addition, dot product, etc.) are also easy to

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

Finite Mathematics : A Business Approach

Finite Mathematics : A Business Approach Finite Mathematics : A Business Approach Dr. Brian Travers and Prof. James Lampes Second Edition Cover Art by Stephanie Oxenford Additional Editing by John Gambino Contents What You Should Already Know

More information

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Lines and Their Equations

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Lines and Their Equations ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER 1 017/018 DR. ANTHONY BROWN. Lines and Their Equations.1. Slope of a Line and its y-intercept. In Euclidean geometry (where

More information

Topic 10 - Linear Regression

Topic 10 - Linear Regression Topic 10 - Linear Regression Least squares principle Hypothesis tests/confidence intervals/prediction intervals for regression 1 Linear Regression How much should you pay for a house? Would you consider

More information

Everything Old Is New Again: Connecting Calculus To Algebra Andrew Freda

Everything Old Is New Again: Connecting Calculus To Algebra Andrew Freda Everything Old Is New Again: Connecting Calculus To Algebra Andrew Freda (afreda@deerfield.edu) ) Limits a) Newton s Idea of a Limit Perhaps it may be objected, that there is no ultimate proportion of

More information

Ch 13 & 14 - Regression Analysis

Ch 13 & 14 - Regression Analysis Ch 3 & 4 - Regression Analysis Simple Regression Model I. Multiple Choice:. A simple regression is a regression model that contains a. only one independent variable b. only one dependent variable c. more

More information

Section 5.4. Ken Ueda

Section 5.4. Ken Ueda Section 5.4 Ken Ueda Students seem to think that being graded on a curve is a positive thing. I took lasers 101 at Cornell and got a 92 on the exam. The average was a 93. I ended up with a C on the test.

More information

Multiple Regression Examples

Multiple Regression Examples Multiple Regression Examples Example: Tree data. we have seen that a simple linear regression of usable volume on diameter at chest height is not suitable, but that a quadratic model y = β 0 + β 1 x +

More information

WSMA Algebra - Expressions Lesson 14

WSMA Algebra - Expressions Lesson 14 Algebra Expressions Why study algebra? Because this topic provides the mathematical tools for any problem more complicated than just combining some given numbers together. Algebra lets you solve word problems

More information

Chapter 3. Introduction to Linear Correlation and Regression Part 3

Chapter 3. Introduction to Linear Correlation and Regression Part 3 Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 1 Richard Lowry, 1999-2000 All rights reserved. Chapter 3. Introduction to Linear Correlation and Regression Part 3 Regression The appearance

More information

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation y = a + bx y = dependent variable a = intercept b = slope x = independent variable Section 12.1 Inference for Linear

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Grade 8 Chapter 7: Rational and Irrational Numbers

Grade 8 Chapter 7: Rational and Irrational Numbers Grade 8 Chapter 7: Rational and Irrational Numbers In this chapter we first review the real line model for numbers, as discussed in Chapter 2 of seventh grade, by recalling how the integers and then the

More information

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI Introduction of Data Analytics Prof. Nandan Sudarsanam and Prof. B Ravindran Department of Management Studies and Department of Computer Science and Engineering Indian Institute of Technology, Madras Module

More information

Correlation and Regression

Correlation and Regression Correlation and Regression October 25, 2017 STAT 151 Class 9 Slide 1 Outline of Topics 1 Associations 2 Scatter plot 3 Correlation 4 Regression 5 Testing and estimation 6 Goodness-of-fit STAT 151 Class

More information

Chapter 14: Finding the Equilibrium Solution and Exploring the Nature of the Equilibration Process

Chapter 14: Finding the Equilibrium Solution and Exploring the Nature of the Equilibration Process Chapter 14: Finding the Equilibrium Solution and Exploring the Nature of the Equilibration Process Taking Stock: In the last chapter, we learned that equilibrium problems have an interesting dimension

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

Please bring the task to your first physics lesson and hand it to the teacher.

Please bring the task to your first physics lesson and hand it to the teacher. Pre-enrolment task for 2014 entry Physics Why do I need to complete a pre-enrolment task? This bridging pack serves a number of purposes. It gives you practice in some of the important skills you will

More information

Unit 27 One-Way Analysis of Variance

Unit 27 One-Way Analysis of Variance Unit 27 One-Way Analysis of Variance Objectives: To perform the hypothesis test in a one-way analysis of variance for comparing more than two population means Recall that a two sample t test is applied

More information

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

Chapter 6 The Standard Deviation as a Ruler and the Normal Model Chapter 6 The Standard Deviation as a Ruler and the Normal Model Overview Key Concepts Understand how adding (subtracting) a constant or multiplying (dividing) by a constant changes the center and/or spread

More information

Chapter 7. Practice Exam Questions and Solutions for Final Exam, Spring 2009 Statistics 301, Professor Wardrop

Chapter 7. Practice Exam Questions and Solutions for Final Exam, Spring 2009 Statistics 301, Professor Wardrop Practice Exam Questions and Solutions for Final Exam, Spring 2009 Statistics 301, Professor Wardrop Chapter 6 1. A random sample of size n = 452 yields 113 successes. Calculate the 95% confidence interval

More information

Chapter 5: Ordinary Least Squares Estimation Procedure The Mechanics Chapter 5 Outline Best Fitting Line Clint s Assignment Simple Regression Model o

Chapter 5: Ordinary Least Squares Estimation Procedure The Mechanics Chapter 5 Outline Best Fitting Line Clint s Assignment Simple Regression Model o Chapter 5: Ordinary Least Squares Estimation Procedure The Mechanics Chapter 5 Outline Best Fitting Line Clint s Assignment Simple Regression Model o Parameters of the Model o Error Term and Random Influences

More information

LECTURE 15: SIMPLE LINEAR REGRESSION I

LECTURE 15: SIMPLE LINEAR REGRESSION I David Youngberg BSAD 20 Montgomery College LECTURE 5: SIMPLE LINEAR REGRESSION I I. From Correlation to Regression a. Recall last class when we discussed two basic types of correlation (positive and negative).

More information

Chapter 1 Review of Equations and Inequalities

Chapter 1 Review of Equations and Inequalities Chapter 1 Review of Equations and Inequalities Part I Review of Basic Equations Recall that an equation is an expression with an equal sign in the middle. Also recall that, if a question asks you to solve

More information

Simple Linear Regression: One Quantitative IV

Simple Linear Regression: One Quantitative IV Simple Linear Regression: One Quantitative IV Linear regression is frequently used to explain variation observed in a dependent variable (DV) with theoretically linked independent variables (IV). For example,

More information

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5) 10 Simple Linear Regression (Chs 12.1, 12.2, 12.4, 12.5) Simple Linear Regression Rating 20 40 60 80 0 5 10 15 Sugar 2 Simple Linear Regression Rating 20 40 60 80 0 5 10 15 Sugar 3 Simple Linear Regression

More information

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc. Notes on regression analysis 1. Basics in regression analysis key concepts (actual implementation is more complicated) A. Collect data B. Plot data on graph, draw a line through the middle of the scatter

More information

Week 8: Correlation and Regression

Week 8: Correlation and Regression Health Sciences M.Sc. Programme Applied Biostatistics Week 8: Correlation and Regression The correlation coefficient Correlation coefficients are used to measure the strength of the relationship or association

More information

Introduction to Uncertainty and Treatment of Data

Introduction to Uncertainty and Treatment of Data Introduction to Uncertainty and Treatment of Data Introduction The purpose of this experiment is to familiarize the student with some of the instruments used in making measurements in the physics laboratory,

More information

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM 1 REGRESSION AND CORRELATION As we learned in Chapter 9 ( Bivariate Tables ), the differential access to the Internet is real and persistent. Celeste Campos-Castillo s (015) research confirmed the impact

More information

MITOCW ocw f99-lec16_300k

MITOCW ocw f99-lec16_300k MITOCW ocw-18.06-f99-lec16_300k OK. Here's lecture sixteen and if you remember I ended up the last lecture with this formula for what I called a projection matrix. And maybe I could just recap for a minute

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there

More information

Correlation and Linear Regression

Correlation and Linear Regression Correlation and Linear Regression Correlation: Relationships between Variables So far, nearly all of our discussion of inferential statistics has focused on testing for differences between group means

More information

Multiple Regression Theory 2006 Samuel L. Baker

Multiple Regression Theory 2006 Samuel L. Baker MULTIPLE REGRESSION THEORY 1 Multiple Regression Theory 2006 Samuel L. Baker Multiple regression is regression with two or more independent variables on the right-hand side of the equation. Use multiple

More information

REVIEW 8/2/2017 陈芳华东师大英语系

REVIEW 8/2/2017 陈芳华东师大英语系 REVIEW Hypothesis testing starts with a null hypothesis and a null distribution. We compare what we have to the null distribution, if the result is too extreme to belong to the null distribution (p

More information

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Quadratic Equations

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Quadratic Equations ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER 1 018/019 DR ANTHONY BROWN 31 Graphs of Quadratic Functions 3 Quadratic Equations In Chapter we looked at straight lines,

More information

Relationships Between Quantities

Relationships Between Quantities Algebra 1 Relationships Between Quantities Relationships Between Quantities Everyone loves math until there are letters (known as variables) in problems!! Do students complain about reading when they come

More information

SMAM 314 Exam 42 Name

SMAM 314 Exam 42 Name SMAM 314 Exam 42 Name Mark the following statements True (T) or False (F) (10 points) 1. F A. The line that best fits points whose X and Y values are negatively correlated should have a positive slope.

More information

Chapter 12 : Linear Correlation and Linear Regression

Chapter 12 : Linear Correlation and Linear Regression Chapter 1 : Linear Correlation and Linear Regression Determining whether a linear relationship exists between two quantitative variables, and modeling the relationship with a line, if the linear relationship

More information

y n 1 ( x i x )( y y i n 1 i y 2

y n 1 ( x i x )( y y i n 1 i y 2 STP3 Brief Class Notes Instructor: Ela Jackiewicz Chapter Regression and Correlation In this chapter we will explore the relationship between two quantitative variables, X an Y. We will consider n ordered

More information

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

1. Create a scatterplot of this data. 2. Find the correlation coefficient. How Fast Foods Compare Company Entree Total Calories Fat (grams) McDonald s Big Mac 540 29 Filet o Fish 380 18 Burger King Whopper 670 40 Big Fish Sandwich 640 32 Wendy s Single Burger 470 21 1. Create

More information

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you.

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you. ISQS 5347 Final Exam Spring 2017 Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you. 1. Recall the commute

More information

{ }. The dots mean they continue in that pattern to both

{ }. The dots mean they continue in that pattern to both INTEGERS Integers are positive and negative whole numbers, that is they are;... 3, 2, 1,0,1,2,3... { }. The dots mean they continue in that pattern to both positive and negative infinity. Before starting

More information