15.8 MULTIPLE REGRESSION WITH MANY EXPLANATORY VARIABLES

Size: px

Start display at page:

Download "15.8 MULTIPLE REGRESSION WITH MANY EXPLANATORY VARIABLES"

Esmond Owens
6 years ago
Views:

1 15.8 MULTIPLE REGRESSION WITH MANY EXPLANATORY VARIABLES The method of multiple regression that we have studied through the use of the two explanatory variable life expectancies example can be extended to any number of explanatory (sometimes called predictor ) variables. This is very useful when we have several potentially useful explanatory variables measured for each observation and wish to explore which are useful for prediction of a response variable of interest. Performing a multiple regression analysis on all the available potentially explanatory variables can help with this. Consider the following example.

2 Example Data were gathered for a period of 17 years on the average price of beef and a number of factors that were believed to have potential effects on the price of beef. But we do not know without statistical analysis that all six explanatory variables are simultaneously needed to predict beef pricing. These explanatory variables are as follows:* CBE: consumption of beef per capita (lb) PPO: price of pork (cents/lb) CPO: consumption of pork per capita (lb) DINC: disposable income per capita index CFO: food consumption per capita index RDINC: index of real disposable income per capita It is easy to see why each of these variables could have an effect on the price of beef. For example, the more pork that is consumed, or the cheaper pork is, the less the demand for beef should be. Obtainable from any standard statistical computer package, the usual least squares multiple regression analysis yields the following regression relationship between the price of beef (in cents/lb) and the six explanatory variables: Beef price (CBE) 0. 32(PPO) 0. 87(CPO) 0. 07(DINC) 0. 37(CFO) 0. 16(RDINC) The ANOVA table for regression, as introduced in the previous section, is as follows: Sum of Degrees Mean Source squares of freedom square Regression Error Total F Clearly, we strongly reject the null hypothesis that the regression is not worthwhile (consult the F tables for 5% and 1% significance values to see whether you agree!). The equation we have constructed thus does have the power to explain the price of beef. Two essential issues arise, however. First, why does the RDINC coefficient have its sign opposite to that which we might expect? That is, we would expect that as real disposable income (RDINC) increased, the price of beef would increase since people would consume more high-priced beef, increasing its demand and driving up its price. However, the negative sign for that term indicates the *F. B. Waugh, Graphic Analysis in Agricultural Economics, Agricultural Handbook 128 (Washington, D.C.: U.S. Department of Agriculture, 1957). The term per capita means per person here.

3 opposite relationship. It is possible that prediction is helped by the (RDINC) term, but we wonder! A second and crucial question is, could we have got by with fewer explanatory variables and have done just as well in explaining the price of beef? Perhaps we are overfitting the data by including some useless explanatory variables. We will address these questions later in this section. Estimation of the Regression Equation and Using the Equation for Prediction The idea of using least squares as a method of finding coefficients in linear regression was introduced in Section 3.5 and discussed briefly in the last section. In Section 3.5, we only had one explanatory variable, and we attempted to determine what value of the slope of the regression line would minimize the mean squared error. The method here is exactly the same, except that we are minimizing over all explanatory variables simultaneously. We want to find the set of coefficients that, when applied in the regression equation, minimizes the mean square error. Of course, we cannot perform this minimization by hand without severe difficulty and spending a great amount of time. That is why we turn to a convenient computer package to provide the proper least square estimates of the coefficients of the explanatory variables. Once this regression line has been found, we can use it, as we did in Chapter 3, to make predictions as to what future responses will be, based on observed explanatory variables. As was explained in Chapter 3, we want to be careful to use only interpolation, not extrapolation. In multiple regression, interpolation means that all of the observed explanatory values used to make the prediction should be within the range of the data values of the corresponding explanatory variable used in forming the regression equation. For example, with the beef data, the range of each of the explanatory variables was as follows: CBE: PPO: CPO: DINC: CFO: RDINC: So, if we were going to use the this equation for prediction, we would want to make sure that each of the explanatory values we were using was within these ranges (a few exceptions, as long as they are not too far out of the range, would be acceptable).

4 Testing Explanatory Variables for Usefulness We now want to explore the questions we posed at the end of Example Namely, we will explore why one of the coefficients in the regression equation has its sign opposite to what was expected, and also whether we can eliminate certain explanatory variables without lessening our ability to explain the price of beef. It turns out that the answers to these questions are related. First, it is important to understand that which of the other six explanatory variables are present in a particular multiple regression equation changes the coefficient of a particular variable such as RDINC. There exists a certain amount of total variation in the response variable, as measured by its total sum of squares. Although one of the explanatory variables may explain some of this total response variation, other explanatory variables will also share in explaining the total variation in the response variable, depending on which of them are included in the model. To understand this perhaps puzzling idea, let s consider an example. Consider prediction of college freshman grade point average (GPA) using both SAT and ACT college entrance scores. First, we note that although they are somewhat different, in fact these two college entrance tests measure very similar things and are highly correlated in the population entering college. We would expect, as is true, that the SAT score by itself does a good job predicting freshman GPA, and hence will have an influential coefficient in the regression equation GPA C m(sat score). But if both scores are used together as explanatory variables, the coefficient of the SAT score will be much less because the ACT score is now also sharing in predicting freshman GPA. The point is that if several explanatory variables are present, then the coefficient of each variable represents the explanatory capacity of that variable viewed in cooperation with the explanatory capacities of other variables. Let s see how this relates to our beef pricing prediction equation. Consider the case of the variable RDINC. We now understand that the variation it explains in the regression equation of Example is variation not being explained by the other five variables, including in particular the variable DINC. It is natural to assume that RDINC and DINC, which are indeed defined to be very similar, would be explaining much the same variation in the price of beef. In fact, the the sample correlation coefficient between RDINC and DINC is 0.82, indicating a strong relationship between them and hence a similar prediction role for them. The point is that the coefficient for any particular explanatory variable in a regression equation has to be understood in the context of all the other explanatory variables present in the equation.

5 This leads to the issue of whether it is of value to include both RDINC and DINC in the regression. If they both explain approximately the same thing, then why include them both? This is a very important issue that statisticians doing multiple regression address, since in developing models to predict and explain the world around us, we are always interested in creating the simplest model (in particular, fewest explanatory variables)possible, while retaining our ability to explain or predict the response variable as well as possible. The ANOVA table shown in Example considers the regression in one line of the table. However, it is possible (though we do not explain how here) to split off, from the regression sum of squares, a sum of squares component for RDINC. For our example, this expanded ANOVA table is as follows: Sum of Degrees Mean Source squares of freedom square F Other five variables RDINC Error Total Note that the six-degrees-of-freedom regression sum of squares (743.13) of Example has here been decomposed into the five-degrees-of-freedom sum of squares for the combined influence of CBO, PPO, CPO, DINC, and CFO and the single-degree-of-freedom sum of squares for RDINC, which, as the theory says, must add to (check it!). Now we have a separate F test for the explanatory variable RDINC. It is important to recall, however, that this sum of squares for RDINC is the sum of squares assuming that the other five explanatory variables are included in the equation. Thus the F test is asking whether including the RDINC variable helps predict the response variable (the price of beef) given that the other five variables are part of the prediction equation. Because of this, if we find by using the F distribution that an explanatory variable is not important, we will want to redo the least squares regression equation with that variable removed. When a variable is removed, the coefficients of the remaining explanatory variables will change. In our example, the RDINC F statistic is We test the null hypothesis that the RDINC variable is of no use in the model (that its coefficient is 0) by comparing 0.16 to the F distribution 5% value with 1 numerator and 10 denominator degrees of freedom. This 5% point is 4.96, so we strongly conclude that we cannot reject this null hypothesis. Thus we conclude that RDINC is not of use in the presence of the other explanatory variables (and

6 hence its original negative coefficient was not to be trusted). We remove this variable from the regression equation. We could explore removing other explanatory variables. Indeed, it is important to ask how many and which variables are needed to obtain a regression equation where each included variable is useful for prediction in addition to the others present and where adding any other explanatory variable does not improve prediction. An advanced ANOVA analysis that considers all possible regression equations formed by including various subsets of the six explanatory variables produces a solution to this question: Beef price (CFO) 1. 27(CBE) 0. 78(CPO) 0. 31(PPO) Here the F test for each coefficient rejects the null hypothesis that the coefficient is zero, indicating the predictive usefulness of each of the variables, even with the other three explanatory variables present. Recall the multiple correlation coefficient was only 66% in the life expectancy example of Section By contrast, the (100 R2)% value here is 97%, a very high value indicating very effective predictive capability. Comparing the coefficients of the four explanatory variables in the above equation and in the original Example equation with all six explanatory variables, we note that two of the coefficients changed little, one changed a moderate amount, and one is now much different. Interestingly, this four explanatory variable equation has dropped both RDINC and DINC because of their being ineffective in the presence of the other four variables. SECTION 15.8 EXERCISES 1. For both of the following sets of values, using 2 3. Can you determine the value of R for the beef the regression equation found in the section, price example? Refer back to Example predict the price of beef, if it is appropriate. If on page 656. it is not appropriate, explain why not. 4. Consider the regression with two predica. CBE 52, PPO 51. 2, CPO 56. 3, tor variables based on 59 metropolitan ar- DINC 48. 7, CFO 90. 5, RDINC eas, where Y Average income in $1000 s, X Average educational level, and Z b. CBE 48, PPO 86. 3, CPO 48. 1, Percentage of workers who are white-collar. DINC 21. 9, CFO 96. 2, RDINC a. The ANOVA table below shows the sum of squares for X as the explanatory variable, 2. Answer true or false, and explain: Since the and the sum of squares for Z after X has sign for the explanatory variable DINC is explained what it can: positive, the correlation between DINC and the price of beef is necessarily also positive.

7 Sum of Degrees of Mean Source squares freedom square X ?? Z ?? Error ? Total Fill in the mean squares for XZ,, and Error, and the F statistics for X and Z. b. Perform the F test for the X variable, with significance level. 05. What do you conclude? c. Perform the F test for the Z variable. What do you conclude? d. The next ANOVA table shows the sum of squares for Z as the explanatory variable, and a blank for the sum of square for X after Z has explained what it can: Sum of Degrees of Mean Source squares freedom square F F Exams, score on exams during semester (not including final); and Final, score on final exam. a. Let Y Final. Which single one of the other variables would you expect to best predict Y? b. The ANOVA table below has the sums of squares for predicting the final exam score from the others. This first line has the sum of squares due to the three variables Labs, In Class, and Exams, and the second has the sum of squares due to the HW after the other three have explained what they can: Degrees Sum of of Mean Source squares freedom square Labs, In Class, Exams ??? HW???? Error ?? Total ? Z ?? Fill in the spaces that have question marks. X? 1?? c. Test whether Labs, In Class, and Exams Error ? together have significant predictive value Total of final exam score. d. Test whether HW has significant additional predictive power after the other three vari- Fill in the sum of squares for X, the mean ables have explained what they can. squares, and the F s. e. The next ANOVA table has the sum of e. Perform the F test for the Z variable. What squares for Labs and In Class together, do you conclude? Compare your concluthen the sum of squares for the additional sion to that in part (c). Is there a contradiceffect of Exams. tion. Explain. f. Perform the F test for the X variable. What do you conclude? Degrees g. Which equation would you prefer? Explain. Source squares freedom square F Sum of of Mean (i) Y a bx cz Labs, In Class ??? (ii) Y a bx Exams???? (iii) Y a cz Error ?? 5. The scores for 107 statistics students included the following: HW, score on book homework; Total ? Labs, score on computer laboratory assignments; In Class, score on in-class assignments; Fill in the missing information. F

8 f. Test whether Labs and In Class vari- h. Which of the following equations would ables combined have significant predictive you prefer for predicting final score? Why? power. (i) Final a b Labs c In Class d g. Test whether Exams has significant addi- Exams tional predictive power after Labs and In (ii) Final a b HW c Labs d Class have explained what they can. In Class e Exams (iii) Final a b Labs c In Class

Steps to take to do the descriptive part of regression analysis:

STA 2023 Simple Linear Regression: Least Squares Model Steps to take to do the descriptive part of regression analysis: A. Plot the data on a scatter plot. Describe patterns: 1. Is there a strong, moderate,