12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

Size: px

Start display at page:

Download "12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)"

Marcus Fox
6 years ago
Views:

1 12.12 Model Building, and the Effects of Multicollinearity (Optional) 1 Although Excel and MegaStat are emphasized in Business Statistics in Practice, Second Canadian Edition, some examples in the additional material on Connect can only be demonstrated using other programs, such as MINITAB, SPSS, and SAS. Please consult the user guides for these programs for instructions on their use MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL) Multicollinearity Recall the sales territory performance data in Table 12.2 (page 422). These data consist of values of the dependent variable y (Sales) and of the independent variables x 1 (Time), x 2 (MktPoten), x 3 (Adver), x 4 (MktShare), and x (Change). The complete sales territory performance data analyzed by Cravens, Woodruff, and Stamper (1972) consist of the data presented in Table 12.2 and data concerning three additional independent variables. These three additional variables are defined as follows: PART 4 Model Building and Model Diagnostics (Optional) x 6 number of accounts handled by the representative (we will denote this variable as Accts), x 7 average workload per account, measured by using a weighting based on the sizes of the orders by the accounts and other workload-related criteria (we will denote this variable as WkLoad), x 8 an aggregate rating on eight dimensions of the representative s performance, made by a sales manager and expressed on a 1 to 7 scale (we will denote this variable as Rating). Table gives the observed values of x 6, x 7, and x 8, and Figure presents the MINITAB output of a correlation matrix for the sales territory performance data. Examining the first column of this matrix, we see that the simple correlation coefficient between Sales and WkLoad is and the p value for testing the significance of the relationship between Sales and WkLoad is This indicates that there is little or no relationship between Sales and WkLoad. However, the simple correlation coefficients between Sales and the other seven independent variables range from to 0.74, with associated p values ranging from to This indicates the existence of potentially useful relationships between Sales and these seven independent variables. While simple correlation coefficients (and scatter plots) give us a preliminary understanding of the data, they cannot be relied upon alone to tell us which independent variables are TABLE Values of Accts, WkLoad, and Rating Accounts, Workload, Rating, x 6 x 7 x Accounts, Workload, Rating, x 6 x 7 x

2 2 Chapter 12 Multiple Regression and Model Building FIGURE MINITAB Output of a Correlation Matrix for the Sales Territory Performance Data Sales Time MktPoten Adver MktShare Change Accts WkLoad Time MktPoten Adver Cell Contents: Pearson correlation P-Value MktShare Change Accts WkLoad Rating significantly related to the dependent variable. One reason for this is a condition called multicollinearity. Multicollinearity is said to exist among the independent variables in a regression situation if these independent variables are related to or dependent upon each other. One way to investigate multicollinearity is to examine the correlation matrix. To understand this, note that all of the simple correlation coefficients located in the first column of this matrix measure the simple correlations between the independent variables. For example, the simple correlation coefficient between Accts and Time is 0.78, which says that the Accts values increase as the Time values increase. Such a relationship makes sense because it is logical that the longer a sales representative has been with the company, the more accounts they handle. Statisticians often regard multicollinearity in a data set to be severe if at least one simple correlation coefficient between the independent variables is at least 0.9. Since the largest such simple correlation coefficient in Figure is 0.78, this is not true for the sales territory performance data. Note, however, that even moderate multicollinearity can be a problem. This will be demonstrated later using the sales territory performance data. Another way to measure multicollinearity is to use variance inflation factors. Consider a regression model relating a dependent variable y to a set of independent variables x 1,..., x j21, x j, x j11,..., x k. The variance inflation factor for the independent variable x j in this set is denoted VIF j and is defined by the equation VIF j 1, 1 2 R 2 j where R 2 j is the multiple coefficient of determination for the regression model that relates x j to all the other independent variables x 1,..., x j21, x j11,..., x k in the set. For example, Figure gives the MegaStat output of the t statistics, p values, and variance inflation factors for the sales territory performance model that relates y to all eight independent variables. The largest variance inflation factor is VIF To calculate VIF 6, MegaStat first calculates the multiple coefficient of determination for the regression model that relates x 6 to x 1, x 2, x 3, x 4, x, x 7, and x 8 to be R It then follows that VIF R In general, if R 2 j 0, which says that x j is not related to the other independent variables, then the variance inflation factor VIF j equals 1. On the other hand, if R 2 j. 0, which says that x j is related to the other independent variables, then 1 2 R 2 j is less than 1, making VIF j greater

3 12.12 Model Building, and the Effects of Multicollinearity (Optional) 3 FIGURE MegaStat Output of the t Statistics, p Values, and Variance Inflation Factors for the Sales Territory Performance Model y b 0 1 b 1 x 1 1 b 2 x 2 1 b 3 x 3 1 b 4 x 4 1 b x 1 b 6 x 6 1 b 7 x 7 1 b 8 x 8 1 e Regression output confidence interval variables coefficients std. error t (df 16) p-value 9% lower 9% upper VIF Intercept 21, Time MktPoten Adver MktShare Change Accts WkLoad Rating mean VIF than 1. Both the largest variance inflation factor among the independent variables and the mean VIF of the variance inflation factors for the independent variables indicate the severity of multicollinearity. Generally, the multicollinearity between independent variables is considered severe if 1 The largest variance inflation factor is greater than ten (which means that the largest R 2 j is greater than 0.9). 2 The mean VIF of the variance inflation factors is substantially greater than one. The largest variance inflation factor in Figure is not greater than ten, and the average of the variance inflation factors, which is 2.667, would probably not be considered substantially greater than one. Therefore, we would probably not consider the multicollinearity among the eight independent variables to be severe. The reason that VIF j is called the variance inflation factor is that it can be shown that when VIF j is greater than one, the standard deviation s bj of the population of all possible values of the least squares point estimate b j is likely to be inflated beyond its value when R 2 j 0. If s bj is greatly inflated, two slightly different samples of values of the dependent variable can yield two substantially different values of b j. To intuitively understand why strong multicollinearity can significantly affect the least squares point estimates, consider the so-called picket fence display in the margin. This figure depicts two independent variables (x 1 and x 2 ) exhibiting strong multicollinearity (note that as x 1 increases, x 2 increases). The heights of the pickets on the fence represent the y observations. If we assume that the model y b 0 1 b 1 x 1 1 b 2 x 2 1 e adequately describes these data, then calculating the least squares point estimates amounts to fitting a plane to the points on the top of the picket fence. Clearly, this plane would be quite unstable. That is, a slightly different height of one of the pickets (a slightly different y value) could cause the slant of the fitted plane (and the least squares point estimates that determine this slant) to radically change. It follows that when strong multicollinearity exists, sampling variation can result in least squares point estimates that differ substantially from the true values of the regression parameters. In fact, some of the least squares point estimates may have a sign (positive or negative) that differs from the sign of the true value of the parameter (you will see an example of this in the exercises). Therefore, when strong multicollinearity exists, it is dangerous to individually interpret the least squares point estimates. The most important problem caused by multicollinearity is that even when multicollinearity is not severe, it can hinder our ability to use the t statistics and related p values to assess the importance of the independent variables. Recall that we can reject H 0 : b j 0 in favour The picket fence display x 1 y x 2

4 4 Chapter 12 Multiple Regression and Model Building of H a : b j? 0 at level of significance a if and only if the absolute value of the corresponding t statistic is greater than t ay2 based on n 2 (k 1 1) degrees of freedom or, equivalently, if and only if the related p value is less than a. Thus, the larger (in absolute value) the t statistic is and the smaller the p value is, the stronger is the evidence that we should reject H 0 : b j 0 and that the independent variable x j is significant. When multicollinearity exists, the sizes of the t statistic and the related p value measure the additional importance of the independent variable x j over the combined importance of the other independent variables in the regression model. Since two or more correlated independent variables contribute redundant information, multicollinearity often causes the t statistics obtained by relating a dependent variable to a set of correlated independent variables to be smaller (in absolute value) than the t statistics that would be obtained if separate regression analyses were run, where each separate regression analysis relates the dependent variable to a smaller set (for example, only one) of the correlated independent variables. Thus, multicollinearity can cause some of the correlated independent variables to appear less important in terms of having small absolute t statistics and large p values than they really are. Another way to understand this is to note that since multicollinearity inflates s bj, it inflates the point estimate s bj of s bj. Since t b j ys bj, an inflated value of s bj can (depending on the size of b j ) cause t to be small (and the related p value to be large). This would suggest that x j is not significant even though x j may actually be important. For example, Figure tells us that when we perform a regression analysis of the sales territory performance data using a model that relates y to all eight independent variables, the p values related to Time, MktPoten, Adver, MktShare, Change, Accts, WkLoad, and Rating are, respectively, , , 0.00, , , , 0.649, and By contrast, recall from Figure 12.7 (page 430) that when we perform a regression analysis of the sales territory performance data using a model that relates y to the first five independent variables, the p values related to Time, MktPoten, Adver, MktShare, and Change are, respectively, 0.006, < , 0.002, < , and Note that Time (p value 0.006) seems highly signifi cant and Change (p value 0.030) seems somewhat signifi cant in the five- independent-variable model. However, when we consider the model that uses all eight independent variables, Time (p value ) seems insignifi cant and Change (p value ) seems somewhat insignifi cant. The reason that Time and Change seem more significant in the five-independent-variable model is that since this model uses fewer variables, Time and Change contribute less overlapping information and thus have more additional importance in this model. Comparing regression models on the basis of R 2, s, adjusted R 2, prediction interval length, and the C statistic We have seen that when multicollinearity exists in a model, the p value associated with an independent variable in the model measures the additional importance of the variable over the combined importance of the other variables in the model. Therefore, it can be difficult to use the p values to determine which variables to retain in a model and which variables to remove from the model. This implies that we need to evaluate more than the additional importance of each independent variable in a regression model. We also need to evaluate how well the independent variables work together to accurately describe, predict, and control the dependent variable. One way to do this is to determine if the overall model gives a high R 2 and R 2, a small s, and short prediction intervals. It can be proven that adding any independent variable to a regression model, even an unimportant independent variable, will decrease the unexplained variation and increase the explained variation. Therefore, since the total variation (y i 2 y) 2 depends only on the observed y values and thus remains unchanged when we add an independent variable to a regression model, it follows that adding any independent variable to a regression model will increase explained variation R 2. total variation

5 12.12 Model Building, and the Effects of Multicollinearity (Optional) This implies that R 2 cannot tell us (by decreasing) that adding an independent variable is undesirable. That is, although we wish to obtain a model with a large R 2, there are better criteria than R 2 that can be used to compare regression models. One better criterion is the standard error SSE s B n 2 (k 1 1). When we add an independent variable to a regression model, the number of model parameters k 1 1 increases by one, and thus the number of degrees of freedom n 2 (k 1 1) decreases by one. If the decrease in n 2 (k 1 1), which is used in the denominator to calculate s, is proportionally more than the decrease in the SSE (the unexplained variation) that is caused by adding the independent variable to the model, then s will increase. If s increases, this tells us that we should not add the independent variable to the model. To see one reason why, consider the formula for the prediction interval for y: 3ŷ 6 t ay2 s21 1 distance value4. Since adding an independent variable to a model decreases the number of degrees of freedom, adding the variable will increase the t ay2 point used to calculate the prediction interval. To understand this, look at any column of the t table in Table A. and scan from the bottom of the column to the top you can see that the t points increase as the degrees of freedom decrease. It can also be shown that adding any independent variable to a regression model will not decrease (and usually increases) the distance value. Therefore, since adding an independent variable increases t ay2 and does not decrease the distance value, if s increases, the length of the prediction interval for y will increase. This means the model will predict less accurately and thus we should not add the independent variable. On the other hand, if adding an independent variable to a regression model decreases s, the length of a prediction interval for y will decrease if and only if the decrease in s is enough to offset the increase in t ay2 and the (possible) increase in the distance value. Therefore, an independent variable should not be included in a fi nal regression model unless it reduces s enough to reduce the length of the desired prediction interval for y. However, we must balance the length of the prediction interval or, in general, the goodness of any criterion against the difficulty and expense of using the model. For instance, predicting y requires knowing the corresponding values of the independent variables. So we must decide whether including an independent variable reduces s and the prediction interval lengths enough to offset the potential errors caused by possible inaccurate determination of values of the independent variables, or the possible expense of determining these values. If adding an independent variable provides prediction intervals that are only slightly shorter while making the model more difficult and/or more expensive to use, we might decide that including the variable is not desirable. Since a key factor is the length of the prediction intervals provided by the model, one might wonder why we do not simply make direct comparisons of prediction interval lengths (without looking at s). It is useful to compare interval lengths, but these lengths depend on the distance value, which depends on how far the values of the independent variables for which we wish to predict are from the centre of the experimental region. We often wish to compute prediction intervals for several different combinations of values of the independent variables (and thus for several different values of the distance value). Thus, we would compute prediction intervals with slightly different lengths. However, the standard error s is a constant factor with respect to the length of the prediction intervals (as long as we are considering the same regression model). Thus, it is common practice to compare regression models on the basis of s (and s 2 ). Finally, note that it can be shown that the standard error s decreases if and only if R 2 (adjusted R 2 ) increases. It follows that if we are comparing regression models, the model that gives the smallest s gives the largest R 2.

6 6 Chapter 12 Multiple Regression and Model Building FIGURE 12.4 MegaStat Output of Some of the Best Sales Territory Performance Regression Models (a) The best single model of each size Nvar Time MktPoten Adver MktShare Change Accts WkLoad Rating s Adj R 2 R 2 Cp p-value E E E E E E E E-07 (b) The best eight models Nvar Time MktPoten Adver MktShare Change Accts WkLoad Rating s Adj R 2 R 2 Cp p-value E E E E E E E E-08 Example The Sales Territory Performance Case Figure 12.4 gives MegaStat output resulting from calculating R 2, R 2, and s for all possible regression models based on all possible combinations of the eight independent variables in the sales territory performance situation (we will explain the values of C p on the output after we complete this example). The first output gives the best single model of each size, and the second output gives the eight best models of any size, in terms of s and R 2. The output also gives the p values for the variables in each model. Examining the output, we see that the three models with the smallest values of s and largest values of R 2 are 1 The six-variable model that contains Time, MktPoten, Adver, MktShare, Change, Accts and has s and R ; we refer to this model as Model 1. 2 The five-variable model that contains Time, MktPoten, Adver, MktShare, Change and has s and R ; we refer to this model as Model 2. 3 The seven-variable model that contains Time, MktPoten, Adver, MktShare, Change, Accts, WkLoad and has s and R ; we refer to this model as Model 3. To see that s can increase when we add an independent variable to a regression model, note that s increases from to when we add WkLoad to Model 1 to form Model 3. In this case, although it can be verified that adding WkLoad decreases the unexplained variation from 3,297, to 3,226,76.271, this decrease is enough to offset the change in the denominator of s 2 SSE n 2 (k 1 1),

7 12.12 Model Building, and the Effects of Multicollinearity (Optional) 7 which decreases from to To see that prediction interval lengths might increase even though s decreases, consider adding Accts to Model 2 to form Model 1. This decreases s from to However, consider a sales representative for whom Time 8.42, MktPoten 3,182.73, Adver 7,281.6, MktShare 9.64, Change 0.28, and Accts The 9 percent prediction interval given by Model 2 for sales corresponding to this combination of values of the independent variables is [3,233.9,,129.89] and has length, , , The 9 percent prediction interval given by Model 1 for such sales is [3,193.86,,093.14] and has length, , , In other words, the slight decrease in s accomplished by adding Accts to Model 2 to form Model 1 is not enough to offset the increases in t ay2 and the distance value (which can be shown to increase from to 0.11), and thus the length of the prediction interval given by Model 1 increases. In addition, the extra independent variable Accts in Model 1 has a p value of Therefore, we conclude that Model 2 is better than Model 1 and is, in fact, the best sales territory performance model (using only linear terms). Another quantity that can be used to compare regression models is called the C statistic (also often called the C p statistic). To show how to calculate the C statistic, suppose that we wish to choose an appropriate set of independent variables from p potential independent variables. We first calculate the mean square error, which we denote as s p 2, for the model using all p potential independent variables. Then, if the SSE is the unexplained variation for another particular model that has k independent variables, it follows that the C statistic for this model is C SSE s p 2 2 3n 2 2(k 1 1)4. For example, consider the sales territory performance case. It can be verified that the mean square error for the model using all p 8 independent variables is 201, and the SSE for the model using the first k independent variables (Model 2 in the previous example) is 3,16, It follows that the C statistic for this latter model is C 3,16, , ( 1 1) Because the C statistic for a given model is a function of the model s SSE, and because we want the SSE to be small, we want C to be small. Although adding an unimportant independent variable to a regression model will decrease the SSE, adding such a variable can increase C. This can happen when the decrease in the SSE caused by the addition of the extra independent variable is not enough to offset the decrease in n 2 2(k 1 1) caused by the addition of the extra independent variable (which increases k by 1). It should be noted that although adding an unimportant independent variable to a regression model can increase both s 2 and C, there is no exact relationship between s 2 and C. While we want C to be small, it can be shown from the theory behind the C statistic that we also wish to find a model for which the C statistic roughly equals k 1 1, the number of parameters in the model. If a model has a C statistic substantially greater than k 1 1, it can be shown that this model has substantial bias and is undesirable. Thus, although we want to find a model for which C is as small as possible, if C for such a model is substantially greater than k 1 1, we may prefer to choose a different model for which C is slightly larger and more nearly equal to the number of parameters in that (different) model. If a particular model has a small value of C and C for this model is less than k 1 1, then the model should be considered desirable. Finally, it should be noted that for the model that includes all p potential independent variables (and thus utilizes p 1 1 parameters), it can be shown that C p 1 1.

8 8 Chapter 12 Multiple Regression and Model Building If we examine Figure 12.4, we see that Model 2 of the previous example has the smallest C statistic. The C statistic for this model equals Since C is less than k 1 1 6, the model is not biased. Therefore, this model should be considered best with respect to the C statistic. Stepwise regression and backward elimination In some situations, it is useful to employ an iterative model selection procedure, where at each step a single independent variable is added to or deleted from a regression model, and a new regression model is evaluated. We discuss here two such procedures stepwise regression and backward elimination. There are slight variations in the way different computer packages carry out stepwise regression. Assuming that y is the dependent variable and x 1, x 2,..., x p are the p potential independent variables, we explain how most of the computer packages perform stepwise regression. Stepwise regression uses t statistics (and related p values) to determine the significance of the independent variables in various regression models. In this context, we say that the t statistic indicates that the independent variable x j is signifi cant at the a level if and only if the related p value is less than a. Then, stepwise regression is carried out as follows. Choice of A entry and A stay Before beginning the stepwise procedure, we choose a value of a entry, which we call the probability of a Type I error related to entering an independent variable into the regression model. We also choose a value of a stay, which we call the probability of a Type I error related to retaining an independent variable that was previously entered into the model. Although there are many considerations in choosing these values, it is common practice to set both a entry and a stay equal to 0.0 or Step 1: The stepwise procedure considers the p possible one-independent-variable regression models of the form y b 0 1 b 1 x j 1 e. Each different model includes a different potential independent variable. For each model, the t statistic (and p value) related to testing H 0 : b 1 0 versus H a : b 1? 0 is calculated. Denoting the independent variable giving the largest absolute value of the t statistic (and the smallest p value) by the symbol x [1], we consider the model y b 0 1 b 1 x e. If the t statistic does not indicate that x [1] is significant at the a entry level, then the stepwise procedure terminates by concluding that none of the independent variables are significant at the a entry level. If the t statistic indicates that the independent variable x [1] is significant at the a entry level, then x [1] is retained for use in Step 2. Step 2: The stepwise procedure considers the p 2 1 possible two-independent-variable regression models of the form y b 0 1 b 1 x b 2 x j 1 e. Each different model includes x [1], the independent variable chosen in Step 1, and a different potential independent variable chosen from the remaining p 2 1 independent variables that were not chosen in Step 1. For each model, the t statistic (and p value) related to testing H 0 : b 2 0 versus H a : b 2? 0 is calculated. Denoting the independent variable giving the largest absolute value of the t statistic (and the smallest p value) by the symbol x [2], we consider the model y b 0 1 b 1 x b 2 x e. If the t statistic indicates that x [2] is significant at the a entry level, then x [2] is retained in this model, and the stepwise procedure checks to see whether x [1] should be allowed to stay in the model. This check should be made because multicollinearity will probably cause the t statistic related to the importance of x [1] to change when x [2] is added to the model. If the t statistic does not indicate that x [1] is significant at the a stay level, then the stepwise procedure returns to the

9 12.12 Model Building, and the Effects of Multicollinearity (Optional) 9 beginning of Step 2. Starting with a new one-independent-variable model that uses the new significant independent variable x [2], the stepwise procedure attempts to find a new two-independentvariable model y b 0 1 b 1 x b 2 x j 1 e. If the t statistic indicates that x [1] is significant at the a stay level in the model y b 0 1 b 1 x b 2 x e, then both the independent variables x [1] and x [2] are retained for use in further steps. Further steps The stepwise procedure continues by adding independent variables one at a time to the model. At each step, an independent variable is added to the model if it has the largest (in absolute value) t statistic of the independent variables not in the model and if its t statistic indicates that it is significant at the a entry level. After adding an independent variable, the stepwise procedure checks all the independent variables already included in the model and removes an independent variable if it has the smallest (in absolute value) t statistic of the independent variables already included in the model and if its t statistic indicates that it is not significant at the a stay level. This removal procedure is sequentially continued, and only after the necessary removals are made does the stepwise procedure attempt to add another independent variable to the model. The stepwise procedure terminates when all the independent variables not in the model are insignificant at the a entry level or when the variable to be added to the model is the one just removed from it. For example, again consider the sales territory performance data. We let x 1, x 2, x 3, x 4, x, x 6, x 7, and x 8 be the eight potential independent variables employed in the stepwise procedure. Figure 12.46(a) gives the MegaStat output of the stepwise regression employing these independent variables where both a entry and a stay have been set equal to The stepwise procedure 1 Adds Accts (x 6 ) on the first step. 2 Adds Adver (x 3 ) and retains Accts on the second step. 3 Adds MktPoten (x 2 ) and retains Accts and Adver on the third step. 4 Adds MktShare (x 4 ) and retains Accts, Adver, and MktPoten on the fourth step. The procedure terminates after step 4 when no more independent variables can be added. Therefore, the stepwise procedure arrives at the model that utilizes x 2, x 3, x 4, and x 6. To carry out backward elimination, we perform a regression analysis by using a regression model containing all of the p potential independent variables. Then the independent variable with the smallest (in absolute value) t statistic is chosen. If the t statistic indicates that this independent variable is significant at the a stay level (a stay is chosen prior to the beginning of the procedure), then the procedure terminates by choosing the regression model containing all p independent variables. If this independent variable is not significant at the a stay level, then it is removed from the model, and a regression analysis is performed by using a regression model containing all of the remaining independent variables. The procedure continues by removing independent variables one at a time from the model. At each step, an independent variable is removed from the model if it has the smallest (in absolute value) t statistic of the independent variables remaining in the model and if it is not significant at the a stay level. The procedure terminates when no independent variable remaining in the model can be removed. Backward elimination is generally considered a reasonable procedure, especially for analysts who like to start with all possible independent variables in the model so that they will not miss anything important. To illustrate backward elimination, we first note that choosing the independent variable that has the smallest (in absolute value) t statistic in a model is equivalent to choosing the independent variable that has the largest p value in the model. With this in mind, Figure 12.46(b) gives the MINITAB output of a backward elimination of the sales territory performance data. Here the backward elimination uses a stay 0.0, begins with the model using all eight independent variables, and removes (in order) Rating (x 8 ), then WkLoad (x 7 ), then Accts (x 6 ), and finally Change (x ). The procedure terminates when no independent variable remaining can be

10 10 Chapter 12 Multiple Regression and Model Building FIGURE The MegaStat Output of Stepwise Regression and the MINITAB Output of Backward Elimination for the Sales Territory Performance Problem (a) Stepwise regression (A entry A stay 0.10) Regression Analysis Stepwise Selection displaying the best model of each size 2 observations Sales Y is the dependent variable p-values for the coefficients Nvar Time X1 MktPoten X2 Adver X3 MktShare X4 Change X Accts X6 WkLoad X7 Rating X8 s Adj R 2 R 2 Cp p-value E E E E E E E E-07 (b) Backward elimination ( stay 0.0) Backward elimination. Alpha-to-Remove: 0.0 Response is Sales on 8 predictors, with N = 2 Step Constant Time T-Value P-Value MktPoten T-Value P-Value Adver T-Value P-Value MktShare T-Value P-Value Change T-Value P-Value Accts T-Value P-Value WkLoad T-Value P-Value Rating 8 T-Value 0.06 P-Value 0.90 S R-Sq R-Sq(adj) Mallows C-p

11 12.12 Model Building, and the Effects of Multicollinearity (Optional) 11 removed that is, when no independent variable has a related p value greater than a stay 0.0 and arrives at a model that uses Time (x 1 ), MktPoten (x 2 ), Adver (x 3 ), and MktShare (x 4 ). This model has an s of 464 and an R 2 of and is inferior to the model arrived at by stepwise regression, which has an s of and an R 2 of (see Figure 12.46(a)). However, the backward elimination process allows us to find a model that is better than either of these. If we look at the model considered by backward elimination after Rating (x 8 ), WkLoad (x 7 ), and Accts (x 6 ) have been removed, we have the model using x 1, x 2, x 3, x 4, and x. This model has an s of 430 and an R 2 of , and in Example we reasoned that this model is perhaps the best sales territory performance model. Interestingly, this is the model that backward elimination would arrive at if we were to set a stay equal to 0.10 rather than 0.0 note that this model has no p values greater than The sales territory performance example brings home two important points. First, the models obtained by backward elimination and stepwise regression depend on the choices of a entry and a stay (whichever is appropriate). Second, it is best not to think of these methods as automatic model-building procedures. Rather, they should be regarded as processes that allow us to find and evaluate a variety of model choices. Exercises for Section CONCEPTS 12.9 What is multicollinearity? What problems can be caused by multicollinearity? Discuss how to compare regression models. METHODS AND APPLICATIONS THE HOSPITAL LABOUR NEEDS CASE Table 12. (page 424) presents data concerning the need for labour in 16 hospitals. This table gives values of the dependent variable Hours (monthly labour hours) and the independent variables Xray (monthly X-ray exposures), BedDays (monthly occupied bed days a hospital has one occupied bed day if one bed is occupied for an entire day), and Length (average length of patients stay, in days). The data in Table 12. are part of a larger data set. The complete data set consists of two additional independent variables Load (average daily patient load) and Pop (eligible population in the area, in thousands) values of which are given in Table Figure gives MegaStat output of multicollinearity analysis and model building for the complete hospital labour needs data set. a. Find the three largest simple correlation coefficients between the independent variables in Figure 12.47(a). Also find the three largest variance inflation factors in Figure 12.47(b). b. Based on your answers to part a, which independent variables are most strongly involved in multicollinearity? c. Do any least squares point estimates have a sign (positive or negative) that is different from what we would intuitively expect another indication of multicollinearity? d. The p value associated with F(model) for the model in Figure 12.47(b) is less than In general, if the p value associated with F(model) is much smaller than any of the p values associated with the independent variables, this is another indication of multicollinearity. Is this true in this situation? TABLE Patient Load and Population for Exercise Load Pop e. Figure 12.47(c) and (d) indicates that the two best hospital labour needs models are the model using Xray, BedDays, Pop, and Length, which we will call Model 1, and the model using Xray, BedDays, and Length, which we will call Model 2. Which model gives the smallest value of s and the largest value of R 2? Which model gives the smallest value of C? Consider a hospital for which Xray 6,194, BedDays 14,077.88, Pop 329.7, and Length The 9 percent prediction intervals given by Models 1 and 2 for labour hours corresponding to this combination of values of the independent variables are, respectively, [14,888.43, 16,861.30] and [14,906.24, 16,886.26]. Which model gives the shorter prediction interval?

12 12 Chapter 12 Multiple Regression and Model Building FIGURE MegaStat Output of Multicollinearity Analysis and Model Building for the Hospital Labour Needs Data (a) A correlation matrix Load Xray BedDays Pop Length Hours Load Xray BedDays Pop Length Hours sample size critical value 0.0 (two-tail) critical value 0.01 (two-tail) (b) The variance inflation factors Regression output variables coefficients std. error t (df 10) p-value VIF Intercept 2, Xray(x1) BedDays(x2) Length(x3) Load(x4) Pop(x) mean VIF (c) The best single model of each size Nvar Load Xray BedDays Pop Length s Adj R 2 R 2 Cp p-value E E E E E-12 (d) The best five models Nvar Load Xray BedDays Pop Length s Adj R 2 R 2 Cp p-value E E E E E-12 f. Consider Figure Which model is chosen by both stepwise regression and backward elimination? Overall, which model seems best? Market Planning, a marketing research firm, has obtained the prescription sales data in Table 12.2 for n 20 independent pharmacies. 1 In this table, y is the average weekly prescription sales over the past year (in units of $1,000), x 1 is the floor space (in square feet), x 2 is the percentage of floor space allocated to the prescription department, x 3 is the number of parking spaces available to the store, x 4 is the weekly per capita income for the surrounding community (in units of $100), and x is a dummy variable that equals 1 if the pharmacy is located in a shopping centre and 0 otherwise. Use the MegaStat output in Figure to discuss why the model using FloorSpace and Presc.Pct might be the best model describing prescription sales. The least squares point estimates of the parameters of this model can be calculated to be b , b , and b Discuss what b 1 and b 2 say about obtaining high prescription sales. 1 This problem is taken from an example in An Introduction to Statistical Methods and Data Analysis, 2nd ed., by L. Ott, (Boston: PWS-KENT Publishing Company, 1987). Used with permission.

13 12.12 Model Building, and the Effects of Multicollinearity (Optional) 13 FIGURE MegaStat Output of a Stepwise Regression and MINITAB Output of a Backward Elimination of the Hospital Labour Needs Data (a) Stepwise regression (A entry A stay 0.10) Regression Analysis Stepwise Selection displaying the best model of each size 16 observations Hours(y) is the dependent variable Nvar p-values for the coefficients Xray(x1) BedDays(x2) Length(x3) Load(x4) Pop(x) s Adj R 2 R 2 Cp p-value E E E E E-12 (b) Backward elimination ( stay 0.0) Step Constant Load -9 T-Value -0.1 P-Value XRay T-Value P-Value BedDays T-Value P-Value Pop T-Value P-Value Length T-Value P-Value S R-Sq TABLE 12.2 Prescription Sales Data Sales, Floor Space, Prescription Parking, Income, Shopping Pharmacy y x 1 Percentage, x 2 x 3 x 4 Centre, x , , , , , , , , , , , , , , , , , , , , Source: From An Introduction to Statistical Methods and Data Analysis, 2nd ed., by L. Ott. Copyright Reprinted with permission of Brooks/Cole, an imprint of the Wadsworth Group, a division of Thomson Learning. Fax

14 14 Chapter 12 Multiple Regression and Model Building FIGURE The MegaStat Output of the Single Best Model of Each Size for the Prescription Sales Data Nvar FloorSpace Presc.Pct Parking Income ShopCntr? s Adj R 2 R 2 Cp p-value

Chapter 3 Multiple Regression Complete Example

Department of Quantitative Methods & Information Systems ECON 504 Chapter 3 Multiple Regression Complete Example Spring 2013 Dr. Mohammad Zainal Review Goals After completing this lecture, you should be