12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

Size: px
Start display at page:

Download "12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)"

Transcription

1 12.12 Model Building, and the Effects of Multicollinearity (Optional) 1 Although Excel and MegaStat are emphasized in Business Statistics in Practice, Second Canadian Edition, some examples in the additional material on Connect can only be demonstrated using other programs, such as MINITAB, SPSS, and SAS. Please consult the user guides for these programs for instructions on their use MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL) Multicollinearity Recall the sales territory performance data in Table 12.2 (page 422). These data consist of values of the dependent variable y (Sales) and of the independent variables x 1 (Time), x 2 (MktPoten), x 3 (Adver), x 4 (MktShare), and x (Change). The complete sales territory performance data analyzed by Cravens, Woodruff, and Stamper (1972) consist of the data presented in Table 12.2 and data concerning three additional independent variables. These three additional variables are defined as follows: PART 4 Model Building and Model Diagnostics (Optional) x 6 number of accounts handled by the representative (we will denote this variable as Accts), x 7 average workload per account, measured by using a weighting based on the sizes of the orders by the accounts and other workload-related criteria (we will denote this variable as WkLoad), x 8 an aggregate rating on eight dimensions of the representative s performance, made by a sales manager and expressed on a 1 to 7 scale (we will denote this variable as Rating). Table gives the observed values of x 6, x 7, and x 8, and Figure presents the MINITAB output of a correlation matrix for the sales territory performance data. Examining the first column of this matrix, we see that the simple correlation coefficient between Sales and WkLoad is and the p value for testing the significance of the relationship between Sales and WkLoad is This indicates that there is little or no relationship between Sales and WkLoad. However, the simple correlation coefficients between Sales and the other seven independent variables range from to 0.74, with associated p values ranging from to This indicates the existence of potentially useful relationships between Sales and these seven independent variables. While simple correlation coefficients (and scatter plots) give us a preliminary understanding of the data, they cannot be relied upon alone to tell us which independent variables are TABLE Values of Accts, WkLoad, and Rating Accounts, Workload, Rating, x 6 x 7 x Accounts, Workload, Rating, x 6 x 7 x

2 2 Chapter 12 Multiple Regression and Model Building FIGURE MINITAB Output of a Correlation Matrix for the Sales Territory Performance Data Sales Time MktPoten Adver MktShare Change Accts WkLoad Time MktPoten Adver Cell Contents: Pearson correlation P-Value MktShare Change Accts WkLoad Rating significantly related to the dependent variable. One reason for this is a condition called multicollinearity. Multicollinearity is said to exist among the independent variables in a regression situation if these independent variables are related to or dependent upon each other. One way to investigate multicollinearity is to examine the correlation matrix. To understand this, note that all of the simple correlation coefficients located in the first column of this matrix measure the simple correlations between the independent variables. For example, the simple correlation coefficient between Accts and Time is 0.78, which says that the Accts values increase as the Time values increase. Such a relationship makes sense because it is logical that the longer a sales representative has been with the company, the more accounts they handle. Statisticians often regard multicollinearity in a data set to be severe if at least one simple correlation coefficient between the independent variables is at least 0.9. Since the largest such simple correlation coefficient in Figure is 0.78, this is not true for the sales territory performance data. Note, however, that even moderate multicollinearity can be a problem. This will be demonstrated later using the sales territory performance data. Another way to measure multicollinearity is to use variance inflation factors. Consider a regression model relating a dependent variable y to a set of independent variables x 1,..., x j21, x j, x j11,..., x k. The variance inflation factor for the independent variable x j in this set is denoted VIF j and is defined by the equation VIF j 1, 1 2 R 2 j where R 2 j is the multiple coefficient of determination for the regression model that relates x j to all the other independent variables x 1,..., x j21, x j11,..., x k in the set. For example, Figure gives the MegaStat output of the t statistics, p values, and variance inflation factors for the sales territory performance model that relates y to all eight independent variables. The largest variance inflation factor is VIF To calculate VIF 6, MegaStat first calculates the multiple coefficient of determination for the regression model that relates x 6 to x 1, x 2, x 3, x 4, x, x 7, and x 8 to be R It then follows that VIF R In general, if R 2 j 0, which says that x j is not related to the other independent variables, then the variance inflation factor VIF j equals 1. On the other hand, if R 2 j. 0, which says that x j is related to the other independent variables, then 1 2 R 2 j is less than 1, making VIF j greater

3 12.12 Model Building, and the Effects of Multicollinearity (Optional) 3 FIGURE MegaStat Output of the t Statistics, p Values, and Variance Inflation Factors for the Sales Territory Performance Model y b 0 1 b 1 x 1 1 b 2 x 2 1 b 3 x 3 1 b 4 x 4 1 b x 1 b 6 x 6 1 b 7 x 7 1 b 8 x 8 1 e Regression output confidence interval variables coefficients std. error t (df 16) p-value 9% lower 9% upper VIF Intercept 21, Time MktPoten Adver MktShare Change Accts WkLoad Rating mean VIF than 1. Both the largest variance inflation factor among the independent variables and the mean VIF of the variance inflation factors for the independent variables indicate the severity of multicollinearity. Generally, the multicollinearity between independent variables is considered severe if 1 The largest variance inflation factor is greater than ten (which means that the largest R 2 j is greater than 0.9). 2 The mean VIF of the variance inflation factors is substantially greater than one. The largest variance inflation factor in Figure is not greater than ten, and the average of the variance inflation factors, which is 2.667, would probably not be considered substantially greater than one. Therefore, we would probably not consider the multicollinearity among the eight independent variables to be severe. The reason that VIF j is called the variance inflation factor is that it can be shown that when VIF j is greater than one, the standard deviation s bj of the population of all possible values of the least squares point estimate b j is likely to be inflated beyond its value when R 2 j 0. If s bj is greatly inflated, two slightly different samples of values of the dependent variable can yield two substantially different values of b j. To intuitively understand why strong multicollinearity can significantly affect the least squares point estimates, consider the so-called picket fence display in the margin. This figure depicts two independent variables (x 1 and x 2 ) exhibiting strong multicollinearity (note that as x 1 increases, x 2 increases). The heights of the pickets on the fence represent the y observations. If we assume that the model y b 0 1 b 1 x 1 1 b 2 x 2 1 e adequately describes these data, then calculating the least squares point estimates amounts to fitting a plane to the points on the top of the picket fence. Clearly, this plane would be quite unstable. That is, a slightly different height of one of the pickets (a slightly different y value) could cause the slant of the fitted plane (and the least squares point estimates that determine this slant) to radically change. It follows that when strong multicollinearity exists, sampling variation can result in least squares point estimates that differ substantially from the true values of the regression parameters. In fact, some of the least squares point estimates may have a sign (positive or negative) that differs from the sign of the true value of the parameter (you will see an example of this in the exercises). Therefore, when strong multicollinearity exists, it is dangerous to individually interpret the least squares point estimates. The most important problem caused by multicollinearity is that even when multicollinearity is not severe, it can hinder our ability to use the t statistics and related p values to assess the importance of the independent variables. Recall that we can reject H 0 : b j 0 in favour The picket fence display x 1 y x 2

4 4 Chapter 12 Multiple Regression and Model Building of H a : b j? 0 at level of significance a if and only if the absolute value of the corresponding t statistic is greater than t ay2 based on n 2 (k 1 1) degrees of freedom or, equivalently, if and only if the related p value is less than a. Thus, the larger (in absolute value) the t statistic is and the smaller the p value is, the stronger is the evidence that we should reject H 0 : b j 0 and that the independent variable x j is significant. When multicollinearity exists, the sizes of the t statistic and the related p value measure the additional importance of the independent variable x j over the combined importance of the other independent variables in the regression model. Since two or more correlated independent variables contribute redundant information, multicollinearity often causes the t statistics obtained by relating a dependent variable to a set of correlated independent variables to be smaller (in absolute value) than the t statistics that would be obtained if separate regression analyses were run, where each separate regression analysis relates the dependent variable to a smaller set (for example, only one) of the correlated independent variables. Thus, multicollinearity can cause some of the correlated independent variables to appear less important in terms of having small absolute t statistics and large p values than they really are. Another way to understand this is to note that since multicollinearity inflates s bj, it inflates the point estimate s bj of s bj. Since t b j ys bj, an inflated value of s bj can (depending on the size of b j ) cause t to be small (and the related p value to be large). This would suggest that x j is not significant even though x j may actually be important. For example, Figure tells us that when we perform a regression analysis of the sales territory performance data using a model that relates y to all eight independent variables, the p values related to Time, MktPoten, Adver, MktShare, Change, Accts, WkLoad, and Rating are, respectively, , , 0.00, , , , 0.649, and By contrast, recall from Figure 12.7 (page 430) that when we perform a regression analysis of the sales territory performance data using a model that relates y to the first five independent variables, the p values related to Time, MktPoten, Adver, MktShare, and Change are, respectively, 0.006, < , 0.002, < , and Note that Time (p value 0.006) seems highly signifi cant and Change (p value 0.030) seems somewhat signifi cant in the five- independent-variable model. However, when we consider the model that uses all eight independent variables, Time (p value ) seems insignifi cant and Change (p value ) seems somewhat insignifi cant. The reason that Time and Change seem more significant in the five-independent-variable model is that since this model uses fewer variables, Time and Change contribute less overlapping information and thus have more additional importance in this model. Comparing regression models on the basis of R 2, s, adjusted R 2, prediction interval length, and the C statistic We have seen that when multicollinearity exists in a model, the p value associated with an independent variable in the model measures the additional importance of the variable over the combined importance of the other variables in the model. Therefore, it can be difficult to use the p values to determine which variables to retain in a model and which variables to remove from the model. This implies that we need to evaluate more than the additional importance of each independent variable in a regression model. We also need to evaluate how well the independent variables work together to accurately describe, predict, and control the dependent variable. One way to do this is to determine if the overall model gives a high R 2 and R 2, a small s, and short prediction intervals. It can be proven that adding any independent variable to a regression model, even an unimportant independent variable, will decrease the unexplained variation and increase the explained variation. Therefore, since the total variation (y i 2 y) 2 depends only on the observed y values and thus remains unchanged when we add an independent variable to a regression model, it follows that adding any independent variable to a regression model will increase explained variation R 2. total variation

5 12.12 Model Building, and the Effects of Multicollinearity (Optional) This implies that R 2 cannot tell us (by decreasing) that adding an independent variable is undesirable. That is, although we wish to obtain a model with a large R 2, there are better criteria than R 2 that can be used to compare regression models. One better criterion is the standard error SSE s B n 2 (k 1 1). When we add an independent variable to a regression model, the number of model parameters k 1 1 increases by one, and thus the number of degrees of freedom n 2 (k 1 1) decreases by one. If the decrease in n 2 (k 1 1), which is used in the denominator to calculate s, is proportionally more than the decrease in the SSE (the unexplained variation) that is caused by adding the independent variable to the model, then s will increase. If s increases, this tells us that we should not add the independent variable to the model. To see one reason why, consider the formula for the prediction interval for y: 3ŷ 6 t ay2 s21 1 distance value4. Since adding an independent variable to a model decreases the number of degrees of freedom, adding the variable will increase the t ay2 point used to calculate the prediction interval. To understand this, look at any column of the t table in Table A. and scan from the bottom of the column to the top you can see that the t points increase as the degrees of freedom decrease. It can also be shown that adding any independent variable to a regression model will not decrease (and usually increases) the distance value. Therefore, since adding an independent variable increases t ay2 and does not decrease the distance value, if s increases, the length of the prediction interval for y will increase. This means the model will predict less accurately and thus we should not add the independent variable. On the other hand, if adding an independent variable to a regression model decreases s, the length of a prediction interval for y will decrease if and only if the decrease in s is enough to offset the increase in t ay2 and the (possible) increase in the distance value. Therefore, an independent variable should not be included in a fi nal regression model unless it reduces s enough to reduce the length of the desired prediction interval for y. However, we must balance the length of the prediction interval or, in general, the goodness of any criterion against the difficulty and expense of using the model. For instance, predicting y requires knowing the corresponding values of the independent variables. So we must decide whether including an independent variable reduces s and the prediction interval lengths enough to offset the potential errors caused by possible inaccurate determination of values of the independent variables, or the possible expense of determining these values. If adding an independent variable provides prediction intervals that are only slightly shorter while making the model more difficult and/or more expensive to use, we might decide that including the variable is not desirable. Since a key factor is the length of the prediction intervals provided by the model, one might wonder why we do not simply make direct comparisons of prediction interval lengths (without looking at s). It is useful to compare interval lengths, but these lengths depend on the distance value, which depends on how far the values of the independent variables for which we wish to predict are from the centre of the experimental region. We often wish to compute prediction intervals for several different combinations of values of the independent variables (and thus for several different values of the distance value). Thus, we would compute prediction intervals with slightly different lengths. However, the standard error s is a constant factor with respect to the length of the prediction intervals (as long as we are considering the same regression model). Thus, it is common practice to compare regression models on the basis of s (and s 2 ). Finally, note that it can be shown that the standard error s decreases if and only if R 2 (adjusted R 2 ) increases. It follows that if we are comparing regression models, the model that gives the smallest s gives the largest R 2.

6 6 Chapter 12 Multiple Regression and Model Building FIGURE 12.4 MegaStat Output of Some of the Best Sales Territory Performance Regression Models (a) The best single model of each size Nvar Time MktPoten Adver MktShare Change Accts WkLoad Rating s Adj R 2 R 2 Cp p-value E E E E E E E E-07 (b) The best eight models Nvar Time MktPoten Adver MktShare Change Accts WkLoad Rating s Adj R 2 R 2 Cp p-value E E E E E E E E-08 Example The Sales Territory Performance Case Figure 12.4 gives MegaStat output resulting from calculating R 2, R 2, and s for all possible regression models based on all possible combinations of the eight independent variables in the sales territory performance situation (we will explain the values of C p on the output after we complete this example). The first output gives the best single model of each size, and the second output gives the eight best models of any size, in terms of s and R 2. The output also gives the p values for the variables in each model. Examining the output, we see that the three models with the smallest values of s and largest values of R 2 are 1 The six-variable model that contains Time, MktPoten, Adver, MktShare, Change, Accts and has s and R ; we refer to this model as Model 1. 2 The five-variable model that contains Time, MktPoten, Adver, MktShare, Change and has s and R ; we refer to this model as Model 2. 3 The seven-variable model that contains Time, MktPoten, Adver, MktShare, Change, Accts, WkLoad and has s and R ; we refer to this model as Model 3. To see that s can increase when we add an independent variable to a regression model, note that s increases from to when we add WkLoad to Model 1 to form Model 3. In this case, although it can be verified that adding WkLoad decreases the unexplained variation from 3,297, to 3,226,76.271, this decrease is enough to offset the change in the denominator of s 2 SSE n 2 (k 1 1),

7 12.12 Model Building, and the Effects of Multicollinearity (Optional) 7 which decreases from to To see that prediction interval lengths might increase even though s decreases, consider adding Accts to Model 2 to form Model 1. This decreases s from to However, consider a sales representative for whom Time 8.42, MktPoten 3,182.73, Adver 7,281.6, MktShare 9.64, Change 0.28, and Accts The 9 percent prediction interval given by Model 2 for sales corresponding to this combination of values of the independent variables is [3,233.9,,129.89] and has length, , , The 9 percent prediction interval given by Model 1 for such sales is [3,193.86,,093.14] and has length, , , In other words, the slight decrease in s accomplished by adding Accts to Model 2 to form Model 1 is not enough to offset the increases in t ay2 and the distance value (which can be shown to increase from to 0.11), and thus the length of the prediction interval given by Model 1 increases. In addition, the extra independent variable Accts in Model 1 has a p value of Therefore, we conclude that Model 2 is better than Model 1 and is, in fact, the best sales territory performance model (using only linear terms). Another quantity that can be used to compare regression models is called the C statistic (also often called the C p statistic). To show how to calculate the C statistic, suppose that we wish to choose an appropriate set of independent variables from p potential independent variables. We first calculate the mean square error, which we denote as s p 2, for the model using all p potential independent variables. Then, if the SSE is the unexplained variation for another particular model that has k independent variables, it follows that the C statistic for this model is C SSE s p 2 2 3n 2 2(k 1 1)4. For example, consider the sales territory performance case. It can be verified that the mean square error for the model using all p 8 independent variables is 201, and the SSE for the model using the first k independent variables (Model 2 in the previous example) is 3,16, It follows that the C statistic for this latter model is C 3,16, , ( 1 1) Because the C statistic for a given model is a function of the model s SSE, and because we want the SSE to be small, we want C to be small. Although adding an unimportant independent variable to a regression model will decrease the SSE, adding such a variable can increase C. This can happen when the decrease in the SSE caused by the addition of the extra independent variable is not enough to offset the decrease in n 2 2(k 1 1) caused by the addition of the extra independent variable (which increases k by 1). It should be noted that although adding an unimportant independent variable to a regression model can increase both s 2 and C, there is no exact relationship between s 2 and C. While we want C to be small, it can be shown from the theory behind the C statistic that we also wish to find a model for which the C statistic roughly equals k 1 1, the number of parameters in the model. If a model has a C statistic substantially greater than k 1 1, it can be shown that this model has substantial bias and is undesirable. Thus, although we want to find a model for which C is as small as possible, if C for such a model is substantially greater than k 1 1, we may prefer to choose a different model for which C is slightly larger and more nearly equal to the number of parameters in that (different) model. If a particular model has a small value of C and C for this model is less than k 1 1, then the model should be considered desirable. Finally, it should be noted that for the model that includes all p potential independent variables (and thus utilizes p 1 1 parameters), it can be shown that C p 1 1.

8 8 Chapter 12 Multiple Regression and Model Building If we examine Figure 12.4, we see that Model 2 of the previous example has the smallest C statistic. The C statistic for this model equals Since C is less than k 1 1 6, the model is not biased. Therefore, this model should be considered best with respect to the C statistic. Stepwise regression and backward elimination In some situations, it is useful to employ an iterative model selection procedure, where at each step a single independent variable is added to or deleted from a regression model, and a new regression model is evaluated. We discuss here two such procedures stepwise regression and backward elimination. There are slight variations in the way different computer packages carry out stepwise regression. Assuming that y is the dependent variable and x 1, x 2,..., x p are the p potential independent variables, we explain how most of the computer packages perform stepwise regression. Stepwise regression uses t statistics (and related p values) to determine the significance of the independent variables in various regression models. In this context, we say that the t statistic indicates that the independent variable x j is signifi cant at the a level if and only if the related p value is less than a. Then, stepwise regression is carried out as follows. Choice of A entry and A stay Before beginning the stepwise procedure, we choose a value of a entry, which we call the probability of a Type I error related to entering an independent variable into the regression model. We also choose a value of a stay, which we call the probability of a Type I error related to retaining an independent variable that was previously entered into the model. Although there are many considerations in choosing these values, it is common practice to set both a entry and a stay equal to 0.0 or Step 1: The stepwise procedure considers the p possible one-independent-variable regression models of the form y b 0 1 b 1 x j 1 e. Each different model includes a different potential independent variable. For each model, the t statistic (and p value) related to testing H 0 : b 1 0 versus H a : b 1? 0 is calculated. Denoting the independent variable giving the largest absolute value of the t statistic (and the smallest p value) by the symbol x [1], we consider the model y b 0 1 b 1 x e. If the t statistic does not indicate that x [1] is significant at the a entry level, then the stepwise procedure terminates by concluding that none of the independent variables are significant at the a entry level. If the t statistic indicates that the independent variable x [1] is significant at the a entry level, then x [1] is retained for use in Step 2. Step 2: The stepwise procedure considers the p 2 1 possible two-independent-variable regression models of the form y b 0 1 b 1 x b 2 x j 1 e. Each different model includes x [1], the independent variable chosen in Step 1, and a different potential independent variable chosen from the remaining p 2 1 independent variables that were not chosen in Step 1. For each model, the t statistic (and p value) related to testing H 0 : b 2 0 versus H a : b 2? 0 is calculated. Denoting the independent variable giving the largest absolute value of the t statistic (and the smallest p value) by the symbol x [2], we consider the model y b 0 1 b 1 x b 2 x e. If the t statistic indicates that x [2] is significant at the a entry level, then x [2] is retained in this model, and the stepwise procedure checks to see whether x [1] should be allowed to stay in the model. This check should be made because multicollinearity will probably cause the t statistic related to the importance of x [1] to change when x [2] is added to the model. If the t statistic does not indicate that x [1] is significant at the a stay level, then the stepwise procedure returns to the

9 12.12 Model Building, and the Effects of Multicollinearity (Optional) 9 beginning of Step 2. Starting with a new one-independent-variable model that uses the new significant independent variable x [2], the stepwise procedure attempts to find a new two-independentvariable model y b 0 1 b 1 x b 2 x j 1 e. If the t statistic indicates that x [1] is significant at the a stay level in the model y b 0 1 b 1 x b 2 x e, then both the independent variables x [1] and x [2] are retained for use in further steps. Further steps The stepwise procedure continues by adding independent variables one at a time to the model. At each step, an independent variable is added to the model if it has the largest (in absolute value) t statistic of the independent variables not in the model and if its t statistic indicates that it is significant at the a entry level. After adding an independent variable, the stepwise procedure checks all the independent variables already included in the model and removes an independent variable if it has the smallest (in absolute value) t statistic of the independent variables already included in the model and if its t statistic indicates that it is not significant at the a stay level. This removal procedure is sequentially continued, and only after the necessary removals are made does the stepwise procedure attempt to add another independent variable to the model. The stepwise procedure terminates when all the independent variables not in the model are insignificant at the a entry level or when the variable to be added to the model is the one just removed from it. For example, again consider the sales territory performance data. We let x 1, x 2, x 3, x 4, x, x 6, x 7, and x 8 be the eight potential independent variables employed in the stepwise procedure. Figure 12.46(a) gives the MegaStat output of the stepwise regression employing these independent variables where both a entry and a stay have been set equal to The stepwise procedure 1 Adds Accts (x 6 ) on the first step. 2 Adds Adver (x 3 ) and retains Accts on the second step. 3 Adds MktPoten (x 2 ) and retains Accts and Adver on the third step. 4 Adds MktShare (x 4 ) and retains Accts, Adver, and MktPoten on the fourth step. The procedure terminates after step 4 when no more independent variables can be added. Therefore, the stepwise procedure arrives at the model that utilizes x 2, x 3, x 4, and x 6. To carry out backward elimination, we perform a regression analysis by using a regression model containing all of the p potential independent variables. Then the independent variable with the smallest (in absolute value) t statistic is chosen. If the t statistic indicates that this independent variable is significant at the a stay level (a stay is chosen prior to the beginning of the procedure), then the procedure terminates by choosing the regression model containing all p independent variables. If this independent variable is not significant at the a stay level, then it is removed from the model, and a regression analysis is performed by using a regression model containing all of the remaining independent variables. The procedure continues by removing independent variables one at a time from the model. At each step, an independent variable is removed from the model if it has the smallest (in absolute value) t statistic of the independent variables remaining in the model and if it is not significant at the a stay level. The procedure terminates when no independent variable remaining in the model can be removed. Backward elimination is generally considered a reasonable procedure, especially for analysts who like to start with all possible independent variables in the model so that they will not miss anything important. To illustrate backward elimination, we first note that choosing the independent variable that has the smallest (in absolute value) t statistic in a model is equivalent to choosing the independent variable that has the largest p value in the model. With this in mind, Figure 12.46(b) gives the MINITAB output of a backward elimination of the sales territory performance data. Here the backward elimination uses a stay 0.0, begins with the model using all eight independent variables, and removes (in order) Rating (x 8 ), then WkLoad (x 7 ), then Accts (x 6 ), and finally Change (x ). The procedure terminates when no independent variable remaining can be

10 10 Chapter 12 Multiple Regression and Model Building FIGURE The MegaStat Output of Stepwise Regression and the MINITAB Output of Backward Elimination for the Sales Territory Performance Problem (a) Stepwise regression (A entry A stay 0.10) Regression Analysis Stepwise Selection displaying the best model of each size 2 observations Sales Y is the dependent variable p-values for the coefficients Nvar Time X1 MktPoten X2 Adver X3 MktShare X4 Change X Accts X6 WkLoad X7 Rating X8 s Adj R 2 R 2 Cp p-value E E E E E E E E-07 (b) Backward elimination ( stay 0.0) Backward elimination. Alpha-to-Remove: 0.0 Response is Sales on 8 predictors, with N = 2 Step Constant Time T-Value P-Value MktPoten T-Value P-Value Adver T-Value P-Value MktShare T-Value P-Value Change T-Value P-Value Accts T-Value P-Value WkLoad T-Value P-Value Rating 8 T-Value 0.06 P-Value 0.90 S R-Sq R-Sq(adj) Mallows C-p

11 12.12 Model Building, and the Effects of Multicollinearity (Optional) 11 removed that is, when no independent variable has a related p value greater than a stay 0.0 and arrives at a model that uses Time (x 1 ), MktPoten (x 2 ), Adver (x 3 ), and MktShare (x 4 ). This model has an s of 464 and an R 2 of and is inferior to the model arrived at by stepwise regression, which has an s of and an R 2 of (see Figure 12.46(a)). However, the backward elimination process allows us to find a model that is better than either of these. If we look at the model considered by backward elimination after Rating (x 8 ), WkLoad (x 7 ), and Accts (x 6 ) have been removed, we have the model using x 1, x 2, x 3, x 4, and x. This model has an s of 430 and an R 2 of , and in Example we reasoned that this model is perhaps the best sales territory performance model. Interestingly, this is the model that backward elimination would arrive at if we were to set a stay equal to 0.10 rather than 0.0 note that this model has no p values greater than The sales territory performance example brings home two important points. First, the models obtained by backward elimination and stepwise regression depend on the choices of a entry and a stay (whichever is appropriate). Second, it is best not to think of these methods as automatic model-building procedures. Rather, they should be regarded as processes that allow us to find and evaluate a variety of model choices. Exercises for Section CONCEPTS 12.9 What is multicollinearity? What problems can be caused by multicollinearity? Discuss how to compare regression models. METHODS AND APPLICATIONS THE HOSPITAL LABOUR NEEDS CASE Table 12. (page 424) presents data concerning the need for labour in 16 hospitals. This table gives values of the dependent variable Hours (monthly labour hours) and the independent variables Xray (monthly X-ray exposures), BedDays (monthly occupied bed days a hospital has one occupied bed day if one bed is occupied for an entire day), and Length (average length of patients stay, in days). The data in Table 12. are part of a larger data set. The complete data set consists of two additional independent variables Load (average daily patient load) and Pop (eligible population in the area, in thousands) values of which are given in Table Figure gives MegaStat output of multicollinearity analysis and model building for the complete hospital labour needs data set. a. Find the three largest simple correlation coefficients between the independent variables in Figure 12.47(a). Also find the three largest variance inflation factors in Figure 12.47(b). b. Based on your answers to part a, which independent variables are most strongly involved in multicollinearity? c. Do any least squares point estimates have a sign (positive or negative) that is different from what we would intuitively expect another indication of multicollinearity? d. The p value associated with F(model) for the model in Figure 12.47(b) is less than In general, if the p value associated with F(model) is much smaller than any of the p values associated with the independent variables, this is another indication of multicollinearity. Is this true in this situation? TABLE Patient Load and Population for Exercise Load Pop e. Figure 12.47(c) and (d) indicates that the two best hospital labour needs models are the model using Xray, BedDays, Pop, and Length, which we will call Model 1, and the model using Xray, BedDays, and Length, which we will call Model 2. Which model gives the smallest value of s and the largest value of R 2? Which model gives the smallest value of C? Consider a hospital for which Xray 6,194, BedDays 14,077.88, Pop 329.7, and Length The 9 percent prediction intervals given by Models 1 and 2 for labour hours corresponding to this combination of values of the independent variables are, respectively, [14,888.43, 16,861.30] and [14,906.24, 16,886.26]. Which model gives the shorter prediction interval?

12 12 Chapter 12 Multiple Regression and Model Building FIGURE MegaStat Output of Multicollinearity Analysis and Model Building for the Hospital Labour Needs Data (a) A correlation matrix Load Xray BedDays Pop Length Hours Load Xray BedDays Pop Length Hours sample size critical value 0.0 (two-tail) critical value 0.01 (two-tail) (b) The variance inflation factors Regression output variables coefficients std. error t (df 10) p-value VIF Intercept 2, Xray(x1) BedDays(x2) Length(x3) Load(x4) Pop(x) mean VIF (c) The best single model of each size Nvar Load Xray BedDays Pop Length s Adj R 2 R 2 Cp p-value E E E E E-12 (d) The best five models Nvar Load Xray BedDays Pop Length s Adj R 2 R 2 Cp p-value E E E E E-12 f. Consider Figure Which model is chosen by both stepwise regression and backward elimination? Overall, which model seems best? Market Planning, a marketing research firm, has obtained the prescription sales data in Table 12.2 for n 20 independent pharmacies. 1 In this table, y is the average weekly prescription sales over the past year (in units of $1,000), x 1 is the floor space (in square feet), x 2 is the percentage of floor space allocated to the prescription department, x 3 is the number of parking spaces available to the store, x 4 is the weekly per capita income for the surrounding community (in units of $100), and x is a dummy variable that equals 1 if the pharmacy is located in a shopping centre and 0 otherwise. Use the MegaStat output in Figure to discuss why the model using FloorSpace and Presc.Pct might be the best model describing prescription sales. The least squares point estimates of the parameters of this model can be calculated to be b , b , and b Discuss what b 1 and b 2 say about obtaining high prescription sales. 1 This problem is taken from an example in An Introduction to Statistical Methods and Data Analysis, 2nd ed., by L. Ott, (Boston: PWS-KENT Publishing Company, 1987). Used with permission.

13 12.12 Model Building, and the Effects of Multicollinearity (Optional) 13 FIGURE MegaStat Output of a Stepwise Regression and MINITAB Output of a Backward Elimination of the Hospital Labour Needs Data (a) Stepwise regression (A entry A stay 0.10) Regression Analysis Stepwise Selection displaying the best model of each size 16 observations Hours(y) is the dependent variable Nvar p-values for the coefficients Xray(x1) BedDays(x2) Length(x3) Load(x4) Pop(x) s Adj R 2 R 2 Cp p-value E E E E E-12 (b) Backward elimination ( stay 0.0) Step Constant Load -9 T-Value -0.1 P-Value XRay T-Value P-Value BedDays T-Value P-Value Pop T-Value P-Value Length T-Value P-Value S R-Sq TABLE 12.2 Prescription Sales Data Sales, Floor Space, Prescription Parking, Income, Shopping Pharmacy y x 1 Percentage, x 2 x 3 x 4 Centre, x , , , , , , , , , , , , , , , , , , , , Source: From An Introduction to Statistical Methods and Data Analysis, 2nd ed., by L. Ott. Copyright Reprinted with permission of Brooks/Cole, an imprint of the Wadsworth Group, a division of Thomson Learning. Fax

14 14 Chapter 12 Multiple Regression and Model Building FIGURE The MegaStat Output of the Single Best Model of Each Size for the Prescription Sales Data Nvar FloorSpace Presc.Pct Parking Income ShopCntr? s Adj R 2 R 2 Cp p-value

Chapter 3 Multiple Regression Complete Example

Chapter 3 Multiple Regression Complete Example Department of Quantitative Methods & Information Systems ECON 504 Chapter 3 Multiple Regression Complete Example Spring 2013 Dr. Mohammad Zainal Review Goals After completing this lecture, you should be

More information

Multiple Regression Methods

Multiple Regression Methods Chapter 1: Multiple Regression Methods Hildebrand, Ott and Gray Basic Statistical Ideas for Managers Second Edition 1 Learning Objectives for Ch. 1 The Multiple Linear Regression Model How to interpret

More information

Chapter 14 Student Lecture Notes 14-1

Chapter 14 Student Lecture Notes 14-1 Chapter 14 Student Lecture Notes 14-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter 14 Multiple Regression Analysis and Model Building Chap 14-1 Chapter Goals After completing this

More information

The simple linear regression model discussed in Chapter 13 was written as

The simple linear regression model discussed in Chapter 13 was written as 1519T_c14 03/27/2006 07:28 AM Page 614 Chapter Jose Luis Pelaez Inc/Blend Images/Getty Images, Inc./Getty Images, Inc. 14 Multiple Regression 14.1 Multiple Regression Analysis 14.2 Assumptions of the Multiple

More information

Model Selection Procedures

Model Selection Procedures Model Selection Procedures Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Model Selection Procedures Consider a regression setting with K potential predictor variables and you wish to explore

More information

School of Mathematical Sciences. Question 1. Best Subsets Regression

School of Mathematical Sciences. Question 1. Best Subsets Regression School of Mathematical Sciences MTH5120 Statistical Modelling I Practical 9 and Assignment 8 Solutions Question 1 Best Subsets Regression Response is Crime I n W c e I P a n A E P U U l e Mallows g E P

More information

Chapter 14 Multiple Regression Analysis

Chapter 14 Multiple Regression Analysis Chapter 14 Multiple Regression Analysis 1. a. Multiple regression equation b. the Y-intercept c. $374,748 found by Y ˆ = 64,1 +.394(796,) + 9.6(694) 11,6(6.) (LO 1) 2. a. Multiple regression equation b.

More information

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines) Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are

More information

Chapter 7 Student Lecture Notes 7-1

Chapter 7 Student Lecture Notes 7-1 Chapter 7 Student Lecture Notes 7- Chapter Goals QM353: Business Statistics Chapter 7 Multiple Regression Analysis and Model Building After completing this chapter, you should be able to: Explain model

More information

Basic Business Statistics 6 th Edition

Basic Business Statistics 6 th Edition Basic Business Statistics 6 th Edition Chapter 12 Simple Linear Regression Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value of a dependent variable based

More information

Multiple Linear Regression

Multiple Linear Regression Andrew Lonardelli December 20, 2013 Multiple Linear Regression 1 Table Of Contents Introduction: p.3 Multiple Linear Regression Model: p.3 Least Squares Estimation of the Parameters: p.4-5 The matrix approach

More information

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables. Regression Analysis BUS 735: Business Decision Making and Research 1 Goals of this section Specific goals Learn how to detect relationships between ordinal and categorical variables. Learn how to estimate

More information

Ch 13 & 14 - Regression Analysis

Ch 13 & 14 - Regression Analysis Ch 3 & 4 - Regression Analysis Simple Regression Model I. Multiple Choice:. A simple regression is a regression model that contains a. only one independent variable b. only one dependent variable c. more

More information

Sociology 593 Exam 1 Answer Key February 17, 1995

Sociology 593 Exam 1 Answer Key February 17, 1995 Sociology 593 Exam 1 Answer Key February 17, 1995 I. True-False. (5 points) Indicate whether the following statements are true or false. If false, briefly explain why. 1. A researcher regressed Y on. When

More information

Multiple linear regression S6

Multiple linear regression S6 Basic medical statistics for clinical and experimental research Multiple linear regression S6 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/42 Introduction Two main motivations for doing multiple

More information

STAT 212 Business Statistics II 1

STAT 212 Business Statistics II 1 STAT 1 Business Statistics II 1 KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICAL SCIENCES DHAHRAN, SAUDI ARABIA STAT 1: BUSINESS STATISTICS II Semester 091 Final Exam Thursday Feb

More information

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal Department of Quantitative Methods & Information Systems Business Statistics Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220 Dr. Mohammad Zainal Chapter Goals After completing

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

Statistics for Managers using Microsoft Excel 6 th Edition

Statistics for Managers using Microsoft Excel 6 th Edition Statistics for Managers using Microsoft Excel 6 th Edition Chapter 13 Simple Linear Regression 13-1 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value of

More information

Chapter 9. Correlation and Regression

Chapter 9. Correlation and Regression Chapter 9 Correlation and Regression Lesson 9-1/9-2, Part 1 Correlation Registered Florida Pleasure Crafts and Watercraft Related Manatee Deaths 100 80 60 40 20 0 1991 1993 1995 1997 1999 Year Boats in

More information

Model Building Chap 5 p251

Model Building Chap 5 p251 Model Building Chap 5 p251 Models with one qualitative variable, 5.7 p277 Example 4 Colours : Blue, Green, Lemon Yellow and white Row Blue Green Lemon Insects trapped 1 0 0 1 45 2 0 0 1 59 3 0 0 1 48 4

More information

28. SIMPLE LINEAR REGRESSION III

28. SIMPLE LINEAR REGRESSION III 28. SIMPLE LINEAR REGRESSION III Fitted Values and Residuals To each observed x i, there corresponds a y-value on the fitted line, y = βˆ + βˆ x. The are called fitted values. ŷ i They are the values of

More information

MBA Statistics COURSE #4

MBA Statistics COURSE #4 MBA Statistics 51-651-00 COURSE #4 Simple and multiple linear regression What should be the sales of ice cream? Example: Before beginning building a movie theater, one must estimate the daily number of

More information

Regression Analysis. BUS 735: Business Decision Making and Research

Regression Analysis. BUS 735: Business Decision Making and Research Regression Analysis BUS 735: Business Decision Making and Research 1 Goals and Agenda Goals of this section Specific goals Learn how to detect relationships between ordinal and categorical variables. Learn

More information

Simple Linear Regression: One Quantitative IV

Simple Linear Regression: One Quantitative IV Simple Linear Regression: One Quantitative IV Linear regression is frequently used to explain variation observed in a dependent variable (DV) with theoretically linked independent variables (IV). For example,

More information

Topic 18: Model Selection and Diagnostics

Topic 18: Model Selection and Diagnostics Topic 18: Model Selection and Diagnostics Variable Selection We want to choose a best model that is a subset of the available explanatory variables Two separate problems 1. How many explanatory variables

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

Basic Business Statistics, 10/e

Basic Business Statistics, 10/e Chapter 4 4- Basic Business Statistics th Edition Chapter 4 Introduction to Multiple Regression Basic Business Statistics, e 9 Prentice-Hall, Inc. Chap 4- Learning Objectives In this chapter, you learn:

More information

Chapter 4: Regression Models

Chapter 4: Regression Models Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,

More information

Topic 1. Definitions

Topic 1. Definitions S Topic. Definitions. Scalar A scalar is a number. 2. Vector A vector is a column of numbers. 3. Linear combination A scalar times a vector plus a scalar times a vector, plus a scalar times a vector...

More information

Urban Transportation Planning Prof. Dr.V.Thamizh Arasan Department of Civil Engineering Indian Institute of Technology Madras

Urban Transportation Planning Prof. Dr.V.Thamizh Arasan Department of Civil Engineering Indian Institute of Technology Madras Urban Transportation Planning Prof. Dr.V.Thamizh Arasan Department of Civil Engineering Indian Institute of Technology Madras Module #03 Lecture #12 Trip Generation Analysis Contd. This is lecture 12 on

More information

Multiple Regression of Students Performance Using forward Selection Procedure, Backward Elimination and Stepwise Procedure

Multiple Regression of Students Performance Using forward Selection Procedure, Backward Elimination and Stepwise Procedure ISSN 2278 0211 (Online) Multiple Regression of Students Performance Using forward Selection Procedure, Backward Elimination and Stepwise Procedure Oti, Eric Uchenna Lecturer, Department of Statistics,

More information

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Chapte The McGraw-Hill Companies, Inc. All rights reserved. 12er12 Chapte Bivariate i Regression (Part 1) Bivariate Regression Visual Displays Begin the analysis of bivariate data (i.e., two variables) with a scatter plot. A scatter plot - displays each observed

More information

LI EAR REGRESSIO A D CORRELATIO

LI EAR REGRESSIO A D CORRELATIO CHAPTER 6 LI EAR REGRESSIO A D CORRELATIO Page Contents 6.1 Introduction 10 6. Curve Fitting 10 6.3 Fitting a Simple Linear Regression Line 103 6.4 Linear Correlation Analysis 107 6.5 Spearman s Rank Correlation

More information

1 Correlation and Inference from Regression

1 Correlation and Inference from Regression 1 Correlation and Inference from Regression Reading: Kennedy (1998) A Guide to Econometrics, Chapters 4 and 6 Maddala, G.S. (1992) Introduction to Econometrics p. 170-177 Moore and McCabe, chapter 12 is

More information

Ratio of Polynomials Fit One Variable

Ratio of Polynomials Fit One Variable Chapter 375 Ratio of Polynomials Fit One Variable Introduction This program fits a model that is the ratio of two polynomials of up to fifth order. Examples of this type of model are: and Y = A0 + A1 X

More information

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections 2.1 2.3 by Iain Pardoe 2.1 Probability model for and 2 Simple linear regression model for and....................................

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

MORE ON SIMPLE REGRESSION: OVERVIEW

MORE ON SIMPLE REGRESSION: OVERVIEW FI=NOT0106 NOTICE. Unless otherwise indicated, all materials on this page and linked pages at the blue.temple.edu address and at the astro.temple.edu address are the sole property of Ralph B. Taylor and

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore What is Multiple Linear Regression Several independent variables may influence the change in response variable we are trying to study. When several independent variables are included in the equation, the

More information

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION In this lab you will learn how to use Excel to display the relationship between two quantitative variables, measure the strength and direction of the

More information

Chapter 19: Logistic regression

Chapter 19: Logistic regression Chapter 19: Logistic regression Self-test answers SELF-TEST Rerun this analysis using a stepwise method (Forward: LR) entry method of analysis. The main analysis To open the main Logistic Regression dialog

More information

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6 STA 8 Applied Linear Models: Regression Analysis Spring 011 Solution for Homework #6 6. a) = 11 1 31 41 51 1 3 4 5 11 1 31 41 51 β = β1 β β 3 b) = 1 1 1 1 1 11 1 31 41 51 1 3 4 5 β = β 0 β1 β 6.15 a) Stem-and-leaf

More information

Chap The McGraw-Hill Companies, Inc. All rights reserved.

Chap The McGraw-Hill Companies, Inc. All rights reserved. 11 pter11 Chap Analysis of Variance Overview of ANOVA Multiple Comparisons Tests for Homogeneity of Variances Two-Factor ANOVA Without Replication General Linear Model Experimental Design: An Overview

More information

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Multicollinearity Read Section 7.5 in textbook. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear

More information

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the CHAPTER 4 VARIABILITY ANALYSES Chapter 3 introduced the mode, median, and mean as tools for summarizing the information provided in an distribution of data. Measures of central tendency are often useful

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house. Exam 3 Resource Economics 312 Introductory Econometrics Please complete all questions on this exam. The data in the spreadsheet: Exam 3- Home Prices.xls are to be used for all analyses. These data are

More information

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur Lecture 10 Software Implementation in Simple Linear Regression Model using

More information

CHAPTER 3: Multicollinearity and Model Selection

CHAPTER 3: Multicollinearity and Model Selection CHAPTER 3: Multicollinearity and Model Selection Prof. Alan Wan 1 / 89 Table of contents 1. Multicollinearity 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3

More information

Mathematics for Economics MA course

Mathematics for Economics MA course Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between

More information

Chapter 13. Multiple Regression and Model Building

Chapter 13. Multiple Regression and Model Building Chapter 13 Multiple Regression and Model Building Multiple Regression Models The General Multiple Regression Model y x x x 0 1 1 2 2... k k y is the dependent variable x, x,..., x 1 2 k the model are the

More information

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear relationship between: - one independent variable X and -

More information

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns Lecture Week Multiple Linear Regression Predict y from (possibly) many predictors x Including extra derived variables Model Criticism Study the importance of columns Draw on Scientific framework Experiment;

More information

Three Factor Completely Randomized Design with One Continuous Factor: Using SPSS GLM UNIVARIATE R. C. Gardner Department of Psychology

Three Factor Completely Randomized Design with One Continuous Factor: Using SPSS GLM UNIVARIATE R. C. Gardner Department of Psychology Data_Analysis.calm Three Factor Completely Randomized Design with One Continuous Factor: Using SPSS GLM UNIVARIATE R. C. Gardner Department of Psychology This article considers a three factor completely

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Business Statistics 41000: Homework # 5

Business Statistics 41000: Homework # 5 Business Statistics 41000: Homework # 5 Drew Creal Due date: Beginning of class in week # 10 Remarks: These questions cover Lectures #7, 8, and 9. Question # 1. Condence intervals and plug-in predictive

More information

Chapter 4. Regression Models. Learning Objectives

Chapter 4. Regression Models. Learning Objectives Chapter 4 Regression Models To accompany Quantitative Analysis for Management, Eleventh Edition, by Render, Stair, and Hanna Power Point slides created by Brian Peterson Learning Objectives After completing

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Steps to take to do the descriptive part of regression analysis:

Steps to take to do the descriptive part of regression analysis: STA 2023 Simple Linear Regression: Least Squares Model Steps to take to do the descriptive part of regression analysis: A. Plot the data on a scatter plot. Describe patterns: 1. Is there a strong, moderate,

More information

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

CHAPTER 5 LINEAR REGRESSION AND CORRELATION CHAPTER 5 LINEAR REGRESSION AND CORRELATION Expected Outcomes Able to use simple and multiple linear regression analysis, and correlation. Able to conduct hypothesis testing for simple and multiple linear

More information

Ratio of Polynomials Fit Many Variables

Ratio of Polynomials Fit Many Variables Chapter 376 Ratio of Polynomials Fit Many Variables Introduction This program fits a model that is the ratio of two polynomials of up to fifth order. Instead of a single independent variable, these polynomials

More information

Multiple Linear Regression

Multiple Linear Regression 1. Purpose To Model Dependent Variables Multiple Linear Regression Purpose of multiple and simple regression is the same, to model a DV using one or more predictors (IVs) and perhaps also to obtain a prediction

More information

Introduction to Regression

Introduction to Regression Regression Introduction to Regression If two variables covary, we should be able to predict the value of one variable from another. Correlation only tells us how much two variables covary. In regression,

More information

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima Applied Statistics Lecturer: Serena Arima Hypothesis testing for the linear model Under the Gauss-Markov assumptions and the normality of the error terms, we saw that β N(β, σ 2 (X X ) 1 ) and hence s

More information

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I PubH 7405: REGRESSION ANALYSIS MLR: INFERENCES, Part I TESTING HYPOTHESES Once we have fitted a multiple linear regression model and obtained estimates for the various parameters of interest, we want to

More information

Sections 7.1, 7.2, 7.4, & 7.6

Sections 7.1, 7.2, 7.4, & 7.6 Sections 7.1, 7.2, 7.4, & 7.6 Adapted from Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1 / 25 Chapter 7 example: Body fat n = 20 healthy females 25 34

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

A particularly nasty aspect of this is that it is often difficult or impossible to tell if a model fails to satisfy these steps.

A particularly nasty aspect of this is that it is often difficult or impossible to tell if a model fails to satisfy these steps. ECON 497: Lecture 6 Page 1 of 1 Metropolitan State University ECON 497: Research and Forecasting Lecture Notes 6 Specification: Choosing the Independent Variables Studenmund Chapter 6 Before we start,

More information

Sociology 593 Exam 1 February 17, 1995

Sociology 593 Exam 1 February 17, 1995 Sociology 593 Exam 1 February 17, 1995 I. True-False. (25 points) Indicate whether the following statements are true or false. If false, briefly explain why. 1. A researcher regressed Y on. When he plotted

More information

Correlation Analysis

Correlation Analysis Simple Regression Correlation Analysis Correlation analysis is used to measure strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the

More information

2 Prediction and Analysis of Variance

2 Prediction and Analysis of Variance 2 Prediction and Analysis of Variance Reading: Chapters and 2 of Kennedy A Guide to Econometrics Achen, Christopher H. Interpreting and Using Regression (London: Sage, 982). Chapter 4 of Andy Field, Discovering

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

Regression-Discontinuity Analysis

Regression-Discontinuity Analysis Page 1 of 11 Home» Analysis» Inferential Statistics» Regression-Discontinuity Analysis Analysis Requirements The basic RD Design is a two-group pretestposttest model as indicated in the design notation.

More information

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010 MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010 Part 1 of this document can be found at http://www.uvm.edu/~dhowell/methods/supplements/mixed Models for Repeated Measures1.pdf

More information

INFERENCE FOR REGRESSION

INFERENCE FOR REGRESSION CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We

More information

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments. Analysis of Covariance In some experiments, the experimental units (subjects) are nonhomogeneous or there is variation in the experimental conditions that are not due to the treatments. For example, a

More information

Multiple Linear Regression. Chapter 12

Multiple Linear Regression. Chapter 12 13 Multiple Linear Regression Chapter 12 Multiple Regression Analysis Definition The multiple regression model equation is Y = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p + ε where E(ε) = 0 and Var(ε) = s 2.

More information

Inference with Simple Regression

Inference with Simple Regression 1 Introduction Inference with Simple Regression Alan B. Gelder 06E:071, The University of Iowa 1 Moving to infinite means: In this course we have seen one-mean problems, twomean problems, and problems

More information

MULTIPLE LINEAR REGRESSION IN MINITAB

MULTIPLE LINEAR REGRESSION IN MINITAB MULTIPLE LINEAR REGRESSION IN MINITAB This document shows a complicated Minitab multiple regression. It includes descriptions of the Minitab commands, and the Minitab output is heavily annotated. Comments

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

Time series and Forecasting

Time series and Forecasting Chapter 2 Time series and Forecasting 2.1 Introduction Data are frequently recorded at regular time intervals, for instance, daily stock market indices, the monthly rate of inflation or annual profit figures.

More information

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS QUESTIONS 5.1. (a) In a log-log model the dependent and all explanatory variables are in the logarithmic form. (b) In the log-lin model the dependent variable

More information

Multiple Regression Part I STAT315, 19-20/3/2014

Multiple Regression Part I STAT315, 19-20/3/2014 Multiple Regression Part I STAT315, 19-20/3/2014 Regression problem Predictors/independent variables/features Or: Error which can never be eliminated. Our task is to estimate the regression function f.

More information

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras Lecture - 39 Regression Analysis Hello and welcome to the course on Biostatistics

More information

Section 11: Quantitative analyses: Linear relationships among variables

Section 11: Quantitative analyses: Linear relationships among variables Section 11: Quantitative analyses: Linear relationships among variables Australian Catholic University 214 ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced or

More information

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections 3.1 3.3.2 by Iain Pardoe 3.1 Probability model for (X 1, X 2,...) and Y 2 Multiple linear regression................................................

More information

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box. FINAL EXAM ** Two different ways to submit your answer sheet (i) Use MS-Word and place it in a drop-box. (ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box. Deadline: December

More information

Interaction effects for continuous predictors in regression modeling

Interaction effects for continuous predictors in regression modeling Interaction effects for continuous predictors in regression modeling Testing for interactions The linear regression model is undoubtedly the most commonly-used statistical model, and has the advantage

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression Chapter 12 12-1 North Seattle Community College BUS21 Business Statistics Chapter 12 Learning Objectives In this chapter, you learn:! How to use regression analysis to predict the value of a dependent

More information

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis STA121: Applied Regression Analysis Linear Regression Analysis - Chapters 3 and 4 in Dielman Artin Department of Statistical Science September 15, 2009 Outline 1 Simple Linear Regression Analysis 2 Using

More information

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Regression Models. Chapter 4. Introduction. Introduction. Introduction Chapter 4 Regression Models Quantitative Analysis for Management, Tenth Edition, by Render, Stair, and Hanna 008 Prentice-Hall, Inc. Introduction Regression analysis is a very valuable tool for a manager

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

Correlation and Regression

Correlation and Regression Correlation and Regression Dr. Bob Gee Dean Scott Bonney Professor William G. Journigan American Meridian University 1 Learning Objectives Upon successful completion of this module, the student should

More information

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i B. Weaver (24-Mar-2005) Multiple Regression... 1 Chapter 5: Multiple Regression 5.1 Partial and semi-partial correlation Before starting on multiple regression per se, we need to consider the concepts

More information

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables 26.1 S 4 /IEE Application Examples: Multiple Regression An S 4 /IEE project was created to improve the 30,000-footlevel metric

More information

Multiple Regression Examples

Multiple Regression Examples Multiple Regression Examples Example: Tree data. we have seen that a simple linear regression of usable volume on diameter at chest height is not suitable, but that a quadratic model y = β 0 + β 1 x +

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information