Discussion Sheet 29.7.9 Qualitative Variables We have devoted most of our attention in multiple regression to quantitative or numerical variables. MR models can become more useful and complex when we consider qualitative variables variables that represent a category such as male/female. These variables can be numerically coded with and and entered into an MR model just as we would any other variable. This allows for more powerful models.. Consider the following data set: x Category y. 2.. 5. 2. -. 2. 8.. -4... 4. -7. 4. 4. Here is the scatterplot of the standardized residuals versus y for using x to predict y:.5 Scatterplot Dependent Variable: Y Regression Sta ndardized Residual..5. -.5 -. -.5-2 Do the residuals look random to you? Y 2. Here is the ANOVA table for this model: Model Sum of SquaresdfMean SquareF Sig. Regression.... Residual 78. 6 6. Total 78. 7 Does this model look like x is a significant help to predicting y? Is the model significant at the α =.5 level?. Here is the coefficient information for this model Unstandardized Coefficients Standardized Coefficientst Sig. Model B Std. ErrorBeta Constant.5 6.874.59.629 X. 2.5... If you were to perform hypotheses tests of H : β i = i =, up as significantly different than to indicate the variables are useful to the model?, which of the coefficients would end
4. Here is the ANOVA table for an MR model using x and Category to predict y: ANOVA Model Sum of df Mean F Sig. Squares Square Regression 288. 2 44. 8..28 Residual 9. 5 8. Total 78. 7 Would you say this model is significant at the α =.5 level? Calculate its coefficient of determination? Does this seem large enough to indicate the model is helpful? 5. Here is a plot of the residuals versus y for the model above:.5 Scatterplot Dependent Variable: Y Regression Sta ndardized Residual..5. -.5 -. -.5-2 Y Do the residuals seem random to you? What, if any, systematic pattern do they suggest? 6. Here is the information on the coefficients for the above model: Coefficients Unstandardized Coefficients Standardized Coefficients t Sig. Model B Std. Error Beta Constant -2.5.969 -.6.556 X..42... CATEGORY 2...87 4.. If you were to perform hypotheses tests of H : β i = i =,,2, which of the coefficients would end up as significantly different than to indicate the variables are useful to the model? 2
Qualitative Variables There are times when we have categorical information that we would like to include into a regression model. These may be categories that imply no order such as Male/Female or they may be categories that have an implied order such as Freshmen/Sophomores/Juniors/Seniors. This kind of information can be incorporated into an MR model by using quantitative or dummy variables. Each such variable represents a one category and equals for an observation in that category and otherwise. For example, if female x = if male is a way of representing male/female information in a regression model. Since the Freshmen/Sophomore etc. information represents ordered category we could use a variable with values,2, and 4 to capture this order. This is only done is the categories have some natural, theoretically significant order. In general, for k categories with no implied order we use k / variables as follows: if observation is in category i x i = i =,K,k otherwise We would thus have a k string of s and s to indicate which category an observation fell into. What about an observation that fell into category k? It would not need its own variable since it could be represented by all s on the k qualitative variables indicating that it was not in any of the other categories. 7. Category in the data above seems to be such a quantitative variable. Comparing the model with it and without it, what does it seem to do for the model?
Qualitative Variables in Regression Models A single qualitative variable if in Category x= otherwise in a regression model with no other variables would make the difference between a model y = β + ε and another model y = β + β x + ε. If we found the prediction equations for the two variables they would be yˆ = βˆ and yˆ = βˆ + βˆ x. Since x is either or we would have when x = in the second model yˆ = βˆ + βˆ x = βˆ + βˆ = βˆ. When x = in the second model yˆ = βˆ + βˆ x = βˆ + βˆ = βˆ + βˆ. Since βˆ and βˆ are constants, the difference between x = and x = is that there are two horizontal lines. When x = we have the line yˆ = βˆ. When x = we have the line yˆ = βˆ + βˆ, a different constant. The MR model then becomes that of two parallel horizontal lines. Suppose we had the same qualitative variable and another variable, x, that was quantitative. Then the model y = β + β x + β 2 x + ε with prediction equation yˆ = βˆ + βˆ x + βˆ 2 x when x = would represent yˆ = βˆ + βˆ + βˆ 2 x = βˆ + βˆ 2 x and when x = would represent yˆ = βˆ + βˆ + βˆ 2 x = βˆ + βˆ + βˆ 2 x. That is, depending on whether x = yˆ = βˆ + βˆ 2 x or x = yˆ = βˆ + βˆ + βˆ 2 x the model would represent two parallel non-horizontal lines assuming βˆ with different intercepts. Suppose we had the same qualitative variable and another variable, x, that was quantitative. Then the model y = β + β x + β 2 x + β xx + ε with prediction equation yˆ = βˆ + βˆ x + βˆ 2 x + βˆ xx when x = would represent yˆ = βˆ + βˆ + βˆ 2 x + βˆ x = βˆ + βˆ 2 x and when x = would represent yˆ = βˆ + βˆ + βˆ x + βˆ x = βˆ + βˆ + βˆ + βˆ x. That is, depending on whether x = 2 2 yˆ = βˆ + βˆ 2 x or x = yˆ = βˆ + βˆ + βˆ 2 + βˆ x the model would represent two non-parallel lines assuming βˆ with different intercepts assuming βˆ. Higher Order and Interaction Terms This process can be continued for quadratic terms and beyond. Terms like xx are called interaction terms and when one of the variables is a qualitative variable, it acts as a toggle switch to turn on and off differences in the intercept, slope, or both or even more with higher order terms. 8. What do you think would happen to the data above if we used a model of y = β + βcategory + β 2 x + β x Category + ε 9. Here is the ANOVA table from such a model versus y. Model Sum of SquaresdfMean SquareFSig. Regression78. 26... Residual. 4. Total 78. 7 Does it seem that including the interaction term was helpful in improving the model? What is the new coefficient of determination? Is the model significant at the α =.5 level Careful!? 4
. Here is the coefficient information for this new model: Unstandardized Coefficients Standardized Coefficients t Sig. Model B Std. Error Beta Constant 5.... X -.. -.488.. CATEGORY -.. -.28.. x * category 6...29.. Which of the variables is significant in this model careful again?. There is no plot of residuals since all residuals were that is, the model fits extremely well. Judging from the model equation, what would be the graph of the model that was just fitted to these data? Suggested Homework:.,.4 Solutions to be Posted:.,.4 5