Interaction effects for continuous predictors in regression modeling Testing for interactions The linear regression model is undoubtedly the most commonly-used statistical model, and has the advantage of wide applicability and ease of interpretation. The model has the form y x x i i p pi i where y is the response variable, { x,, x p } are predictor variables, and is an error term. An implication of this model is that the partial relationship between y and any predictor x (given the other predictors are held fixed) is the same across all values of the predictors; specifically that holding all else fixed, a one unit change in x is associated with an expected unit change in y, for any value of x and any values of the other predictors. When considering the constant relationship between y for any value of another predictor, this is often referred to as the lack of an interaction effect of x on y given the value of a third variable. From a mathematical point of view, this is represented by the fact that the partial derivative y x is a constant. It is not uncommon for researchers and data analysts to consider the possibility that the effect of a predictor on the response could be different depending on the value of a third variable; that is, the presence of an interaction effect. The classic situation of this occurring is if the third variable is defining subgroups in the data, with the implication being that the slope of x differs depending on group membership. It is well-known that such a model can be fit by including in a regression model a set of indicator variables to define the groups, and all of the pairwise products of the indicator variables and the variable x (this can also be accomplished using effect codings; see Mayhew and Simonoff, 5, for a full discussion of the use of effect codings to define subgroups in a data set). Consider the simplest situation of the presence of two subgroups A and B in the data and a single predictor x. Say an indicator variable I defines group membership, with I corresponding to membership in group A and I corresponding to membership group B. Fitting the regression model based on I, x, and their product Ix yi xi Ii 3Iixi i is equivalent to fitting the two separate lines y x for members of group A ( I ) and 5, Jeffrey S. Simonoff i i i * * yi xi 3xi i xi i for members of group B ( I ). As can be seen, by including the product of I in the regression model, different slopes for the two groups are implied, representing the interaction effect of group membership and the numerical variable x. This generalizes for more than two
subgroups to an analysis of covariance model (see Chatteree and Simonoff, 3, for extensive discussion of fitting such models). This fact has had the unfortunate effect of resulting in researchers attempting to represent interactions between two numerical variables in the same way, by including their product as a predictor in a fitted regression, yi x i xi 3x ix i i. (.) This is problematic because using the t-test for whether the slope of the product variable equals as an interaction test potentially results in errors of both types, Type I (mistakenly identifying a pattern that does not correspond to an interaction effect as an interaction) and Type II (mistakenly deciding that no interaction effect is present when it actually is), no matter how large the sample is or how strong the underlying relationships are. We will treat each of these issues in turn in the next two sections, illustrating them with simulated data. The data are a deliberately simplified version of the problem where the patterns are obvious, in order to illustrate the issues clearly; in a real data situation with multiple additional predictors the patterns could easily be less obvious to the eye, but ust as serious. We will then discuss how to graphically uncover an interaction effect between two numerical variables, and how the use of additive models (a generalization of the linear model) can be an appropriate way to avoid mistakenly identifying a supposed interaction effect. We will then suggest a simple alternative approach for identifying interactions between numerical variables. Problems with the product test for interactions Mistakenly identifying nonlinearity as an interaction (Type I error) The key idea is to recognize that (.) is not an interaction equation, but rather a nonlinear one. If nonlinearity is mistakenly identified as an interaction, a Type I error occurs. This can easily happen if the variables x are correlated with each other. Consider the following situation. Say the true underlying relationship is a quadratic one on variable x alone; that is, y x x i i i i. The model on only x clearly cannot account for this quadratic relationship. If the product model (.) is fit instead, and if x are highly correlated, * * * * y x x x x x x i i i i i i 3 i i i, because up to constant terms or terms in x or x alone x x ixi. Thus, if a product term is included in the regression its t-statistic will be statistically significant, implying an interaction between x, when in fact what is present is a nonlinear relationship in x alone. Consider the following simulated example. The following regression output is based on fitting a regression with two predictors, x : 5, Jeffrey S. Simonoff
The regression equation is y =.683 +. x -.8 x Predictor Coef SE Coef T P Constant.683.449 4.7. x.8.49.49.4 x -.799.5 -.9.37 S =.47 R-Sq = 9.% R-Sq(ad) = 7.% Analysis of Variance Source DF SS MS F P Regression 9.496 9.748 4.8. Residual Error 97 96.66. Total 99 5.56 The overall regression is statistically significant, but neither predictor is; the reason for this is that the two predictors are highly correlated (the correlation between them is.994). The product test for an interaction now adds the product variable to the regression: The regression equation is y = -.65 +.5 x -.59 x +. xx Predictor Coef SE Coef T P Constant -.65.8655 -.7.47 x.543.764.99.5 x -.594.7756 -.76.447 xx.737.65 6.6. S =.7673 R-Sq = 76.5% R-Sq(ad) = 75.8% Analysis of Variance Source DF SS MS F P Regression 3 64.95 54.984 4.3. Residual Error 96 5.6.57 Total 99 5.56 The t-test is extremely highly statistically significant, apparently indicating an extremely strong interaction between the two predictors, but that is not in fact the case. The scatter plot below demonstrates what is actually going on: there is a quadratic relationship between y, and the high correlation between x has resulted in the product of the two variables taking the place of the x term. Thus, a nonlinear relationship in a single predictor has been misidentified as an interaction effect involving two predictors. 5, Jeffrey S. Simonoff
y Scatterplot of y vs x 8 6 4 - -3 - - x 3 Mistakenly missing the presence an interaction (Type II error) The product term in equation (.) can be viewed as an interaction effect on the response, as it does correspond to a differential effect of x on y given the value of x ; specifically, y 3x. x The problem with the test is that this is a very specific form of an effect, and many interaction effects do not correspond to a relationship even close to this form. As a result, there are many situations where an actual interaction will be missed by the test of whether the slope of the product term equals. Consider the following simulated example. The following regression output is based on fitting a regression with two predictors, x (note that y are not the same as in the previous example): The regression equation is y = -.6 +.65 x -.39 x Predictor Coef SE Coef T P Constant -.59.956 -.8.936 x.65.48.3.3 x -.387.3347 -..98 5, Jeffrey S. Simonoff
S =.39 R-Sq = 5.% R-Sq(ad) = 3.3% Analysis of Variance Source DF SS MS F P Regression 56.4 8..68.73 Residual Error 97 69.3 4.8 Total 99 73.7 The overall regression is marginally statistically significant, as is the slope coefficient for x. The product test for an interaction now adds the product variable to the regression: The regression equation is y =.3 + 3.6 x -.66 x -.9 xx Predictor Coef SE Coef T P Constant.3.999..988 x 3.6.6.63.6 x -.658.34 -.9.847 xx -.89.476 -.5.6 S =.78 R-Sq = 5.5% R-Sq(ad) =.5% Analysis of Variance Source DF SS MS F P Regression 3 59. 96.7.86.4 Residual Error 96 4.6 5.6 Total 99 73.7 As is apparent, the product variable is not close to being statistically significant here, apparently implying that there is no interaction effect, but that is not in fact the case. There is in fact a very strong interaction effect: if x 35 or x 7 the slope, and otherwise the slope. This can be seen in the following scatter plot, where the regions are labeled Low, Mid, and High: 5, Jeffrey S. Simonoff
y 3 Scatterplot of y vs x Region High Low Mid - - -3-3 - - x 3 Since this interaction does not look like a product term, the test has no power to identify it, even though doing so correctly would result in a strong fit (an R more than 75% and a highly statistically significant interaction effect corresponding to different slopes for the three regions of x ). Identifying interaction effects Given the deficiencies in using the product of two numerical predictors to test for the presence of an interaction effect, a natural question to ask is whether there are better methods. The answer is yes, as we discuss here. We first describe a graphical technique (termed a trellis display) that can help expose the presence of an interaction effect, and we then discuss how the linear regression model can be generalized to an additive model that is flexible enough to distinguish between nonlinear relationships and actual interaction effects. Both of these techniques are available as part of the free software package R. We then note how fitting an analysis of covariance model can easily test for the presence of an interaction effect in a way that is much more effective in general than is multiplying numerical variables. 5, Jeffrey S. Simonoff
y Trellis displays A trellis display is a version of a conditioning plot; it highlights patterns in the data conditioning on the value of a specific variable. Since this is precisely what an interaction effect in regression represents (the relationship between the response and a predictor changing based on the value of another variable), such a display is ideal for exploring graphically the possibility of an interaction effect. The display below gives a display for the second data set given above prepared using the lattice package of the R software package (Sarkar, 8). Recall that in that data set the slope between y changes depending on the value of x. The plot is constructed by defining subregions based on the conditioning variable x ; a simple default (used here) is to divide the data into regions with roughly equal numbers of observations. Each panel of the display is a scatter plot of y versus x for the observations in that x subregion. The subregions go from smallest values of x in the lower left to largest values in the upper right, and are identified by the shading at the top of each plot in the display. - - x x x - - x x x - - - - x - - 5, Jeffrey S. Simonoff
It is apparent in the display that for smaller values of x there is a direct relationship between y, for moderate values there is an inverse relationship, and for large values there is again a direct relationship. Thus, the plot easily summarizes the interaction effect in the data. As is true for any scatter plot in a multiple regression the display is in general only suggestive, since it cannot account for the effects of predictors other than x on the relationship between y given x, but it is certainly worth constructing if the possibility of an interaction effect is contemplated. Additive models Additive models (Hastie and Tibshirani, 99) are a generalization of linear models in which linear terms are replaced with arbitrary, usually smooth, functions of predictors. The simplest version of the model takes the form y f ( x ) f ( x ), i i p pi i where the functions f ( ) can be generalizations beyond the linear terms in a linear model. These functions are typically assumed to be smooth, and are estimated using kernel-based local polynomials, smoothing splines, and so on (see Simonoff, 996, for a discussion of smoothing methods). These models provide a compromise between linear models (with their ease of interpretation but strong assumption of linearity of effects) and arbitrary nonlinear models (with their greater flexibility but difficulties in specification and estimation) by hypothesizing that effects can be nonlinear, but do not interact with each other. They can be fit using either the gam or mgcv packages in the R software package. So, for example, for the first data set given above an additive model fit can automatically highlight the nonlinear relationship between y, and given that the unimportance of x : 5, Jeffrey S. Simonoff
- 4 6 8 - - - - x 4 6 8 x In this display each of the plots gives the effect of the variable given the presence of the other. The superimposed lines correspond to estimates of the underlying functions f and f, and show that once x is included in the model x does not add anything, even though a simple scatter plot of y on x would show a quadratic pattern because of the high correlation between x (the smoothness of the fitted curves must be chosen by the data analyst; Simonoff and Tsai, 999, discuss this statistical problem, but from a practical point of view it is often satisfactory to choose the curves by eye). Analysis of covariance This does not directly address the problem of identifying interactions if they exist, beyond identifying when a nonlinear relationship has been misidentified as an interaction. Thus, the plot of the additive terms for the second data set above (where there is an interaction effect) shows that an additive model is not an adequate representation of the relationships, as the additive 5, Jeffrey S. Simonoff
- - - 3 model tries to use a parabolic curve to estimate a much more complex relationship between y and x : - - x 4 6 8 x While it is possible to generalize the additive model to allow for terms that are explicitly smooth interactions of predictors, a more straightforward approach is to build on the trellis display, and explore a regression model that allows for different slopes for a predictor depending on the value of another variable. This is not correct unless the groupings happen to correspond exactly to true subgroups in the data (recall, for example, that the true relationship in the second data set is based on three subgroups in the data, not the six automatically chosen in the trellis display), but is flexible enough to usually identify the existence of a potential interaction that could then be explored further. That is, fit an analysis of covariance model that includes an interaction effect and construct a partial F-test for whether this provides significantly better fit than does a constant 5, Jeffrey S. Simonoff
shift model. This corresponds to fitting separate lines to each of the subplots in the trellis display if there are no other predictors in the model, but generalizes the display to account for the potential effects of other variables if there are any. If that is done for the second data set above, the interaction effect is clearly supported, with a partial F-test equal to 46.5 on (5,88) degrees of freedom, yielding a p-value vanishingly close to, strongly implying improved performance for lines with different slopes over a set of parallel lines. Closer examination of the trellis display would then show that there seem to be three separate regimes defining the interaction, which could be explored further. References Chatteree, S. and Simonoff, J.S. (3), Handbook of Regression Analysis, Wiley: Hoboken, NJ. Hastie, T.J. and Tibshirani, R.J. (99), Generalized Additive Models, Chapman and Hall: London. Mayhew, M.J. and Simonoff, J.S. (5), Nonwhite, No More: Effect Coding as an Alternative to Dummy Coding with Implications for Researchers in Higher Education, Journal of College Student Development, 56, 7-75. Sarkar, D. (8), Lattice: Multivariate Data Visualization with R, Springer: New York. Simonoff, J.S. (996), Smoothing Methods in Statistics, Springer: New York. Simonoff, J.S. and Tsai, C.-L. (999), Semiparametric and Additive Model Selection Using an Improved Akaike Information Criterion, Journal of Computational and Graphical Statistics, 8, -4. 5, Jeffrey S. Simonoff