Interaction effects for continuous predictors in regression modeling

Similar documents
, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Is economic freedom related to economic growth?

Lecture 18: Simple Linear Regression

10. Alternative case influence statistics

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

Chapter 14 Student Lecture Notes 14-1

Key Algebraic Results in Linear Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

INFERENCE FOR REGRESSION

MATH 644: Regression Analysis Methods

Introduction to Regression

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

POL 681 Lecture Notes: Statistical Interactions

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Multiple Regression Examples

Correlation & Simple Regression

23. Inference for regression

1 Chapter 2, Problem Set 1

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Chapter 3 Multiple Regression Complete Example

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Simple Linear Regression

Analysis of Bivariate Data

1 A Review of Correlation and Regression

28. SIMPLE LINEAR REGRESSION III

Chapter 7 Student Lecture Notes 7-1

STAT 212 Business Statistics II 1

STATISTICS 110/201 PRACTICE FINAL EXAM

Fractional Polynomial Regression

36-707: Regression Analysis Homework Solutions. Homework 3

Categorical Predictor Variables

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

10 Model Checking and Regression Diagnostics

Multilevel Analysis, with Extensions

Unit 6 - Simple linear regression

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Regression Analysis IV... More MLR and Model Building

STAT Chapter 11: Regression

Basic Business Statistics 6 th Edition

9. Linear Regression and Correlation

Ph.D. Preliminary Examination Statistics June 2, 2014

Modelling Survival Data using Generalized Additive Models with Flexible Link

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Applied Statistics and Econometrics

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

Regression tree-based diagnostics for linear multilevel models

STATISTICAL COMPUTING USING R/S. John Fox McMaster University

27. SIMPLE LINEAR REGRESSION II

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Chapter 1. Linear Regression with One Predictor Variable

Soil Phosphorus Discussion

Multiple Linear Regression

An Introduction to Mplus and Path Analysis

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Biostatistics 380 Multiple Regression 1. Multiple Regression

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Chapter 4: Regression Models

Unit 6 - Introduction to linear regression

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Inference for Regression

An Introduction to Path Analysis

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot.

Statistical Distribution Assumptions of General Linear Models

appstats27.notebook April 06, 2017

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

REVIEW 8/2/2017 陈芳华东师大英语系

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Experimental Design and Data Analysis for Biologists

The Model Building Process Part I: Checking Model Assumptions Best Practice

Inferences for Regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

ANCOVA. ANCOVA allows the inclusion of a 3rd source of variation into the F-formula (called the covariate) and changes the F-formula

Multiple Regression Methods

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

4. Nonlinear regression functions

Business Statistics. Lecture 10: Correlation and Linear Regression

Simple, Marginal, and Interaction Effects in General Linear Models

General Linear Model (Chapter 4)

Statistical Modelling in Stata 5: Linear Models

Multiple Linear Regression

Exam Applied Statistical Regression. Good Luck!

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Alternatives to Difference Scores: Polynomial Regression and Response Surface Methodology. Jeffrey R. Edwards University of North Carolina

ST430 Exam 2 Solutions

Chapter 4. Regression Models. Learning Objectives

Chapter 9. Correlation and Regression

Estimating complex causal effects from incomplete observational data

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

Six Sigma Black Belt Study Guides

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Scatter plot of data from the study. Linear Regression

Chapter 12: Multiple Regression

INTRODUCTION TO DESIGN AND ANALYSIS OF EXPERIMENTS

Unit 11: Multiple Linear Regression

Transcription:

Interaction effects for continuous predictors in regression modeling Testing for interactions The linear regression model is undoubtedly the most commonly-used statistical model, and has the advantage of wide applicability and ease of interpretation. The model has the form y x x i i p pi i where y is the response variable, { x,, x p } are predictor variables, and is an error term. An implication of this model is that the partial relationship between y and any predictor x (given the other predictors are held fixed) is the same across all values of the predictors; specifically that holding all else fixed, a one unit change in x is associated with an expected unit change in y, for any value of x and any values of the other predictors. When considering the constant relationship between y for any value of another predictor, this is often referred to as the lack of an interaction effect of x on y given the value of a third variable. From a mathematical point of view, this is represented by the fact that the partial derivative y x is a constant. It is not uncommon for researchers and data analysts to consider the possibility that the effect of a predictor on the response could be different depending on the value of a third variable; that is, the presence of an interaction effect. The classic situation of this occurring is if the third variable is defining subgroups in the data, with the implication being that the slope of x differs depending on group membership. It is well-known that such a model can be fit by including in a regression model a set of indicator variables to define the groups, and all of the pairwise products of the indicator variables and the variable x (this can also be accomplished using effect codings; see Mayhew and Simonoff, 5, for a full discussion of the use of effect codings to define subgroups in a data set). Consider the simplest situation of the presence of two subgroups A and B in the data and a single predictor x. Say an indicator variable I defines group membership, with I corresponding to membership in group A and I corresponding to membership group B. Fitting the regression model based on I, x, and their product Ix yi xi Ii 3Iixi i is equivalent to fitting the two separate lines y x for members of group A ( I ) and 5, Jeffrey S. Simonoff i i i * * yi xi 3xi i xi i for members of group B ( I ). As can be seen, by including the product of I in the regression model, different slopes for the two groups are implied, representing the interaction effect of group membership and the numerical variable x. This generalizes for more than two

subgroups to an analysis of covariance model (see Chatteree and Simonoff, 3, for extensive discussion of fitting such models). This fact has had the unfortunate effect of resulting in researchers attempting to represent interactions between two numerical variables in the same way, by including their product as a predictor in a fitted regression, yi x i xi 3x ix i i. (.) This is problematic because using the t-test for whether the slope of the product variable equals as an interaction test potentially results in errors of both types, Type I (mistakenly identifying a pattern that does not correspond to an interaction effect as an interaction) and Type II (mistakenly deciding that no interaction effect is present when it actually is), no matter how large the sample is or how strong the underlying relationships are. We will treat each of these issues in turn in the next two sections, illustrating them with simulated data. The data are a deliberately simplified version of the problem where the patterns are obvious, in order to illustrate the issues clearly; in a real data situation with multiple additional predictors the patterns could easily be less obvious to the eye, but ust as serious. We will then discuss how to graphically uncover an interaction effect between two numerical variables, and how the use of additive models (a generalization of the linear model) can be an appropriate way to avoid mistakenly identifying a supposed interaction effect. We will then suggest a simple alternative approach for identifying interactions between numerical variables. Problems with the product test for interactions Mistakenly identifying nonlinearity as an interaction (Type I error) The key idea is to recognize that (.) is not an interaction equation, but rather a nonlinear one. If nonlinearity is mistakenly identified as an interaction, a Type I error occurs. This can easily happen if the variables x are correlated with each other. Consider the following situation. Say the true underlying relationship is a quadratic one on variable x alone; that is, y x x i i i i. The model on only x clearly cannot account for this quadratic relationship. If the product model (.) is fit instead, and if x are highly correlated, * * * * y x x x x x x i i i i i i 3 i i i, because up to constant terms or terms in x or x alone x x ixi. Thus, if a product term is included in the regression its t-statistic will be statistically significant, implying an interaction between x, when in fact what is present is a nonlinear relationship in x alone. Consider the following simulated example. The following regression output is based on fitting a regression with two predictors, x : 5, Jeffrey S. Simonoff

The regression equation is y =.683 +. x -.8 x Predictor Coef SE Coef T P Constant.683.449 4.7. x.8.49.49.4 x -.799.5 -.9.37 S =.47 R-Sq = 9.% R-Sq(ad) = 7.% Analysis of Variance Source DF SS MS F P Regression 9.496 9.748 4.8. Residual Error 97 96.66. Total 99 5.56 The overall regression is statistically significant, but neither predictor is; the reason for this is that the two predictors are highly correlated (the correlation between them is.994). The product test for an interaction now adds the product variable to the regression: The regression equation is y = -.65 +.5 x -.59 x +. xx Predictor Coef SE Coef T P Constant -.65.8655 -.7.47 x.543.764.99.5 x -.594.7756 -.76.447 xx.737.65 6.6. S =.7673 R-Sq = 76.5% R-Sq(ad) = 75.8% Analysis of Variance Source DF SS MS F P Regression 3 64.95 54.984 4.3. Residual Error 96 5.6.57 Total 99 5.56 The t-test is extremely highly statistically significant, apparently indicating an extremely strong interaction between the two predictors, but that is not in fact the case. The scatter plot below demonstrates what is actually going on: there is a quadratic relationship between y, and the high correlation between x has resulted in the product of the two variables taking the place of the x term. Thus, a nonlinear relationship in a single predictor has been misidentified as an interaction effect involving two predictors. 5, Jeffrey S. Simonoff

y Scatterplot of y vs x 8 6 4 - -3 - - x 3 Mistakenly missing the presence an interaction (Type II error) The product term in equation (.) can be viewed as an interaction effect on the response, as it does correspond to a differential effect of x on y given the value of x ; specifically, y 3x. x The problem with the test is that this is a very specific form of an effect, and many interaction effects do not correspond to a relationship even close to this form. As a result, there are many situations where an actual interaction will be missed by the test of whether the slope of the product term equals. Consider the following simulated example. The following regression output is based on fitting a regression with two predictors, x (note that y are not the same as in the previous example): The regression equation is y = -.6 +.65 x -.39 x Predictor Coef SE Coef T P Constant -.59.956 -.8.936 x.65.48.3.3 x -.387.3347 -..98 5, Jeffrey S. Simonoff

S =.39 R-Sq = 5.% R-Sq(ad) = 3.3% Analysis of Variance Source DF SS MS F P Regression 56.4 8..68.73 Residual Error 97 69.3 4.8 Total 99 73.7 The overall regression is marginally statistically significant, as is the slope coefficient for x. The product test for an interaction now adds the product variable to the regression: The regression equation is y =.3 + 3.6 x -.66 x -.9 xx Predictor Coef SE Coef T P Constant.3.999..988 x 3.6.6.63.6 x -.658.34 -.9.847 xx -.89.476 -.5.6 S =.78 R-Sq = 5.5% R-Sq(ad) =.5% Analysis of Variance Source DF SS MS F P Regression 3 59. 96.7.86.4 Residual Error 96 4.6 5.6 Total 99 73.7 As is apparent, the product variable is not close to being statistically significant here, apparently implying that there is no interaction effect, but that is not in fact the case. There is in fact a very strong interaction effect: if x 35 or x 7 the slope, and otherwise the slope. This can be seen in the following scatter plot, where the regions are labeled Low, Mid, and High: 5, Jeffrey S. Simonoff

y 3 Scatterplot of y vs x Region High Low Mid - - -3-3 - - x 3 Since this interaction does not look like a product term, the test has no power to identify it, even though doing so correctly would result in a strong fit (an R more than 75% and a highly statistically significant interaction effect corresponding to different slopes for the three regions of x ). Identifying interaction effects Given the deficiencies in using the product of two numerical predictors to test for the presence of an interaction effect, a natural question to ask is whether there are better methods. The answer is yes, as we discuss here. We first describe a graphical technique (termed a trellis display) that can help expose the presence of an interaction effect, and we then discuss how the linear regression model can be generalized to an additive model that is flexible enough to distinguish between nonlinear relationships and actual interaction effects. Both of these techniques are available as part of the free software package R. We then note how fitting an analysis of covariance model can easily test for the presence of an interaction effect in a way that is much more effective in general than is multiplying numerical variables. 5, Jeffrey S. Simonoff

y Trellis displays A trellis display is a version of a conditioning plot; it highlights patterns in the data conditioning on the value of a specific variable. Since this is precisely what an interaction effect in regression represents (the relationship between the response and a predictor changing based on the value of another variable), such a display is ideal for exploring graphically the possibility of an interaction effect. The display below gives a display for the second data set given above prepared using the lattice package of the R software package (Sarkar, 8). Recall that in that data set the slope between y changes depending on the value of x. The plot is constructed by defining subregions based on the conditioning variable x ; a simple default (used here) is to divide the data into regions with roughly equal numbers of observations. Each panel of the display is a scatter plot of y versus x for the observations in that x subregion. The subregions go from smallest values of x in the lower left to largest values in the upper right, and are identified by the shading at the top of each plot in the display. - - x x x - - x x x - - - - x - - 5, Jeffrey S. Simonoff

It is apparent in the display that for smaller values of x there is a direct relationship between y, for moderate values there is an inverse relationship, and for large values there is again a direct relationship. Thus, the plot easily summarizes the interaction effect in the data. As is true for any scatter plot in a multiple regression the display is in general only suggestive, since it cannot account for the effects of predictors other than x on the relationship between y given x, but it is certainly worth constructing if the possibility of an interaction effect is contemplated. Additive models Additive models (Hastie and Tibshirani, 99) are a generalization of linear models in which linear terms are replaced with arbitrary, usually smooth, functions of predictors. The simplest version of the model takes the form y f ( x ) f ( x ), i i p pi i where the functions f ( ) can be generalizations beyond the linear terms in a linear model. These functions are typically assumed to be smooth, and are estimated using kernel-based local polynomials, smoothing splines, and so on (see Simonoff, 996, for a discussion of smoothing methods). These models provide a compromise between linear models (with their ease of interpretation but strong assumption of linearity of effects) and arbitrary nonlinear models (with their greater flexibility but difficulties in specification and estimation) by hypothesizing that effects can be nonlinear, but do not interact with each other. They can be fit using either the gam or mgcv packages in the R software package. So, for example, for the first data set given above an additive model fit can automatically highlight the nonlinear relationship between y, and given that the unimportance of x : 5, Jeffrey S. Simonoff

- 4 6 8 - - - - x 4 6 8 x In this display each of the plots gives the effect of the variable given the presence of the other. The superimposed lines correspond to estimates of the underlying functions f and f, and show that once x is included in the model x does not add anything, even though a simple scatter plot of y on x would show a quadratic pattern because of the high correlation between x (the smoothness of the fitted curves must be chosen by the data analyst; Simonoff and Tsai, 999, discuss this statistical problem, but from a practical point of view it is often satisfactory to choose the curves by eye). Analysis of covariance This does not directly address the problem of identifying interactions if they exist, beyond identifying when a nonlinear relationship has been misidentified as an interaction. Thus, the plot of the additive terms for the second data set above (where there is an interaction effect) shows that an additive model is not an adequate representation of the relationships, as the additive 5, Jeffrey S. Simonoff

- - - 3 model tries to use a parabolic curve to estimate a much more complex relationship between y and x : - - x 4 6 8 x While it is possible to generalize the additive model to allow for terms that are explicitly smooth interactions of predictors, a more straightforward approach is to build on the trellis display, and explore a regression model that allows for different slopes for a predictor depending on the value of another variable. This is not correct unless the groupings happen to correspond exactly to true subgroups in the data (recall, for example, that the true relationship in the second data set is based on three subgroups in the data, not the six automatically chosen in the trellis display), but is flexible enough to usually identify the existence of a potential interaction that could then be explored further. That is, fit an analysis of covariance model that includes an interaction effect and construct a partial F-test for whether this provides significantly better fit than does a constant 5, Jeffrey S. Simonoff

shift model. This corresponds to fitting separate lines to each of the subplots in the trellis display if there are no other predictors in the model, but generalizes the display to account for the potential effects of other variables if there are any. If that is done for the second data set above, the interaction effect is clearly supported, with a partial F-test equal to 46.5 on (5,88) degrees of freedom, yielding a p-value vanishingly close to, strongly implying improved performance for lines with different slopes over a set of parallel lines. Closer examination of the trellis display would then show that there seem to be three separate regimes defining the interaction, which could be explored further. References Chatteree, S. and Simonoff, J.S. (3), Handbook of Regression Analysis, Wiley: Hoboken, NJ. Hastie, T.J. and Tibshirani, R.J. (99), Generalized Additive Models, Chapman and Hall: London. Mayhew, M.J. and Simonoff, J.S. (5), Nonwhite, No More: Effect Coding as an Alternative to Dummy Coding with Implications for Researchers in Higher Education, Journal of College Student Development, 56, 7-75. Sarkar, D. (8), Lattice: Multivariate Data Visualization with R, Springer: New York. Simonoff, J.S. (996), Smoothing Methods in Statistics, Springer: New York. Simonoff, J.S. and Tsai, C.-L. (999), Semiparametric and Additive Model Selection Using an Improved Akaike Information Criterion, Journal of Computational and Graphical Statistics, 8, -4. 5, Jeffrey S. Simonoff