MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS F. Chiaromonte 1
Pool of available predictors/terms from them in the data set. Related to model selection, are the questions: What is the relative importance of different terms, what is the sign and magnitude of their effect on Y, what is their contributions to explaining the variability of Y? Can a term be dropped (should a term be added) to the model because its contribution is small (large)? When the terms are not correlated with one another, answers are straightforward, e.g.: yi = β0 + β1x1, i + β2x2, i + εi y = % β + % β x + % ε i 0 1 1, i i if cor( x1, x2) = 0, then: b = b% 1 1 SSR( x x ) = SSE( x ) SSE( x, x ) = SSR( x ) 1 2 2 1 2 1 the LS coefficient and the response variability explained by x 1, alone and together with x 2, are the same F. Chiaromonte 2
However, when terms are correlated with one another, things become complicated: b SSR( x x, h ) h LS coefficient and contribution to explaining y variability can change depending on what other terms are considered with x in the model; they are context dependent b can change in magnitude and even sign depending on the presence of terms correlated to x interpretation of b as the average change in y when x increases by one and all other terms are held constant becomes ambiguous; in practice, can you hold the other terms constant while changing x? the SSR attributable to x can decrease but also increase depending on the presence of terms correlated to x (e.g. decrease if both x and other terms are correlated with y, and with one another; increase if x is not correlated with y per se, but is correlated with other terms, which in turn are correlated with y). F. Chiaromonte 3
In addition, when terms are correlated with one another, the sampling variance of the LS regression coefficients, and therefore their standard errors σ { b } = [ σ ( X ' X) ] 2 2 1 se b s X X, 1 ( ) = [( ' ) ], increase Our estimates become less accurate. When correlations among terms are very strong, the inversion of (X X) becomes numerically unstable (det close to 0), so our estimates 1 b= ( X ' X) X ' Y y are not ust very variable, they are poorly determined! Many different b vectors provide very similar LS fit to the data x2 F. Chiaromonte 4 x1
Because both the elements in b and their se are affected, if terms are correlated it is hard to use the t ratios b t = se( b ) as indicative of importance: many p-values for individual t tests can be non-significant, although some terms are obviously relevant, and the regression is significant as a whole (overall F test) p-values can change dramatically when dropping/adding. The regression equation is Systol = 225 + 1.31 Years - 4.50 Weight + 0.0381 Years^2 + 0.0504 Weight^2-0.0503 Years*Weight Predictor Coef SE Coef T P Constant 225.1 115.5 1.95 0.00 Years 1.308 1.919 0.8 0.500 Weight -4.499 3.90-1.15 0.258 Years^2 0.03809 0.01388 2.74 0.010 Weight^2 0.05042 0.03325 1.52 0.139 Years*Weight -0.05032 0.03247-1.55 0.131 S = 9.4320 R-Sq = 54.8% R-Sq(ad) = 47.9% Analysis of Variance Source DF SS MS F P Regression 5 357.21 715.24 7.99 0.000 Residual Error 33 2955.22 89.55 Total 38 531.44 F. Chiaromonte 5
Diagnose pair-wise correlations among terms: scatter-plot matrix and correlation coefficients matrix: Matrix Plot of x1, x2, x3 x1 9 x2 Correlations: x1 x2 x2 0.74 x3 0.830 0.853 3 15 10 x3 5 4 8 3 9 F. Chiaromonte
However, these diagnostics are incomplete, because the real issue is the presence of linear interdependencies among the terms. These can be strong even when pair-wise correlations are relatively week. Given the model: y = β + β x... + β x... + β x + ε i 0 1 1, i, i p 1 p 1, i i Consider each term as a linear function of all others, fitting regressions of the type 2 2 i, = α0 + αl l, i+ i = ( l, l ) = 1... 1 l x x err R R x x p 2 2 1, 2 1 R share of the variability of x explained by a linear form in the other terms. variance of regr coeff σ { b } = [ σ ( X ' X) ] 1 = VIF variance inflation factor Rules of thumb: serious multicollinearity if p 1 1 max VIF 10 and/or VIF = VIF 1 = 1... p 1 p 1 = 1 F. Chiaromonte 7
y = 0 + x + x + x +ε, ε iid N(0,1) Simulated example: 1, 2, 3, i i i i i i The regression equation is y = 0.115 + 1.31 x1-0.109 x2 + 1.38 x3 Predictor Coef SE Coef T P VIF Constant 0.1148 0.3523 0.33 0.745 x1 1.312 1.031 1.27 0.204 224.145 x2-0.1093 0.9971-0.11 0.913 207.52 x3 1.3798 0.7323 1.88 0.01 454.378 S = 1.0131 R-Sq = 94.4% R-Sq(ad) = 94.3% 8 x1 x2 Analysis of Variance Source DF SS MS F P Regression 3 3411.1 1137.0 110.70 0.000 Residual Error 19 201.4 1.0 Total 199 312.4 4 Correlations: 15 x1 x2 10 x3 x2 0.995 x3 0.998 0.998 5 4 8 4 8 x = x + 2, i 1, i x = x + x + F. Chiaromonte 8 3, i 1, i 2, i small gaussian noise small gaussian noise
One remedy: dropping terms as needed e.g. in the simulated example we have The regression equation is y = 0.051 + 2.75 x1 + 1.23 x2 Predictor Coef SE Coef T P VIF Constant 0.0507 0.3529 0.14 0.88 x1 2.7457 0.7001 3.92 0.000 102.00 x2 1.2299 0.7037 1.75 0.082 102.00 S = 1.02015 R-Sq = 94.3% R-Sq(ad) = 94.3% Analysis of Variance Source DF SS MS F P Regression 2 3407.4 1703.7 137.08 0.000 Residual Error 197 205.0 1.0 Total 199 312.4 The regression equation is y = 0.107 + 3.9 x1 Predictor Coef SE Coef T P VIF Constant 0.1073 0.3532 0.30 0.72 x1 3.932 0.095 5.90 0.000 1.000 S = 1.02543 R-Sq = 94.2% R-Sq(ad) = 94.2% Analysis of Variance Source DF SS MS F P Regression 1 3404.2 3404.2 3237.51 0.000 Residual Error 198 208.2 1.1 Total 199 312.4 F. Chiaromonte 9
More sophisticated remedies: orthogonalize terms at the outset (but new terms are linear combs of original ones, harder to interpret ) fit a ridge regression. Important: multicollinearity does not affect prediction! fitted values, their sampling variability and stability are not affected it s ust that very similar fitted values can be produced by very different models! F. Chiaromonte 10