Multicollinearity : Estimation and Elimination S.S.Shantha Kumari 1 Abstract Multiple regression fits a model to predict a dependent (Y) variable from two or more independent (X) variables. If the model fits the data well, the overall R 2 value will be high, and the corresponding P value will be low In addition to the overall P value, multiple regression also reports an individual P value for each independent variable. A low P value here means that this particular independent variable significantly improves the fit of the model. It is calculated by comparing the goodness-of-fit of the entire model to the goodness-of-fit when that independent variable is omitted. If the fit is much worse when that variable is omitted from the model, the P value will be low, telling you that the variable has a significant impact on the model. In some cases, multiple regression results may seem paradoxical. Even though the overall P value is very low, all of the individual P values are high. This means that the model fits the data well, even though none of the X variables has a statistically significant impact on predicting Y. This is due to the high correlation between the independent variables. In this case, neither may contribute significantly to the model after the other one is included. But together they contribute a lot. If we removed both variables from the model, the fit would be much worse. So the overall model fits the data well, but neither X variable makes a significant contribution when it is added to the model. When this happens, the X variables are collinear and the results show multicollinearity. The best solution is to understand the cause of multicollinearity and remove it. This paper helps in ways for identification and elimination of multicollinearity that could result in best-fit model. 1 Faculty, PSG Institute of Management, PSG College of Technology, Coimbatore : 641004. 87
Introduction The past twenty years have seen an extraordinary g rowth in the use of quantitative methods in financial markets. This is one area where econometric methods have rapidly gained ground. As economic growth is making more and more people wealthier and with the rapid progress in information technology, there will be a continuous need for improving the performance of financial mo dels in forecasting returns, making use of all the information available, in particular the ultra high frequency intra daily data. The de velo pmen t of mu ltivariate and simultaneous extensions of financial models has made Finance professionals now routinely use sophisticated techniques in portfolio management, proprietary trading, risk management, financial consulting, and securities regulation. Reg ression anal ysis is almo st certainly the most important tool at the econometrician s disposal. The explanation and prediction of the security returns and their relation to risk has received a great deal of attention in the financial research. Both intuitive and theoretical models have been developed in which return or risk is expressed as a linear function of either one or several macroeconomic, market or firm related variables. Studies attempting to explore these relationships, however, have been plagued by the interdependent nature of corporate financial variables. When using classical mu lti ple regressi on analysis, these interdependencies may result with the various symptoms of multicollinearity including overstated regression coefficients, incorrect signs, and highly unstable predictive equations. The objective of this paper is to present ways and means for detecti on and elimination of multicollinearity to improve the predictive power of any financial model. Multicollinearity : Its Nature One of the three basic assumptions in re gression modeli ng i s th at the independent variables in the model are not li nearly related. The oth er two assumptions are the model residuals are normally distributed with zero mean and constant variances and they have no autocorrelation. The existence of a linear relationship among of the independent variables is called multicollinearity. The term multicollinearity is due to Ragnar Frisch 2. Multicollinearity can cause large forecasting error and make it difficult to assess the relative importance of individual variables in the model. If two or more variables have a l inear relati onship between the m, w e have perfe ct multicollinearity. The following regression equation 88
Y i =a+bx 1i +cx 2i +dx 3i +u i 1 has three independent variables X 1i, X 2i and X 3i. The assumptions requires that the three variables are not linearly related in the following form X 1i =k 1 X 2i + k 2 X 3i +e i 2 If the assumption holds true then k 1 =k 2 =0 and e i is simply X 1i, there is no multicollinearity among the independent variables included in the model. If one variable in equation 2 is not zero then the model has multicollinearity problem. Consequences of Multicollinearity 1. In a tw o-variable model, wh en multicollinearity is present, the estimated standard error for the coefficients will be large. This is because in the coefficient variance formula there is a multiplying factor in the form of l/(l-r 2 ), where r is the correlation coefficient between two variables, and its value falls between -1 and +1. This factor is often called variance inflation factor. When r = 0, there is no multicollinearity, and the inflation factor equals to 1. As r increases in absolute terms, the varian ces for the estimated co effi cien ts i ncrease too. As r approaches +1, the inflation factor approaches infinity. 2. The estimated coefficie nts may become insignificant or have wrong sig ns and conse quen tly will be sensitive to changes in the data. This is because when the independent variables are correlated, the estimated standard errors for the coefficients will be large, and as a result the t-statistics wi ll be small. The estimated coefficients with large standard errors will be unstable; an addition of a few more data points to the sample will cause a large change in the size of the coefficients and sometimes in the signs of the coefficients. When any of the coefficients changes sign from positive to negative or from negative to positive at model updating, the model will not produce a good forecast. 3. When the estimated coefficients have large standard errors and are unstable, it will be difficult for the model user to properly assess th e re lati ve importance of the i ndepende nt variables. 4. The presence of multicollinearity can lead th e re searcher to drop an important variable from the model because of its low t-statistic. Detection of Multicollinearity Multicolline arity is essentially a sample phenomenon arising out of the largely non experimental data collected in most social sciences. According to Kmenta 3 (1986), multicollinearity is a question of degree and not of kind and it is the feature of the sample and not the population. The refo re, we do no t test for multicollinearity but can measure its degree in any particular sample. 89
1. High R 2 but few significant t ratios. Table 1. Model Summary(b) 1.925(a).855.782.02424 1.793 a Predictors: (Constant), logx6, logx5, logx2, logx3, logx4 b Dependent Variable: logy Table 2. ANOVA(b) Model Sum of df Mean Square F Sig. Squares 1 Regression.035 5.007 11.774.001(a) Residual.006 10.001 Total.040 15 a Predictors: (Constant), logx6, logx5, logx2, logx3, logx4 b Dependent Variable: logy Table 3 Coefficients(a) Model Unstandardized Standardized t Sig. Collinearity Coefficients Coefficients Statistics B Std. Error Beta Tolerance VIF 1 (Constant) 1.414 8.302.170.868 logx2 1.790.873 3.854 2.050.068.004 243.451 logx3-4.109 1.600-12.083-2.568.028.001 1524.347 logx4 2.127 1.258 7.960 1.691.122.001 1525.689 logx5 -.030.122 -.084 -.250.808.130 7.718 logx6.278 2.037.229.136.894.005 194.975 a Dependent Variable: logy It is clear from Table 1 that the R 2 is.855 and the F Ratio (Table 2) is also significant showing the model is fit. But most of the t-stat is insignificant showing the possibility of multicollinearity. 90
2. High Pair-wise correlation among regressors. Table 4 Correlations logx2 logx3 logx4 logx5 logx6 logx2 Pearson Correlation 1.996(**).993(**).585(*).974(**) Sig. (2-tailed).000.000.017.000 logx3 Pearson Correlation.996(**) 1.996(**).619(*).974(**) Sig. (2-tailed).000.000.011.000 logx4 Pearson Correlation.993(**).996(**) 1.585(*).987(**) Sig. (2-tailed).000.000.017.000 logx5 Pearson Correlation.585(*).619(*).585(*) 1.600(*) Sig. (2-tailed).017.011.017.014 logx6 Pearson Correlation.974(**).974(**).987(**).600(*) 1 Sig. (2-tailed).000.000.000.014 ** Correlation is significant at the 0.01 level (2-tailed). * Correlation is significant at the 0.05 level (2-tailed). If the pair-wise correlation coefficient between two regressors is high, i.e. in excess of 0.80, then multicollinearity is a problem. High pair-wise correlation is sufficient but not a necessary condition for the existence of the multicollinearity. 91
3. Auxiliary Regressions. Table 5.1 Model Summary(b) 1.998(a).996.994.00837 1.727 a Predictors: (Constant), logx6, logx5, logx3, logx4 b Dependent Variable: logx2 Table 5.2 Model Summary(b) 1 1.000(a).999.999.00457 2.642 a Predictors: (Constant), logx2, logx5, logx6, logx4 b Dependent Variable: logx3 Table 5.3 Model Summary(b) 1 1.000(a).999.999.00581 2.597 a Predictors: (Constant), logx3, logx5, logx6, logx2 b Dependent Variable: logx4 Table 5.4 Model Summary(b) 1.933(a).870.823.05997 2.625 a Predictors: (Constant), logx4, logx6, logx2, logx3 b Dependent Variable: logx5 Table 5.5 Model Summary(b) 1.997(a).995.993.00359 2.396 a Predictors: (Constant), logx5, logx4, logx2, logx3 b Dependent Variable: logx6 92
The table 5.1 to 5.5 shows that the R 2 value of the auxiliary regressions is more than the overall R 2 suggesting that the multicollinearity is a troublesome problem. 4. Eigen Values and Condition Index From the Eigen values we can derive the condition number k. maximum Eigen Value k minimum Eigen Value If k is between 100 and 1000 there is moderate to strong multicollinearity. as And the condition index (CI) is defined CI maximum Eigen Value minimum Eigen Value If the CI is between 10 and 30, there is a moderate to strong multicollinearity. If it exce eds 30 there is seve re multicollinearity. Table 6 Eigen Value and Condition index. Dimension Eigenvalue k Condition Index 1 5.980990910 1.000000000 2 0.016218839 19.203336662 3 0.002761830 46.535899716 4 0.000020954 534.262945714 5 0.000007283 906.190115069 6 0.000000185 5687.945336822 k = 32352722.2 CI= 5687.94534 The k value is greater than thousand showing the existence of multicollinearity. The condition index is also greater than con firming the existence of severe multicollinearity. 5. Tolerance and Variance Inflation Factor (Constant) Table 6 Tolerance and VIF. Tolerance VIF logx2 0.004 243.451 logx3 0.001 1524.347 logx4 0.001 1525.689 logx5 0.130 7.718 logx6 0.005 194.975 The closer the Tolerance value to Zero and if VIF exceeds 10, the greater is the degree of multicollinearity. Elimination of Multicollinearity The choice of a remedial measure de pends on the circumstances the researcher encounters. The methods which solve the problem in one model may not be effective in another model. The researcher has to try several procedures to obtain a best fit model. 1. Dropping a variable(s) 2. Transformation of the variables 3. Additional or new data 93
4. Reducing collinearity in a polynomial regression The Tolerance, VIF and Zero order Correlation which tells us to look into variables like log X 2, Log X 3, Log X 4 and Log X 6. By analyzing the above factors and the theoretical background, the variables X 2 and X 3 are eliminated from the model. Revised model results are presented below. Y= -17.0582-0.9533logX 4-0.03099logX 5 +4.90logX 6 Standard t p value Error Constant -17.0582 4.60097-3.70752 0.002994 Personal Disposable income log X 4-0.95333 0.233974-4.07451 0.001541 Interest rate log X 5-0.3099 0.064542-4.80143 0.000432 Employed civilian labor force log X 6 4.901277 1.074051 4.563356 0.000651 R Square.759 Adjusted R Square.699 F Ratiop value 12.594(0.001) Sample Size 16 The F ratio is also significant explaining the impact of the explanatory variables on the sale of new passenger cars. The R square is 0.759, which means 76% of the variation in the dependent variable are due to the explanatory variables. The t-value for all the coefficient is significant for the explanatory variable. Conclusion The explanatory variables specified in an economic model usually come from economic theory or basic understanding of the behaviour the researchers are trying to model. The data for these variables typical ly comes fro m un controll ed experiments and often move together. In this situation, it is difficult to solve the problem by omitting or adding a new variable. So care should be taken by the researcher to reduce the problem of multicollinearity while formulating a model using the time series data. 94
References i i i i i i iv Ragnar Frisch, Statistical confluence Analysi s by means o f Co mple te Regression systems, Institute of Economics, Osl o U nive rsity, publ.no.5,1934. Jan Kmenta, Elements of Econo metrics, 2nd edition, Macmillan, New York, 1986. Ramu Ramanathan, Introductory Econometrics with Applications 5th edition, Thomson South Western, Bangalore, 2002. Brooks Chris, Introductory econo metrics for finance, Cambri dge university Press, 2002. v. Gujarathi Damodaran & Sangeetha, Basic Econometrics, 4th Edition, Tata Mcgraw-Hill Companies, New Delhi, 2007. vi Maddla G. S., In troduction to Econometrics, 3rd Edition, Wiley India, New Delhi, 2005. 95