Multicollinearity : Estimation and Elimination

Similar documents
Relationship between ridge regression estimator and sample size when multicollinearity present among regressors

L7: Multicollinearity

ECON 497 Midterm Spring

St. Xavier s College Autonomous Mumbai. Syllabus For 4 th Semester Core and Applied Courses in. Economics (June 2019 onwards)

Regression ( Kemampuan Individu, Lingkungan kerja dan Motivasi)

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

405 ECONOMETRICS Chapter # 11: MULTICOLLINEARITY: WHAT HAPPENS IF THE REGRESSORS ARE CORRELATED? Domodar N. Gujarati

School of Mathematical Sciences. Question 1. Best Subsets Regression

Making sense of Econometrics: Basics

QUANTITATIVE STATISTICAL METHODS: REGRESSION AND FORECASTING JOHANNES LEDOLTER VIENNA UNIVERSITY OF ECONOMICS AND BUSINESS ADMINISTRATION SPRING 2013

Multiple linear regression S6

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

1 Correlation and Inference from Regression

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Applied Econometrics. Applied Econometrics Second edition. Dimitrios Asteriou and Stephen G. Hall

Lecture 4: Multivariate Regression, Part 2

EC4051 Project and Introductory Econometrics

Multiple Regression and Model Building (cont d) + GIS Lecture 21 3 May 2006 R. Ryznar

y response variable x 1, x 2,, x k -- a set of explanatory variables

Available online at (Elixir International Journal) Statistics. Elixir Statistics 49 (2012)

Multiple Regression Analysis

Regression: Ordinary Least Squares

Lecture 4: Multivariate Regression, Part 2

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Using the Regression Model in multivariate data analysis

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

STAT Checking Model Assumptions

Multiple Regression Methods

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Multiple Regression Analysis. Part III. Multiple Regression Analysis

WORKSHOP. Introductory Econometrics with EViews. Asst. Prof. Dr. Kemal Bağzıbağlı Department of Economic

Inference with Simple Regression

Chapter 14 Student Lecture Notes 14-1

Linear Regression Models

Quantitative Methods I: Regression diagnostics

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

9. Linear Regression and Correlation

Classification & Regression. Multicollinearity Intro to Nominal Data

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

Matematické Metody v Ekonometrii 7.

Econometric Analysis of Some Economic Indicators Influencing Nigeria s Economy.

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Regression of Inflation on Percent M3 Change

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

A particularly nasty aspect of this is that it is often difficult or impossible to tell if a model fails to satisfy these steps.

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

DEMAND ESTIMATION (PART III)

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

Multiple Regression Analysis

Multiple Linear Regression CIVL 7012/8012

Review of Multiple Regression

Reducing Computation Time for the Analysis of Large Social Science Datasets

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

Daniel Boduszek University of Huddersfield

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

Item-Total Statistics. Corrected Item- Cronbach's Item Deleted. Total

J. Environ. Res. Develop. Journal of Environmental Research And Development Vol. 8 No. 3A, January-March 2014

Iris Wang.

Chapter 19: Logistic regression

DEPARTMENT OF ECONOMICS AND FINANCE COLLEGE OF BUSINESS AND ECONOMICS UNIVERSITY OF CANTERBURY CHRISTCHURCH, NEW ZEALAND

UNIVERSITY OF DELHI DELHI SCHOOL OF ECONOMICS DEPARTMENT OF ECONOMICS. Minutes of Meeting

Last updated: Oct 18, 2012 LINEAR REGRESSION PSYC 3031 INTERMEDIATE STATISTICS LABORATORY. J. Elder

Single and multiple linear regression analysis

Introduction to Econometrics

Final Exam - Solutions

Interpreting Regression Results

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

Ordinary Least Squares (OLS): Multiple Linear Regression (MLR) Analytics What s New? Not Much!

Information Content Change under SFAS No. 131 s Interim Segment Reporting Requirements

Stat 500 Midterm 2 12 November 2009 page 0 of 11

Multiple Regression. Midterm results: AVG = 26.5 (88%) A = 27+ B = C =

Immigration attitudes (opposes immigration or supports it) it may seriously misestimate the magnitude of the effects of IVs

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Cost analysis of alternative modes of delivery by lognormal regression model

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Research Center for Science Technology and Society of Fuzhou University, International Studies and Trade, Changle Fuzhou , China

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Modeling Spatial Relationships Using Regression Analysis. Lauren M. Scott, PhD Lauren Rosenshein Bennett, MS

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Chapter 9 - Correlation and Regression

1 A Non-technical Introduction to Regression

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

A NOTE ON THE EFFECT OF THE MULTICOLLINEARITY PHENOMENON OF A SIMULTANEOUS EQUATION MODEL

Circling the Square: Experiments in Regression

Overview of Dispersion. Standard. Deviation

THE RANK CONDITION FOR STRUCTURAL EQUATION IDENTIFICATION RE-VISITED: NOT QUITE SUFFICIENT AFTER ALL. Richard Ashley and Hui Boon Tan

CHAPTER 6: SPECIFICATION VARIABLES

UNIVERSITY OF DELHI DELHI SCHOOL OF ECONOMICS DEPARTMENT OF ECONOMICS. Minutes of Meeting

Econometrics. 9) Heteroscedasticity and autocorrelation

The simple linear regression model discussed in Chapter 13 was written as

Prediction of Bike Rental using Model Reuse Strategy

Econ107 Applied Econometrics

WISE International Masters

Transcription:

Multicollinearity : Estimation and Elimination S.S.Shantha Kumari 1 Abstract Multiple regression fits a model to predict a dependent (Y) variable from two or more independent (X) variables. If the model fits the data well, the overall R 2 value will be high, and the corresponding P value will be low In addition to the overall P value, multiple regression also reports an individual P value for each independent variable. A low P value here means that this particular independent variable significantly improves the fit of the model. It is calculated by comparing the goodness-of-fit of the entire model to the goodness-of-fit when that independent variable is omitted. If the fit is much worse when that variable is omitted from the model, the P value will be low, telling you that the variable has a significant impact on the model. In some cases, multiple regression results may seem paradoxical. Even though the overall P value is very low, all of the individual P values are high. This means that the model fits the data well, even though none of the X variables has a statistically significant impact on predicting Y. This is due to the high correlation between the independent variables. In this case, neither may contribute significantly to the model after the other one is included. But together they contribute a lot. If we removed both variables from the model, the fit would be much worse. So the overall model fits the data well, but neither X variable makes a significant contribution when it is added to the model. When this happens, the X variables are collinear and the results show multicollinearity. The best solution is to understand the cause of multicollinearity and remove it. This paper helps in ways for identification and elimination of multicollinearity that could result in best-fit model. 1 Faculty, PSG Institute of Management, PSG College of Technology, Coimbatore : 641004. 87

Introduction The past twenty years have seen an extraordinary g rowth in the use of quantitative methods in financial markets. This is one area where econometric methods have rapidly gained ground. As economic growth is making more and more people wealthier and with the rapid progress in information technology, there will be a continuous need for improving the performance of financial mo dels in forecasting returns, making use of all the information available, in particular the ultra high frequency intra daily data. The de velo pmen t of mu ltivariate and simultaneous extensions of financial models has made Finance professionals now routinely use sophisticated techniques in portfolio management, proprietary trading, risk management, financial consulting, and securities regulation. Reg ression anal ysis is almo st certainly the most important tool at the econometrician s disposal. The explanation and prediction of the security returns and their relation to risk has received a great deal of attention in the financial research. Both intuitive and theoretical models have been developed in which return or risk is expressed as a linear function of either one or several macroeconomic, market or firm related variables. Studies attempting to explore these relationships, however, have been plagued by the interdependent nature of corporate financial variables. When using classical mu lti ple regressi on analysis, these interdependencies may result with the various symptoms of multicollinearity including overstated regression coefficients, incorrect signs, and highly unstable predictive equations. The objective of this paper is to present ways and means for detecti on and elimination of multicollinearity to improve the predictive power of any financial model. Multicollinearity : Its Nature One of the three basic assumptions in re gression modeli ng i s th at the independent variables in the model are not li nearly related. The oth er two assumptions are the model residuals are normally distributed with zero mean and constant variances and they have no autocorrelation. The existence of a linear relationship among of the independent variables is called multicollinearity. The term multicollinearity is due to Ragnar Frisch 2. Multicollinearity can cause large forecasting error and make it difficult to assess the relative importance of individual variables in the model. If two or more variables have a l inear relati onship between the m, w e have perfe ct multicollinearity. The following regression equation 88

Y i =a+bx 1i +cx 2i +dx 3i +u i 1 has three independent variables X 1i, X 2i and X 3i. The assumptions requires that the three variables are not linearly related in the following form X 1i =k 1 X 2i + k 2 X 3i +e i 2 If the assumption holds true then k 1 =k 2 =0 and e i is simply X 1i, there is no multicollinearity among the independent variables included in the model. If one variable in equation 2 is not zero then the model has multicollinearity problem. Consequences of Multicollinearity 1. In a tw o-variable model, wh en multicollinearity is present, the estimated standard error for the coefficients will be large. This is because in the coefficient variance formula there is a multiplying factor in the form of l/(l-r 2 ), where r is the correlation coefficient between two variables, and its value falls between -1 and +1. This factor is often called variance inflation factor. When r = 0, there is no multicollinearity, and the inflation factor equals to 1. As r increases in absolute terms, the varian ces for the estimated co effi cien ts i ncrease too. As r approaches +1, the inflation factor approaches infinity. 2. The estimated coefficie nts may become insignificant or have wrong sig ns and conse quen tly will be sensitive to changes in the data. This is because when the independent variables are correlated, the estimated standard errors for the coefficients will be large, and as a result the t-statistics wi ll be small. The estimated coefficients with large standard errors will be unstable; an addition of a few more data points to the sample will cause a large change in the size of the coefficients and sometimes in the signs of the coefficients. When any of the coefficients changes sign from positive to negative or from negative to positive at model updating, the model will not produce a good forecast. 3. When the estimated coefficients have large standard errors and are unstable, it will be difficult for the model user to properly assess th e re lati ve importance of the i ndepende nt variables. 4. The presence of multicollinearity can lead th e re searcher to drop an important variable from the model because of its low t-statistic. Detection of Multicollinearity Multicolline arity is essentially a sample phenomenon arising out of the largely non experimental data collected in most social sciences. According to Kmenta 3 (1986), multicollinearity is a question of degree and not of kind and it is the feature of the sample and not the population. The refo re, we do no t test for multicollinearity but can measure its degree in any particular sample. 89

1. High R 2 but few significant t ratios. Table 1. Model Summary(b) 1.925(a).855.782.02424 1.793 a Predictors: (Constant), logx6, logx5, logx2, logx3, logx4 b Dependent Variable: logy Table 2. ANOVA(b) Model Sum of df Mean Square F Sig. Squares 1 Regression.035 5.007 11.774.001(a) Residual.006 10.001 Total.040 15 a Predictors: (Constant), logx6, logx5, logx2, logx3, logx4 b Dependent Variable: logy Table 3 Coefficients(a) Model Unstandardized Standardized t Sig. Collinearity Coefficients Coefficients Statistics B Std. Error Beta Tolerance VIF 1 (Constant) 1.414 8.302.170.868 logx2 1.790.873 3.854 2.050.068.004 243.451 logx3-4.109 1.600-12.083-2.568.028.001 1524.347 logx4 2.127 1.258 7.960 1.691.122.001 1525.689 logx5 -.030.122 -.084 -.250.808.130 7.718 logx6.278 2.037.229.136.894.005 194.975 a Dependent Variable: logy It is clear from Table 1 that the R 2 is.855 and the F Ratio (Table 2) is also significant showing the model is fit. But most of the t-stat is insignificant showing the possibility of multicollinearity. 90

2. High Pair-wise correlation among regressors. Table 4 Correlations logx2 logx3 logx4 logx5 logx6 logx2 Pearson Correlation 1.996(**).993(**).585(*).974(**) Sig. (2-tailed).000.000.017.000 logx3 Pearson Correlation.996(**) 1.996(**).619(*).974(**) Sig. (2-tailed).000.000.011.000 logx4 Pearson Correlation.993(**).996(**) 1.585(*).987(**) Sig. (2-tailed).000.000.017.000 logx5 Pearson Correlation.585(*).619(*).585(*) 1.600(*) Sig. (2-tailed).017.011.017.014 logx6 Pearson Correlation.974(**).974(**).987(**).600(*) 1 Sig. (2-tailed).000.000.000.014 ** Correlation is significant at the 0.01 level (2-tailed). * Correlation is significant at the 0.05 level (2-tailed). If the pair-wise correlation coefficient between two regressors is high, i.e. in excess of 0.80, then multicollinearity is a problem. High pair-wise correlation is sufficient but not a necessary condition for the existence of the multicollinearity. 91

3. Auxiliary Regressions. Table 5.1 Model Summary(b) 1.998(a).996.994.00837 1.727 a Predictors: (Constant), logx6, logx5, logx3, logx4 b Dependent Variable: logx2 Table 5.2 Model Summary(b) 1 1.000(a).999.999.00457 2.642 a Predictors: (Constant), logx2, logx5, logx6, logx4 b Dependent Variable: logx3 Table 5.3 Model Summary(b) 1 1.000(a).999.999.00581 2.597 a Predictors: (Constant), logx3, logx5, logx6, logx2 b Dependent Variable: logx4 Table 5.4 Model Summary(b) 1.933(a).870.823.05997 2.625 a Predictors: (Constant), logx4, logx6, logx2, logx3 b Dependent Variable: logx5 Table 5.5 Model Summary(b) 1.997(a).995.993.00359 2.396 a Predictors: (Constant), logx5, logx4, logx2, logx3 b Dependent Variable: logx6 92

The table 5.1 to 5.5 shows that the R 2 value of the auxiliary regressions is more than the overall R 2 suggesting that the multicollinearity is a troublesome problem. 4. Eigen Values and Condition Index From the Eigen values we can derive the condition number k. maximum Eigen Value k minimum Eigen Value If k is between 100 and 1000 there is moderate to strong multicollinearity. as And the condition index (CI) is defined CI maximum Eigen Value minimum Eigen Value If the CI is between 10 and 30, there is a moderate to strong multicollinearity. If it exce eds 30 there is seve re multicollinearity. Table 6 Eigen Value and Condition index. Dimension Eigenvalue k Condition Index 1 5.980990910 1.000000000 2 0.016218839 19.203336662 3 0.002761830 46.535899716 4 0.000020954 534.262945714 5 0.000007283 906.190115069 6 0.000000185 5687.945336822 k = 32352722.2 CI= 5687.94534 The k value is greater than thousand showing the existence of multicollinearity. The condition index is also greater than con firming the existence of severe multicollinearity. 5. Tolerance and Variance Inflation Factor (Constant) Table 6 Tolerance and VIF. Tolerance VIF logx2 0.004 243.451 logx3 0.001 1524.347 logx4 0.001 1525.689 logx5 0.130 7.718 logx6 0.005 194.975 The closer the Tolerance value to Zero and if VIF exceeds 10, the greater is the degree of multicollinearity. Elimination of Multicollinearity The choice of a remedial measure de pends on the circumstances the researcher encounters. The methods which solve the problem in one model may not be effective in another model. The researcher has to try several procedures to obtain a best fit model. 1. Dropping a variable(s) 2. Transformation of the variables 3. Additional or new data 93

4. Reducing collinearity in a polynomial regression The Tolerance, VIF and Zero order Correlation which tells us to look into variables like log X 2, Log X 3, Log X 4 and Log X 6. By analyzing the above factors and the theoretical background, the variables X 2 and X 3 are eliminated from the model. Revised model results are presented below. Y= -17.0582-0.9533logX 4-0.03099logX 5 +4.90logX 6 Standard t p value Error Constant -17.0582 4.60097-3.70752 0.002994 Personal Disposable income log X 4-0.95333 0.233974-4.07451 0.001541 Interest rate log X 5-0.3099 0.064542-4.80143 0.000432 Employed civilian labor force log X 6 4.901277 1.074051 4.563356 0.000651 R Square.759 Adjusted R Square.699 F Ratiop value 12.594(0.001) Sample Size 16 The F ratio is also significant explaining the impact of the explanatory variables on the sale of new passenger cars. The R square is 0.759, which means 76% of the variation in the dependent variable are due to the explanatory variables. The t-value for all the coefficient is significant for the explanatory variable. Conclusion The explanatory variables specified in an economic model usually come from economic theory or basic understanding of the behaviour the researchers are trying to model. The data for these variables typical ly comes fro m un controll ed experiments and often move together. In this situation, it is difficult to solve the problem by omitting or adding a new variable. So care should be taken by the researcher to reduce the problem of multicollinearity while formulating a model using the time series data. 94

References i i i i i i iv Ragnar Frisch, Statistical confluence Analysi s by means o f Co mple te Regression systems, Institute of Economics, Osl o U nive rsity, publ.no.5,1934. Jan Kmenta, Elements of Econo metrics, 2nd edition, Macmillan, New York, 1986. Ramu Ramanathan, Introductory Econometrics with Applications 5th edition, Thomson South Western, Bangalore, 2002. Brooks Chris, Introductory econo metrics for finance, Cambri dge university Press, 2002. v. Gujarathi Damodaran & Sangeetha, Basic Econometrics, 4th Edition, Tata Mcgraw-Hill Companies, New Delhi, 2007. vi Maddla G. S., In troduction to Econometrics, 3rd Edition, Wiley India, New Delhi, 2005. 95