LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Similar documents
ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Efficient GMM LECTURE 12 GMM II

Algebra of Least Squares

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Regression, Inference, and Model Building

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Statistical Fundamentals and Control Charts

ECON 3150/4150, Spring term Lecture 3

1 Inferential Methods for Correlation and Regression Analysis

11 Correlation and Regression

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

4 Multidimensional quantitative data

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Statistics 511 Additional Materials

Linear Regression Models

Properties and Hypothesis Testing

Probability, Expectation Value and Uncertainty

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

A statistical method to determine sample size to estimate characteristic value of soil parameters

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Machine Learning for Data Science (CS 4786)

Chapter 2 Descriptive Statistics

Ismor Fischer, 1/11/

Stat 139 Homework 7 Solutions, Fall 2015

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Singular value decomposition. Mathématiques appliquées (MATH0504-1) B. Dewals, Ch. Geuzaine

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Correlation and Regression

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d


multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Lecture 7: Properties of Random Samples

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

µ and π p i.e. Point Estimation x And, more generally, the population proportion is approximately equal to a sample proportion

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

Zeros of Polynomials

Economics 250 Assignment 1 Suggested Answers. 1. We have the following data set on the lengths (in minutes) of a sample of long-distance phone calls

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

Linear Regression Demystified

CHAPTER 2. Mean This is the usual arithmetic mean or average and is equal to the sum of the measurements divided by number of measurements.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

General IxJ Contingency Tables

Statistical Properties of OLS estimators

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

Correlation Regression

5.1 Review of Singular Value Decomposition (SVD)

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

Simple Linear Regression

Chapter 12 Correlation

Lecture 24: Variable selection in linear models

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Eigenvalues and Eigenvectors

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Section 9.2. Tests About a Population Proportion 12/17/2014. Carrying Out a Significance Test H A N T. Parameters & Hypothesis

Last time: Moments of the Poisson distribution from its generating function. Example: Using telescope to measure intensity of an object

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

STP 226 ELEMENTARY STATISTICS

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Chapter 3: Other Issues in Multiple regression (Part 1)

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

Statistical Intervals for a Single Sample

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

ECON 3150/4150, Spring term Lecture 1

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Bayesian Methods: Introduction to Multi-parameter Models

Random Variables, Sampling and Estimation

Math 155 (Lecture 3)

Chapter Vectors

CLRM estimation Pietro Coretto Econometrics

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Solutions Descriptive Statistics. None at all!

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

Topic 9: Sampling Distributions of Estimators

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Machine Learning for Data Science (CS 4786)

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations

Indices of Distances: Characteristics and Detection of Abnormal Points

1 Last time: similar and diagonalizable matrices

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.


Correlation and Covariance

Soo King Lim Figure 1: Figure 2: Figure 3: Figure 4: Figure 5: Figure 6: Figure 7:

A proposed discrete distribution for the statistical modeling of

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Chimica Inorganica 3

Transcription:

LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 9 Multicolliearity Dr Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur

Multicolliearity diagostics A importat questio that arises is how to diagose the presece of multicolliearity i the data o the basis of give sample iformatio Several diagostic measures are available ad each of them is based o a particular approach It is difficult to say that which of the diagostic is the best or ultimate Some of the popular ad importat diagostics are described further The detectio of multicolliearity ivolves 3 aspects: (i) Determiig its presece (ii) Determiig its severity (iii) Determiig its form or locatio Determiat of X ' X ( X ' X ) This measure is based o the fact that the matrix X ' X becomes ill coditioed i the presece of multicolliearity The value of determiat of X ' X, ie, X ' X declies as degree of multicolliearity icreases If ra ( X ' X) < the X ' X will be sigular ad so X ' X 0 So as X ' X 0, the degree of multicolliearity icreases ad it becomes exact or perfect at X ' X 0 Thus X ' X serves as a measure of multicolliearity ad X ' X 0 idicates that perfect multicolliearity exists

3 Limitatios: This measure has followig limitatios i It is ot bouded as ii 0 < X ' X < It is affected by dispersio of explaatory variables For example, if, the X ' X x i xx i i i i xix i xi i i ( ) x i xi r i i where r is the correlatio coefficiet betwee X ad X So X ' X variability of explaatory variable If explaatory variables have very low variability, the zero which will idicate the presece of multicolliearity ad which is ot the case so depeds o correlatio coefficiet ad X ' X may ted to iii It gives o idea about the relative effects o idividual coefficiets If multicolliearity is preset, the it will ot idicate that which variable i X ' X is causig multicolliearity ad is hard to determie

4 Ispectio of correlatio matrix The ispectio of off-diagoal elemets r i i gives a idea about the presece of multicolliearity If X i ad X are early liearly depedet the r i will be close to Note that the observatios i X are stadardized i the sese that each observatio is subtracted from mea of that variable ad divided by the square root of corrected sum of squares of that variable X ' X Whe more tha two explaatory variables are cosidered ad if they are ivolved i ear-liear depedecy, the it is ot ecessary that ay of the r i will be large Geerally, pairwise ispectio of correlatio coefficiets is ot sufficiet for detectig multicolliearity i the data 3 Determiat of correlatio matrix Let D be the determiat of correlatio matrix the 0 D If D 0 the it idicates the existece of exact liear depedece amog explaatory variables If D the the colums of X matrix are orthoormal Thus a value close to 0 is a idicatio of high degree of multicolliearity Ay value of D betwee 0 ad gives a idea of the degree of multicolliearity Limitatio: It gives o iformatio about the umber of liear depedecies amog explaatory variables

5 Advatages over X ' X (i) It is a bouded measure 0 D (ii) It is ot affected by the dispersio of explaatory variables For example, whe, xi xx i i i i r xx i i xi i i X ' X ( ) 4 Measure based o partial regressio A measure of multicolliearity ca be obtaied o the basis of coefficiets of determiatio based o partial regressio Let R be the coefficiet of determiatio i the full model, ie, based o all explaatory variables ad determiatio i the model whe i th explaatory variable is dropped, i,,,, ad R L R Max( R, R,, R ) be the coefficiet of

6 Procedure: i Drop oe of the explaatory variable amog variables, say X ii Ru regressio of y over rest of the ( - ) variables X, X 3,, X iii Calculate R iv Similarly calculate 3 v Fid R Max( R, R,, R ) vi L Determie R, R,, R R R L ( ) R R L R L The quatity provides a measure of multicolliearity If multicolliearity is preset, will be high Higher the R L ( R R L ) degree of multicolliearity, higher the value of So i the presece of multicolliearity, be low Thus if R R L ( ) is close to 0, it idicates the high degree of multicolliearity Limitatios: i It gives o iformatio about the uderlyig relatios about explaatory variables, ie, how may relatioships are preset or how may explaatory variables are resposible for the multicolliearity ii Small value of ( R R L ) may occur because of poor specificatio of the model also ad it may be iferred i such situatio that multicolliearity is preset

5 Variace iflatio factors (VIF) The matrix C ( X ' X) X ' X becomes ill-coditioed i the presece of multicolliearity i the data So the diagoal elemets of helps i the detectio of multicolliearity If deotes the coefficiet of determiatio obtaied whe X is regressed o the remaiig ( - ) variables excludig X, the the th diagoal elemet of C is R 7 C R If X is early orthogoal to remaiig explaatory variables, the is small ad cosequetly C is close to R If X is early liearly depedet o a subset of remaiig explaatory variables, the C is large Sice the variace of th OLSE of β is Var b σ ( ) C is close to ad cosequetly So C is the factor by which the variace of b icreases whe the explaatory variables are ear liear depedet Based o this cocept, the variace iflatio factor for the th explaatory variable is defied as VIF R This is the factor which is resposible for iflatig the samplig variace The combied effect of depedecies amog the explaatory variables o the variace of a term is measured by the VIF of that term i the model Oe or more large VIFs idicate the presece of multicolliearity i the data R I practice, usually a VIF > 5 or 0 idicates that the associated regressio coefficiets are poorly estimated because of multicolliearity If regressio coefficiets are estimated by OLSE ad its variace is part of this variace is give by VIF σ ( X ' X) So VIF idicates that a

8 Limitatios: (i) (ii) It sheds o light o the umber of depedecies amog the explaatory variables The rule of VIF > 5 or 0 is a rule of thumb which may differ from oe situatio to aother situatio Aother iterpretatio of VIF The VIFs ca also be viewed as follows The cofidece iterval of th OLSE of b± ˆ σ C t α, The legth of the cofidece iterval is β is give by L ˆ σ C t α, Now cosider a situatio where X is a orthogoal matrix, ie, X ' X I so that C, sample size is same as earlier ad same root mea squares ( x ), the the legth of cofidece iterval becomes i x L* ˆ σt α, i L Cosider the ratio C L * Thus VIF idicates the icrease i the legth of cofidece iterval of th regressio coefficiet due to the presece of multicolliearity

6 Coditio umber ad coditio idex λ, λ,, λ X ' X Let be the eigevalues (or characteristic roots) of Let 9 λ Max( λ, λ,, λ ) max λ Mi( λ, λ,, λ ) mi The coditio umber (CN) is defied as CN λ λ max < < mi,0 CN The small values of characteristic roots idicates the presece of ear-liear depedecies i the data The CN provides a measure of spread i the spectrum of characteristic roots of X X The coditio umber provides a measure of multicolliearity If CN < 00, the it is cosidered as o-harmful multicolliearity If 00 < CN < 000, the it idicates that the multicolliearity is moderate to severe (or strog) This rage is referred to as dager level If CN > 000, the it idicates a severe (or strog) multicolliearity The coditio umber is based oly or two eigevalues: use iformatio o other eigevalues as well The coditio idices of X X are defied as I fact, largest C CN λ mi ad λ Aother measures are coditio idices which The umber of coditio idices that are large, say more tha 000, idicate the umber of ear-liear depedecies i X X A limitatio of CN ad C is that they are ubouded measures as 0 < CN <, 0 < C < max max C λ,,,, λ

0 7 Measure based o characteristic roots ad proportio of variaces λ, λ,, λ X ' X, Λ diag( λ, λ,, λ ) Let be the eigevalues of is matrix ad V is a matrix costructed by the eigevectors of X X Obviously, V is a orthogoal matrix The X X ca be decomposed as V, V,, V λ X ' X VΛV ' Let be the colum of V If there is ear-liear depedecy i the data, the is close to zero ad the ature of liear depedecy is described by the elemets of associated eigevector V The covariace matrix of OLSE is Vb ( ) σ ( X' X) σ ( VΛV ') σ VΛ V ' vi vi v i Var( bi ) σ + + + λ λ λ where v, v,, v are the elemets i V i i i The coditio idices are max C λ,,,, λ

Procedure: i Fid coditio idex C, C,, C ii (a) Idetify those λ ' s for which C is greater tha the dager level 000 (b) This gives the umber of liear depedecies (c) Do t cosider those C which are below the dager level ' s iii For such λ 's with coditio idex above the dager level, choose oe such eigevalue, say iv Fid the value of proportio of variace correspodig to i Var( b ), Var( b ),, Var( b ) as v i Note that ca be foud from the expressio λ vi vi v i Var( bi ) σ + + + λ λ λ ie, correspodig to factor The proportio of variace i th p i ( vi / λ ) vi / λ VIF vi p i ( / λ ) λ provides a measure of multicolliearity λ p > 05, If it idicates that is adversely affected by the multicolliearity, ie, estimate of is iflueced by the i presece of multicolliearity b i β i It is a good diagostic tool i the sese that it tells about the presece of harmful multicolliearity as well as also idicates the umber of liear depedecies resposible for multicolliearity This diagostic is better tha other diagostics

The coditio idices are also defied by the sigular value decompositio of X matrix as follows: where U is matrix, V is matrix, is matrix UU ' I, VV ' I, D is matrix, D diag ( µ, µ,, µ ) ad µ, µ,, µ are the sigular values of X, V is a matrix whose colums are eigevectors correspodig to eigevalues of X X ad U is a matrix whose colums are the eigevectors associated with the ozero eigevalues of X X X UDV ' The coditio idices of X matrix are defied as µ max η,,,, µ where µ max Max( µ, µ,, µ ) If λ, λ,, λ are the eigevalues of X X the p p X X UDV UDV VD V V V ' ( ') ' ' ' Λ ', so,,,, µ λ Note that with µ λ Var( b ) σ VIF p i i, v i i i ( v i / µ i ) VIF µ v µ i i

3 The ill-coditioig i X is reflected i the size of sigular values There will be oe small sigular value for each oliear depedecy The extet of ill coditioig is described by how small is µ µ max relative to It is suggested that the explaatory variables should be scaled to uit legth but should ot be cetered whe computig p i This will help i diagosig the role of itercept term i ear-liear depedece No uique guidace is available i literature o the issue of ceterig the explaatory variables The ceterig maes the itercept orthogoal to explaatory variables So this may remove the ill coditioig due to itercept term i the model