LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 9 Multicolliearity Dr Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur

Multicolliearity diagostics A importat questio that arises is how to diagose the presece of multicolliearity i the data o the basis of give sample iformatio Several diagostic measures are available ad each of them is based o a particular approach It is difficult to say that which of the diagostic is the best or ultimate Some of the popular ad importat diagostics are described further The detectio of multicolliearity ivolves 3 aspects: (i) Determiig its presece (ii) Determiig its severity (iii) Determiig its form or locatio Determiat of X ' X ( X ' X ) This measure is based o the fact that the matrix X ' X becomes ill coditioed i the presece of multicolliearity The value of determiat of X ' X, ie, X ' X declies as degree of multicolliearity icreases If ra ( X ' X) < the X ' X will be sigular ad so X ' X 0 So as X ' X 0, the degree of multicolliearity icreases ad it becomes exact or perfect at X ' X 0 Thus X ' X serves as a measure of multicolliearity ad X ' X 0 idicates that perfect multicolliearity exists

3 Limitatios: This measure has followig limitatios i It is ot bouded as ii 0 < X ' X < It is affected by dispersio of explaatory variables For example, if, the X ' X x i xx i i i i xix i xi i i ( ) x i xi r i i where r is the correlatio coefficiet betwee X ad X So X ' X variability of explaatory variable If explaatory variables have very low variability, the zero which will idicate the presece of multicolliearity ad which is ot the case so depeds o correlatio coefficiet ad X ' X may ted to iii It gives o idea about the relative effects o idividual coefficiets If multicolliearity is preset, the it will ot idicate that which variable i X ' X is causig multicolliearity ad is hard to determie

4 Ispectio of correlatio matrix The ispectio of off-diagoal elemets r i i gives a idea about the presece of multicolliearity If X i ad X are early liearly depedet the r i will be close to Note that the observatios i X are stadardized i the sese that each observatio is subtracted from mea of that variable ad divided by the square root of corrected sum of squares of that variable X ' X Whe more tha two explaatory variables are cosidered ad if they are ivolved i ear-liear depedecy, the it is ot ecessary that ay of the r i will be large Geerally, pairwise ispectio of correlatio coefficiets is ot sufficiet for detectig multicolliearity i the data 3 Determiat of correlatio matrix Let D be the determiat of correlatio matrix the 0 D If D 0 the it idicates the existece of exact liear depedece amog explaatory variables If D the the colums of X matrix are orthoormal Thus a value close to 0 is a idicatio of high degree of multicolliearity Ay value of D betwee 0 ad gives a idea of the degree of multicolliearity Limitatio: It gives o iformatio about the umber of liear depedecies amog explaatory variables

5 Advatages over X ' X (i) It is a bouded measure 0 D (ii) It is ot affected by the dispersio of explaatory variables For example, whe, xi xx i i i i r xx i i xi i i X ' X ( ) 4 Measure based o partial regressio A measure of multicolliearity ca be obtaied o the basis of coefficiets of determiatio based o partial regressio Let R be the coefficiet of determiatio i the full model, ie, based o all explaatory variables ad determiatio i the model whe i th explaatory variable is dropped, i,,,, ad R L R Max( R, R,, R ) be the coefficiet of

6 Procedure: i Drop oe of the explaatory variable amog variables, say X ii Ru regressio of y over rest of the ( - ) variables X, X 3,, X iii Calculate R iv Similarly calculate 3 v Fid R Max( R, R,, R ) vi L Determie R, R,, R R R L ( ) R R L R L The quatity provides a measure of multicolliearity If multicolliearity is preset, will be high Higher the R L ( R R L ) degree of multicolliearity, higher the value of So i the presece of multicolliearity, be low Thus if R R L ( ) is close to 0, it idicates the high degree of multicolliearity Limitatios: i It gives o iformatio about the uderlyig relatios about explaatory variables, ie, how may relatioships are preset or how may explaatory variables are resposible for the multicolliearity ii Small value of ( R R L ) may occur because of poor specificatio of the model also ad it may be iferred i such situatio that multicolliearity is preset

5 Variace iflatio factors (VIF) The matrix C ( X ' X) X ' X becomes ill-coditioed i the presece of multicolliearity i the data So the diagoal elemets of helps i the detectio of multicolliearity If deotes the coefficiet of determiatio obtaied whe X is regressed o the remaiig ( - ) variables excludig X, the the th diagoal elemet of C is R 7 C R If X is early orthogoal to remaiig explaatory variables, the is small ad cosequetly C is close to R If X is early liearly depedet o a subset of remaiig explaatory variables, the C is large Sice the variace of th OLSE of β is Var b σ ( ) C is close to ad cosequetly So C is the factor by which the variace of b icreases whe the explaatory variables are ear liear depedet Based o this cocept, the variace iflatio factor for the th explaatory variable is defied as VIF R This is the factor which is resposible for iflatig the samplig variace The combied effect of depedecies amog the explaatory variables o the variace of a term is measured by the VIF of that term i the model Oe or more large VIFs idicate the presece of multicolliearity i the data R I practice, usually a VIF > 5 or 0 idicates that the associated regressio coefficiets are poorly estimated because of multicolliearity If regressio coefficiets are estimated by OLSE ad its variace is part of this variace is give by VIF σ ( X ' X) So VIF idicates that a

8 Limitatios: (i) (ii) It sheds o light o the umber of depedecies amog the explaatory variables The rule of VIF > 5 or 0 is a rule of thumb which may differ from oe situatio to aother situatio Aother iterpretatio of VIF The VIFs ca also be viewed as follows The cofidece iterval of th OLSE of b± ˆ σ C t α, The legth of the cofidece iterval is β is give by L ˆ σ C t α, Now cosider a situatio where X is a orthogoal matrix, ie, X ' X I so that C, sample size is same as earlier ad same root mea squares ( x ), the the legth of cofidece iterval becomes i x L* ˆ σt α, i L Cosider the ratio C L * Thus VIF idicates the icrease i the legth of cofidece iterval of th regressio coefficiet due to the presece of multicolliearity

6 Coditio umber ad coditio idex λ, λ,, λ X ' X Let be the eigevalues (or characteristic roots) of Let 9 λ Max( λ, λ,, λ ) max λ Mi( λ, λ,, λ ) mi The coditio umber (CN) is defied as CN λ λ max < < mi,0 CN The small values of characteristic roots idicates the presece of ear-liear depedecies i the data The CN provides a measure of spread i the spectrum of characteristic roots of X X The coditio umber provides a measure of multicolliearity If CN < 00, the it is cosidered as o-harmful multicolliearity If 00 < CN < 000, the it idicates that the multicolliearity is moderate to severe (or strog) This rage is referred to as dager level If CN > 000, the it idicates a severe (or strog) multicolliearity The coditio umber is based oly or two eigevalues: use iformatio o other eigevalues as well The coditio idices of X X are defied as I fact, largest C CN λ mi ad λ Aother measures are coditio idices which The umber of coditio idices that are large, say more tha 000, idicate the umber of ear-liear depedecies i X X A limitatio of CN ad C is that they are ubouded measures as 0 < CN <, 0 < C < max max C λ,,,, λ

0 7 Measure based o characteristic roots ad proportio of variaces λ, λ,, λ X ' X, Λ diag( λ, λ,, λ ) Let be the eigevalues of is matrix ad V is a matrix costructed by the eigevectors of X X Obviously, V is a orthogoal matrix The X X ca be decomposed as V, V,, V λ X ' X VΛV ' Let be the colum of V If there is ear-liear depedecy i the data, the is close to zero ad the ature of liear depedecy is described by the elemets of associated eigevector V The covariace matrix of OLSE is Vb ( ) σ ( X' X) σ ( VΛV ') σ VΛ V ' vi vi v i Var( bi ) σ + + + λ λ λ where v, v,, v are the elemets i V i i i The coditio idices are max C λ,,,, λ

Procedure: i Fid coditio idex C, C,, C ii (a) Idetify those λ ' s for which C is greater tha the dager level 000 (b) This gives the umber of liear depedecies (c) Do t cosider those C which are below the dager level ' s iii For such λ 's with coditio idex above the dager level, choose oe such eigevalue, say iv Fid the value of proportio of variace correspodig to i Var( b ), Var( b ),, Var( b ) as v i Note that ca be foud from the expressio λ vi vi v i Var( bi ) σ + + + λ λ λ ie, correspodig to factor The proportio of variace i th p i ( vi / λ ) vi / λ VIF vi p i ( / λ ) λ provides a measure of multicolliearity λ p > 05, If it idicates that is adversely affected by the multicolliearity, ie, estimate of is iflueced by the i presece of multicolliearity b i β i It is a good diagostic tool i the sese that it tells about the presece of harmful multicolliearity as well as also idicates the umber of liear depedecies resposible for multicolliearity This diagostic is better tha other diagostics

The coditio idices are also defied by the sigular value decompositio of X matrix as follows: where U is matrix, V is matrix, is matrix UU ' I, VV ' I, D is matrix, D diag ( µ, µ,, µ ) ad µ, µ,, µ are the sigular values of X, V is a matrix whose colums are eigevectors correspodig to eigevalues of X X ad U is a matrix whose colums are the eigevectors associated with the ozero eigevalues of X X X UDV ' The coditio idices of X matrix are defied as µ max η,,,, µ where µ max Max( µ, µ,, µ ) If λ, λ,, λ are the eigevalues of X X the p p X X UDV UDV VD V V V ' ( ') ' ' ' Λ ', so,,,, µ λ Note that with µ λ Var( b ) σ VIF p i i, v i i i ( v i / µ i ) VIF µ v µ i i

3 The ill-coditioig i X is reflected i the size of sigular values There will be oe small sigular value for each oliear depedecy The extet of ill coditioig is described by how small is µ µ max relative to It is suggested that the explaatory variables should be scaled to uit legth but should ot be cetered whe computig p i This will help i diagosig the role of itercept term i ear-liear depedece No uique guidace is available i literature o the issue of ceterig the explaatory variables The ceterig maes the itercept orthogoal to explaatory variables So this may remove the ill coditioig due to itercept term i the model