Comparison among Some Remedial Procedures for Solving. Multicollinearity Problem in Regression Model Using Simulation. Ashraf Noureddin Dawod Ababneh

Size: px

Start display at page:

Download "Comparison among Some Remedial Procedures for Solving. Multicollinearity Problem in Regression Model Using Simulation. Ashraf Noureddin Dawod Ababneh"

Emmeline Goodwin
6 years ago
Views:

Comparson among Some Remedal Procedures for Solvng Multcollnearty Problem n Regresson Model Usng Smulaton By Ashraf Noureddn Dawod Ababneh Supervsor Prof.Fars M.

1 Comparson among Some Remedal Procedures for Solvng Multcollnearty Problem n Regresson Model Usng Smulaton By Ashraf Noureddn Dawod Ababneh Supervsor Prof.Fars M. Al-Athar hs hess was submtted n Partal Fulfllment of the Requrements for the Master s Degree n Mathematcs Faculty of Graduate Studes Zarqa Unversty December 016

COMMIEE DECISION hs hess (Comparson among Some Remedal Procedures for Solvng Multcollnearty Problem n Regresson Model Usng Smulaton) was successfully defended and approved on /1/017 Examnaton

2 COMMIEE DECISION hs hess (Comparson among Some Remedal Procedures for Solvng Multcollnearty Problem n Regresson Model Usng Smulaton) was successfully defended and approved on /1/017 Examnaton Commttee Sgnature Prof. Fars M. Al-athar (Supervsor) Prof. of Mathematcs Prof. Gharb M. Gharb(Member) Prof. of Mathematcs Dr. Radwan abou Qewdr (Member) Assoc. prof. of Mathematcs Dr. Mustfa O. abou shawesh (Member) Assoc. prof. of Mathematcs

3 Dedcaton o my soul of the dear mom, to my father, brothers and ssters o my wfe, and beloved chldren Omar and Jana

4 Acknowledgements I wsh to thank all those who have provded me wth support and assstance durng my research. I am deeply grateful to my supervsor Professor Fars M. Al-Athar. He has provded the gudance and nstructon that I value greatly. I would lke to thank the examnaton commttee members for ther cooperaton.

5 v able of Contents Commttee Decson. Dedcaton Acknowledgement v able of Contents v Lst of ables......v Lst of Fgures x Lst of Abbrevatons x Abstract..... x Chapter Zero 0.1 Introducton Purposes of he hess Lterature Revew Chapter One: he lnear Regresson and Multcollnearty 1.1 Introducton he Smple Lnear Regresson Model he Multple Lnear Regresson Model Least Squares Method he Multcollnearty,,,,,,,.. 11 Chapter : he Remedal Measure of Multcollnearty.1 Introducton Rdge egresson he Prncpal Component Regresson he Partal Least Square Regresson.... 4

6 v Chapter 3: he Smulaton Study and Analyss 3.1 Introducton he Smulaton Study Generatng Smulated Data Set Performance Measure Dagnoss Multcollnearty Comparson Analyss Concluson Abstract(n Arabc) References

7 v Comparson among Some Remedal Procedures for Solvng Multcollnearty Problem n Regresson Model Usng Smulaton By Ashraf Noureddn Dawod Ababneh Supervsor Professor Fars Alathar ABSRAC Multcollnearty s a problem that always occurs when two or more predctor varables are correlated wth each other. hs problem can cause the values of the least squares estmated regresson coeffcents to be condtonal upon the correlated predctor varables n the model. And t would be dffcult to dstngush between the contrbutons of these predctor varables to that of the response varable. Several approaches for handlng multcollnearty problem have been developed. In ths thess we restrct our attenton n famous three methods they are Prncpal Component Regresson, Partal Least Squares Regresson and Rdge Regresson he purpose of ths thess s to fnd the best method to handle the multcollnearty problems by comparng the performances of the three methods to determne whch method s superor than the others n terms of ts practcalty and effcency. Accordng to the smulaton study results n ths thess, we found that the partal least square method s the best among others because t s more effectve and capable to represent data when multcollnearty exsts.

8 v Lst of ables able (1.1): he data of detergent..16 able (1.): he VIF value of data of detergent data...17 able (1.3): he Egenvalue value of data of detergent able (1.3): Least Square Coeffcents Estmator Values able (.1): K values for R.R. Estmator able (.): R.R coeffcent values for 4 estmator able (.3): Egenvalue for experment data able (.4): Egenvectors for experment data able (.5): the prncpal components are columns of Z able (.6): values for varaton percent. 41 able (.7): PLS weght vector for detergent data. 50 able (.8): PLS components values for detergent data. 51 able (.9): PLS component for detergent data...51 able (.10): RMSE value for detergent data.. 5 able (3.1): Factors and levels for the smulated data sets..56 able (3.): the value y for p= 5,10,0,30 and able (3.3): the value of correlaton matrx for p= 5 wth 4 correlated varables...58 able (3.4): the value of correlaton matrx for p= 10 wth 4 correlated varables.58 able (3.5): shows VIF s values for p = able (3.6): shows VIF s values for p = able (3.7): shows VIF s values for p = able (3.8): shows VIF s values for p = able (3.9): shows VIF s values for p = able (3.10): shows R values for p = 5 and n=10,15, able (3.11): shows R values for p = 10 and n=15,0, able (3.1): shows R values for p = 0 and n=5,30, able (3.13): R values for p = 30 and n = 35, 40, able (3.14): shows R values for p = 50 and n=55,60, able (3.15): shows RMSE values for p = 5 and n=10,15, able (3.16): shows RMSE values for p = 10 and n=15,0, able (3.17): show RMSE values for p = 0 and n=5,30, able (3.18): shows RMSE values for p = 30 and n = 35, 40, able (3.19): shows RMSE values for p = 50 and n=55, 60, able (3.0): shows MSE values for p = 5 and n=10, 15, able (3.1): shows MSE values for p = 10 and n=15, 0,30.70 able (3.): shows MSE values for p = 0 and n=5, 30, able (3.3): shows MSE values for p = 30 and n = 35, 40, able (3.4): shows MSE values for p = 50 and n=55, 60,

9 v Lst of Abbrevatons E(. ) Expected value Var(.) Cov(. ) y e ε σ SSE R OLS RR PCR PLSR MSE VIF CI RMSE Varance value Varance-Covarance matrx Ftted values Resduals Errors Varance Resdual sum of squares Coeffcent of determnaton Ordnary Least Squares Method Rdge Regresson Prncpal Component Regresson Partal Least Square Regresson Mean Square Error of Estmator Varance Inflaton Factor Condton Index Root Mean Square Error

10 x Lst of Fgures Fgure (1.1) the ftted lne that represent the relaton between X and Y.. 7 Fgure (1.): the scatter plot matrx for detergent data Fgure (.1) the steps n Rdge Regresson for new four of chosen k methods Fgure (.) shows the steps n PCR algorthm Fgure (.3) shows the steps n SIMPLS algorthm Fgure (3.1): Plot of R values aganst m = 100 replcatons for p = 0 for n = Fgure (3.): Plot of R values aganst m = 100 replcatons for p = 50 for n = Fgure (3.3): Plot of RMSE values aganst m = 100 replcatons for p = 0 for n = Fgure (3.4): Plot of RMSE values aganst m = 100 replcatons for p = 50 for n = Fgure (3.5): Plot of MSE values aganst m = 100 replcatons for p = 0 for n = Fgure (3.6): Plot of MSE values aganst m = 100 replcatons for p = 30 for n =

11 1 Chapter zero 0.1 Introducton he regresson analyss s a statstcal method wdely used n many felds such as economcs, technology, socal scences and fnance. A lnear regresson model s constructed to descrbe the relatonshp between the dependent varable and one or several predctor varables. One of the prmary condtons of the standard lnear regresson model s the lnear ndependence of the predctor varables, multcollnearty s a problem that always occurs when two or more predctor varables are correlated wth each other. he presence of serous multcollnearty would reduce the precson of the estmated coeffcents n a lnear regresson model. he Ordnary Least Squares Estmators are an unbased estmators that are used to estmate the unknown parameters n the model. he varance of the Ordnary Least Squares Estmators would be very large n the presence of multcollnearty. herefore, based estmators wth small mean-square error and very small varance and based, are suggested as alternatves to the Ordnary Least Squares Estmator. Several approaches for handlng multcollnearty problem have been developed such as Prncpal Components Regresson, Partal Least Squares Regresson and Rdge Regresson. Prncpal Components Regresson (PCR) s a combnaton of prncpal components analyss (PCA) and ordnary least squares regresson (OLS). Partal Least Squares Regresson (PLSR) s an approach smlar to PCR because one needs to construct a component that can be used to reduce the number of varables.

12 Rdge Regresson s the modfed least squares method that allows based estmators of the regresson coeffcents. hs study wll explore whch method among Prncpal Component Regresson, Partal Least Squares Regresson, and Rdge Regresson performs best as a method for handlng multcollnearty problem n regresson analyss. 0. Purposes of the hess he man purposes of ths thess are: 1. Studyng multcollnearty effect and dagnoss methods.. Revewng varous remedes methods that are used to solve multcollnearty, and studyng ther propertes and effcency. 3. Comparng the remedes methods and ther performance n dfferent stuatons and dfferent crtera. 0.3 Lterature Revew he multcollnearty s an old new subject, the frst dscusson about t began n the frst of the last century, and ntroduced by Fsher n 1934 when he studed the tme seres ncluded many varables, and he showed the presence of serous multcollnearty would reduce the precson of the parameter estmate n a lnear regresson model. A very useful method to deal wth the multcollnearty s the prncpal components analyss. hs method of data analyss, descrbed by Pearson (1901) and Hotellng (1933), concerns fndng the best way to represent sample of sze n by usng vectors wth p varables (predctors), n such a manner so the smlar samples are represented by ponts

13 3 as close as possble. In order to fnd the prncpal components from a set of predctors, the method used s the analyss of egenvalues and egenvectors, whch starts from a data representaton usng a symmetrcal matrx and transforms t In 1970, Horel and Kennard ntroduced rdge regresson method to handle the multcollnearty, he rdge regresson procedure s based on the matrx the dentty ( I ) matrx and (k) beng a postve scalar parameter. A procedure can be used n llcondton stuatons where correlatons between the varous predctors n the model cause the X X matrx to be close to sngular. In partcular, we can obtan a pont estmate wth a smaller mean square error. Hoerl and Kennard (1970) suggested that n order to control nflaton and general nstablty assocated wth the least squares estmates, one can use 1 k X X k I X Y ; k 0 he rdge estmator, though based, has lower mean square error than the best lner unbased estmator Partal least squares regresson was ntroduced as an algorthm n the early 1980s, PLS regresson s a recent technque that generalzes and combnes features from prncpal component analyss and multple regresson. It s partcularly useful when we need to predct a set of dependent varables from a very large set of ndependent varables. It orgnated n the socal scences (specfcally economy, Herman Wold 1966) but became popular frst n chemometrcs due n part to Herman s son Svante, evaluaton (Martens & Naes, 1989). However, PLS regresson s also becomng a tool of choce n the socal scences as a multvarate technque for non-expermental and expermental data alke (e.g., neuromagng, see Mcntosh, et al., 1996). It was frst presented as an

14 4 algorthm akn to the power method (used for computng egenvectors) but was rapdly nterpreted n a statstcal framework. Many comparsons have been made among these methods Mcdonld and Galarneau (1975) evaluated the performance of some rdge estmators by smulaton experments. Wcheren and Churchll (1978) made another evaluaton by smulaton to relatve performance of several estmators usng smulaton. Gbbons (1981) made compresson between some rdge estmators usng Monte Carlo methods. Kejan (1993) proposed a new class of based estmate combnes the advantages of rdge and Sten estmators. hs estmator s called the Lu estmator. Sakallogu et al. (001) compared Lu estmator wth rdge he found the rdge and lu are equvalently well. Al-sayf (00) compared between some rdge estmators and least squares method usng smulaton technque and found that the rdge regresson method s better. Alht and Qasab (006) Compared between Least Squares and Latent Roots Regresson Methods and they found that the latent method s better f the number of varables 3,4, or 5 and worse f the numbers of varables s more than 5 Qwaydr(007) made a Comparson of Unbased Rdge Regresson Method Wth Least Squares and Latent Roots Regresson Methods Usng Smulaton echnque that the Unbased Rdge method s the best; It was also found that the Latent Root method s better than the Least Square method when the relaton between the varables s hgh.

15 5 Ahmad (007) made Comparson of Unbased Rdge Regresson Method Wth prncpal component Methods Usng Smulaton echnque,he found that Unbased Rdge Regresson Method s best Al-hassan (007) found that rdge estmators s better than prncple components estmators. Abu Al-aesh (007) made compasson between prncple component regresson, usng Kaser-Guttman,and unbased rdge regresson methods dependng on data generatng by Monte Carlo method, he found rdge regresson method s the best one.

16 6 Chapter One he Lnear Regresson Model and Multcollnearty 1.1 Introducton he multple regresson s a versatle statstcal tool and s useful n many felds. It nvolves determnng how a set of predctor varables are related to a response varable. Lke all statstcal analyss, the relatonshp between the response varable (we usually denote t by Y) and the set of number of (p) predctor varables (X1,X,,Xp) can be approxmated by the regresson model Y (,,, ) ε f X1 X X p where ε s the error. One of the most used functons f ( X1, X,, X ) s the multple lnear regresson p f ( X, X,, X ) X X X 1 p p p where β0, β1,β,,βp are called the regresson parameters or coeffcents, snce these parameters are unknown, then our goal n multple lnear regresson to estmate the values of these unknown parameters. However, n multple regresson certan assumptons must be met. he multple regresson assumes that each predctor varable represents at least some unque nformaton. hs assumpton s sometmes volated when some of the predctor varables are hghly correlated, ths problem s called multcollnearty. he dscusson on the smple lnear regresson and multple lnear regresson wll be done n Secton 1. and 1.3. he multcollnearty wll be descrbed n Secton 1.4. Secton 1.5 wll dscuss practcal example to explan multcollnearty.

7 1.. he Smple Lnear Regresson Model Lnear regresson s an approach for modelng the relatonshp between dependent varable y and one or more explanatory varables (or ndependent varables) denoted by 1 p

17 7 1.. he Smple Lnear Regresson Model Lnear regresson s an approach for modelng the relatonshp between dependent varable y and one or more explanatory varables (or ndependent varables) denoted by 1 p X X X X. We have two cases for the lnear regresson model, they are ether smple wth one explanatory varable, or multple wth more than one explanatory varable. he smple lnear regresson model s Y 0 1 X 1,, 3 n (1.1) where Y: dependent varable X: ndependent varable β0, β1 : regresson coeffcents constant ε: the error he smple lnear regresson model s used to the fnd best-fttng lne that represent the relatonshp between Y and X as shown n Fg (1.1) Fgure (1.1) he ftted lne that represents the relaton between X and Y he most common crteron used to determne the best-fttng lne s called the least square method, whch fnds the lne that mnmzes the sum of squared errors. hs lne does not need to go through any of the actual data ponts, and t can have a dfferent number of ponts above t and below t.

18 he Multple Lnear Regresson Model As we mentoned before, the multple lnear regresson model nvolves more than one explanatory varable, whch descrbe the relatonshp between the response and explanatory varables. Formally, the multple lnear regresson model, gven (n) observatons and (p) varables, s Y 0 1X 1 X pxp, 1,, 3 n (1.) = 0 p X j1 j j By usng matrx notaton, we can wrte the MLR model as follows: where Y X (1.3) Y= y 1 y yn, X= X X X p X X X 1 p Xn1 Xn Xnp 0 1, β = p, ε = 1 n where Y: s a (n 1) vector of response. X : s a (n (p +1)) matrx of the explanatory varables. β: s a ((p+1) 1) vector of regresson coeffcent. ε: s a (n 1) vector of random errors he Standardzed Regresson Model he standardzaton s recommended when regresson models are beng bult. When there are predctors wth dfferent unts and ranges, the fnal model wll have coeffcents,

19 9 whch are very small for some predctors, and t makes t dffcult to nterpret. Furthermore, Centerng and Scalng wll mprove the numercal stablty of the data. Standardzaton has two method for transformatons of data: () Centerng transformaton s reducng the Mean value of samples from all observatons. So, the observatons wll have a mean value of zero after ths transformaton. () Scalng transformaton s dvdng the value of the predctor for each observaton by standard devaton of all samples. hs wll cause the transformed values to have a standard devaton of one. If we consder the model (1.) to make standardzaton for every * X j such that X * j n 1 X j X ( X X ) j j j where 1,,..., n j 1,,..., p (1.4) where X j s the mean of jth column n matrx of X he regresson model wth the transformaton varables Y * and X * as defned by the standardzed transformaton s gven as follows: X X X X Y X... X ( ) s... ( ) s (1.5) 1 1 p p p p 1 1 p p s1 sp but Y X X (1.6) p p So Y Y X 1s1... X s ( ) (1.7) * * 1 p p p hs concludes that

20 10 Y s Y s X X (1.8) s s s s ( 1 1 )... ( p p ) ( ) 1 p y y y y Let Y Y Y s, ( ) and ( ) * * p p * sy sy sy then * * * * Y X (1.8) where X *,Y * and * are the matrces of standardzed varables 1.3. Assumpton of he Multple Lnear Regresson Model 1- he expected value of error s zero, that s (E( ε) = 0) - Constant varance (a.k.a. homoscedastcty). hs means that dfferent response varables have the same varance n ther errors. 3-Independence of errors. hs assumes that the errors of the response varables are uncorrelated wth each other. 4- Random errors are assumed to be normal dstrbuton wth mean 0 and varance 1. 5-he sample sze should be greater than the number of regresson coeffcent n p 1 6- he explanatory varables are orthogonal that means there s no relatonshp between the explanatory varables he Least Squares Estmaton of Regresson Parameters he least square estmator s derved by mnmzng the sum of squares of the error terms, can be gven as: 1 ( LS X X ) X Y (1.9) he least square estmator have followng propertes:

21 11 1-he lnearty: by the defnton of LS n (1.9), LS s a lnear functon of Y. - he least squares estmator s unbased E( LS). 3- he varance-covarance matrx of the least squares estmator s var( LS) ( X X ) 1 4- he mnmum the mean square error f the matrx of the mean squared error s arguably the most mportant crteron used to evaluate the performance of a predctor or an estmator. he mean squared error s also useful to relay the concepts of bas, precson, and accuracy n statstcal estmaton. he mean square error (MSE) of an estmator of a parameter defned by E ( ), and n multvarate case, when s a parameter vector the MSE functon defned by MSE( ) race[ Var( )] [ bas( )] [ bas( )] (1.10) = p1 p1 Var( j) [ bas( j)] (1.11) j1 j1 where race[ Var( )] the sum of the elements on the man dagonal of Var( ) matrx herefore, the s unbased then n 1 MSE( ls) Var( ls) = race[( X X ) ] j1 (1.1) 1.4 Multcollnearty he Multcollnearty n general occurs when there are hgh correlatons between two or more predctor varables. he presence of serous multcollnearty would reduce the precson of the parameter estmate n a lnear regresson model. In the case of perfect

22 1 multcollnearty the predctor matrx s sngular and therefore cannot be nverted, and n perfect multcollnearty, for a general lnear model Y = X β + ε, the ordnary least-squares estmator ( ) 1 X X X Y does not exst ype of Multcollnearty here are two types of multcollnearty exst: a- he Perfect multcollnearty: We have perfect multcollnearty f the correlaton between two predctor varables s equal to 1 or 1. Mathematcally, a set of varables s perfectly multcollnearty f there exst one or more exact lnear relatonshps among some of the varables. For example, we may have 0 1X 1 X X k... 0 (1.13) holdng for all observatons, where αj are constants and Xj s the th observaton on the j th explanatory varable. Consder the followng example, suppose the model Y X X (1.14) s to be estmated. Suppose that n the sample, t happens to be the case where X 3 4X (1.15) 1 Assume that there s no error term n the equaton. herefore, the correlaton between X1and X s 1.0. hs s an example of a model havng perfect multcollnearty. b- he hgh multcollnearty. It occurs when there are strong (but not perfect) lnear relatonshps among the ndependent varables.

23 13 he hgh multcollnearty occurs f there are two varables have a correlaton that s close to 1 or 1. herefore, the closer t gets to 1 or 1. When there s hgh but mperfect multcollnearty, a soluton s stll possble but as the ndependent varables ncrease n correlaton wth each other, the standard errors of the regresson coeffcents wll become nflated (Younger, 1979) he Sources of Multcollnearty here are several sources of multcollnearty as follows: 1. he data collecton method employed, for example, samplng over a lmted range of the values taken by the regressors n the populaton.. Constrants on the model or n the populaton beng sampled. For example, n the regresson of electrcty consumpton on ncome (X) and house sze (X3) there s a physcal constrant n the populaton n the famles wth hgher ncome generally have larger homes than famles wth lower ncomes. 3. Model specfcatons, for example, addng polynomal terms to a regresson model, especally when the range of the X varable s small. 4. An over determned model. hs happens when the model has more explanatory varables than the number of observatons. hs could happen n medcal research where there may be a small number of patents about whom nformaton s collected on a large number of varables he Consequences of multcollnearty In case of perfect multcollnearty among the explanatory varables then the regresson coeffcent of the X varables are ndetermnate and ther standard errors are nfnte. If multcollnearty s hgh, one s lkely to encounter the followng consequences.

24 14 1- he OLS estmators have the large varance and covarance s makng precse estmaton dffcult. o understand the effects of multcollnearty, consder a model wth two Predctor varables. he model s Y = X + X + (1.16) * * * * 1 1 and the normal equatons are gven by 1 r * 1y 1 * r 1 r y r (1.17) where r denotes the correlaton between x1 and x; and r1y and ry denote the correlatons of x1 and x wth y. From Eq. (1.17), we get 1 * 1 r r1y 1 * r 1 r y (1.18) 1 r r( r ) 1y y (1 r ) ry r( r1y) (1.19) If the predctors are hghly correlated, that s r 1, then the dvsor that s the determnant of * * ( ) X X or 1 r, s close to zero. he varances of the estmates are Var * * ( ) / (1 r ) (1.0) * * * 1 ( X X ) (1.1) are nflated by ths dvsor. he estmates are hghly correlated, wth corr r * * (, ) 1 where, f r s negatve, the estmates are approxmately equal wth nflated magntude. If r s postve, the estmates are approxmately equal n magntude but opposte n sgns wth the magntude beng nflated by the dvsor.

25 15. Because of consequence 1, the confdence ntervals tend to be much wder, leadng to the acceptance of the zero null hypothess more readly. 3. Also because of consequence 1, the t rato of one or more coeffcents tends to be statstcally nsgnfcant. 4. Although the t rato of one or more coeffcents s statstcally nsgnfcant, R, the overall measure of goodness of ft can be very hgh. 5. he OLS estmators & ther standard error can be senstve to small changes n the data Method of Detectng Multcollnearty a) Correlaton among predctors We can fnd the correlaton among the predctor varables, f the absolute value of correlaton between two predctor varables s more than 0.8 or.0.9, then there s a serous multcollnearty. b) Determnaton of * * ( X X ) * * If the regressor varables are standardzed, so ( X X ) contans elements that are the smple correlaton coeffcents between the regressors. he determnant of * * ( ) X X falls between 0 and 1, f det * * ( ) X X = 1, then the columns * * of X are orthogonal, f det ( X X ) close to 0 that means we have hgh multcollnearty but f det * * ( ) X X = 0,then we have one or more exact lnear dependences ext that s means we have perfect multcollnearty.( Montgomery and Peck,198) c) Egenvalues and Condtonal Index Here we dscuss the method of Egenvalue and condtonal ndex to detect the multcollnearty. Frst, we have to calculate the data matrx. hen usng * * X X I 0,we get the values of λ whch s Egen value. Now we have

26 16 Condtonal Index CI maxmum egen value (1.5) mnmum egen value After calculatng CI, for Montgomery and Peck (198) that, f CI les between 10 to 30, then there s moderate multcollnearty. In addton, f CI exceeds 30 then there s severe multcollnearty. d) Varance Inflaton Factor (VIF) We can calculate VIF values for each X n three steps: 1-Frst we run the least square regresson for each X of all the other explanatory varables n the frst equaton, the equaton would be: X X... X X... X p p - Calculate the VIF factor for X as followng: VIF 1 (1.6) 1 R where R s the coeffcent of determnaton of X n step one. 3- Analyze the magntude of multcollnearty by consderng the sze of here s no formal cutoff value to use wth the VIF for determnng the presence of multcollnearty, but Neter et al. (1996) recommended lookng at the largest VIF value. A value greater than 10 s often used as an ndcaton of potental multcollnearty problem. 1.5 Example he table 1.1 supples the data, ths data show 30 detergent cases were made. he goal was to develop a lnear equaton that relates the percentage concentraton of fve mportant components to a response that measures the amount of stan removed by the detergent durng the washng process. VIF.

27 17 able 1.1: he data of detergent (ICM chemcal magazne) Y X1 X X3 X4 X From the VI F s values shown n (able 1.), frst four varables ncluded are possbly multcollnearty because they exceed the cut off values of VIF n determnng whether collnearty s a problem. X1 appears to have a strong multcollnearty problems followed by X, X3 and X4

28 18 able 1.: he VIF value of data of detergent VARIABLE VIF X X X X X he correlaton matrx below, show another method to dagnoss multcollnearty whch show same result n VIF table and that can be seen able 1.3: he Egenvalue value of data of detergent Varable Egenvalue λ λ λ λ λ From able (1.3) of egenvalues wth a maxmum and a mnmum values are n bold, so the CI equal 35, whch means that we have a serous multcollnearty problem. hen we can compute each least square estmator coeffcents values β, as the table 1.4 show able 1.4: Least Square Coeffcents Estmator Values β0 β1 β β3 β4 β

29 19 he fnal Estmated Least Square regresson model for detergent data s gven as below: Y x x x 1. x x Fgure (1.) shows scatter plot matrx descrbed the relaton between each two varables, he correlaton between X1, X, X3 and X4 s hgh and n postve drecton, that mean we have multcollnearty problem. Scatter plot matrx also show the relatonshp between Y and each X1, X, X3 and X4 n postve drecton, ths relatonshp reflected on the values of β1, β,β3 and β4 to have postve sgns,, but the least square estmators for β1 and β3 have negatve sgns whch from multcollnearty effect.

30 0 Y X1 X X3 X4 X5 Fgure (1.): shows scatter plot matrx for detergent data

31 1 Chapter wo he Remedal Measure of Multcollnearty.1 Introducton Several methods have been developed to overcome the defcences of multcollnearty. he current study explores PLSR, PCR and RR methods to handle multcollnearty. Hence these methods wll be dscussed n ths chapter. he Rdge Regresson method wll be dscussed n Secton.. Followed by PCR method n Secton.3 and the PLS regresson method n Secton.4. In each secton, the algorthms of the method are descrbed and a numercal example usng detergent data sets wll be used to llustrate applcaton these methods.. he Rdge Regresson the rdge regresson method was suggested by Hoerl and Kennard ) 1970(, as an alternatve procedure to the OLS method, especally, when the multcollnearty exsts. he rdge technque s based on addng a basng constants K's to the dagonal of * * ( X X ) matrx before computng by usng the method of Hoerl and Kennard (1976). herefore, the rdge soluton s gven by: ( ), 0 (.1) * * * 1 * * RR X X KI X Y K where K s the rdge parameter and I s the dentty matrx. Note that f K = 0, the rdge estmator becomes as the OLS. If all K's are the same, the resultng estmators are called the ordnary rdge estmators...1 he Propertes of Rdge Regresson Estmator he rdge regresson estmator has several propertes, whch can be summarzed as follow:

32 1- Rdge regresson estmator s based estmator but reduces the varance of the estmate. E E X X KI X Y * * * 1 * * [ RR] [( ) ] E X X X X K X X X X X Y * * * * 1 * * 1 1 * * 1 * * [( ( ) ( ) ) ( ) ] E I K X X X X X Y * * 1 1 * * 1 * * [( ( ) ) ( ) ] * * 1 1 ( I K( X X ) ) (.) - he varance of the rdge regresson estmates s Var( ) ( X X KI) ( X X )( X X KI) (.3) * * * 1 * * * * 1 RR And to fnd the varance of rdge regresson estmators by usng egenvalue and egenvectors Let V be egenvectors of * * ( ) * * 1 X X so that ( V X X ) V D dag( 1,,..., p ), then t s also egenvectors of * * 1 ( X X ) * * 1, ( X X ki ) and * * ( X X ki ) then we the Eq.(.3) can be rewrtten as follows * * * 1 * * * * 1 [ ( )] [( ) ( )( RR ) ] trace Var trace X X KI X X X X KI * * * * trace[( X X KI) ( X X )] trace[ V ( D KI) V VDV ] trace[ V ( D KI) D] p = 1 ( k) Var (.4) * ( RR ) 3- he bas of rdge regresson estmates s bas * * * [ RR ] E[ RR ] * * 1 1 * * ( I K( X X ) )

33 3 * * 1 * K( X X ) KI) (.5) And to fnd the bas of rdge regresson estmators by used egenvalue and egenvectors then we the Eq.(.5) can be rewrtten as follows * * * * * * RR RR trace[ bas[ ]{ bas[ ]} ] K ( X X KI) * * * * K V V ( X X KI ) V V Now let * V { bas[ ]} K ( D KI ) * RR p k (.6) 1 ( ) k 4- he mean squared error of rdge regresson estmators from Eq.(.4) and Eq.(.6) s MSE( ) RR k p p 1 ( k) 1 ( k) (.7).. he Choce of the Rdge Parameter (K) Below, several methods of selectng rdge parameter are ntroduced: 1- Hoerl and Knnard method: Hoerl and Knnard (1970) proposed estmatng rdge parameter as follows: K hk (.8) max( ) - Hoerl and Knnard and Balvadne's method: Hoerl and Knnard and Balvadne (1976) proposed another estmaton method for rdge parameter as follows K hkb p p 1 (.9)

34 4 3- Lawless and Wang's method: Lawless and Wang (1976) proposed the followng rdge parameter estmaton: K lw p 1 p (.10) 4- Hockng, Spead and Lynn s method: Hockng, Speed and Lynn (1976) estmated rdge parameter as follows K hsl p ( ) 1 p ( ) 1 (.11) 5- Kbra (003) proposed usng arthmetc mean, geometrc mean and medan as rdge parameters, whch are: K AM 1 p (.1) P 1 K GM p ( ) p p (.13) KMED medan (.14)..3 he Rdge Regresson Steps and Appled Example he Fgure (.1) shows the steps of estmate the rdge regresson estmators where the four methods of choosng parameter K are chosen n thess, Hoerl and Knnard and Balvadne's method, arthmetc mean, geometrc mean and medan.

35 5 SEP 1 : Scale and center of the data x * j X x n 1 ( x X ) j j j SEP : Compute σ,βls and for standardzed data * * ( Y Y ) * n p * * * 1 * * ls ( X X ) X Y * * * ls VV ls ls Step 3: Compute K values for Rdge estmators * p* KHKB * 1 K am n * 1/ K ( ) p gm Kmed * medan( ) SEP 4: Compute the coeffcents estmator for standardzed data for every K * * * 1 * * RR ( X X KI) X Y SEP 5: Compute the coeffcents of the orgnal varables where X s, ( ) p * *, RR, RR Y 0, RR Y, RR 1 SX sx Fgure.1 he steps n Rdge Regresson for new four of chosen K methods.

36 6 By appled RR algorthm to an example of detergent data, n able 1.1.and able. presents rdge coeffcent values for four estmators. K able.1: K values for R.R. Estmator K value HKB AM GM MED After computng each K value for each estmator, we can compute each β as llustrated n algorthm, table. shows the R.R coeffcent values for 4 estmators. able.: R.R coeffcent values for 4 method β0 β1 β β3 β4 β5 HKB AM GM MED he fnal estmated Rdge regresson models for detergent data are gven as follows: For KHKB Y x x x x x

37 7 For KAM Y x x x x x For KGM Y x x x x x For KMED Y x x x x x he prncpal Components Regresson Method In ths secton the prncpal component wll be dscussed whch depends on prncpal component analyss, so we wll dscuss frst..3.1 Prncpal component analyss (PCA) PCA was nvented n 1901 by Karl Pearson and further developed by hotelng (1933),comprehensve surveys of the felds have been gven by jollffe (1986),Jackson(1991),and Baslevsky (1994),they ntroduce the dea of populaton components. As multvarate technques that uses an orthogonal transformaton to convert a set of observatons of possbly correlated varables nto a set of values of lnearly uncorrelated varables called prncpal components. And adopted an exploratory way can beneft them to get the explanaton and understandng of the nterrelatonshp between varables. he number of prncpal components s less than or equal to the number of orgnal varables. hs transformaton s defned n such a way that the frst prncpal component has the largest possble varance (that s, accounts for as much of the varablty n the

38 8 data as possble), and each succeedng component n turn has the hghest varance possble under the constrant that t s orthogonal to the precedng components. he resultng vectors are an uncorrelated orthogonal bass set. he prncpal components are orthogonal because they are the egenvectors of the covarance matrx, whch s symmetrc. PCA s senstve to the relatve scalng of the orgnal varables. he goals of PCA are to : (a) Extract the most mportant nformaton from the data table. (b) Compress the sze of the data set by keepng only ths mportant nformaton. (c) Smplfy the descrpton of the data set. (d) Analyze the structure. In order to acheve these goals the Prncpal Component Analyss (PCA) Procedure: Suppose that we have a random vector X. where X X * X X X * 1 * * p (.15) wth populaton varance-covarance matrx Var( X ) 1 1 1p 1 p p1 p p (.16) Consder the lnear combnatons

39 9 Z v X v X v X * * * p p Z v X v X v X * * 1 1 p p (.17) Z v X v X v X * * * p p1 1 p pp p each of these can be thought of as a lnear regresson, predctng Z from X1, X,..., Xp. here s no ntercept. o fnd th Prncpal Component, We v1, v,..., vp,are selected that maxmzes p p p * ( ) k m km j k 1 m1 j1 (.18) Var Z v v v v Moreover, Z and Zj wll have a populaton covarance p p p * cov( Z, Z j ) v v v k jm km j v j k1 m1 j1 (.19) subject to the constrant that the sums of squared coeffcents add up to one...along wth the addtonal constrant that ths new component wll be uncorrelated wth all the prevously defned components. p j v v v 1 (.0) j1 p p * cov( Z1, Z ) v v v 0 1k m 1 v km k1 m1 p p *, k m km k1 m1 cov( Z Z ) v v v v 0 (.1) p p * cov( Z 1, Z) v v v v 0 ( 1) k m km ( 1) k1 m1

40 30.3. he Prncpal Component Regresson Method he Prncpal components regresson (PCR) method s a regresson analyss technque that s based on prncpal components analyss (PCA). n PCR method, nstead of regressng the dependent varable on the explanatory varables drectly, the prncpal components of the explanatory varables are used as regressors,or only a subset of all the prncpal components for regresson s used. he PC regresson s smply starts by usng the PCs of the predctor varables n place of the predctor varables. As the PCs are uncorrelated, there are no multcollneartes between them, and the regresson calculatons are also smplfed. If all the PCs are ncluded n the regresson, then the resultng model s equvalent to that obtaned by least squares, so the large varances caused by multcollneartes have not gone away. However, calculaton of the least squares estmates va PC regresson may be numercally more stable than drect calculaton. If some of the PCs are deleted from the regresson equaton, estmators are obtaned for the coeffcents n the orgnal regresson equaton. hese estmators are usually based, but can smultaneously greatly reduce any large varances for regresson coeffcent estmators caused by multcollneartes. he PCR method may be broadly dvded nto three major steps: 1. Perform PCA on the observed data matrx for the explanatory varables to obtan the prncpal components, and then (usually) select a subset, based on some approprate crtera of the prncpal components so obtaned for further use.. Now regress the observed vector of outcomes on the selected prncpal components as covarates, usng ordnary least squares regresson (lnear regresson) to get a vector of

41 31 estmated regresson coeffcents (wth dmenson equal to the number of selected prncpal components). 3. Now transform ths vector back to the scale of the actual covarates, usng the selected PCA loadngs (the egenvectors correspondng to the selected prncpal components) to get the fnal PCR estmator (wth dmenson equal to the total number of covarates) for estmatng the regresson coeffcents characterzng the orgnal model..3.3 Prncpal Component Regresson Estmators Consder the standard regresson model, as defned n equaton (1.4), that s * * * * Y X he values of the PCs for each observaton are gven by Z * X A (.) where the (, k)th element of Z s the value (score) of the kth PC for the th observaton, and A A1 A Ap whose kth column s the kth egenvector of X X, and A A 1. Because A s orthogonal, Xβ can be rewrtten as follows: * X AA Z (.3) where α = A β. Equaton (1.4) can therefore be wrtten as follows: Y X X AA * * * * * * * * * Y Z (.4)

42 3 whch has smply replaced the predctor varables by ther PCs n the regresson model. Prncpal component regresson by appled the least square method to predct value of Y from the prncpal components Z, that s 1 ( Z Z) Z Y (.5) we can compute the varance by Var( ) ( Z Z) ( A X X A) 1 * * * 1 (.6) Var( ) D p 1 (.7) hen * Var( j ) j 1,,, p (.8) j Moreover, total varance gven by 1 ( ) ( ) p p p * * Var j Var j j1 j1 j j1 (.9) Now f one of egenvalues s so closed to zero then Var( ) wll be large and hence the total of varance wll be large, so we can delete the prncpal component correspondng to egenvalue, whch s close to zero.

43 33 he reduced model can be defned as * Y Zmm m (.30) where αm s a vector of m elements that are subset of elements of α, and Zm s an (n m) matrx whose columns are subset of columns of Z, and εm s the approprate error term. Usng least squares method to estmate αm n (.38) and then fndng an estmator for β from the equaton as follows: A (.31) * pc m m Where Am s the matrx of egenvectors of X * * X except those have egenvalues close to zero..3.4 Propertes of PCR Estmators In ths secton wll dscuss the propertes of PCR estmators 1- he PCR estmators are based o fnd the bas( pc ), we dvde the prncpal components to that frst subset that used n fnd pc s (m) components and the second that be deleted (d) components, so there d=p m we can partton the matrx A as follows: 1 m m 1 p m d A A A, A A ( A : A ), where Ad s ( p ( p m)).orthogonal matrx and ther columns are egenvector assocated wth j, j m 1,..., p So the A s orthogonal, then

44 34 A A A A I (.3) m m d d A A I A A (.33) m m d d Now from Eq. (.31) * pc A A ( Z Z ) Z Y (.34) m m m m m m * * 1 * * m( m m) m A A X X A A X Y (.35) 1 * * m m m Y A D A X (.36) Substtute A D A X ( X ) (.37) * 1 * * m m m pc By takng expectaton for both sdes (.38) E E A D A X X E A D A X (.39) * 1 * * 1 * ( ) ( m m m ) ( m m m ) pc (.40) E( ) A D A X X (.41) * 1 * * pc m m m But * * X X A D A A D A.hs gves m m m d d d E A D A A D A A D A A D A (.4) * 1 1 ( pc) ( m m m m m m m m m d d d ) E * ( pc) Am Am (.43) Snce now, A A m m I and A A 0 m d E * ( ) (1 pc Ad Ad ) (.44) E (.45) * ( pc ) Ad Ad

45 35 then bas E A A (.46) * * ( pc) ( pc ) d d - he Varance Var * ( ) Var( A ) pc mm Var A Var A (.47) * ( ) [ ( )] pc m m m Substtute equaton (.8) n equaton (.47) Var A D A * * 1 ( ) ( ) pc m m m Var( ) A D A (.48) * * 1 pc m m m 3- Mean Squared Error After we fnd the bas * ( pc ) and the Var we can use them to fnd the mean square * ( pc ) error of PCR estmator as follows: From the defnton of MSE n chapter one, we get MSE tr Var bas bas (.49) * * * * ( PC ) [ ( PC )] [ ( PC )] [ ( PC )] Now we substtute the bas( pc ) and the Var( pc ) MSE tr A D A A A A A (.50) * * 1 ( PC ) [ m m m ] [ d d ] [ d d ] MSE tr A D A A A A A (.51) * * 1 ( PC ) [ m m m ] d d d d Now snce A A m m I and A A d d I then MSE tr A D A A A (.5) * * 1 ( PC ) [ m m m ] d d

46 36 now tr A D A A A A A A A (.53) 1 [ ] m m m 1 1 m m 1 m because the A s orthogonal and A A 1, 1,,..., m then tr[ A D A ] 1 * m m m m 1 1 (.54) also we know that: A A ( A )( A ) (.55) d d d d ( A A A )( A A A ) m1 m p m1 m p p ( A ) (.56) m1 then we can fnd the mean squared error for the PCR estmator after substtutng (.56) and (.54) n (.5), to get the followng result : 1 MSE( ) ( ) m p * * PC A 1 m1 (.57).3.5 Selectng Varables Rules n PCR From the Eq (.57), t can be seen that the maxmal reducton of varance s acheved by deletng the prncpal components assocated wth the smallest egenvalues, but t s also desrable to keep prncpal components wth large coeffcents to avod a large bas, there are many approaches to satsfyng that, the famous two method as follows:

47 37 1- Proportons of Varaton In ths method selects only a number of PCs wth a hgh contrbuton to varance, and snce the egenvalue s the varance of that PC, we frst compute the percent of total varance for each PC as follows: p 1 (.58) hen select the PC s only whch ther percent larger than (100/p), where p s the number of varables. hs method advocated by some statstcan lke Jollfee (1986), Jackson (1991). - Kaser Guttman hs method depends on selectng the PC s, for Jackson (1993) egenvalues has larger than the mean of egenvalues where defned as follows: p 1 p (.59).3.6 he Prncpal Components Steps and Appled Example he Fgure (.3 ) shows the steps n PCR algorthm. hs algorthm s used n ths study. SEP 1 : Scale and center the data X * X n 1 j X ( X X ) j j j

48 38 SEP : Compute the correlaton matrx for centered and scaled data * * ( X X ) SEP 3: Compute egenvalues and egenvectors of the components. he component assocated wth the smallest egenvalue wll be deleted. SEP 4: Compute the components, Z Z * X A SEP 5: Compute the coeffcent estmate for the component after deleton 1 * ( Z Z ) Z Y m m m m SEP 6: ransform back the coeffcent estmate to the orgnal standardzed varables * A * pc m m SEP 7 : Compute the coeffcent of PCR for the orgnal varables p * *, pc X, pcsy 0, pc Y,, pc ( ) S s 1 x X Fgure.3 shows the steps n PCR algorthm. p By appled the PCR, algorthm to an example of detergent data as gven n able 1.1 the followng ables.3 and able.4 contan the egenvalues and the resultng egenvectors respectvely.

49 39 able.3: Egenvalue for experment data Varable Egenvalue λ λ λ λ λ able.4: Egenvectors for experment data he Z matrx of prncpal components s found by Z = X*A, where X* s centered and scaled and A s the matrx of egenvectors as shown n able.4. X* s a (5 x 30) matrx wthout the column of ones. herefore, the prncpal components are columns of Z shown n able.5 able.5: he prncpal components are columns of Z PC1 PC PC3 PC4 PC

50 For the proportons of varaton method for selectng subset of components, the values of varaton percent n able.6 show that the PC1 and PC columns of Z are the prncpal that larger than ( % % 0%) whch means only the frst two PC s have been p 5 selected.

51 41 able.6: values for varaton percent Varable Varaton percent PC % PC % PC % PC % PC % herefore, the applcaton of prncpal components regresson nvolves the removal of PC3,PC4 and PC5 wth the response beng regressed aganst the remanng components. hs regresson gves the followng coeffcents of the Z: Equaton (.14) can be used to determne the estmates of the coeffcents n terms of the centered and scaled regressors, and the results are dsplayed as follows * 1, pc *, pc * 3, pc * 4, pc * 5, pc hs s followed by a transformaton to the coeffcents of the natural varables usng Equaton n step7 for estmatng the constant term. he results are gven by o, pc 1, pc, pc 3, pc 4, pc 5, pc

52 4 he fnal PC regresson model for chemcal data are gven as follows Y X X X X X he Partal Least Squares Regresson Method hs method was frst developed by Wold ) 1966( whch orgnated n socal scences specfcally economy but became popular frst n chemometrcs by Wold s son. Partal Least Squares regresson (PLS) method s a recent technque that generalzes and combnes features from prncpal component analyss and multple regresson. Partal Least Squares Regresson s based on lnear transton from a large number of orgnal explanatory to a new varable space based on small number of orthogonal factors. PLS decomposes the X and Y matrces nto the form X P E (.60) Y UQ F (.61) where and U, the (n N) matrces are called scores matrces. P and Q, the (p N) and (1 N) respectvely, are called loadngs matrces. E and F the (n p) and (n 1) respectvely are the matrces of resduals. he PLS method, fnds weght vectors w, c such that,,, cov t u cov Xw Yc max cov Xr Ys (.6) r s1 where cov t, u tu (.63) n whch denoted the sample covarance between the score vectors t and u and w and c can be defned as

53 43 w X u / u u (.64) c Y t / t t (.65) where t Xw u Yc (.66) (.67) the vectors of loadngs p and q from Eq. (.60) and Eq. (.61) can be computed by regressng X on t and Y on u, respectvely P 1 X t( t t) (.68) Q Y u( u u) (.69) he lnear nner relaton between the scores vectors t and u exsts; that s U D H (.70) where D s the (p p) dagonal matrx and H denotes the matrx of resduals. If we substtute Eq.(.70) n Eq.(.61) we have Y DQ HQ F (.71) Let C * DQ and F HQ F then we get Y C F * (.7) Because of (.61), (.7) can be rewrtten to look as a multple regresson model: Y X W C F X F the "PLS regresson coeffcents", can be wrtten as pls WC (.73)

54 SIMPLS Algorthm he SIMPLS are an extenson of PLSR. It was proposed by De Jong (1993). he SIMPLS algorthm s the leadng PLSR algorthm because of ts speed and effcency. hs algorthm s based on the emprcal cross-varance matrx between the response varables and the regressors and on lnear least squares regresson. he SIMPLS method assumes that the X and Y varables are related through a blnear model X X Pt g (.74) Y Y A t f (.75) where X and Y denote the means of the X and Y varables. he t s are called the scores whch are k-dmensonal, wth k p, whereas P P k, p s the matrx of x-loadngs. he resduals of each equaton are represented by the g and f respectvely. he matrx A A kq, represents the slope matrx n the regresson of Y on t. he elements of the scores t are defned as lnear combnatons of the mean centered data, t a xra or equvalently n, k Xn, prp, k wth R r, r,, r p, k 1 k De Jong (1993) had stated that the weghts should be determned such as to maxmze the covarance score vectors t a and u a under some constrants. He also ponted out the followng four condtons that are specfed to control the soluton, where : ut 1. Maxmzaton of covarance: cov( t, u) q ( Y X ) r n. Normalzaton of weghts r a : rr Normalzaton of weghts q a : qq 1.

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 30 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 2 Remedes for multcollnearty Varous technques have