RV COEFFICIENT & PRINCIPAL COMPONENT ANALYSIS OF INSTRUMENTAL VARIABLES USING SASIIML SOFTWARE

Size: px

Start display at page:

Download "RV COEFFICIENT & PRINCIPAL COMPONENT ANALYSIS OF INSTRUMENTAL VARIABLES USING SASIIML SOFTWARE"

Kelly Stevens
6 years ago
Views:

1 RV COEFFICIENT & PRINCIPAL COMPONENT ANALYSIS OF INSTRUMENTAL VARIABLES USING SASIIML SOFTWARE Pascal SCHLICH INTRODUCTION When p numerical variables have been recorded from n samples it is now quite usual to perform a principal component analysis (PCA) (Morrisson 1976) on the data set called Xnp. This method gives a representaion of the samples in a low dimensional space directed by the first principal components (PC). The PCs are obtained as linear combinations of the p variables. The interpretation of a principal component amounts to the comparison of the p coefficients of the associated linear combination. A few problems appear whenp is large (for instance higher than 50). On one hand the memory size of some microcomputers could be too small or the length of computing time could be prohibitive. On the other hand and this is the main problem the interpretation of a linear combination of so many variables would certainly be tiresome and not very convincing. It is often difficult to decide which are the relevant correlations between variables and PCs. It would be of a great interest to have previous knowledge of the relevant variables and then perform the PCA with these variables only. Let Y be another data set containing q numerical variables recorded from the same n samples. The problem of the comparison of the two sample configurations can be studied through the two PCA. A numerical coefficient is needed to quantify this comparison. Now assume that Y has to be fit by X. Some variables in X are maybe unnecessary to get principal sample plots as close as possible to those of Y PCA. The problem is then to select a subset of r variables from X called X(r) giving a PCA interpretation similar to this of Y. Here also a coefficent is needed to measure similarity between two sets of variables: X(r) and Y Assuming X= Y this last question amounts to the first one about the selection of a subset of sufficient variables in a single PCA. The RV coefficent (Escoufier ) brings useful answers to these questions. It is based on a presentation ofpcacurrentlyused in France(Caillez and Pages 1976) in which important point is.. the choice. of the metric. The RV coefficient can be used to select simultaneously variables and metric in order to reduce the number of variables in a single PCA or to establish links between two PCAs on the same samples (Escoufier and Roberts 1979; Bonifaset ai.1984). Thi& last method is called the principal component analysis of instrumental variables (PCAN). It is a non symetric conjoint analysis includingselection of variables from a data set to fit another one. Applications of these methods in food science have been presented by Schlich et al (1987). The aim of this communication is to announce a program for RV selection of variables written in SAS/IML. The first part of this communication gives the broad outlines of the RV coefficient and the PCAN. The second one interprets a pedagogic example from artificial data. 566

2 I :. F ;...! 2 r " " ;" r i " t. F: f u ".. f: I ; " " 1: I :; t. " ;-; f: l- ; : -:.; jj. METHODS PCA Xnp is a data set containing p numerical variables recorded from nsamples. Dnn is a diagonal matrix of sample weights. Usually each sample gets the same weights lin. Ppp is a symetric matrix defining the metric used to measure distances between samples. PCA of t.he stat.iscal study (XPD) leads to computat.ion of the eigen system of the n f n matrix product WX=XPXD or equivalently to the p*p dual matrix pr()duct XDXP (the means matrix transposition). When Pis the identity then this last matrix product becomes the covariance matrix. RV coefficient Let (Ynq Qqq Dnn) be another statistical study on the same n samples and Wy be the matrix product YQYD.The RV coefficient between the two statistical studies is defined by: RV(WXWy) == trace(wx.wy) / (trace(wx.wx> * trace(wy.wy»li2 The RV coefficient appears as a generalized coefficient of correlation between the two PCAs. It varies between 0 and 1. When RV equals 0 then the matrix of correlation between X and Y is null. When RV equals 1 then all distances between samples are proportional in the two PCAs. Therefore the closer to 1 the RV the nearer the two PCAs. In the most usual cases in which the dual product is reduced to the covariance or correlation matrix: RV = Li=Ip;k=lq C2(XiYk)!(tij=lp C2(XiXj) * LkI=Iq C2(YkYl»li2 where c is a covariance or correlation coefficient. In thepar1;icular case in which q=l and if R is the classical multiple correlation coefficient then RV=R2/pl/2. This identity points out two pecularities of this coefficient. First it appean> as the square cosine of the angle between Y vector and X space inversely weighted by the dimension of this space. Secondly the magnitude of RV value is comparable to that of a squared correlation. Practically an RV value of 0.95 leads to equivalent interpretation of the two PCA.. I.!. Choosing variables in PCA Let X";Y(r) be a subset of r variables from Y. IfRV(WY(r)Wy) is close to 1 then the PCA of the reduced data set Y(r) can replace the PCA of the whole Y dataset without disturbing the sample locations on principal plots. Variables composing Y(r) are selected using a forward algorithm (Do Chi 1979). Three options are available to associate a metric Q(r) to the data set Y(r): "Q(r) is the identity matrix and each selected variable gts the same weight. - Q(r) is a diagonal matric containing the variable weights..; Q(r) isa symetric matrix containing a general metric. For the two last options the metric is constructed iteratively and simultaneously to the choice of variables by the program. 3 : 567

3 Figure 1 shows and explains the input window of the program. PCAIV PCAIV occurs with two different data sets X and Y. X(r) variables are selected from X to fit Y using the third option for Q(r) described above. PCAIV is. equivalent to the PCA of the q linear combinations of the r X( r) variables obtained by orthogonal projection of the q Y variables on the space spanned by the X(r) variables. In fact the PCAIV seems like a multiple linear regression with several instead of one only dependant variables Y and one common subset of in dependant variables X(r). After the X(r) selection the user has to compute the PCAN using the procedure PRINCOMP (with COY option) on the output data set DTS3 (Figure 1). The principal components of the PCAN are then understood by the correlations computed between both X(r) and Y variables and those components. EXAMPLE WITH SIMULATED DATA The data set ps.dts16 contains 26 observations called A B... Z and 16 variables called Xl... X16. It has been simulated to have a particular matrix of correlations showed in Table 1. More precisely the variables Xl X10 and X15 have been first independently simulated. Secondly the variables X2 X3... X9 have been simulated to have good correlations with Xl alternating sign plus and minus. The variables XU X12 X13 and X14 have been simulated in the same way for X10. Finally the variable X16 has been simulated to have good correlation withx15. Of course the PCA of this data set underlines three significant dimensions (Table 1).: the first one is higly correlated with the first 9 variables the second one with the 5 variables X10... X14 and the third with X15 and X16. Table 2 shows the RV selection with and without weights on selected variables. In the two cases. the two first selected variables are Xl and X10 which are both head of one variable group in the simulation process. The third group is represented by variable X16 in the two selection. The figure 2 shows where and how the RV selectsvariables. In fact 6 variables only are sufficient to get a first sample plot very similar to this ofthe whole PCA (Figure 3). Allowing weights on variables 3 variables only (one per dimension) are necessary to get a RV value greater than The computed weights are thus similar to those of the PCs. Table 3 gives and describes an example of PCAIV. It demonstrates clearly that PCAN is able to discover hidden linear combinations of variables. CONCLUSION The RV. coefficient is a useful tool to compare and to sununarize subsets of variables recorded from the same samples. More generally the RV coefficient can be considered as an unifying tool for linear multivariate statistical methods (Robert and Escoufier 1976). It can be. used also for classification of variables (Schlich 1989).or for a three-way data analysis called STATIS (Lavit 1988; Traissac 1990). 568

4 VSELECT----=== =-====-===================================9 Command ===> RV SELECTION: From X to fit y OTSl = First input data set y =.;..NUM_ Names of Y variables (NUM for all numerical variables) OPT = COR COV for covariance / CORfo]: correlation Second input data set Names of candidate variables XSEL = Names of variables imposed to the selection OPT = COR COV for covariance / COR for correlation METR = IO IO for identity/ WG for weighted/ IV for PCAIV RVMX = 1.00 PMAX = OTS3 = Maximum value for the RV coefficient Maximum number ofselected variables Output data set of the selected variables transformed by the metric for WG and IV options To select a subset of variables from a single data set to fit this whole data set define: DTS1 =DTS2 XSEL variables will be the selected first..; : " t f li ;" t ;: t( ; 1 1;" f t.- ;; r t OPT=COR means that each variable is divided by its standard deviation. RVMX andpmax stops the selectionat a given step. DT$3 is the output data set containing the selected variables transformed by the metric for IV and WG values of the METR option. When METR=lV the PCA of this data set is the PCAIV. In this case as for METR=WG PRINCOMP must be perform on DTS3 using its COY option. " Figure 1 : Inpu.tprogram window RVSELECT and comments -. : 569

5 SIMUlATED EXAMPLE OF SELECTIOO OF VARIABLE IN A SINGLE DATA SET *************************************************************** The data set ps.dts16 has been simulated to have almost this particular correlation matrix: VARIABLE Xl X7. X3 X4 X5 X6 r:1 X8 X9 Xl X X3 -O.-O o8 M-o.E O.OO-O. 0." X o."-o.m M-o o.n O.-o.M O. X r:1 -o4-o.n " O.-o o. O. X O.U-O. O."-O. O.-O. O.-o X Xl Xl o.oj Xl Xl Xl Xl XlO X11 Xl2 Xl3 Xl4-0.-o.09-o.03-o.14 O.H -O.-O. m-o. H o.-o o o.h-o o. O o.n-o.n O..-o.M o4 o.m Xl5 Xl PCA of the whole data set ps.dts16 canposed of 26 observations and 16 variables: Eigenvalue % PC PC PC PC PC PC Table 1 Variables higly correlated with the PC Xl X7. X3 X4 X5 X6 r:1 X8 XlO Xl1 Xl2 Xl3 Xl4 Xl5 Xl6 Unselected PC RV SELECTIOO FROO X TO m Y Y is fran ps.dts16 Variables (COR) are Xl X7.X3 X4 X6 r:1 X8 X9 XlO xu Xl2 Xl3 X14 Xl5 xi6 x is fran ps.dts16 Variables (COR) are Xl X7. X3 X4 X5 X6 r:1 X8 X9 XlO Xl1 Xl2 Xl3 Xl4 Xl5 Xl6 Metric wi th X selected variables is ID Metric with X selected variables is WG Weights Step Variable RV PC highly correlated Step Variable RV Xl XlO Xl6 1 1Xl PC1 1 1Xl XlO PC XlO X PC Xl L Xl PC) 5 5 X PCl 4 14 Xl Unselected variables 6 11 Xl PC Xl X X Unselected variables 7 2lQ X X Xl r: Xl Xl r: X Xl Xl X Xl Xl X X3 1 The 16 variables had 3 relevant PC of unequal weights. With the ID metric the RV selects 6 variables: 3 for PCl 2 for PC2 and 1 for PC3 With the WG metric the RV selects 3 variables and 3 weights similar to the PC weights Weights % Xl XlO Xl Table 2 570

6 CORRELATIONS BETWEEN PC OF THE WHOLE PCA AND THE VARIABLES PC2 -r j 0.5 i 0.0 -O.? 1 XIS m DO Xl6 Xl6 XI! ---l XU I _...- I PCl RV (METR=lD) chooses variables (highlighted) among the different significant directions and proportionally to the number of variables highly correlated with them. Figure 2 X8 SUPERIMPOSED FIRST PRINCIPAL SAMPLE PLOTS OF THE WHOLE AND THE REDUCED PCA (METR=lD) "j h-l I h! 4 " I j I : 31 i Z 2 J I z I j R II T : 1 r f 1 C Gg C k K W XW XXII J do 0 A ley bini 0 : -1 U e B L f n a ju sf v q I -2 I a -3 s I v p :: t I :" I S PCl Capital letters denote sample coming from the whole PCA. Lower case letter denotes samples coming from the reduced PCA of variables X1 X10 X3 X16 X5 and X11. Due to the good RV value (0.955) location of samples are very similar in the two PCA. The two interpretations are equivalent! Figure 3... f.. LO

7 SIMULATI:O EX1OO1.B OF PCAIV ************************** Assume vadables YJ. y/ Y3 Y4 and Y5 are simulated hneax combinations of variables Xl X9 XlO Xl4 and X.1.6 from data set ps.dts16. the coefficients are uniformely and randa:oly chosen between -1 and +1: Yl= -0.92*Xl *X9-0.25*X.l0-0.47*X *X16 Y2= 0.04*X9-0.78*X *X *X *)(10 Y3= 0.97*Xl *Xl *X *X *X10 Y4= O.72*XlO *)( *X *X *X9 Y5= -o.06*xl *XlO *X9 + O.17*Xl *X16 Is the RV able to di.scover this structure when all the 16 X vari.ables are submitted to the selection? RUNNING TIlE THREl: OPTIONS FOR TIlE MmIC k*************************************** RV SELF.crION FROM X lo FIT Y Y j.s from ps.dts5 Variables (COR) ar.e Yl Y2 Y3 Y4 Y5 x is from ps.dts16 Variables (COR) are X) X2 X3 X4 X5 X6 X7 X8 X9 X10 XJJ Xl/ X13 X14 Xl5 )16 Metric lith X selected. variables is ill Step Variable 9 X9 16 Xl6 14 X.14 2X2 15 X15 RV With no choice for the metric the RV fails in selecting the right variables. Metric 1ith X selected variables is WG Step Varhble 9 X9 16 Xl6 14 Xl4 13 X Xll RV Using a metric ofwd ghts the ref;tll t is a little b:it better but 2 selected variables are.stilj. wrong. Metdc with X selected variables is IV Step Variable J C) XC) 2 16 X XlO 4 14 X Xl RV Usi.ng a general metric the RV selects the right variables and discovers the structure. Table 3 572

8 REFERENCES Bernard-Do Chi C. Choix de variables en analyse de donnees. These. Universite des Sciences et Techniques du Languedoc i: f.:;.. j. y f i.: ". j.... Bonifas L.; Escoufier Y.;Gonzales P.L.; Sabatier R. Choix de variables en analyse en compos antes principales Rev. Stat. Appl Cailliez F.; Pages J.P. Introduction it lanalyse des donnees. Smash: Paris Escoufier Y. Echantillonage dans une population de variables aleatoires reelles. Publ. lnst. Stat. Univ. Paris Escoufier Y. Le traiterilent des variables vectorielles. Biometrics Escoufier Y.; Robert P Choosing variables andmetrics by optimizing the RV coefficient. In Optimizing Methods in Statistics; Rustagi J.S. Ed.; Academic: New York 1979; pp Lavit Ch. Analyse conjointe de tableaux qo.antitatifs. Masson: Paris Morrisson D.F. Multivariate Statistical Methods 2nd ed.; McGraw-Hill: New York K...r: "Z.i t:.. :. fl :..; t.. f; ".. i. l Robert P.;Escoufier Y. A unifying tool for linear multivariate statistical methods: the RV coefficient. Appl. Stat Schlich P.; Issanchou S.; Guichard E.; Etievant P.; Adda J. RV coefficient: a new approach to select variables in PCA and to get correlations between sensory and instrumental data. In Flavour Science and Technology; Martens M. DalenG.A. Russwurm R. Jr. Eds; Wiley: Chichester 1987; pp Schlich P.; Guichard E. Selection and classification of volatilecompoundf:! of apricot using the RV coefficient. J. Agric. FoodChem Traissac P. Exploratory data analysis of a cube of data by the ACT (Statis method) using SASlIML and SASIGRAPH software. In SEUG190 proceedings (in press) Address of the author is: INRA- Laboratoire de Recherches sur los Aromes 17 rue Sully DIJON Cedex - FRANCE i. 573

The STATIS Method. 1 Overview. Hervé Abdi 1 & Dominique Valentin. 1.1 Origin and goal of the method

The STATIS Method. 1 Overview. Hervé Abdi 1 & Dominique Valentin. 1.1 Origin and goal of the method The Method Hervé Abdi 1 & Dominique Valentin 1 Overview 1.1 Origin and goal of the method is a generalization of principal component analysis (PCA) whose goal is to analyze several sets of variables collected