A DIFFERENT APPROACH TO MULTIPLE CORRESPONDENCE ANALYSIS (MCA) THAN THAT OF SPECIFIC MCA. Odysseas E. MOSCHIDIS 1

Math. Sci. hum / Mathematics and Social Sciences 47 e année, n 86, 009), p. 77-88) A DIFFERENT APPROACH TO MULTIPLE CORRESPONDENCE ANALYSIS MCA) THAN THAT OF SPECIFIC MCA Odysseas E. MOSCHIDIS RÉSUMÉ Un traitement de l analyse des correspondances multiples ACM) différent de celui de l ACM spécifique Dans lanalyse des correspondances multiples, la contribution de chaque variable nominale à l inertie est différente selon le nombre des modalités, ou des catégories de cette variable. Habituellement, pour les variables ayant beaucoup de modalités ou catégories) il y a des modalités peu fréquentes les «classes faibles») qui contribuent à linertie de la variable correspondante de façon disproportionnée. Souvent, ces modalités contribuent fortement à la détermination des premiers axes factoriels ce qui a pour conséquence de ne pas correctement représenter le problème étudié. Lanalyse spécifique des correspondances multiples traite le problème des modalités peu fréquentes en les supprimant. Cest-àdire, qu elle les ignore purement et simplement dans le calcul des distances entre les individus [Le Roux B., 999 ; Le Roux B., Rouanet H., 004]. Dans cet article, nous traitons ce problème dune façon différente. Nous maintenons les modalités faibles dans lanalyse. Ce que réalise notre analyse, c est de remplacer la métrique du χ d, χ ), par une nouvelle métrique qui prend également en compte le nombre de modalités de chaque variable : c est aussi d attribuer un effet raisonnable aux modalités faibles et d équilibrer toutes les variables nominales. En outre, nous tenons compte, uniformément, des modalités faibles, qu elles dérivent de beaucoup ou de peu de variables, bien que la plupart des cas «dangereux» sont ceux des variables qui ont beaucoup de modalités. Seules les variables à deux modalités ne sont pas concernées. MOTS-CLÉS Analyse des correspondances multiples, Analyse spécifique des correspondances multiples, Coefficient d austement, Nouvelle métrique SUMMARY In multiple correspondence analysis, each nominal variable affects the analysis with a different amount of inertia, depending on the number of its modalities or categories. Usually in variables with many modalities categories created infrequent weak classes) modalities which contribute disproportionally to the inertia of the corresponding variable. Often these modalities contribute heavily to the determination of the first factorial axes and as a result this cannot clearly represent the investigated problem. Specific multiple correspondence analysis deals with the problem of infrequent weak) modalities by removing them. That is, it simply ignores them in the calculation of distances between individuals [Le Roux B., 999; Le Roux B., Rouanet H., 004]. In this paper we deal with this problem in a different manner. We keep the weak modalities in the analysis. Replacing the X d x ) metric by a new metric which also takes into account the number of modalities of each variable, a reasonable effect of the weak modalities and a balancing of all the nominal variables is achieved in the analysis. We also encounter uniformly the weak modalities, whether they derive from many or few variables, even though the most dangerous case is the one variables where have many modalities. Only variables of two modalities are not affected. MCA KEYWORDS Adustment coefficient, Multiple correspondence analysis, New metric, Specific General manager and President of General Hospital of Thessaloniki, fmos@uom.gr

78 O. E. MOSCHIDIS. INTRODUCTION We deal with the processing of a table of the form individuals nominal variables) using multiple correspondence analysis. It is clear that with the metric X as a mechanism which accents the similarity of the individuals, the latter is affected by the number of modalities of each variable. Variables with few categories create columns with heavy weights compared to the other variables with many categories, which create columns with lower weights. As a result, the similarity of two individuals in relation to a variable with few categories is a priori more powerful than their similarity in relation to a variable with many categories. Initially we deal with this problem in order to let the nominal variables with a different number of modalities or categories contribute equivalently to the similarity of the individuals. For this reason, we introduce a balance or an adustment coefficient and a new metric d X ).. PROPOSAL OF THE NEW METRIC d x Suppose that E, E,...,E S, S) are the nominal variables or questions with p, p,..., p S respectively the number of modalities or the potential responses. We define as a balance or an adustment coefficient of the question E k the number: I k * p k ",.) Taking into account the adustment coefficient I * k, we define a new metric d X, by the formula: d X a i,a ) # + p k " f. ) +,.) k f. µ k * s"m k where the individuals α i and α give m common responses m S) to S questions, f " # k is the weight of the modality column) " k of the question E k which is selected by the individual " i and respectively f "µ k is the weight of the modality column) µ k of the question E k which is selected by the individual ". With the following example we want to show the impact of the new metric d X on the balance of the variables. Suppose the table 0- which corresponds to the nominal variables E, with 3 modalities we have E, with 5 modalities, which are answered by 00 individuals.

A DIFFERENT APPROACH TO MULTIPLE CORRESPONDENCE ANALYSIS 79 Table. The table 0- of the nominal variables E and E E E Weight " 0 0 0 0 0 0 /00 " 0 0 0 0 0 0 /00 " 3 0 0 0 0 0 0 /00 Weight 0. 0. 0. 0. 0. 0. 0. 0. With the metric of X it is: ) d X "," # 0. + 0. # + 0. + 0. 0 + 0 Contribution of E Contribution of E this means that in question Ε, it seems that the individuals are much more similar than question Ε, while with the new metric it is: d X ) "," 3 # 0. + 0. ) + 5 # 0. + 0. ) 5 + 5 Contribution of E Contribution of E this means that in each question we have the individuals with equal degree of similarity. For each variable, the new metric d X functions as an adustment coefficient of the weights of its columns, so it finally becomes a balancing measure of the similarity of the individuals against all the questions variables). 3. THE NEW METRIC d X AND THE INERTIA In this paragraph we investigate the consistence of the metric d in the total inertia I X o of the cloud of responses, which is equal to the inertia of the cloud of the individuals. The inertia of the cloud of responses through the metric of X as well as the implication of the latter in multiple correspondence analysis, has been investigated by J.-P. Benzécri [980b)], Greenacre [993], B. Escofier [988], L. Lebart, A. Morineau, M. Piron [000]. Below we give the basic conclusions. The inertia I, of the response or modality ), of question q or variable q) is given by the formula

80 O. E. MOSCHIDIS I S " # K n ),.) where S) is the number of the responses, K ) is the weight of column J and n) is the number of the individuals. Consequently, the smaller number K is, that is the more infrequent it is as a response or the weaker a modality is), the greater is its inertia. This results in a deformation of the real picture of analysis since this weak modality contributes decisively in the creation of the factorial axis. The inertia I q of question q is the sum of inertias of its responses and is given by the formula I q S m " ),.) where m is the number of potential responses to in question q. Consequently, the more responses a question has, the more this question contributes to the total inertia of the cloud of the responses. That is, questions with many responses contribute with great inertia. For this reason the aforementioned authors suggest that if we want the questions to have the same weight, they must have about the same number of responses [Lebart L., 995]. Finally, the total inertia I o of the cloud of responses, which is the sum of the inertias of all the questions, is given by the formula I o P S ",.3) where p is the number of potential responses, to all questions, as a result the more answers there are, the greater appears the total inertia. Now we investigate the corresponding magnitudes with the new metric Initially we compute the distance d X,g of the cloud of responses. d X It is n,g) # i" m " * k i " k n * ) n) m n n # k i " " k ) n i m n " n k i " n # # k i nk i" m n n # k i " k " k i + nk ) n i d. X ) of the response from the center of weight g ) ) it is k n i k i, because k i 0 or

A DIFFERENT APPROACH TO MULTIPLE CORRESPONDENCE ANALYSIS 8 m n " n k i " + k nk # ) i n n m " k k " n + ) n m n # " k " n In other words d X,g) m n " # " k n )..4) So, the smaller the weight of response, the greater the distance from the center of weight g. This is canceled by the divider m -. For the inertia I of the response J we have " I k # ns d X,g ) k ns n " m ) k ) # n k Sm ") # " k n ) Sm ") # " k n ) That is I Sm ") # " k n ).. ) So, the inertia created by an infrequent response small k ) is smaller because of the divider m. ). Consequently, the contribution of infrequent responses in the creation of the factorial axes is moderated. This also limits the possibility that the real picture is deformed from the infrequent responses. The inertia I q of question q is the sum of the inertias of its responses m I q " I m " # k Sm #) n ) Sm ") m " m # k n Sm #) ) ) Sm ") m " n ) n ) m # k " n ) Sm ") m " ) S that is I q S. ) So we believe that we obtain an important result: the inertia of a response is not dependent on the number of its potential responses. Instead questions have equal inertia " I q #. It is significant to rise the specific importance of each response in the S question. This results from formula. ), where its inertia depends on its weight k ).

8 O. E. MOSCHIDIS It is obvious that the total inertia I o of the cloud of responses, sum of the inertias of all the questions, is I o " I q " S s q s q that is I o.3 ) It is also clear that the role and the ability which gives this new metric d X efficient and useful in the analysis of questions with a different number of responses. is 4. TRANSFORMATION OF INITIAL DATA In order to proceed to a concrete application of the proposed approach of MCA and drax attention to the differences from classic MCA, we must develop a program wih the new proposed metric d X, in which every variable contributes to the analysis with equal inertia, without dependence on the number of its classes. There is another possibility which consists in transforming the initial data in the logical table and then applying classic MCA, in such a way that in the new transformed table, classic MCA with the classic " metric, lets variables contribute with equal inertia that is independent from the number of their classes. This is what we have chosen to represent. Initial table of 0- data E K i E m E s Sums of rows Center of weight g) " " " i x S S ns n " n ns K Variable E m consists of P m classes, where K i is equal to 0 or and this assigns of individual " i to class of E m variable. In order to create a new table we divide each E m variable with number P m " and then we proceed in the same manner for the rest of the variables. The new Transformed table is:

A DIFFERENT APPROACH TO MULTIPLE CORRESPONDENCE ANALYSIS 83 E E m E s Sums of rows Center of weight g) " " " i K i p m " p " + p " +...+ p m " +...+ p s " A A na n " n na K p m " We calculate consecutively, using the classic " metric the d x,g) distance of class from the center of weight g, the inertia I * of class, the inertia I m * E m and finally the total inertia I o" *. We have: k i n,g) " p # " m i k n) * n * * p m ) d x of the variable n # n k i " k n # ) n n " k ) n as.4) 3.) i i and : # I * K p m " na d x,g) K p m " # ) ) n ) A n K " n p m " # ) ) A " K n 3.) Therefore : P m I * * m " I p m #)A P m # K " n ) # p m " ) A p m # n n p m ")A p m ") A 3.3) each question contributes with same inertia equal to /Α, independent from the number of their classes. Finally we have:

84 O. E. MOSCHIDIS S I * o" # I µ A S m 3.4) From the above, we see that both proposals on one hand of the elaboration of the logical table with the new metric d x, and on the other hand of the elaboration of the new transformed table with the classic d x are equal. 5. THE APPLICATION EXPLOITS REAL DATA the answers of 68 citizens are summed up in 3 nominal variables: E,with 7 classes E, E,, E 7 ), E with 5 classes E, E,, E 5 ) and E 3 with 3 classes E 3, E 3, E 33 ). E is a very weak class chosen only by 3 citizens). This class contributes with 39,5 CTR395) cf. Table ) in the creation of 4 th factorial axis it does not contribute significantly neither in the creation of st nor in the nd nor in the 3 rd axis) which absorbs an important percentage 9,35 of the total inertia, while the st factorial axis absorbs,6 of the total inertia cf. Table 3). Table. Co-ordinates, Proections, Contributions #G COR CTR #G COR CTR #G3 COR CTR #G4 COR CTR ER -38 4 4-765 43 97 75 330 508 395 ER 50 3-4 0 70 579 396-60 30 5 ER3-607 9 7-045 9 30-008 46 34-456 9 0 ER4 880 33 5-6 0 7-74 8 6 6 ER5-83 8 68 8 83 77 58 4 4 677 79 59 ER6 37 0 305 44 3-373 66 35-47 87 5 ER7-00 8 49-938 69 48-86 3-063 89 7 ER -967 7 69 9 50 00-55 47 34 3 45 37 ER -94 5 87 77 03 66 3 30 93-75 ER3-44 34 9-460 40 6 4 0-00 93 38 ER4 735 0 99 548 6 63-49 8 6 56 77 ER5 54 4 7-675 8 8-30 59 9 4 0 4 ER3 733 399 5-3 7 495 8 78-30 0 4 ER3-9 3 566 53 77-84 34 64-39 7 43 ER33-34 48-50 86 49 6 555 0 67

A DIFFERENT APPROACH TO MULTIPLE CORRESPONDENCE ANALYSIS 85 Total inertia 4,00000 Table 3. Table of inertia axe inertia of explanation sum Histogramme of characteristic roots 0,5043067,6,6 ***************************************** 0,456748,4 4,0 ************************************* 3 0,448907,05 35,06 ************************************ 4 0,389946 9,75 44,8 ******************************* 5 0,367703 9,8 53,99 ****************************** 6 0,339080 8,48 6,47 *************************** 7 0,343848 7,86 70,33 ************************* 8 0,3049687 7,6 77,95 ************************* 9 0,55783 6,39 84,35 ********************* 0 0,434496 6,09 90,43 ******************** 0,670 5,67 96,0 ****************** 0,559998 3,90 00,00 ************* Therefore the class E alters the real image of the analysis. The above is elated to data analysis with classic Μ.Α.C. By applying multiple correspondence factor analysis in the proposed transformed table, E class does not contribute significantly to the creation of 4 th factorial axis nor, or course to the creation of previous axes) as it results from Table 4. Table 3. Co-ordinates, Proections, Contributions of transformed model #G COR CTR #G COR CTR #G3 COR CTR #G4 COR CTR ER -7 4 6 99 45 3 959 4 3-949 4 4 ER 454 6 5 980 76-7 08 56 00 79 44 ER3-3 79 7 0 733 38 75 369 6 5 ER4 439 80 6 40 0 58 7-48 7 3 ER5-379 4 6-57 3-043 87 89-43 3 4 ER6 84 3 3-67 33 7-7 0 0 3 ER7-609 9 9-4 68 36 0 47 3 69 ER -368 56 4-774 8 8-45 63 53-00 ER -569 55 4 3-66 475 349 350 34 5 ER3-380 7 0 04 8 4 48 45 3 9 0 0 ER4 369 5 7-93 4 4-409 64 40-77 536 36 ER5 4 8 9 59 844 44 34 397 98 56 ER3 938 654 336 638 30 65-80 4 6 3 0 3 ER3-68 4-47 973 68 47 4 6 6 8 ER33-53 76 5 757 9 37 74 4-04 3 9 The contribution of E is,4 of 8,8 of inertia of the 4 th factorial axis cf. Table 5)

86 O. E. MOSCHIDIS axe inertia Table 5. Inertia table of the transformed model of explanation sum Histogramme of characteristic roots 0,607477 8,56 8,56 ***************************************** 0,569947 7,4 35,98 ************************************** 3 0,368 9,66 45,64 ********************* 4 0,90560 8,88 54,5 ******************** 5 0,80637 8,59 63, ******************* 6 0,687994 8, 7,3 ****************** 7 0,840 5,53 76,85 ************ 8 0,785875 5,46 8,3 ************ 9 0,60374 4,89 87,0 *********** 0 0,583436 4,84 9,04 *********** 0,46049 4,46 96,50 ********** 0,4409 3,50 00,00 ******** The result of this is that the real explanatory importance of the axis does not change. It is also remarkable that with the proposed analysis, explanation percentages of the first factorial axes are also generally improved. We must note that each formula is verified for I o"* is: S 3number of variables) A p " + p " + p 3 " 6 + 4 + + 3 + 6 0,96 Therefore: I o"* S A 3 0,96 3,773 Exactly as it is shown in Table 5 of inertia I o"* 3,773

A DIFFERENT APPROACH TO MULTIPLE CORRESPONDENCE ANALYSIS 87 The table of the data in the application E E E 3 E E E 3 E E E 3 E E E 3 I 6 4 I 6 3 3 I4 4 5 I6 4 5 I 4 4 I 5 3 I4 4 5 I6 4 5 I3 6 3 I3 4 4 I43 4 3 3 I63 5 5 I4 4 I4 7 5 I44 6 I64 3 I5 6 3 I5 4 5 I45 6 5 I65 5 3 I6 4 4 I6 6 5 I46 6 4 3 I66 7 5 3 I7 6 5 I7 5 4 I47 4 I67 5 3 I8 6 5 I8 4 4 I48 5 4 I68 I9 6 4 I9 6 5 I49 6 I0 6 4 I30 6 5 I50 5 5 I 5 I3 7 I5 5 3 I 6 4 I3 6 5 I5 5 I3 6 4 I33 4 3 I53 4 4 I4 6 3 I34 6 3 I54 6 5 I5 7 5 I35 3 5 3 I55 I6 5 3 I36 4 5 I56 4 4 I7 4 3 I37 7 3 3 I57 4 3 3 I8 4 4 3 I38 4 4 I58 5 3 I9 3 5 3 I39 4 5 I59 5 4 I0 6 3 I40 4 5 I60 3 3 6. CONCLUSIONS With the balance or adustment coefficient and the new metric d x, it became feasible to let nominal variables with a different number of modalities, affect the analysis with " the same amount of inertia I q #. As a result, the specific weight and significance of S each modality of the variable is more clearly revealed in the analysis. Also, the inertia which is created by a weak modality small k ) is smaller because of the divider m ". ). Consequently, the contribution of infrequent responses in the determination of the factorial axes is tempered, that is, the possibility that the real picture is deformed by infrequent responses is limited. Finally, with the new transformed table and the classic MCA with classic the " metric, the variables contribute with equal inertia that is independent from the number of their classes. With an application, we show the differences between classic MCA and the new proposed approach.

88 O. E. MOSCHIDIS REFERENCES BENZECRI J.-P., «Sur l analyse des tableaux binaires associés à une correspondance multiple [ BinMult ]», Cahiers de l analyse des données ), 977, p. 55-7, [from a mimeographed note of 97]. ESCOFIER B., PAGES J., «Analyses factorielles multiples», Paris, Dunod, 988. GREENACRE M., Correspondence Analysis in Practice, London, Academic Press, 993. LEBART L., MORINEAU A., PIRON M., Statistique exploratoire multidimensionnelle, Paris, Dunod, 000. LE ROUX B., «Analyse spécifique d un nuage euclidien», Mathématiques et Sciences humaines 37, 999, p. 65-83. LE ROUX B., ROUANET H., Geometric Data Analysis from Correspondence Analysis to Structured Data Analysis, Dordrecht-Boston-London, Kluwer Academic Publisher, 004. MOSCHIDIS O., PAPADIMITRIOU I., Evaluation scales subectivity transformation of the initial degrees in probabilities.), Data Analysis Bulletin, 00, p. 4-5. MOSCHIDIS O., PAPADIMITRIOU I., Proposition of new indicators and a new metric for scale evaluation analysis with the methods of multidimensional data analysis, Data Analysis Bulletin, 00, p. 65-73. MOSCHIDIS O., Evaluation scales: metrics similarity in table 0-, Proceedings of 5 th Pan-Hellenic Statistic Congress, 00, p. 500-508. MOSCHIDIS O., A proposition for coding evaluation scales, Proceedings of the 6 th Pan-Hellenic Statistic Congress, 003, p. 337-346. VAN DE GEER J. P., Multivariate Analysis of Cate-orical Data: Theory, Sage, California, 993, 0 p.