Some Measures of Agreement Between Close Partitions

Some Measures of Agreement Between Close Partitions Genane Youness and Gilbert Saorta CEDRIC CNAM, BP - 466, Beirut, Lebanon, genane99@hotmail.com Chaire de Statistique Aliquée- CEDRIC, CNAM, 9 rue Saint Martin, 75003 Paris, France, saorta@cnam.fr Abstract. In order to measure the similarity between two artitions coming from the same data set, we study extensions of the RV-coefficient, the aa coefficient roosed by Cohen (in case of artitions with same number of classes, and the D coefficient roosed by Poing. We find that the RV coefficient is identical to the Janson and Vegelius index. We comare the result coming from aa s coefficient to the ordination given by corresondence analysis. We study the emirical distribution of these indices under the hyotheses of a common artition. For this urose, we use data coming from a latent rofile model to formulate the null hyothesis. Keywords. RV, Janson and Vegelius index, Cohen s Kaa, Poing D, Corresondence analysis, latent class. Introduction The roblem of measuring the agreement between two artitions of a same data set has attracted interest and reaears continually in the literature of classification. The most useful agreement measured roosed by Rand (97 has been rediscovered and modified by [FOW 83]. Based on the comarison of object triles, a measure of artition corresondence was introduced by [HUB 85]. A generalized Rand index method for consensus clustering was roosed by [KRI 99] for finding an amalgamated clustering of a set of contributory artitions. By using mathematical and statistical asects in the standardization of the comarison coefficients, [LER 88] shows how to tae into account the relational constraint, which results from the artition structure. A simle way to comare artitions is to study the emirical distribution of a measure of agreement under some null hyothesis. It is necessary to define measures of agreement between artitions, before testing the agreement. In revious aers [SAP 0,0], [YOU 03], we have examined the following indices: Rand, Mc Nemar, and Jaccard. In this aer, we resent new aroaches to find the similarity between clustering. The RV-coefficient introduced by Robert, P. and Escoufier, Y. [ROB 76] is roosed and written in term of aired comarisons. We rove that RV-coefficient is identical to the JV index [JAN 8]. The aa coefficient roosed by Cohen [COH 60] is another way to measure the agreement between artitions with equal number of classes. Kaa can be used only in situations where categories are secified in advance, which is not the case of artitions where the labels of the classes are arbitrary: for that, we identify the ermutation of classes of one of the artitions by maximizing the aa values. We comare the otimal ermutation with the raning given by corresondence analysis: the results are the same. Finally we also study the D index roosed by Poing [POP 83]. This less nown index is based on comarison of airs of subjects. We find the emirical distribution of this agreement measure when the artitions only differ by chance.

. Notations Let P and P be two artitions (or two categorical variables of the same subjects with and q classes. If K and K are the disjunctives nx and nxq tables and N the corresonding contingency table with elements ' n, we have: N K K Each artition P is also characterized by the n n aired comarison table C with general term ' : c if i and i' are in the same class of P, ii' 0 otherwise Note that c ii, i N ' ' We have C KK and C K K Given n subjects, n(n-/ airs of subjects can be comared. When both artitions assign airs of subjects to the same classes, we consider this as an agreement. Table. The four cases P \ P Same class Different class Same class a Agreement (same Different class d Disagreement c Disagreement b Agreement (Different c ii 3. RV- coefficient Robert, P. and Escoufier, Y. [ROB 76] have derived a measure of similarity of two configurations, called RV-coefficient or the vector correlation coefficient; it allows to measure the similarity between two data tables X and X of the same observations by comaring the scalar roduct between individuals associated to the two tables. If the scalar matrices roduct X i X i is noted as W i with dimension nxn, the RV-coefficient is defined as: RV (X, X trace (W W trace (W trace (W When alied to the disjunctive tables associated to two artitions, we find: RV(P,P trace (C trace (C C trace (C (c ii ' (c ii ' i i' i i' (c ii ' ( c i i' ii ' q q ( A high value of RV may lead to the conclusions that artitions are almost identical. Lazraq, A. and Cleroux, R. [LAZ 0,0] have roosed a test for the null hyothesis that the theoretical (oulation RV is zero, but only in the case of numerical data, which cannot be alied to indicator variables.

3. JV index. The JV index is a measure of association roosed by Janson, S. and Vegelius, J. [JAN 8]. Its initial exression is: q nuv n u. q n.v + n u v u v JV(P,P [( n + n ][q(q n + n ] It has been roved that, in terms of aired comarisons: JV(P,P u u. (c (c ii' (c ii' i,i' i,i' Idrissi [IDR 00] used this formula to study the robability distribution of JV under the hyothesis of indeendence. If the classes are equirobable, one finds that c ii' c ii' follows a binomial distribution i i' B(n(n-,. Idrissi claims that the JV index between two categorical variables with equirobable modalities has an exected value equal to: ii' (c E (JV n The RV coefficient and the JV index are identical, in terms of aired comarisons. i,i' ii' q v q.v 4. Cohen s Kaa. The aa coefficient for comuting nominal scale agreement between two raters was first roosed by Cohen [COH 60]. He defined his index as the roortion of agreement after chance agreement is removed from consideration. P o Pe κ P Where P 0 n i n i n P e n n ii i.. i In this formula, P o is the amount of observed agreement. In case of indeendence, given the marginals, the exected amount of agreement is P e. Therefore, a correction is made for this amount agreement. In order that the index assumes the value in case of erfect agreement, the correction is also made in the denominator. It measures the deviation of objects on the diagonals in the agreement table. We use the equivalent formula of the aa coefficient to comute the agreement between two artitions having the same number of classes: e 3

n n n n ii i.. i i i κ n n i. n. i i An imortant condition to use the aa coefficient is for the situation in which the identity of classes er observations is nown in advance. In our case, when we comare artitions coming from the same data set and found by the clustering methods, the numbering of classes is totally arbitrary: so we roose to identify the classes of the artitions from the maximum value of κ. We ermute the columns or rows in the cross table and we calculate κ at each time; we tae the numbering of ermutation with the maximum κ. Another method to find the otimal ordering is to use the order of the categories of both variables given by the first factor in the simle corresonding analysis. This method maximizes the weight of the diagonal of the cross table by ermuting their lines and columns. We comare the result found by the simle corresonding analysis to our method. Often, the same value of κ can be observed. 5. Poing s D The D index was roosed by Poing, R. [POP 83] and is based on the same rinciles of aa coefficient. It is used to study the agreement between the nominal classifications by two judges who indeendently categorize the same subjects in case that the categories are not nown in advance. The D index is based on the comarison of airs of subjects and start from the situation of same agreement (Table. The index contains a correction for agreement that can be exected on the basis of chance given the marginals of the original classification. Under the null hyothesis of indeendence, the D is a transformation of the Russel and Rao s coefficient a (. a + b + c + d The D index is defined as: where Do De c q n i j (n n(n q c i j n(n D D D g (h 0,5g 0,5 where o m De D ni. (ni. n. j(n. j i D, D q n(n n(n e ni..n. j h and g integer(h n and D m Max(D,D q Note that in D one considers D e as a reasonable minimum [POP 83], so for situations where we use the g in D e, we have no emirical roof to get an exected value smaller than the minimum. This is in favor of g 4

instead of the fractional number h. However, this results in the fact that D e is biased: the results are too high. The consequence for D is that the index will tae conservative values. The D index has been comared to the aa coefficient to the JV- index, in the articular following case: Table. The articular case found by Poing Categories + - Total + u v-u v - v-u u v Total v v v u v Poing obtained the results: κ v u v D v JV In the general case we found that there are a strong correlation between these indices in the simulation study. We find a relation between the Jaccard s index nown as measure of similarity between objects described by resence-absence attributes [SAP 0] to the D coefficient, we find the following relation: In case of Integer(h h D n (C bj max(a + c,a + d C n a where J a + c + d and C n n(n 6. Emirical distribution We use the algorithm resented in [SAP 0] for finding the emirical distribution of the indices of association when the two artitions only differ by chance. The difficulty consists in concetualising a null hyothesis of identical artitions and a rocedure to chec it. Note that dearture from indeendence does not mean that there exists a strong enough agreement. Now we have to define the sentence two artitions are identical. Our aroach consists to say that the units come from a same artition, where the two observed artitions are noised. The latent class model is well adated to this roblem for getting artitions. Note that Green and Krieger [GRE 99] have used it in their consensus artition research. More recisely, we use the latent rofiles models because we have numerical variables For getting near-identical artitions, we suose the existence of a common artition for the oulation according to a latent rofile model. The basic hyothesis is the indeendency of observed variables conditional to the latent classes that gives similar artitions from one or another grous of variables: f ( x π f ( x / j j 5

Theπ are the class roortions and x is the random vector of observed variables, where the comonent x j are indeendent in each class. We use here the latent class model to generate the data and not to estimate arameters. Once having choosen the number of classes, we first generate their frequencies from a multinomial distribution with robabilities π, and then we generate observations in each class according to the local indeendence model in other words a normal mixture model with indeendent comonents in each class. Then we slit arbitrarily the variables into two sets and erform a artitioning algorithm on each set. The two artitions should thus differ only by random. We calculate the indices for the two artitions: we reeat the rocedure N times to find a samling distribution of aa, RV (or J index, and D. Our algorithm has the following stes:. Generate the sizes n, n,.., n of the clusters according to a multinomial distribution M(n; π π. For each cluster, generate n i values from a random normal vector with indeendent comonents 3. Get two artitions: P of the units according to the first variables and P according to the last - variables. 4. Comute association measures (for RV, J and D for P and P 4. Permute the columns of the cross table (P, P to find the maximum value of aa, so the numbering of classes. 5. Reeat the rocedure N times. 7. Numerical Alications We alied the revious rocedure with 4 equirobable latent classes, 000 units and 4 variables. We obtained the two artitions P (with X and X and P (with X 3 and X 4 with 4 classes by the -means methods. We resent only one of our simulations (erformed with S+ software. Table 3. The normal mixture model Class n Class n Class n 3 Class n 4 X N(.,.5 X N( -,.5 X N( -5,.5 X N(-8,.5 X N(4,.5 X N(-4,.5 X N(-0,.5 X N(-5,.5 X3 N(7,3.5 X3 N(-6,3.5 X3 N(-3,3.5 X3 N(-0,3.5 X4 N(0,4.5 X4 N(-0,4.5 X4 N(-0,4.5 X4 N(-30,4.5 The following figure shows the satial distribution of one of the 000 iterations for the two artitions P and P. Comonent -6-4 - 0 4 6 Comonent -5-0 -5 0 5 0 5-0 0 0 Comonent These two comonents exlain 00 % of the oint variability. -0 0 0 40 Comonent These two comonents exlain 00 % of the oint variability. Figure. The first two rincial comonents of one of the 000 samles for the P and P 6

The cross table of the two artitions in one of the simulations is reresented as follows: Table 4. Cross tabulation of P and P for one of the simulation 3 4 48 0 0 98 7 9 6 43 0 0 58 9 We find a value of the aa coefficient equal to 0.335, and a value of JV index (or RV equal to 0.648. The value of the D index is equal to 0.647. To identify the labels of classes of P to those of P, we ermute the columns (4! ermutations of the cross table: The maximum value of the aa coefficient is 0.787, and the numbering of column s ermutation is,,4,3. So the table becomes: Table 5. Cross tabulation with the new ermutation of columns 4 3 48 0 0 98 9 7 6 0 43 0 58 9 Another way of getting the otimal ordering is by means of corresondence analysis: categories of both variables are ordered according to their coordinates along the first axis. Table 6. Order according to the first factor of CA 98 7 9 58 9 0 6 43 0 0 0 48 If we calculate the aa coefficient of this last table, we find the value 0.787. Here corresondence analysis ordering gives the same result as our method. The distribution of maximum aa coefficient values is resented below: 7

400 300 00 00 0 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 KAPPA 0.3 0.4 0.5 0.6 0.7 0.40 0.45 0.50 0.55 0.60 0.65 0.70 Figure 3: Distribution of aa for 000 individuals and 000 iterations, artitions with 4 classes. The scatter lot of JV against D in the 000 iterations. With the same choice of normal indeendent mixtures variables, we find that the aa coefficient varies between 0.4 and 0.875. Its means is 0.8. By simulation, the value of the JV index varies between 0.4 and 0.7. The most frequent value is 0.63 and the mean is equal to 0.67. (Fig. 4 Under the hyothesis of indeendence E(JV0.003, and with 000 observations, indeendence should have been rejected for JV>0.67 at 5% level. The 5% critical value is much higher than the corresonding one in the indeendence case. It shows that dearture from indeendence does not mean that the two artitions are close enough. The D index taes its values between 0.3 and 0.7 with a mode equal to 0.65. The distribution has a mean equal to 0.60. Bimodality is due to the use of -means, which gives a local otima [YOU 04]. There is a strong correlation between the JV and D equal to 0.983. So we can say that the two indices give same result in comaring artitions (Fig.3 Under the null hyothesis of close artitions, all indices have values around the mean, which is close to 0.6 so that we could say that the two artitions P and P are close enough. 8

0 50 00 50 00 50 300 0 00 00 300 400 500 600 0.4 0.5 0.6 0.7 JV 0.3 0.4 0.5 0.6 0.7 D Figure. The JV- index distribution and D for the artitions of 4 classes in 000 iterations 8. Conclusion In this aer, we have roved the identity between the RV coefficient and the JV- index for comaring two artitions. A latent class model has been used to solve the roblem of comaring close artitions and three agreement indices: JV (or RV, aa and D have been studied. The aa coefficient allows to test the similarity between two artitions after ermutation and when we they have the same number of classes. We comare the otimal ermutation with the order given by corresondence analysis: By simulation, it gives often the same results. The D index seems useful for comaring the classification of two artitions in the case that the identification of classes is not nown in advance. D gives stress on airs in the same class of both artitions, as Jaccard s index. We found a relation between them. Since we found a strong correlation between the JV index and the D in our simulation, we can deduce that JV (or RV and D lead us to the same result in comaring artitions. The distributions of these roosed indices have been found very different from the case of indeendence and are bimodal. The bimodality might be exlained by the resence of local otima in the -means algorithm. 9. References [COH 60] COHEN, J., A Coefficient of Agreement for Nominal Scales., Educ. Psychol. Meas., vol 0, 7-46, 960. [FOW 83] FOWLKES, E.B. AND MALLOWS, C.L., A Method for Comaring Two Hiearchical Clusterings, Journal of American Statistical Association, vol. 78, 553-569, 983. [GRE 99] GREEN, P., KRIEGER, A. A Generalized Rand-Index Method for Consensus Clustering of Searate Partitions of the Same Data Base, Journal of Classification, vol.6, 63-89, 999. [HUB 85] HUBERT, L., ARABIE, P., Comaring Partitions, Journal of Classification,vol., 93-8, 985. [IDR 00] IDRISSI A. Contribution à l Unification de Critères d Association our Variables Qualitatives, Thèse de doctorat de l Université de Paris 6, 000. 9

[LAZ 0] LAZRAQ, A., CLEROUX R.. Inférence Robuste sur un Indice de Redondance, Revue de Statistique Aliquée, vol.4, 39-54, 00. [LER 73] LERMAN, I.C, Etude Distributionnelle de Statistiques de Proximité entre Structures Finies de Memes Tyes; Alication à la Classification Automatique, Cahier no.9 du Bureau Universitaire de Recherche Oérationelle, Institut de Statistique des Universités de Paris, 973. [LER 88] LERMAN, I.C, Comaring Partitions (Mathematical and Statistical Asects, Classification and Related Methods of Data Analysis, H.H Boc Editor, -3, 988. [JAN 8] JANSON S., VEGELIUS J. The J- Index as a Measure of Association For Nominal Scale Resonse Agreement. Alied sychological measurement, vol.6, 43-50, 98. [POP 83] POPPING, R. Traces of Agreement. On the Dot-Product as a Coefficient of Agreement. Quality and Quantity, vol 7 (, -8, 983. [POP 9] POPPING, R. Taxonomy on Nominal Scale Agreement, Groningen ; iec ProGAMMA, 945-990, 99,. [ROB 76] ROBERT P., ESCOUFIER, Y. A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient. Al. Statist., vol. 5, 57-65, 976. [SAP 0] SAPORTA G., YOUNESS G. Concordance entre Deux Partitions: Quelques Proositions et Exériences, in Proceedings SFC 00, 8èmes rencontres de la Société Francohone de Classification, Pointe à Pitre, 00. [SAP 0] SAPORTA G., YOUNESS G. Comaring Two Partitions: some roosals and Exeriments, Proceedings in Comutational Statistics edited by Wolfgang Härdle, Physica- Verlag, Berlin, 00. [YOU 04] YOUNESS G., SAPORTA G., Une Méthodologie Pour la Comaraison de Partitions, Revue de Statistique Aliquée, à araître. 0