Correcting Jaccard and other similarity indices for chance agreement in cluster analysis

Size: px

Start display at page:

Download "Correcting Jaccard and other similarity indices for chance agreement in cluster analysis"

Dorothy Evans
6 years ago
Views:

1 Adv Data Anal Classif DOI /s y REGULAR ARTICLE Correcting Jaccard and other similarity indices for chance agreement in cluster analysis Ahmed N. Albatineh Magdalena Niewiadomska-Bugaj Received: 1 June 010 / Revised: 15 March 011 / Accepted: 18 May 011 Springer-Verlag 011 Abstract Correcting a similarity index for chance agreement requires computing its expectation under fixed marginal totals of a matching counts matrix. For some indices, such as Jaccard, Rogers and Tanimoto, Sokal and Sneath, and Gower and Legendre the expectations cannot be easily found. We show how such similarity indices can be expressed as functions of other indices and expectations found by approximations such that approximate correction is possible. A second approach is based on Taylor series expansion. A simulation study illustrates the effectiveness of the resulting correction of similarity indices using structured and unstructured data generated from bivariate normal distributions. Keywords Similarity indices Matching counts matrix Correction for chance agreement Jaccard index Cluster analysis Comparing partitions Mathematics Subject Classification (000 6H30 1 Introduction Measuring similarity between two different partitions (clusterings of the same set of objects is an important issue in cluster analysis. Many similarity measures have been A. N. Albatineh (B Department of Epidemiology and Biostatistics, Florida International University, Miami, FL, USA aalbatin@fiu.edu M. Niewiadomska-Bugaj Department of Statistics, Western Michigan University, Kalamazoo, MI, USA m.bugaj@wmich.edu 13

2 A. N. Albatineh, M. Niewiadomska-Bugaj proposed in the literature and are extensively used in cluster analysis applications including validation studies and recovery of clustering structure, see Milligan et al. (1983; Saxena and Navaneerham (1991, 1993; Milligan and Cooper (1986 and Steinley (004 for discussion. The problem with these indices is that they do not account for agreement due to chance. Morey and Agresti (1984 proposed a correction for the Rand index R (Rand 1971 for chance agreement based on an asymptotic multinomial distribution, while Hubert and Arabie (1985 used the exact generalized hypergeometric distribution for the same purpose. Albatineh et al. (006, p. 308 showed that the difference between the Morey and Agresti (1984 and Hubert and Arabie (1985 expectations (asymptotic and exact is negligible when the number of objects to be clustered is not too small. Fligner et al. (00 proposed a modification of the Jaccard Tanimoto index to be used in diverse selection of chemical compounds using binary strings. These authors emphasized that the Jaccard Tanimoto index has been widely used in computational chemistry and has become the standard for measuring the structural similarity of compounds. Historically the coefficient of Jaccard Tanimoto appeared much earlier as Jaccard (1908 in an ecological context to measure the degree of relatedness between two biological communities with respect to their species composition. Albatineh et al. (006, p. 307 proposed a correction of the indices of Fowlkes and Mallows (1983; Hamann (1961; Russell and Rao (1940; Czekanowski (Cz (193 and Wallace (1983 for chance agreement. Their simulations showed that correction improves the performance of the indices in the sense that the indices take values close to zero when no clustering structure is present, while they take values close to the original index value when a clustering structure exists. Albatineh et al. (006, p. 308 introduced a family L of similarity indices that are linear functions of the sum of the squares of the matching counts. Some of the indices that are not members of the L family are of great importance and wide applicability in botany, ecology, zoology; such as the index J of Jaccard (1908; Sokal and Sneath (1963; Gower and Legendre (1986 and Rogers and Tanimoto (1960 to name a few. In this paper, our goal is to find a general method to correct similarity indices such as J, RT, SS, and GL for agreement due to chance. Two approaches will be introduced. First: as the indices of J, RT, SS, and GL are functions of two members of the L family, namely Czekanowski (Cz (193 and Rand (1971, this relationship can be approximated and the expectation in Eq..5 can be approximately computed. Second: Taylor series expansion of those indices is discussed and an approximation to the expectations of these indices is obtained and thus a correction for chance can be computed. The paper is organized as follows: Sect. presents an overview of similarity indices, Sect. 3 presents some results relating the indices to each other with a proposed method for approximating the relationships and hence the correction, while Sect. 4 presents the Taylor series idea to find the expectation of the indices. Sect. 5 presents the simulations showing the effect of the proposed methods with conclusions in Sect

3 Correcting Jaccard and other similarity indices Table 1 Binary counts for two clustering (partitioning methods Partition B Number of pairs In the same clusters In different clusters Total Partition A In the same clusters a b a + b In different clusters c d c+ d Total a + c b+ d N Overview of similarity indices A standard approach in comparing two partitions of the same data set is to calculate the similarity between the two obtained partitions of the underlying set of objects using similarity indices. Since the clusters are not predefined, the similarity of the results between different clustering procedures (algorithms is usually based on the number of pairs of objects that are (not placed together into the same cluster, according to each algorithm. Consequently a similarity table as in Table 1 is formed where a, b, c, d, and N are defined as: a: Number of pairs of objects which are joined in the same cluster for both clustering methods. b: Number of pairs of objects which are joined together by method A, and not joined together by method B c: Number of pairs of objects not joined together by method A, while joined together by method B. d: Number of pairs of objects which are not joined together by either of the two methods. ( n The total number of pairs is N = a + b + c + d = = n(n 1, where n is the number of observations to be clustered. Let U = {u 1, u,...,u I } and V ={v 1,v,...,v J } be two partitions of the same data set resulting from the two clustering methods A and B and producing I and J clusters (i = 1,,...,I and j = 1,,...,J, respectively. The entries of Table 1 can also be defined in terms of counts in the matching matrix M between the two partitions U, V as M = (m ij, where the entry m ij = u i v j is the number of common objects in cluster u i from method A, and cluster v j of method B (Jain and Dubes (1988, p. 173: a = ( = 1 n. (.1 13

4 b = c = d = ( m+ j j=1 ( mi+ ( ( ( n a b c = 1 = 1 = 1 A. N. Albatineh, M. Niewiadomska-Bugaj m + j 1 j=1. (. mi+ 1. (.3 mi+ +. (.4 m ij + n 1 j=1 m + j where m i+ = J j=1 m ij and m + j = I m ij are the ith row and jth column totals of the matching counts matrix M, respectively. Any similarity index (SI when corrected for chance agreement (CSI takes the form CSI = SI E(SI 1 E(SI (.5 where E(SI is the expected value of the index under fixed marginal totals of the matching counts matrix M and unity is the theoretical maximum value of the index, see Morey and Agresti (1984, p. 35. Any SI that takes the form SI = α +β I Jj=1, where α and β are unique for each index, is said to be a member of the family L (Albatineh et al. 006, p Albatineh (010 derived means and variances for any member of the family L under fixed marginal totals of the matching counts matrix and independence of the clustering algorithms. For example, the indices of Rand (1971 and Czekanowski (Cz (193 are among many that are members of the family L, and can be written as R = Cz = 13 ( a + d a + b + c + d = 1 1 I n(n 1 m i+ + J j=1 m + j }{{} α a a + b + c = 1 + n(n 1 }{{} β n ( I mi+ + J j=1 m + j n } {{ } α + 1. (.6 1 ( I mi+ + J j=1 m + j n } {{ } β. (.7

5 Correcting Jaccard and other similarity indices Table Selected similarity indices No. Index Symbol Formula 1 Rogers and Tanimoto (1960 RT Gower and Legendre (1986 GL 3 Jaccard (191 J 4 Sokal and Sneath (1963 SS 5 Sokal and Michener (1958; Rand (1971 R 6 Czekanowski (Cz (193; Dice (1945; Sørensen (1948 CZ 7 Hamann (1961 H 8 Mcconnaughey (1964 Mc 9 Johnson (1967 Jo 10 Kulczynski (197 K 11 Legendre and Legendre (1998 LL 1 Lamont and Grant (1979 LG 13 Maarel (1969 M 14 Sokal and Sneath (1963 SS3 15 Sokal and Sneath (1963 SS4 16 Southwood (1978 S a+d a+(b+c+d a+d a+ 1 a (b+c+d a+b+c a a+(b+c a+d a+b+c+d a a+b+c (a+d (b+c a+b+c+d a bc ((a+b(a+c a a+b + a+c a ( 1 a a+b + a+c a 3a 3a+b+c a a+b+c a (b+c a+b+c (a+d (a+d+(b+c a+d b+c a b+c Table presents partial list of similarity indices. The indices of RT, SS, GL, and J can be written in terms of m ij as a + d RT = a + (b + c + d I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j = I mi+ + J j=1 m + j + n(n 1 I Jj=1. (.8 a SS = a + (b + c I Jj=1 = ( n I mi+ + J j=1 m + j n 3 I. (.9 Jj=1 a + d GL = a + 1 (b + c + d I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j = I ( Jj=1 + n(n 1 1 I mi+ + (.10 J j=1 m + j. 13

6 A. N. Albatineh, M. Niewiadomska-Bugaj Table 3 Relationships between some similarity indices No. Comparison Relationship No. Comparison Relationship 1 (R,RT RT = R/( R 14 (LL, J LL = 3J/(1 + J (R,GL R = GL/( GL 15 (LL, SS LL = 6SS/(5SS (J, CZ J = CZ/( CZ 16 (LG, M M = 4LG 1 4 (SS, J SS = J/( J 17 (J, M M = (3J 1/(J (H, R H = R 1 0 (SS3, R R = SS3/( SS3 6 (Mc, K Mc = K 1 1 (SS4, R SS4 = R/(1 R 7 (CZ, M M = CZ 1 (SS, S SS = S/(S + 8 (SS, CZ SS = CZ/(4 3CZ 18 (LG, SS SS = LG/( 3LG 9 (RT, GL RT = GL/(4 3GL 19 (SS3, SS4 SS4 = SS3/(1 SS3 10 (Mc, Jo Mc = Jo 1 3 (S, J J = S/(S (LL, LG LG = LL/(3 LL 4 (LL, M LL = (3 + 3M/(5 + M 1 (LG, J LG = J/(1 + J 5 (GL, H GL = (H + /(H (H, RT H = (3RT 1/(RT (SS, M SS = (5 3M/(M + 1 a J = a + b + c I Jj=1 = n I mi+ + J j=1 m + j n I Jj=1. (.11 These indices are not linear in I Jj=1 m ij and hence are not members of the family L. Their conditional expectations under fixed marginal totals of the matching matrix M can not be found explicitly. Therefore, the first idea is to express them as functions of other indices which have an expectation that can be explicitly computed, and to use them in an approximating formula of the corresponding function. In Sect. 3, the indices of RT, SS, GL, and J are shown to be functions of R and Cz which are members of the L family and their expectations are known. 3 Indices that are not in family L In this paper, we will focus on the indices RT, SS, GL, and J which are not members of the family L. Other non-members of the family L can be handled in a similar way. Relationships between some indices are presented in Table 3 and have been studied in Hubálek (198 and Janson and Vegelius (1981 in the context of their suitability for measuring coexistence between two species over different localities in ecology. Similarly, Snijders et al. (1990 established relationships between R, J, and CZ and derived some distributional results. The relationships numbered 1 5 in Table 3 were established in Janson and Vegelius (1981 and Snijders et al. (1990 and are presented here for completeness. Relationships 6 6 are newly formulated. Such relationships will form the basis for approximating expectations of indices that are not members of family L as discussed in the next section. 13

7 Correcting Jaccard and other similarity indices Cz Exact Approx J Fig. 1 Exact versus approximate relationship between J and Cz indices 3.1 Correction of Jaccard index As shown in Table 3, the indices J and Cz are related by the equation J = h(cz = Cz Cz, where Cz and J can be written in terms of m ij as given by (.7 and (.11, respectively. Note that the index Cz index is a member of the family L and hence the expectation of Cz under fixed marginal totals of the matrix M can be found explicitly, see Eq..7. In order to find the mean of the J index (not a member of family L, we use the fact that the function h can be closely approximated by a quadratic function (see Fig. 1 oftheform J = ξ 1 Cz + ξ Cz + ξ 3. (3.1 Least squares estimates of ξ 1,ξ, and ξ 3 were found using the R statistical software, and thus the relationship between J and Cz can be approximated by J = Cz Cz (3. Point by point evaluation of the exact relationship of J and Cz and the quadratic approximation of J by Cz revealed that the approximation is very good, with maximum absolute difference which occurred at points close to 0 or 1. Therefore, the expectation of the approximate J index given by (3.1 can be calculated as E[J ]=ξ 1 E(Cz + ξ E(Cz + ξ 3 = E(Cz E(Cz (3.3 13

8 A. N. Albatineh, M. Niewiadomska-Bugaj To calculate E[J ], we need E(Cz and E(Cz which are established in Theorem 1 not only for Cz, but for the family L. Theorem 1 Let SI be any similarity index of the form SI = α + β I Jj=1. Under fixed marginal totals m i+ and m + j of the matching counts matrix M = (m ij and independence of the two clusterings, we have ( PQ E[SI]=α + β n(n 1 + n, (3.3 ( ( PQ E[SI ]=α + αβ n(n 1 + n + β E[U ]+ PQ n 1 n, (3.3 where E[U ]= PQ n(n 1 + 4P Q n(n 1(n + (P 4P P(Q 4Q Q, (3.3 n(n 1(n (n 3 U = P = P = m ij (m ij 1 = mi+ n,q = m + j n, j=1 m ( ij, m i+ (m i+ 1(m i+, andq = (3.3 m + j (m + j 1(m + j. j=1 Proof For fixed marginal totals m i+ and m + j of the matching counts matrix M = (m ij and independence of the two procedures, the elements of M = (m ij have a generalized hypergeometric distribution, see Lancaster (1969, p. 14 and Fowlkes and Mallows (1983. Define m (p ij = m ij (m ij 1 (m ij p + 1, then the pth factorial moment of m ij is ( E m (p ij ( I This implies that E Jj=1 p. 00. Therefore, 13 E[SI]=E α + β = α + β = m (p i+ m(p + j /n(p (3.4 = PQ n(n 1 + n (see Hubert and Arabie 1985, m ij ( PQ n(n 1 + n = α + β E. (3.5

9 Correcting Jaccard and other similarity indices Since [SI ]= α + β = α + αβ m ij we obtain E[SI ]=α + αβ E + β + β E ( PQ = α + αβ n(n 1 + n + β E,. (3.6 The expectation on the right hand side of (3.6 can be evaluated as follows: consider m ( ij Therefore, E = E Hence, E = E = = l j=j m ij (m ij 1 n m ij (m ij 1 ne = m ij (m ij 1 + ne n + n + n. n. (3.7 In order to find the first expectation on the right hand side of (3.7, i.e. E[U ],weuse the fact 13

10 ( m ( ij A. N. Albatineh, M. Niewiadomska-Bugaj = ( (m ij 1 = m ij ( 1 = m 4 ij m 3 ij + m ij = m (4 ij + 4m (3 ij + m ( ij m (4 ij +4m (3 ij +m ( ij {}}{ and therefore, U = m ( ( ij = m ( ij + m ( ij m ( ij j =1 j = j + m ( ij m ( i j + m ( ij m ( i j. (3.8 i =1 j=1 i =1 j=1 j =1 i =i i =i j = j Hence, E(U = m(4 i. m (4. j n + 4 m (3 i. m (3. j n (3 + m ( i. m (. j n ( + m(4 i. m (. j m (. j j n =1 + m( i. m ( i. m(4. j i n =1 j=1 j = j i =i + m( i. m ( i. m(. j m. j i =1 j=1 j n =1. (3.9 i =i j = j After some simplifications and collecting identical terms we obtain (3.3 as desired. Furthermore, we obtain from (3.7: E = E[U ]+ne n ( PQ = E[U ]+n n(n 1 + n n = E[U ]+ PQ n 1 + n. (3.10 Substituting (3.10 into(3.6 results in ( ( PQ E[SI ]=α + αβ n(n 1 + n + β E[U ]+ PQ n 1 n with E[U ] as given by (3.3. (3.11 Since Cz index belongs to the family L, we can obtain E(Cz and E(Cz from Theorem 1 and therefore compute E[J ]from(3.3 as an approximation to E[J], the 13

11 Correcting Jaccard and other similarity indices expected Jaccard index. Hence, an approximation to the corrected Jaccard index (CJ is given by CJ = J E[J ] 1 E[J ] where J and E[J ]aregivenby(.11 and (3.3, respectively. 3. Correction of Rogers and Tanimoto index R R It is shown in Table 3 that R and RT are related by the equation RT = f (R = where RT, not a member of the family L, can be written in terms of m ij as in (.8. The curve representing the relationship between RT and R is similar to Fig. 1, and can be approximated by a quadratic equation of the form RT = γ 1 R + γ R + γ 3. (3.1 Least squares estimates of γ 1,γ, and γ 3 were found using R statistical software, and thus the relationship between RT and R can be approximated by RT = R R (3.13 Therefore, E(RT = E(R E(R (3.14 For determining E[R] we write ( 1 I R = 1 n(n 1 m i+ + J j=1 m + j + n(n 1 }{{}}{{} α β = α + β. (3.15 Since R belongs to the family L and Theorem 1 yields 1 E(R = E 1 mi+ n(n = 1 mi+ n(n 1 + j=1 j=1 m + j m + j + + n(n 1 m ij ( PQ n(n 1 n(n 1 + n 13

12 A. N. Albatineh, M. Niewiadomska-Bugaj 1 = 1 (P + Q + n + PQ n(n 1 n (n 1 + n 1 (P + Q = 1 n(n 1 + PQ n (n 1. (3.16 Note that Theorem 1 provides a general formula for E[SI] and E[SI ]. For the R index, E[R] and E[R ]aregivenby(3.3 and (3.3 with α and β given in (3.15. In particular, E[R] is given by (3.16 and ( ( PQ E[R ]=α + αβ n(n 1 + n + β E[U ]+ PQ n 1 n (3.17 where α and β are as given in (3.15, and E[U ] isgivenby(3.3. Thus, the corrected RT (CRT can be approximately calculated as CRT = RT E(RT 1 E(RT (3.18 where RT and E(RT aregivenby(.8 and (3.14, respectively. 3.3 Correction of Gower and Legendre index R 1+R. In Table 3 it is shown that GL is related to R by the equation GL = h(r = Equation (.10 shows that GL can be written in terms of m ij but does not belong to the family L. The graph showing the relationship between R and GL is displayed in Figure, it can be approximated by a quadratic equation of the form GL = β 1 R + β R + β 3. (3.19 Least squares estimates of β 1,β, and β 3 were found using R statistical software, and thus the relationship between GL and R can be approximated by GL = R R , (3.0 The maximum absolute difference between the exact and approximate relationships (see Fig. is Therefore, E[GL ]= E(R E(R (3.1 The values of E[R] and E[R ]in(3.1 aregivenby(3.16 and (3.17, respectively. Hence the corrected GL index (CGL can be approximately calculated as CGL = GL E(GL 1 E(GL (3. where GL and E(GL aregivenby(.10 and (3.1, respectively. 13

13 Correcting Jaccard and other similarity indices GL Exact Approx Fig. Exact versus approximate relationship between R and GL indices R 3.4 Correction of Sokal and Sneath index It is shown in Table 3 that SS and Cz are related by the equation SS = g(cz = Cz 4 3Cz. In order to derive the moments of the SS index (not a member of the family L, we use the fact that the function g can be closely approximated by a quadratic function (see Fig. 3 oftheform SS = η 1 + η Cz + η 3 Cz (3.3 Using the R statistical software, estimates of η 1,η, and η 3 using least squares method were obtained and hence the approximate relationship between Cz and SS is given by SS = Cz Cz (3.4 Point by point evaluation of the exact and approximate relationships between SS and Cz reveals that their maximum absolute difference is as large as Using (3.3, E(SS can be approximated by E(SS E(SS = E(Cz E(Cz (3.5 Note that Cz is a member of( the family L, so using Theorem 1, E(Cz ( and E(Cz are given by E(Cz = α + β PQ n(n 1, + n E[Cz ]=α + αβ PQ ( n(n 1 + n + β E[U ]+ PQ n 1 n with α = P+Q n,β = P+Q and E[U ] as given by (

14 A. N. Albatineh, M. Niewiadomska-Bugaj SS Exact Approx Cz Fig. 3 Exact versus approximate relationship between Cz and SS indices Thus the corrected SS can be approximately calculated as CSS = SS E(SS 1 E(SS (3.6 where SS and E(SS aregivenby(.9 and (3.5, respectively. In the following section we propose another way to find the expectation of the indices based on Taylor series expansion of the indices as functions of I Jj=1 m ij, which provides a better approximation in case of indices such as J, GL, RT, and SS. 4 Expectations based on Taylor series expansion Consider the indices of RT, SS, GL, and J as given by Eqs..8,.9,.10, and.11, respectively. Clearly, each of these indices is non-linear in the quantity I Jj=1 m ij and therefore can be thought of as a function Y = g(x, where X = I Jj=1 since n, I mi+, and J j=1 m + j are constants. Consider the Taylor series expansion of Y around μ = E(X which is given by Y = g(x g(μ + 1 1! g (μ(x μ + 1! g (μ(x μ + (4.1 Since E(X μ = 0 and E(X μ =Var (X, Eq. 4.1 can be written as 13 E(Y = E(g(X g(μ + 1! g (μvar(x + (4.

15 Correcting Jaccard and other similarity indices Two conditional expectation formulas for correcting the Rand (1971 index for chance agreement were proposed. Hubert and Arabie (1985 proposed an expectation based on the exact generalized hypergeometric distribution of the matching counts in the matrix M which is given by E = 1 n(n 1 mi+ m + j + n (n 1 1 (n 1 mi+ + j=1 m + j (4.3 Morey and Agresti (1984 proposed an asymptotic expectation based on multinomial distribution given by E ( mi+ m + j n = 1 n mi+ m + j (4.4 Albatineh et al. (006, p. 308 showed that, as the sample size increases, the difference between the corrected Rand (1971 index using Eqs. 4.3 and 4.4 becomes negligible. For simplicity, the expectation in Eq. 4.4 will be used in the Taylor series expansion to obtain the expectation of the indices J, RT, GL, and SS as explained below. Initial evaluations revealed little contribution of the second term in Eq. 4. and therefore only the first term will be used to approximate the expectation of the indices as described below. 1. Correction of Jaccard index: The J index as a function of X = I Jj=1 m ij is given by J = g(x = I Jj=1 m ij n I m i+ + J j=1 m + j I Jj=1 m ij n (4.5 Therefore, using Eq. 4.4, the expected J index is given by E(J = E(g(X g(e(x I Jj=1 ( m i+m + j n n = I mi+ + J j=1 m + j I Jj=1 ( m i+m + j n n 1 I m Jj=1 n = i+ m + j n I mi+ + J j=1 m + j 1 I m Jj=1 n i+ m + j n 1 (P + n(q + n n n = P + Q + n 1 (P + n(q + n n (4.6 13

16 A. N. Albatineh, M. Niewiadomska-Bugaj where P = I mi+ n and Q = J j=1 m + j n. Therefore, the corrected J index is given by CJ = J E(J 1 E(J where J and E(J are given by Eqs..11 and 4.6, respectively.. Correction of Rogers and Tanimoto: The RT index is given by I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j RT = I mi+ + J j=1 m + j + n(n 1 I Jj=1 (4.7 (4.8 Therefore, using Eq. 4.4, the expected RT index is given by E(RT = = (P + n(q + n + n(n 1 (P + Q + n n P + n + Q + n + n(n 1 (P + n(q + n n (P + n(q + n + n(n 1 (P + Q + n n P + Q + n(n + 1 (P + n(q + n n (4.9 Thus, the corrected RT is given by CRT = RT E(RT 1 E(RT (4.10 where RT and E(RT are given by Eqs..8 and 4.9, respectively. 3. Correction of Gower and Legendre: The GL index is given by I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j GL = I ( Jj=1 + n(n 1 1 I mi+ + (4.11 J j=1 m + j. Therefore, using Eq. 4.4, the expected GL index is given by E(GL = (P + n(q + n + n(n 1 (P + Q + n n 1 (P + n(q + n + n(n 1 1 n (P + Q + n (4.1 Thus, the corrected GL index is given by CGL = GL E(GL 1 E(GL (4.13 where GL and E(GL are given by Eqs..10 and 4.1, respectively. 13

17 Correcting Jaccard and other similarity indices 4. Correction of Sokal and Sneath: The SS index is given by I Jj=1 SS = ( n I mi+ + J j=1 m + j n 3 I. (4.14 Jj=1 Therefore, using Eq. 4.4, the expected SS index is given by E(SS = 1 n (P + n(q + n n (P + Q + n n 3 n (P + n(q + n (4.15 Thus, the corrected SS index is given by CSS = SS E(SS 1 E(SS (4.16 where SS and E(SS are given by Eqs..9 and 4.15, respectively. In the following section, results of numerical simulation for homogeneous and structured data are presented using expectations obtained in Sects. 3 and 4. 5 Simulation results In this section we investigate the performance of the correction using an expectation based on approximations of the relationships between indices and an an approximate expectation based on Taylor series expansion. Data sets with and without clustering structure will be generated, and the values of the indices before and after correction will be compared. 5.1 Homogeneous data In this case 500 observations are generated from a bivariate normal distribution with parameters μ = ( 10, and = 10 ( The data (see Fig. 4 is clustered by the average linkage method (arbitrarily chosen and we look at the obtained partition with, 3, 4,...,10 clusters (method A. In addition, the same data is randomly split into, 3, 4,...,10 clusters of equal size (method B. The similarity between the two resulting clusterings is calculated (for the same number of clusters using J, RT, GL, and SS indices along with their versions that were corrected for chance agreement as discussed in Sects. 3 and 4. This data generation and clustering process is repeated 1,000 times and the averages of the indices are calculated. Since the data have no clustering structure, we expect the values of the indices to be very close to zero. Table 4 presents the results for the homogenous data 13

18 A. N. Albatineh, M. Niewiadomska-Bugaj y x Fig. 4 Random sample of size 500 generated from bivariate normal distribution simulations using the average linkage method. For example, if we consider the column with three clusters in Table 4, the values of the original indices J, RT, GL, and SS are , 0.314, 0.543, and while it is only , , , and after correction for chance agreement (when using the Taylor series method. The values of the corrected indices are 0.004, , , and , respectively when using the proposed approximations from Sect. 3. This indicates that the proposed methods are very effective in correcting the J, RT, GL, and SS indices for chance agreement insofar as their values are close to zero when no cluster structure exists. 5. Clustered data For this example, five clusters with 100 observation each were generated from five bivariate normal distributions with parameters given by ( ( 5 5 μ 1 =,μ 5 =,μ 16 3 = ( = ( 10,μ 11 4 = ( ( 15 15,μ 16 5 =, and 5 A random sample obtained from these distributions is shown in Fig. 5. The average linkage method was used to cluster the 500 points by requesting, 3, 4,...,10 clusters. The similarity between the original five clusters (data sets and the k-class partition (k =, 3,...,10 resulting from the average linkage method was calculated. This process was repeated 1,000 times and the average of the indices were calculated. 13

19 Correcting Jaccard and other similarity indices Table 4 Values of J, RT, GL, and SS indices before and after correction obtained for data of size 500 observations generated from a bivariate normal distribution using average linkage method Index\#Clusters J CJ Taylor CJ appr RT CRT Taylor 9.e CRT appr GL CGL Taylor CGL appr SS CSS Taylor 6.6e CSS appr y Fig. 5 Data from five bivariate normal distributions each with sample size 100 x It is expected that the indices will attain maximum values at the correct number of clusters which is five, and attain values smaller as the number of clusters gets further away from the correct number of clusters, see Milligan and Cooper (1986, p. 455 for more details on using the similarity indices as tools for measuring clustering structure recovery. Table 5 presents the results obtained by the average linkage method with values of the indices J, RT, GL, and SS along with their proposed corrected versions. The values of the indices at the correct number of clusters are close to each other, whereas 13

20 A. N. Albatineh, M. Niewiadomska-Bugaj Table 5 Values of indices J, RT, GL, SS and their corrected versions using the average linkage method with data generated from five bivariate normal distributions each with sample size 100 using average linkage method Index\#Clusters J CJ Taylor CJ appr RT CRT Taylor CRT appr GL CGL Taylor CGL appr SS CSS Taylor CSS appr The bold values represent values of the similarity indices at the correct number of clusters the values of the corrected indices at and 10 clusters are smaller than the uncorrected indices (as expected since we are far from the target of five clusters. It must be noted that the values of the corrected indices drop faster once we have passed the correct number of clusters, see for example GL and CGL at 5 and 6 clusters. In summary, for homogeneous data set, the corrected indices attained values closer to zero (as desired compared to uncorrected indices. For clustered data, the corrected indices showed less similarity for number of clusters far from the target (five clusters in this case, while attaining maximum value at the correct number of clusters. This clearly shows the effectiveness of the proposed approximations in correcting the indices of J, RT, GL, and SS for chance agreement, in the sense that the corrected indices attain values close to zero for homogeneous data and close to the original index for structured data. For more on using the corrected similarity indices to find the optimal number of clusters in a data set see Albatineh and Niewiadomska-Bugaj ( Conclusion In this paper a proposal for correcting the similarity indices of J, RT, SS, and GL which are not members of the family L has been presented. Similar indices can be handled the same way. The indices of J, RT, SS, and GL are either functions of each other or functions of the indices R and Cz. In order to correct the indices of J, RT, SS, and GL for chance agreement, two ideas were discussed. The first idea is to approximate the relationship between the indices in order to find the expectation of the index and hence correcting it for chance agreement. The second idea is to find the expectation based on Taylor series expansion of the indices around μ = E(X where X = I Jj=1. Simulation results revealed that such a correction greatly 13

21 Correcting Jaccard and other similarity indices improves the performance (recovery of clustering structure of the indices, in the sense that they produce similarity values closer to zero between two clusterings when the data has no clustering structure (homogeneous data. In a structured case simulations with five clusters, the corrected indices showed the desirable small similarity when the number of clusters was far from the target and very close to the original indices at the target. However, not all indices can be expressed in terms of other indices that are linear in the matching counts. In such cases, we can find expectations of the indices using the Taylor series idea which is more general. References Albatineh AN, Niewiadomska-Bugaj M, Mihalko DP (006 On similarity indices and correction for chance agreement. J Classif 3: Albatineh AN, Niewiadomska-Bugaj M (011 MCS: a method for finding the number of clusters. J Classif 8. doi: /s Albatineh AN (010 Means and variances for a family of similarity indices used in cluster analysis. J Stat Plan Inference 140: Czekanowski J (193 Coefficient of racial likeness und durchschnittliche Differenz. Anthropologischer Anzeiger 14:7 49 Dice LR (1945 Measures of the amount of ecological association between species. Ecology 6:97 30 Fligner MA, Verducci JS, Blower PE (00 A modification of the Jaccard Tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics 44: Fowlkes EB, Mallows CL (1983 A method for comparing two hierarchical clusterings. J Am Stat Assoc 78: Gower JC, Legendre P (1986 Metric and Euclidean properties of dissimilarity coefficients. J Classif 3:5 48 Hamann U (1961 Merkmalsbestand und Verwandtschaftsbeziehungen der Farinosae. Willdenowia : Hubálek Z (198 Coefficients of association and similarity based on binary (presence absence data: an evaluation. Biol Rev 57: Hubert L, Arabie P (1985 Comparing partitions. J Classif : Jaccard P (1908 Nouvelles recherches sur la distribution florale. Bull Soc Vaudoise Sci Nat 44:3 70 Jaccard P (191 The distribution of the flora of the alpine zone. New Phytol 11:37 50 Jain AK, Dubes RC (1988 Algorithms for clustering data. Prentice Hall, New Jersey Janson S, Vegelius J (1981 Measures of ecological association. Oecologia 49: Johnson SC (1967 Hierarchical clustering schemes. Psychometrika 3:41 54 Kulczynski S (197 Die Pflanzenassoziationen der Pinien, Bulletin International de L Académie Polonaise des Sciences et des Lettres, Classe des Sciences Mathématiques et Naturelles. Series B, Supplément II :57 03 Lamont BB, Grant KJ (1979 A comparison of twenty-one measures of site dissimilarity. In: Orlóci L, Rao CR, Stiteler WM (eds Multivariate methods in ecological work. International Cooperation Publishing House, Fairland, pp Lancaster HO (1969 The Chi-squared distribution. John Wiley, New York Lehmann EL (1959 Testing statistical hypothesis. Wiley, New York Legendre P, Legendre L (1998 Numerical ecology. Elsevier, Amsterdam Mcconnaughey BH (1964 The determination and analysis of plankton communities. Marine Research, Special No, Indonesia, pp 1 40 Milligan G, Cooper M (1986 A study of the comparability of external criteria for hierarchical cluster analysis. Multivar Behav Res 1: Milligan G, Soon S, Sokol L (1983 The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Patt Anal Mach Intell PAMI-5:40 47 Morey L, Agresti A (1984 The measurement of classification agreement: an adjustment to the Rand statistic for chance agreement. Educ Psychol Meas 44:33 37 Rand W (1971 Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: Rogers DJ, Tanimoto TT (1960 A computer program for classifying plants. Science 13:

22 A. N. Albatineh, M. Niewiadomska-Bugaj Russell PF, Rao TR (1940 On habitat and association of species of anopheline larvae in South-Eastern Madras. J Malar Inst India 3: Saxena PC, Navaneerham K (1991 The effect of cluster size, dimensionality, and number of clusters on recovery of true cluster structure through Chernoff-type faces. Statistician 40: Saxena PC, Navaneerham K (1993 Comparison of Chernoff-type face and non-graphical methods for clustering multivariate observations. Comput Stat Data Anal 15:63 79 Snijders TAB, Dormaar M, Van Schuur WH, Dijkman-Caes C, Driessen G (1990 Distribution of some similarity coefficients for dyadic binary data in the case of associated attributes. J Classif 7:5 31 Sokal RR, Michener CD (1958 A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38: Sokal RR, Sneath PHA (1963 Principles of numerical taxonomy. WH Freeman, San Francisco Sørensen T (1948 A Method of establishing groups of equal amplitude in plant sociology based on similarity of species content. Biologiske Skrifter 5:1 34 Southwood TS (1978 Ecological methods. Chapman and Hall, London Steinley D (004 Properties of the Hubert Arabie adjusted Rand index. Psychol Methods 9: Van Der Maarel E (1969 On the use of ordination models in phytosociology. Vegetatio 19:1 46 Wallace DL (1983 A method for comparing two hierarchical clusterings: comment. J Am Stat Assoc 78:

arxiv: v1 [stat.ml] 17 Jun 2016

arxiv: v1 [stat.ml] 17 Jun 2016 Ground Truth Bias in External Cluster Validity Indices Yang Lei a,, James C. Bezdek a, Simone Romano a, Nguyen Xuan Vinh a, Jeffrey Chan b, James Bailey a arxiv:166.5596v1 [stat.ml] 17 Jun 216 Abstract