Correcting Jaccard and other similarity indices for chance agreement in cluster analysis

Size: px
Start display at page:

Download "Correcting Jaccard and other similarity indices for chance agreement in cluster analysis"

Transcription

1 Adv Data Anal Classif DOI /s y REGULAR ARTICLE Correcting Jaccard and other similarity indices for chance agreement in cluster analysis Ahmed N. Albatineh Magdalena Niewiadomska-Bugaj Received: 1 June 010 / Revised: 15 March 011 / Accepted: 18 May 011 Springer-Verlag 011 Abstract Correcting a similarity index for chance agreement requires computing its expectation under fixed marginal totals of a matching counts matrix. For some indices, such as Jaccard, Rogers and Tanimoto, Sokal and Sneath, and Gower and Legendre the expectations cannot be easily found. We show how such similarity indices can be expressed as functions of other indices and expectations found by approximations such that approximate correction is possible. A second approach is based on Taylor series expansion. A simulation study illustrates the effectiveness of the resulting correction of similarity indices using structured and unstructured data generated from bivariate normal distributions. Keywords Similarity indices Matching counts matrix Correction for chance agreement Jaccard index Cluster analysis Comparing partitions Mathematics Subject Classification (000 6H30 1 Introduction Measuring similarity between two different partitions (clusterings of the same set of objects is an important issue in cluster analysis. Many similarity measures have been A. N. Albatineh (B Department of Epidemiology and Biostatistics, Florida International University, Miami, FL, USA aalbatin@fiu.edu M. Niewiadomska-Bugaj Department of Statistics, Western Michigan University, Kalamazoo, MI, USA m.bugaj@wmich.edu 13

2 A. N. Albatineh, M. Niewiadomska-Bugaj proposed in the literature and are extensively used in cluster analysis applications including validation studies and recovery of clustering structure, see Milligan et al. (1983; Saxena and Navaneerham (1991, 1993; Milligan and Cooper (1986 and Steinley (004 for discussion. The problem with these indices is that they do not account for agreement due to chance. Morey and Agresti (1984 proposed a correction for the Rand index R (Rand 1971 for chance agreement based on an asymptotic multinomial distribution, while Hubert and Arabie (1985 used the exact generalized hypergeometric distribution for the same purpose. Albatineh et al. (006, p. 308 showed that the difference between the Morey and Agresti (1984 and Hubert and Arabie (1985 expectations (asymptotic and exact is negligible when the number of objects to be clustered is not too small. Fligner et al. (00 proposed a modification of the Jaccard Tanimoto index to be used in diverse selection of chemical compounds using binary strings. These authors emphasized that the Jaccard Tanimoto index has been widely used in computational chemistry and has become the standard for measuring the structural similarity of compounds. Historically the coefficient of Jaccard Tanimoto appeared much earlier as Jaccard (1908 in an ecological context to measure the degree of relatedness between two biological communities with respect to their species composition. Albatineh et al. (006, p. 307 proposed a correction of the indices of Fowlkes and Mallows (1983; Hamann (1961; Russell and Rao (1940; Czekanowski (Cz (193 and Wallace (1983 for chance agreement. Their simulations showed that correction improves the performance of the indices in the sense that the indices take values close to zero when no clustering structure is present, while they take values close to the original index value when a clustering structure exists. Albatineh et al. (006, p. 308 introduced a family L of similarity indices that are linear functions of the sum of the squares of the matching counts. Some of the indices that are not members of the L family are of great importance and wide applicability in botany, ecology, zoology; such as the index J of Jaccard (1908; Sokal and Sneath (1963; Gower and Legendre (1986 and Rogers and Tanimoto (1960 to name a few. In this paper, our goal is to find a general method to correct similarity indices such as J, RT, SS, and GL for agreement due to chance. Two approaches will be introduced. First: as the indices of J, RT, SS, and GL are functions of two members of the L family, namely Czekanowski (Cz (193 and Rand (1971, this relationship can be approximated and the expectation in Eq..5 can be approximately computed. Second: Taylor series expansion of those indices is discussed and an approximation to the expectations of these indices is obtained and thus a correction for chance can be computed. The paper is organized as follows: Sect. presents an overview of similarity indices, Sect. 3 presents some results relating the indices to each other with a proposed method for approximating the relationships and hence the correction, while Sect. 4 presents the Taylor series idea to find the expectation of the indices. Sect. 5 presents the simulations showing the effect of the proposed methods with conclusions in Sect

3 Correcting Jaccard and other similarity indices Table 1 Binary counts for two clustering (partitioning methods Partition B Number of pairs In the same clusters In different clusters Total Partition A In the same clusters a b a + b In different clusters c d c+ d Total a + c b+ d N Overview of similarity indices A standard approach in comparing two partitions of the same data set is to calculate the similarity between the two obtained partitions of the underlying set of objects using similarity indices. Since the clusters are not predefined, the similarity of the results between different clustering procedures (algorithms is usually based on the number of pairs of objects that are (not placed together into the same cluster, according to each algorithm. Consequently a similarity table as in Table 1 is formed where a, b, c, d, and N are defined as: a: Number of pairs of objects which are joined in the same cluster for both clustering methods. b: Number of pairs of objects which are joined together by method A, and not joined together by method B c: Number of pairs of objects not joined together by method A, while joined together by method B. d: Number of pairs of objects which are not joined together by either of the two methods. ( n The total number of pairs is N = a + b + c + d = = n(n 1, where n is the number of observations to be clustered. Let U = {u 1, u,...,u I } and V ={v 1,v,...,v J } be two partitions of the same data set resulting from the two clustering methods A and B and producing I and J clusters (i = 1,,...,I and j = 1,,...,J, respectively. The entries of Table 1 can also be defined in terms of counts in the matching matrix M between the two partitions U, V as M = (m ij, where the entry m ij = u i v j is the number of common objects in cluster u i from method A, and cluster v j of method B (Jain and Dubes (1988, p. 173: a = ( = 1 n. (.1 13

4 b = c = d = ( m+ j j=1 ( mi+ ( ( ( n a b c = 1 = 1 = 1 A. N. Albatineh, M. Niewiadomska-Bugaj m + j 1 j=1. (. mi+ 1. (.3 mi+ +. (.4 m ij + n 1 j=1 m + j where m i+ = J j=1 m ij and m + j = I m ij are the ith row and jth column totals of the matching counts matrix M, respectively. Any similarity index (SI when corrected for chance agreement (CSI takes the form CSI = SI E(SI 1 E(SI (.5 where E(SI is the expected value of the index under fixed marginal totals of the matching counts matrix M and unity is the theoretical maximum value of the index, see Morey and Agresti (1984, p. 35. Any SI that takes the form SI = α +β I Jj=1, where α and β are unique for each index, is said to be a member of the family L (Albatineh et al. 006, p Albatineh (010 derived means and variances for any member of the family L under fixed marginal totals of the matching counts matrix and independence of the clustering algorithms. For example, the indices of Rand (1971 and Czekanowski (Cz (193 are among many that are members of the family L, and can be written as R = Cz = 13 ( a + d a + b + c + d = 1 1 I n(n 1 m i+ + J j=1 m + j }{{} α a a + b + c = 1 + n(n 1 }{{} β n ( I mi+ + J j=1 m + j n } {{ } α + 1. (.6 1 ( I mi+ + J j=1 m + j n } {{ } β. (.7

5 Correcting Jaccard and other similarity indices Table Selected similarity indices No. Index Symbol Formula 1 Rogers and Tanimoto (1960 RT Gower and Legendre (1986 GL 3 Jaccard (191 J 4 Sokal and Sneath (1963 SS 5 Sokal and Michener (1958; Rand (1971 R 6 Czekanowski (Cz (193; Dice (1945; Sørensen (1948 CZ 7 Hamann (1961 H 8 Mcconnaughey (1964 Mc 9 Johnson (1967 Jo 10 Kulczynski (197 K 11 Legendre and Legendre (1998 LL 1 Lamont and Grant (1979 LG 13 Maarel (1969 M 14 Sokal and Sneath (1963 SS3 15 Sokal and Sneath (1963 SS4 16 Southwood (1978 S a+d a+(b+c+d a+d a+ 1 a (b+c+d a+b+c a a+(b+c a+d a+b+c+d a a+b+c (a+d (b+c a+b+c+d a bc ((a+b(a+c a a+b + a+c a ( 1 a a+b + a+c a 3a 3a+b+c a a+b+c a (b+c a+b+c (a+d (a+d+(b+c a+d b+c a b+c Table presents partial list of similarity indices. The indices of RT, SS, GL, and J can be written in terms of m ij as a + d RT = a + (b + c + d I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j = I mi+ + J j=1 m + j + n(n 1 I Jj=1. (.8 a SS = a + (b + c I Jj=1 = ( n I mi+ + J j=1 m + j n 3 I. (.9 Jj=1 a + d GL = a + 1 (b + c + d I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j = I ( Jj=1 + n(n 1 1 I mi+ + (.10 J j=1 m + j. 13

6 A. N. Albatineh, M. Niewiadomska-Bugaj Table 3 Relationships between some similarity indices No. Comparison Relationship No. Comparison Relationship 1 (R,RT RT = R/( R 14 (LL, J LL = 3J/(1 + J (R,GL R = GL/( GL 15 (LL, SS LL = 6SS/(5SS (J, CZ J = CZ/( CZ 16 (LG, M M = 4LG 1 4 (SS, J SS = J/( J 17 (J, M M = (3J 1/(J (H, R H = R 1 0 (SS3, R R = SS3/( SS3 6 (Mc, K Mc = K 1 1 (SS4, R SS4 = R/(1 R 7 (CZ, M M = CZ 1 (SS, S SS = S/(S + 8 (SS, CZ SS = CZ/(4 3CZ 18 (LG, SS SS = LG/( 3LG 9 (RT, GL RT = GL/(4 3GL 19 (SS3, SS4 SS4 = SS3/(1 SS3 10 (Mc, Jo Mc = Jo 1 3 (S, J J = S/(S (LL, LG LG = LL/(3 LL 4 (LL, M LL = (3 + 3M/(5 + M 1 (LG, J LG = J/(1 + J 5 (GL, H GL = (H + /(H (H, RT H = (3RT 1/(RT (SS, M SS = (5 3M/(M + 1 a J = a + b + c I Jj=1 = n I mi+ + J j=1 m + j n I Jj=1. (.11 These indices are not linear in I Jj=1 m ij and hence are not members of the family L. Their conditional expectations under fixed marginal totals of the matching matrix M can not be found explicitly. Therefore, the first idea is to express them as functions of other indices which have an expectation that can be explicitly computed, and to use them in an approximating formula of the corresponding function. In Sect. 3, the indices of RT, SS, GL, and J are shown to be functions of R and Cz which are members of the L family and their expectations are known. 3 Indices that are not in family L In this paper, we will focus on the indices RT, SS, GL, and J which are not members of the family L. Other non-members of the family L can be handled in a similar way. Relationships between some indices are presented in Table 3 and have been studied in Hubálek (198 and Janson and Vegelius (1981 in the context of their suitability for measuring coexistence between two species over different localities in ecology. Similarly, Snijders et al. (1990 established relationships between R, J, and CZ and derived some distributional results. The relationships numbered 1 5 in Table 3 were established in Janson and Vegelius (1981 and Snijders et al. (1990 and are presented here for completeness. Relationships 6 6 are newly formulated. Such relationships will form the basis for approximating expectations of indices that are not members of family L as discussed in the next section. 13

7 Correcting Jaccard and other similarity indices Cz Exact Approx J Fig. 1 Exact versus approximate relationship between J and Cz indices 3.1 Correction of Jaccard index As shown in Table 3, the indices J and Cz are related by the equation J = h(cz = Cz Cz, where Cz and J can be written in terms of m ij as given by (.7 and (.11, respectively. Note that the index Cz index is a member of the family L and hence the expectation of Cz under fixed marginal totals of the matrix M can be found explicitly, see Eq..7. In order to find the mean of the J index (not a member of family L, we use the fact that the function h can be closely approximated by a quadratic function (see Fig. 1 oftheform J = ξ 1 Cz + ξ Cz + ξ 3. (3.1 Least squares estimates of ξ 1,ξ, and ξ 3 were found using the R statistical software, and thus the relationship between J and Cz can be approximated by J = Cz Cz (3. Point by point evaluation of the exact relationship of J and Cz and the quadratic approximation of J by Cz revealed that the approximation is very good, with maximum absolute difference which occurred at points close to 0 or 1. Therefore, the expectation of the approximate J index given by (3.1 can be calculated as E[J ]=ξ 1 E(Cz + ξ E(Cz + ξ 3 = E(Cz E(Cz (3.3 13

8 A. N. Albatineh, M. Niewiadomska-Bugaj To calculate E[J ], we need E(Cz and E(Cz which are established in Theorem 1 not only for Cz, but for the family L. Theorem 1 Let SI be any similarity index of the form SI = α + β I Jj=1. Under fixed marginal totals m i+ and m + j of the matching counts matrix M = (m ij and independence of the two clusterings, we have ( PQ E[SI]=α + β n(n 1 + n, (3.3 ( ( PQ E[SI ]=α + αβ n(n 1 + n + β E[U ]+ PQ n 1 n, (3.3 where E[U ]= PQ n(n 1 + 4P Q n(n 1(n + (P 4P P(Q 4Q Q, (3.3 n(n 1(n (n 3 U = P = P = m ij (m ij 1 = mi+ n,q = m + j n, j=1 m ( ij, m i+ (m i+ 1(m i+, andq = (3.3 m + j (m + j 1(m + j. j=1 Proof For fixed marginal totals m i+ and m + j of the matching counts matrix M = (m ij and independence of the two procedures, the elements of M = (m ij have a generalized hypergeometric distribution, see Lancaster (1969, p. 14 and Fowlkes and Mallows (1983. Define m (p ij = m ij (m ij 1 (m ij p + 1, then the pth factorial moment of m ij is ( E m (p ij ( I This implies that E Jj=1 p. 00. Therefore, 13 E[SI]=E α + β = α + β = m (p i+ m(p + j /n(p (3.4 = PQ n(n 1 + n (see Hubert and Arabie 1985, m ij ( PQ n(n 1 + n = α + β E. (3.5

9 Correcting Jaccard and other similarity indices Since [SI ]= α + β = α + αβ m ij we obtain E[SI ]=α + αβ E + β + β E ( PQ = α + αβ n(n 1 + n + β E,. (3.6 The expectation on the right hand side of (3.6 can be evaluated as follows: consider m ( ij Therefore, E = E Hence, E = E = = l j=j m ij (m ij 1 n m ij (m ij 1 ne = m ij (m ij 1 + ne n + n + n. n. (3.7 In order to find the first expectation on the right hand side of (3.7, i.e. E[U ],weuse the fact 13

10 ( m ( ij A. N. Albatineh, M. Niewiadomska-Bugaj = ( (m ij 1 = m ij ( 1 = m 4 ij m 3 ij + m ij = m (4 ij + 4m (3 ij + m ( ij m (4 ij +4m (3 ij +m ( ij {}}{ and therefore, U = m ( ( ij = m ( ij + m ( ij m ( ij j =1 j = j + m ( ij m ( i j + m ( ij m ( i j. (3.8 i =1 j=1 i =1 j=1 j =1 i =i i =i j = j Hence, E(U = m(4 i. m (4. j n + 4 m (3 i. m (3. j n (3 + m ( i. m (. j n ( + m(4 i. m (. j m (. j j n =1 + m( i. m ( i. m(4. j i n =1 j=1 j = j i =i + m( i. m ( i. m(. j m. j i =1 j=1 j n =1. (3.9 i =i j = j After some simplifications and collecting identical terms we obtain (3.3 as desired. Furthermore, we obtain from (3.7: E = E[U ]+ne n ( PQ = E[U ]+n n(n 1 + n n = E[U ]+ PQ n 1 + n. (3.10 Substituting (3.10 into(3.6 results in ( ( PQ E[SI ]=α + αβ n(n 1 + n + β E[U ]+ PQ n 1 n with E[U ] as given by (3.3. (3.11 Since Cz index belongs to the family L, we can obtain E(Cz and E(Cz from Theorem 1 and therefore compute E[J ]from(3.3 as an approximation to E[J], the 13

11 Correcting Jaccard and other similarity indices expected Jaccard index. Hence, an approximation to the corrected Jaccard index (CJ is given by CJ = J E[J ] 1 E[J ] where J and E[J ]aregivenby(.11 and (3.3, respectively. 3. Correction of Rogers and Tanimoto index R R It is shown in Table 3 that R and RT are related by the equation RT = f (R = where RT, not a member of the family L, can be written in terms of m ij as in (.8. The curve representing the relationship between RT and R is similar to Fig. 1, and can be approximated by a quadratic equation of the form RT = γ 1 R + γ R + γ 3. (3.1 Least squares estimates of γ 1,γ, and γ 3 were found using R statistical software, and thus the relationship between RT and R can be approximated by RT = R R (3.13 Therefore, E(RT = E(R E(R (3.14 For determining E[R] we write ( 1 I R = 1 n(n 1 m i+ + J j=1 m + j + n(n 1 }{{}}{{} α β = α + β. (3.15 Since R belongs to the family L and Theorem 1 yields 1 E(R = E 1 mi+ n(n = 1 mi+ n(n 1 + j=1 j=1 m + j m + j + + n(n 1 m ij ( PQ n(n 1 n(n 1 + n 13

12 A. N. Albatineh, M. Niewiadomska-Bugaj 1 = 1 (P + Q + n + PQ n(n 1 n (n 1 + n 1 (P + Q = 1 n(n 1 + PQ n (n 1. (3.16 Note that Theorem 1 provides a general formula for E[SI] and E[SI ]. For the R index, E[R] and E[R ]aregivenby(3.3 and (3.3 with α and β given in (3.15. In particular, E[R] is given by (3.16 and ( ( PQ E[R ]=α + αβ n(n 1 + n + β E[U ]+ PQ n 1 n (3.17 where α and β are as given in (3.15, and E[U ] isgivenby(3.3. Thus, the corrected RT (CRT can be approximately calculated as CRT = RT E(RT 1 E(RT (3.18 where RT and E(RT aregivenby(.8 and (3.14, respectively. 3.3 Correction of Gower and Legendre index R 1+R. In Table 3 it is shown that GL is related to R by the equation GL = h(r = Equation (.10 shows that GL can be written in terms of m ij but does not belong to the family L. The graph showing the relationship between R and GL is displayed in Figure, it can be approximated by a quadratic equation of the form GL = β 1 R + β R + β 3. (3.19 Least squares estimates of β 1,β, and β 3 were found using R statistical software, and thus the relationship between GL and R can be approximated by GL = R R , (3.0 The maximum absolute difference between the exact and approximate relationships (see Fig. is Therefore, E[GL ]= E(R E(R (3.1 The values of E[R] and E[R ]in(3.1 aregivenby(3.16 and (3.17, respectively. Hence the corrected GL index (CGL can be approximately calculated as CGL = GL E(GL 1 E(GL (3. where GL and E(GL aregivenby(.10 and (3.1, respectively. 13

13 Correcting Jaccard and other similarity indices GL Exact Approx Fig. Exact versus approximate relationship between R and GL indices R 3.4 Correction of Sokal and Sneath index It is shown in Table 3 that SS and Cz are related by the equation SS = g(cz = Cz 4 3Cz. In order to derive the moments of the SS index (not a member of the family L, we use the fact that the function g can be closely approximated by a quadratic function (see Fig. 3 oftheform SS = η 1 + η Cz + η 3 Cz (3.3 Using the R statistical software, estimates of η 1,η, and η 3 using least squares method were obtained and hence the approximate relationship between Cz and SS is given by SS = Cz Cz (3.4 Point by point evaluation of the exact and approximate relationships between SS and Cz reveals that their maximum absolute difference is as large as Using (3.3, E(SS can be approximated by E(SS E(SS = E(Cz E(Cz (3.5 Note that Cz is a member of( the family L, so using Theorem 1, E(Cz ( and E(Cz are given by E(Cz = α + β PQ n(n 1, + n E[Cz ]=α + αβ PQ ( n(n 1 + n + β E[U ]+ PQ n 1 n with α = P+Q n,β = P+Q and E[U ] as given by (

14 A. N. Albatineh, M. Niewiadomska-Bugaj SS Exact Approx Cz Fig. 3 Exact versus approximate relationship between Cz and SS indices Thus the corrected SS can be approximately calculated as CSS = SS E(SS 1 E(SS (3.6 where SS and E(SS aregivenby(.9 and (3.5, respectively. In the following section we propose another way to find the expectation of the indices based on Taylor series expansion of the indices as functions of I Jj=1 m ij, which provides a better approximation in case of indices such as J, GL, RT, and SS. 4 Expectations based on Taylor series expansion Consider the indices of RT, SS, GL, and J as given by Eqs..8,.9,.10, and.11, respectively. Clearly, each of these indices is non-linear in the quantity I Jj=1 m ij and therefore can be thought of as a function Y = g(x, where X = I Jj=1 since n, I mi+, and J j=1 m + j are constants. Consider the Taylor series expansion of Y around μ = E(X which is given by Y = g(x g(μ + 1 1! g (μ(x μ + 1! g (μ(x μ + (4.1 Since E(X μ = 0 and E(X μ =Var (X, Eq. 4.1 can be written as 13 E(Y = E(g(X g(μ + 1! g (μvar(x + (4.

15 Correcting Jaccard and other similarity indices Two conditional expectation formulas for correcting the Rand (1971 index for chance agreement were proposed. Hubert and Arabie (1985 proposed an expectation based on the exact generalized hypergeometric distribution of the matching counts in the matrix M which is given by E = 1 n(n 1 mi+ m + j + n (n 1 1 (n 1 mi+ + j=1 m + j (4.3 Morey and Agresti (1984 proposed an asymptotic expectation based on multinomial distribution given by E ( mi+ m + j n = 1 n mi+ m + j (4.4 Albatineh et al. (006, p. 308 showed that, as the sample size increases, the difference between the corrected Rand (1971 index using Eqs. 4.3 and 4.4 becomes negligible. For simplicity, the expectation in Eq. 4.4 will be used in the Taylor series expansion to obtain the expectation of the indices J, RT, GL, and SS as explained below. Initial evaluations revealed little contribution of the second term in Eq. 4. and therefore only the first term will be used to approximate the expectation of the indices as described below. 1. Correction of Jaccard index: The J index as a function of X = I Jj=1 m ij is given by J = g(x = I Jj=1 m ij n I m i+ + J j=1 m + j I Jj=1 m ij n (4.5 Therefore, using Eq. 4.4, the expected J index is given by E(J = E(g(X g(e(x I Jj=1 ( m i+m + j n n = I mi+ + J j=1 m + j I Jj=1 ( m i+m + j n n 1 I m Jj=1 n = i+ m + j n I mi+ + J j=1 m + j 1 I m Jj=1 n i+ m + j n 1 (P + n(q + n n n = P + Q + n 1 (P + n(q + n n (4.6 13

16 A. N. Albatineh, M. Niewiadomska-Bugaj where P = I mi+ n and Q = J j=1 m + j n. Therefore, the corrected J index is given by CJ = J E(J 1 E(J where J and E(J are given by Eqs..11 and 4.6, respectively.. Correction of Rogers and Tanimoto: The RT index is given by I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j RT = I mi+ + J j=1 m + j + n(n 1 I Jj=1 (4.7 (4.8 Therefore, using Eq. 4.4, the expected RT index is given by E(RT = = (P + n(q + n + n(n 1 (P + Q + n n P + n + Q + n + n(n 1 (P + n(q + n n (P + n(q + n + n(n 1 (P + Q + n n P + Q + n(n + 1 (P + n(q + n n (4.9 Thus, the corrected RT is given by CRT = RT E(RT 1 E(RT (4.10 where RT and E(RT are given by Eqs..8 and 4.9, respectively. 3. Correction of Gower and Legendre: The GL index is given by I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j GL = I ( Jj=1 + n(n 1 1 I mi+ + (4.11 J j=1 m + j. Therefore, using Eq. 4.4, the expected GL index is given by E(GL = (P + n(q + n + n(n 1 (P + Q + n n 1 (P + n(q + n + n(n 1 1 n (P + Q + n (4.1 Thus, the corrected GL index is given by CGL = GL E(GL 1 E(GL (4.13 where GL and E(GL are given by Eqs..10 and 4.1, respectively. 13

17 Correcting Jaccard and other similarity indices 4. Correction of Sokal and Sneath: The SS index is given by I Jj=1 SS = ( n I mi+ + J j=1 m + j n 3 I. (4.14 Jj=1 Therefore, using Eq. 4.4, the expected SS index is given by E(SS = 1 n (P + n(q + n n (P + Q + n n 3 n (P + n(q + n (4.15 Thus, the corrected SS index is given by CSS = SS E(SS 1 E(SS (4.16 where SS and E(SS are given by Eqs..9 and 4.15, respectively. In the following section, results of numerical simulation for homogeneous and structured data are presented using expectations obtained in Sects. 3 and 4. 5 Simulation results In this section we investigate the performance of the correction using an expectation based on approximations of the relationships between indices and an an approximate expectation based on Taylor series expansion. Data sets with and without clustering structure will be generated, and the values of the indices before and after correction will be compared. 5.1 Homogeneous data In this case 500 observations are generated from a bivariate normal distribution with parameters μ = ( 10, and = 10 ( The data (see Fig. 4 is clustered by the average linkage method (arbitrarily chosen and we look at the obtained partition with, 3, 4,...,10 clusters (method A. In addition, the same data is randomly split into, 3, 4,...,10 clusters of equal size (method B. The similarity between the two resulting clusterings is calculated (for the same number of clusters using J, RT, GL, and SS indices along with their versions that were corrected for chance agreement as discussed in Sects. 3 and 4. This data generation and clustering process is repeated 1,000 times and the averages of the indices are calculated. Since the data have no clustering structure, we expect the values of the indices to be very close to zero. Table 4 presents the results for the homogenous data 13

18 A. N. Albatineh, M. Niewiadomska-Bugaj y x Fig. 4 Random sample of size 500 generated from bivariate normal distribution simulations using the average linkage method. For example, if we consider the column with three clusters in Table 4, the values of the original indices J, RT, GL, and SS are , 0.314, 0.543, and while it is only , , , and after correction for chance agreement (when using the Taylor series method. The values of the corrected indices are 0.004, , , and , respectively when using the proposed approximations from Sect. 3. This indicates that the proposed methods are very effective in correcting the J, RT, GL, and SS indices for chance agreement insofar as their values are close to zero when no cluster structure exists. 5. Clustered data For this example, five clusters with 100 observation each were generated from five bivariate normal distributions with parameters given by ( ( 5 5 μ 1 =,μ 5 =,μ 16 3 = ( = ( 10,μ 11 4 = ( ( 15 15,μ 16 5 =, and 5 A random sample obtained from these distributions is shown in Fig. 5. The average linkage method was used to cluster the 500 points by requesting, 3, 4,...,10 clusters. The similarity between the original five clusters (data sets and the k-class partition (k =, 3,...,10 resulting from the average linkage method was calculated. This process was repeated 1,000 times and the average of the indices were calculated. 13

19 Correcting Jaccard and other similarity indices Table 4 Values of J, RT, GL, and SS indices before and after correction obtained for data of size 500 observations generated from a bivariate normal distribution using average linkage method Index\#Clusters J CJ Taylor CJ appr RT CRT Taylor 9.e CRT appr GL CGL Taylor CGL appr SS CSS Taylor 6.6e CSS appr y Fig. 5 Data from five bivariate normal distributions each with sample size 100 x It is expected that the indices will attain maximum values at the correct number of clusters which is five, and attain values smaller as the number of clusters gets further away from the correct number of clusters, see Milligan and Cooper (1986, p. 455 for more details on using the similarity indices as tools for measuring clustering structure recovery. Table 5 presents the results obtained by the average linkage method with values of the indices J, RT, GL, and SS along with their proposed corrected versions. The values of the indices at the correct number of clusters are close to each other, whereas 13

20 A. N. Albatineh, M. Niewiadomska-Bugaj Table 5 Values of indices J, RT, GL, SS and their corrected versions using the average linkage method with data generated from five bivariate normal distributions each with sample size 100 using average linkage method Index\#Clusters J CJ Taylor CJ appr RT CRT Taylor CRT appr GL CGL Taylor CGL appr SS CSS Taylor CSS appr The bold values represent values of the similarity indices at the correct number of clusters the values of the corrected indices at and 10 clusters are smaller than the uncorrected indices (as expected since we are far from the target of five clusters. It must be noted that the values of the corrected indices drop faster once we have passed the correct number of clusters, see for example GL and CGL at 5 and 6 clusters. In summary, for homogeneous data set, the corrected indices attained values closer to zero (as desired compared to uncorrected indices. For clustered data, the corrected indices showed less similarity for number of clusters far from the target (five clusters in this case, while attaining maximum value at the correct number of clusters. This clearly shows the effectiveness of the proposed approximations in correcting the indices of J, RT, GL, and SS for chance agreement, in the sense that the corrected indices attain values close to zero for homogeneous data and close to the original index for structured data. For more on using the corrected similarity indices to find the optimal number of clusters in a data set see Albatineh and Niewiadomska-Bugaj ( Conclusion In this paper a proposal for correcting the similarity indices of J, RT, SS, and GL which are not members of the family L has been presented. Similar indices can be handled the same way. The indices of J, RT, SS, and GL are either functions of each other or functions of the indices R and Cz. In order to correct the indices of J, RT, SS, and GL for chance agreement, two ideas were discussed. The first idea is to approximate the relationship between the indices in order to find the expectation of the index and hence correcting it for chance agreement. The second idea is to find the expectation based on Taylor series expansion of the indices around μ = E(X where X = I Jj=1. Simulation results revealed that such a correction greatly 13

21 Correcting Jaccard and other similarity indices improves the performance (recovery of clustering structure of the indices, in the sense that they produce similarity values closer to zero between two clusterings when the data has no clustering structure (homogeneous data. In a structured case simulations with five clusters, the corrected indices showed the desirable small similarity when the number of clusters was far from the target and very close to the original indices at the target. However, not all indices can be expressed in terms of other indices that are linear in the matching counts. In such cases, we can find expectations of the indices using the Taylor series idea which is more general. References Albatineh AN, Niewiadomska-Bugaj M, Mihalko DP (006 On similarity indices and correction for chance agreement. J Classif 3: Albatineh AN, Niewiadomska-Bugaj M (011 MCS: a method for finding the number of clusters. J Classif 8. doi: /s Albatineh AN (010 Means and variances for a family of similarity indices used in cluster analysis. J Stat Plan Inference 140: Czekanowski J (193 Coefficient of racial likeness und durchschnittliche Differenz. Anthropologischer Anzeiger 14:7 49 Dice LR (1945 Measures of the amount of ecological association between species. Ecology 6:97 30 Fligner MA, Verducci JS, Blower PE (00 A modification of the Jaccard Tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics 44: Fowlkes EB, Mallows CL (1983 A method for comparing two hierarchical clusterings. J Am Stat Assoc 78: Gower JC, Legendre P (1986 Metric and Euclidean properties of dissimilarity coefficients. J Classif 3:5 48 Hamann U (1961 Merkmalsbestand und Verwandtschaftsbeziehungen der Farinosae. Willdenowia : Hubálek Z (198 Coefficients of association and similarity based on binary (presence absence data: an evaluation. Biol Rev 57: Hubert L, Arabie P (1985 Comparing partitions. J Classif : Jaccard P (1908 Nouvelles recherches sur la distribution florale. Bull Soc Vaudoise Sci Nat 44:3 70 Jaccard P (191 The distribution of the flora of the alpine zone. New Phytol 11:37 50 Jain AK, Dubes RC (1988 Algorithms for clustering data. Prentice Hall, New Jersey Janson S, Vegelius J (1981 Measures of ecological association. Oecologia 49: Johnson SC (1967 Hierarchical clustering schemes. Psychometrika 3:41 54 Kulczynski S (197 Die Pflanzenassoziationen der Pinien, Bulletin International de L Académie Polonaise des Sciences et des Lettres, Classe des Sciences Mathématiques et Naturelles. Series B, Supplément II :57 03 Lamont BB, Grant KJ (1979 A comparison of twenty-one measures of site dissimilarity. In: Orlóci L, Rao CR, Stiteler WM (eds Multivariate methods in ecological work. International Cooperation Publishing House, Fairland, pp Lancaster HO (1969 The Chi-squared distribution. John Wiley, New York Lehmann EL (1959 Testing statistical hypothesis. Wiley, New York Legendre P, Legendre L (1998 Numerical ecology. Elsevier, Amsterdam Mcconnaughey BH (1964 The determination and analysis of plankton communities. Marine Research, Special No, Indonesia, pp 1 40 Milligan G, Cooper M (1986 A study of the comparability of external criteria for hierarchical cluster analysis. Multivar Behav Res 1: Milligan G, Soon S, Sokol L (1983 The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Patt Anal Mach Intell PAMI-5:40 47 Morey L, Agresti A (1984 The measurement of classification agreement: an adjustment to the Rand statistic for chance agreement. Educ Psychol Meas 44:33 37 Rand W (1971 Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: Rogers DJ, Tanimoto TT (1960 A computer program for classifying plants. Science 13:

22 A. N. Albatineh, M. Niewiadomska-Bugaj Russell PF, Rao TR (1940 On habitat and association of species of anopheline larvae in South-Eastern Madras. J Malar Inst India 3: Saxena PC, Navaneerham K (1991 The effect of cluster size, dimensionality, and number of clusters on recovery of true cluster structure through Chernoff-type faces. Statistician 40: Saxena PC, Navaneerham K (1993 Comparison of Chernoff-type face and non-graphical methods for clustering multivariate observations. Comput Stat Data Anal 15:63 79 Snijders TAB, Dormaar M, Van Schuur WH, Dijkman-Caes C, Driessen G (1990 Distribution of some similarity coefficients for dyadic binary data in the case of associated attributes. J Classif 7:5 31 Sokal RR, Michener CD (1958 A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38: Sokal RR, Sneath PHA (1963 Principles of numerical taxonomy. WH Freeman, San Francisco Sørensen T (1948 A Method of establishing groups of equal amplitude in plant sociology based on similarity of species content. Biologiske Skrifter 5:1 34 Southwood TS (1978 Ecological methods. Chapman and Hall, London Steinley D (004 Properties of the Hubert Arabie adjusted Rand index. Psychol Methods 9: Van Der Maarel E (1969 On the use of ordination models in phytosociology. Vegetatio 19:1 46 Wallace DL (1983 A method for comparing two hierarchical clusterings: comment. J Am Stat Assoc 78:

arxiv: v1 [stat.ml] 17 Jun 2016

arxiv: v1 [stat.ml] 17 Jun 2016 Ground Truth Bias in External Cluster Validity Indices Yang Lei a,, James C. Bezdek a, Simone Romano a, Nguyen Xuan Vinh a, Jeffrey Chan b, James Bailey a arxiv:166.5596v1 [stat.ml] 17 Jun 216 Abstract

More information

Fuzzy order-equivalence for similarity measures

Fuzzy order-equivalence for similarity measures Fuzzy order-equivalence for similarity measures Maria Rifqi, Marie-Jeanne Lesot and Marcin Detyniecki Abstract Similarity measures constitute a central component of machine learning and retrieval systems,

More information

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES REVSTAT Statistical Journal Volume 13, Number 3, November 2015, 233 243 MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES Authors: Serpil Aktas Department of

More information

Correlation Analysis of Binary Similarity and Distance Measures on Different Binary Database Types

Correlation Analysis of Binary Similarity and Distance Measures on Different Binary Database Types Correlation Analysis of Binary Similarity and Distance Measures on Different Binary Database Types Seung-Seok Choi, Sung-Hyuk Cha, Charles C. Tappert Department of Computer Science, Pace University, New

More information

arxiv: v1 [math.co] 27 Jul 2015

arxiv: v1 [math.co] 27 Jul 2015 Perfect Graeco-Latin balanced incomplete block designs and related designs arxiv:1507.07336v1 [math.co] 27 Jul 2015 Sunanda Bagchi Theoretical Statistics and Mathematics Unit Indian Statistical Institute

More information

STAD Research Report Adjusted Concordance Index, an extension of the Adjusted Rand index to fuzzy partitions

STAD Research Report Adjusted Concordance Index, an extension of the Adjusted Rand index to fuzzy partitions STAD Research Report 03 2015 arxiv:1509.00803v2 [stat.me] 16 Mar 2016 Adjusted Concordance Index, an extension of the Adjusted Rand index to fuzzy partitions Sonia Amodio a, Antonio d Ambrosio a, Carmela

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Introduction Edps/Psych/Stat/ 584 Applied Multivariate Statistics Carolyn J Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees,

More information

Accuracy Measures for the Comparison of Classifiers

Accuracy Measures for the Comparison of Classifiers Accuracy Measures for the Comparison of Classifiers Vincent Labatut 1 and Hocine Cherifi 2 1 Galatasaray University, Computer Science Department, Çırağan cad. n 36, 34357 İstanbul, Turkey vlabatut@gsu.edu.tr

More information

Analysis of Survival Data Using Cox Model (Continuous Type)

Analysis of Survival Data Using Cox Model (Continuous Type) Australian Journal of Basic and Alied Sciences, 7(0): 60-607, 03 ISSN 99-878 Analysis of Survival Data Using Cox Model (Continuous Type) Khawla Mustafa Sadiq Department of Mathematics, Education College,

More information

Clustering Ambiguity: An Overview

Clustering Ambiguity: An Overview Clustering Ambiguity: An Overview John D. MacCuish Norah E. MacCuish 3 rd Joint Sheffield Conference on Chemoinformatics April 23, 2004 Outline The Problem: Clustering Ambiguity and Chemoinformatics Preliminaries:

More information

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication ANOVA approach Advantages: Ideal for evaluating hypotheses Ideal to quantify effect size (e.g., differences between groups) Address multiple factors at once Investigates interaction terms Disadvantages:

More information

STATISTICS SYLLABUS UNIT I

STATISTICS SYLLABUS UNIT I STATISTICS SYLLABUS UNIT I (Probability Theory) Definition Classical and axiomatic approaches.laws of total and compound probability, conditional probability, Bayes Theorem. Random variable and its distribution

More information

Construction of Partially Balanced Incomplete Block Designs

Construction of Partially Balanced Incomplete Block Designs International Journal of Statistics and Systems ISS 0973-675 Volume, umber (06), pp. 67-76 Research India Publications http://www.ripublication.com Construction of Partially Balanced Incomplete Block Designs

More information

Similarity measures for binary and numerical data: a survey

Similarity measures for binary and numerical data: a survey Int. J. Knowledge Engineering and Soft Data Paradigms, Vol., No., 2009 63 Similarity measures for binary and numerical data: a survey M-J. Lesot* and M. Rifqi* UPMC Univ Paris 06, UMR 7606, LIP6, 04, avenue

More information

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

FULL LIKELIHOOD INFERENCES IN THE COX MODEL October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach

More information

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:

More information

Statistical Inference of Covariate-Adjusted Randomized Experiments

Statistical Inference of Covariate-Adjusted Randomized Experiments 1 Statistical Inference of Covariate-Adjusted Randomized Experiments Feifang Hu Department of Statistics George Washington University Joint research with Wei Ma, Yichen Qin and Yang Li Email: feifang@gwu.edu

More information

2. Matrix Algebra and Random Vectors

2. Matrix Algebra and Random Vectors 2. Matrix Algebra and Random Vectors 2.1 Introduction Multivariate data can be conveniently display as array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns

More information

Some Processes or Numerical Taxonomy in Terms or Distance

Some Processes or Numerical Taxonomy in Terms or Distance Some Processes or Numerical Taxonomy in Terms or Distance JEAN R. PROCTOR Abstract A connection is established between matching coefficients and distance in n-dimensional space for a variety of character

More information

Dimensionality of Hierarchical

Dimensionality of Hierarchical Dimensionality of Hierarchical and Proximal Data Structures David J. Krus and Patricia H. Krus Arizona State University The coefficient of correlation is a fairly general measure which subsumes other,

More information

Simplified marginal effects in discrete choice models

Simplified marginal effects in discrete choice models Economics Letters 81 (2003) 321 326 www.elsevier.com/locate/econbase Simplified marginal effects in discrete choice models Soren Anderson a, Richard G. Newell b, * a University of Michigan, Ann Arbor,

More information

The fingerprint Package

The fingerprint Package The fingerprint Package October 7, 2007 Version 2.6 Date 2007-10-05 Title Functions to operate on binary fingerprint data Author Rajarshi Guha Maintainer Rajarshi Guha

More information

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor

More information

Session 3 The proportional odds model and the Mann-Whitney test

Session 3 The proportional odds model and the Mann-Whitney test Session 3 The proportional odds model and the Mann-Whitney test 3.1 A unified approach to inference 3.2 Analysis via dichotomisation 3.3 Proportional odds 3.4 Relationship with the Mann-Whitney test Session

More information

Optimal Selection of Blocked Two-Level. Fractional Factorial Designs

Optimal Selection of Blocked Two-Level. Fractional Factorial Designs Applied Mathematical Sciences, Vol. 1, 2007, no. 22, 1069-1082 Optimal Selection of Blocked Two-Level Fractional Factorial Designs Weiming Ke Department of Mathematics and Statistics South Dakota State

More information

The Study on Trinary Join-Counts for Spatial Autocorrelation

The Study on Trinary Join-Counts for Spatial Autocorrelation Proceedings of the 8th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences Shanghai, P. R. China, June 5-7, 008, pp. -8 The Study on Trinary Join-Counts

More information

Review of One-way Tables and SAS

Review of One-way Tables and SAS Stat 504, Lecture 7 1 Review of One-way Tables and SAS In-class exercises: Ex1, Ex2, and Ex3 from http://v8doc.sas.com/sashtml/proc/z0146708.htm To calculate p-value for a X 2 or G 2 in SAS: http://v8doc.sas.com/sashtml/lgref/z0245929.htmz0845409

More information

NOMINAL VARIABLE CLUSTERING AND ITS EVALUATION

NOMINAL VARIABLE CLUSTERING AND ITS EVALUATION NOMINAL VARIABLE CLUSTERING AND ITS EVALUATION Hana Řezanková Abstract The paper evaluates clustering of nominal variables using different similarity measures. The created clusters can serve for dimensionality

More information

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment Enhancing Generalization Capability of SVM Classifiers ith Feature Weight Adjustment Xizhao Wang and Qiang He College of Mathematics and Computer Science, Hebei University, Baoding 07002, Hebei, China

More information

Measures of Association and Variance Estimation

Measures of Association and Variance Estimation Measures of Association and Variance Estimation Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University D. Bandyopadhyay (VCU) BIOS 625: Categorical Data & GLM 1 / 35

More information

Affinity analysis: methodologies and statistical inference

Affinity analysis: methodologies and statistical inference Vegetatio 72: 89-93, 1987 Dr W. Junk Publishers, Dordrecht - Printed in the Netherlands 89 Affinity analysis: methodologies and statistical inference Samuel M. Scheiner 1,2,3 & Conrad A. Istock 1,2 1Department

More information

Analysis of Consensus Partition in Cluster Ensemble

Analysis of Consensus Partition in Cluster Ensemble Analysis of Consensus Partition in Cluster Ensemble Alexander P. Topchy Martin H. C. Law Anil K. Jain Dept. of Computer Science and Engineering Michigan State University East Lansing, MI 48824, USA {topchyal,

More information

Lecture 4: Probability and Discrete Random Variables

Lecture 4: Probability and Discrete Random Variables Error Correcting Codes: Combinatorics, Algorithms and Applications (Fall 2007) Lecture 4: Probability and Discrete Random Variables Wednesday, January 21, 2009 Lecturer: Atri Rudra Scribe: Anonymous 1

More information

Four aspects of a sampling strategy necessary to make accurate and precise inferences about populations are:

Four aspects of a sampling strategy necessary to make accurate and precise inferences about populations are: Why Sample? Often researchers are interested in answering questions about a particular population. They might be interested in the density, species richness, or specific life history parameters such as

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Ratio of Linear Function of Parameters and Testing Hypothesis of the Combination Two Split Plot Designs

Ratio of Linear Function of Parameters and Testing Hypothesis of the Combination Two Split Plot Designs Middle-East Journal of Scientific Research 13 (Mathematical Applications in Engineering): 109-115 2013 ISSN 1990-9233 IDOSI Publications 2013 DOI: 10.5829/idosi.mejsr.2013.13.mae.10002 Ratio of Linear

More information

Chapter 2 Application to DC Circuits

Chapter 2 Application to DC Circuits Chapter 2 Application to DC Circuits In this chapter we use the results obtained in Chap. 1 to develop a new measurement based approach to solve synthesis problems in unknown linear direct current (DC)

More information

Interaction balance in symmetrical factorial designs with generalized minimum aberration

Interaction balance in symmetrical factorial designs with generalized minimum aberration Interaction balance in symmetrical factorial designs with generalized minimum aberration Mingyao Ai and Shuyuan He LMAM, School of Mathematical Sciences, Peing University, Beijing 100871, P. R. China Abstract:

More information

Chapter 30 Design and Analysis of

Chapter 30 Design and Analysis of Chapter 30 Design and Analysis of 2 k DOEs Introduction This chapter describes design alternatives and analysis techniques for conducting a DOE. Tables M1 to M5 in Appendix E can be used to create test

More information

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2. Chapter 1 LINEAR EQUATIONS 11 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,, a n, b are given real

More information

A L A BA M A L A W R E V IE W

A L A BA M A L A W R E V IE W A L A BA M A L A W R E V IE W Volume 52 Fall 2000 Number 1 B E F O R E D I S A B I L I T Y C I V I L R I G HT S : C I V I L W A R P E N S I O N S A N D TH E P O L I T I C S O F D I S A B I L I T Y I N

More information

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012 Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Some slides by Serafim Batzoglou 1 From expression profiles to distances From the Raw Data matrix we compute the similarity matrix S. S ij reflects the similarity of the expression

More information

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data Yujun Wu, Marc G. Genton, 1 and Leonard A. Stefanski 2 Department of Biostatistics, School of Public Health, University of Medicine

More information

Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal

Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal and transformations Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2017 Definitions An association coefficient is a function

More information

Section 3.2. Multiplication of Matrices and Multiplication of Vectors and Matrices

Section 3.2. Multiplication of Matrices and Multiplication of Vectors and Matrices 3.2. Multiplication of Matrices and Multiplication of Vectors and Matrices 1 Section 3.2. Multiplication of Matrices and Multiplication of Vectors and Matrices Note. In this section, we define the product

More information

ON VARIANCE COVARIANCE COMPONENTS ESTIMATION IN LINEAR MODELS WITH AR(1) DISTURBANCES. 1. Introduction

ON VARIANCE COVARIANCE COMPONENTS ESTIMATION IN LINEAR MODELS WITH AR(1) DISTURBANCES. 1. Introduction Acta Math. Univ. Comenianae Vol. LXV, 1(1996), pp. 129 139 129 ON VARIANCE COVARIANCE COMPONENTS ESTIMATION IN LINEAR MODELS WITH AR(1) DISTURBANCES V. WITKOVSKÝ Abstract. Estimation of the autoregressive

More information

HANDBOOK OF APPLICABLE MATHEMATICS

HANDBOOK OF APPLICABLE MATHEMATICS HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester

More information

Decomposition of Parsimonious Independence Model Using Pearson, Kendall and Spearman s Correlations for Two-Way Contingency Tables

Decomposition of Parsimonious Independence Model Using Pearson, Kendall and Spearman s Correlations for Two-Way Contingency Tables International Journal of Statistics and Probability; Vol. 7 No. 3; May 208 ISSN 927-7032 E-ISSN 927-7040 Published by Canadian Center of Science and Education Decomposition of Parsimonious Independence

More information

Spatial autoregression model:strong consistency

Spatial autoregression model:strong consistency Statistics & Probability Letters 65 (2003 71 77 Spatial autoregression model:strong consistency B.B. Bhattacharyya a, J.-J. Ren b, G.D. Richardson b;, J. Zhang b a Department of Statistics, North Carolina

More information

ELEMENTARY LINEAR ALGEBRA

ELEMENTARY LINEAR ALGEBRA ELEMENTARY LINEAR ALGEBRA K R MATTHEWS DEPARTMENT OF MATHEMATICS UNIVERSITY OF QUEENSLAND First Printing, 99 Chapter LINEAR EQUATIONS Introduction to linear equations A linear equation in n unknowns x,

More information

Limit Theorems for Exchangeable Random Variables via Martingales

Limit Theorems for Exchangeable Random Variables via Martingales Limit Theorems for Exchangeable Random Variables via Martingales Neville Weber, University of Sydney. May 15, 2006 Probabilistic Symmetries and Their Applications A sequence of random variables {X 1, X

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Networks: Lectures 9 & 10 Random graphs

Networks: Lectures 9 & 10 Random graphs Networks: Lectures 9 & 10 Random graphs Heather A Harrington Mathematical Institute University of Oxford HT 2017 What you re in for Week 1: Introduction and basic concepts Week 2: Small worlds Week 3:

More information

ESTIMATION OF CONSERVATISM OF CHARACTERS BY CONSTANCY WITHIN BIOLOGICAL POPULATIONS

ESTIMATION OF CONSERVATISM OF CHARACTERS BY CONSTANCY WITHIN BIOLOGICAL POPULATIONS ESTIMATION OF CONSERVATISM OF CHARACTERS BY CONSTANCY WITHIN BIOLOGICAL POPULATIONS JAMES S. FARRIS Museum of Zoology, The University of Michigan, Ann Arbor Accepted March 30, 1966 The concept of conservatism

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information

Inferences for Proportions and Count Data

Inferences for Proportions and Count Data Inferences for Proportions and Count Data Corresponds to Chapter 9 of Tamhane and Dunlop Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León (University of Tennessee) 1 Inference

More information

ELEMENTARY LINEAR ALGEBRA

ELEMENTARY LINEAR ALGEBRA ELEMENTARY LINEAR ALGEBRA K R MATTHEWS DEPARTMENT OF MATHEMATICS UNIVERSITY OF QUEENSLAND Second Online Version, December 998 Comments to the author at krm@mathsuqeduau All contents copyright c 99 Keith

More information

KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE BASED ON LINEAR PLACEMENTS

KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE BASED ON LINEAR PLACEMENTS Bull. Korean Math. Soc. 5 (24), No. 3, pp. 7 76 http://dx.doi.org/34/bkms.24.5.3.7 KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE BASED ON LINEAR PLACEMENTS Yicheng Hong and Sungchul Lee Abstract. The limiting

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Analysis of Multinomial Response Data: a Measure for Evaluating Knowledge Structures

Analysis of Multinomial Response Data: a Measure for Evaluating Knowledge Structures Analysis of Multinomial Response Data: a Measure for Evaluating Knowledge Structures Department of Psychology University of Graz Universitätsplatz 2/III A-8010 Graz, Austria (e-mail: ali.uenlue@uni-graz.at)

More information

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Department of Mathematics

Department of Mathematics Department of Mathematics Ma 3/103 KC Border Introduction to Probability and Statistics Winter 2017 Supplement 2: Review Your Distributions Relevant textbook passages: Pitman [10]: pages 476 487. Larsen

More information

Determinants of Partition Matrices

Determinants of Partition Matrices journal of number theory 56, 283297 (1996) article no. 0018 Determinants of Partition Matrices Georg Martin Reinhart Wellesley College Communicated by A. Hildebrand Received February 14, 1994; revised

More information

Notes on Generalized Method of Moments Estimation

Notes on Generalized Method of Moments Estimation Notes on Generalized Method of Moments Estimation c Bronwyn H. Hall March 1996 (revised February 1999) 1. Introduction These notes are a non-technical introduction to the method of estimation popularized

More information

High-dimensional asymptotic expansions for the distributions of canonical correlations

High-dimensional asymptotic expansions for the distributions of canonical correlations Journal of Multivariate Analysis 100 2009) 231 242 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva High-dimensional asymptotic

More information

THE NUMBER OF LOCALLY RESTRICTED DIRECTED GRAPHS1

THE NUMBER OF LOCALLY RESTRICTED DIRECTED GRAPHS1 THE NUMBER OF LOCALLY RESTRICTED DIRECTED GRAPHS1 LEO KATZ AND JAMES H. POWELL 1. Preliminaries. We shall be concerned with finite graphs of / directed lines on n points, or nodes. The lines are joins

More information

Adjusting for Chance Clustering Comparison Measures

Adjusting for Chance Clustering Comparison Measures Journal of Machine Learning Research 17 216) 1-32 Submitted 12/15; Revised 7/16; Published 8/16 Adjusting for Chance Clustering Comparison Measures Simone Romano simone.romano@unimelb.edu.au guyen Xuan

More information

AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC

AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC Journal of Applied Statistical Science ISSN 1067-5817 Volume 14, Number 3/4, pp. 225-235 2005 Nova Science Publishers, Inc. AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC FOR TWO-FACTOR ANALYSIS OF VARIANCE

More information

On consistency of Kendall s tau under censoring

On consistency of Kendall s tau under censoring Biometria (28), 95, 4,pp. 997 11 C 28 Biometria Trust Printed in Great Britain doi: 1.193/biomet/asn37 Advance Access publication 17 September 28 On consistency of Kendall s tau under censoring BY DAVID

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

arxiv:math.pr/ v1 17 May 2004

arxiv:math.pr/ v1 17 May 2004 Probabilistic Analysis for Randomized Game Tree Evaluation Tämur Ali Khan and Ralph Neininger arxiv:math.pr/0405322 v1 17 May 2004 ABSTRACT: We give a probabilistic analysis for the randomized game tree

More information

MULTIVARIATE ANALYSIS OF VARIANCE

MULTIVARIATE ANALYSIS OF VARIANCE MULTIVARIATE ANALYSIS OF VARIANCE RAJENDER PARSAD AND L.M. BHAR Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 0 0 lmb@iasri.res.in. Introduction In many agricultural experiments,

More information

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti Good Confidence Intervals for Categorical Data Analyses Alan Agresti Department of Statistics, University of Florida visiting Statistics Department, Harvard University LSHTM, July 22, 2011 p. 1/36 Outline

More information

EPMC Estimation in Discriminant Analysis when the Dimension and Sample Sizes are Large

EPMC Estimation in Discriminant Analysis when the Dimension and Sample Sizes are Large EPMC Estimation in Discriminant Analysis when the Dimension and Sample Sizes are Large Tetsuji Tonda 1 Tomoyuki Nakagawa and Hirofumi Wakaki Last modified: March 30 016 1 Faculty of Management and Information

More information

Lower Bounds for Testing Bipartiteness in Dense Graphs

Lower Bounds for Testing Bipartiteness in Dense Graphs Lower Bounds for Testing Bipartiteness in Dense Graphs Andrej Bogdanov Luca Trevisan Abstract We consider the problem of testing bipartiteness in the adjacency matrix model. The best known algorithm, due

More information

Phase Transition & Approximate Partition Function In Ising Model and Percolation In Two Dimension: Specifically For Square Lattices

Phase Transition & Approximate Partition Function In Ising Model and Percolation In Two Dimension: Specifically For Square Lattices IOSR Journal of Applied Physics (IOSR-JAP) ISS: 2278-4861. Volume 2, Issue 3 (ov. - Dec. 2012), PP 31-37 Phase Transition & Approximate Partition Function In Ising Model and Percolation In Two Dimension:

More information

Czechoslovak Mathematical Journal

Czechoslovak Mathematical Journal Czechoslovak Mathematical Journal Varaporn Saenpholphat; Ping Zhang Connected resolvability of graphs Czechoslovak Mathematical Journal, Vol. 53 (2003), No. 4, 827 840 Persistent URL: http://dml.cz/dmlcz/127843

More information

Optimal Multiple Decision Statistical Procedure for Inverse Covariance Matrix

Optimal Multiple Decision Statistical Procedure for Inverse Covariance Matrix Optimal Multiple Decision Statistical Procedure for Inverse Covariance Matrix Alexander P. Koldanov and Petr A. Koldanov Abstract A multiple decision statistical problem for the elements of inverse covariance

More information

The spectra of super line multigraphs

The spectra of super line multigraphs The spectra of super line multigraphs Jay Bagga Department of Computer Science Ball State University Muncie, IN jbagga@bsuedu Robert B Ellis Department of Applied Mathematics Illinois Institute of Technology

More information

Growing a Large Tree

Growing a Large Tree STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................

More information

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008) Dipartimento di Biologia Evoluzionistica Sperimentale Centro Interdipartimentale di Ricerca per le Scienze Ambientali in Ravenna INTERNATIONAL WINTER SCHOOL UNIVERSITY OF BOLOGNA DETECTING BIOLOGICAL AND

More information

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments We consider two kinds of random variables: discrete and continuous random variables. For discrete random

More information

SIMULATED POWER OF SOME DISCRETE GOODNESS- OF-FIT TEST STATISTICS FOR TESTING THE NULL HYPOTHESIS OF A ZIG-ZAG DISTRIBUTION

SIMULATED POWER OF SOME DISCRETE GOODNESS- OF-FIT TEST STATISTICS FOR TESTING THE NULL HYPOTHESIS OF A ZIG-ZAG DISTRIBUTION Far East Journal of Theoretical Statistics Volume 28, Number 2, 2009, Pages 57-7 This paper is available online at http://www.pphmj.com 2009 Pushpa Publishing House SIMULATED POWER OF SOME DISCRETE GOODNESS-

More information

Linear estimation in models based on a graph

Linear estimation in models based on a graph Linear Algebra and its Applications 302±303 (1999) 223±230 www.elsevier.com/locate/laa Linear estimation in models based on a graph R.B. Bapat * Indian Statistical Institute, New Delhi 110 016, India Received

More information

Unbiased prediction in linear regression models with equi-correlated responses

Unbiased prediction in linear regression models with equi-correlated responses ') -t CAA\..-ll' ~ j... "1-' V'~ /'. uuo. ;). I ''''- ~ ( \ '.. /' I ~, Unbiased prediction in linear regression models with equi-correlated responses Shalabh Received: May 13, 1996; revised version: December

More information

Marginal Balance of Spread Designs

Marginal Balance of Spread Designs Marginal Balance of Spread Designs For High Dimensional Binary Data Joe Verducci, Ohio State Mike Fligner, Ohio State Paul Blower, Leadscope Motivation Database: M x N array of 0-1 bits M = number of compounds

More information

Stochastic Design Criteria in Linear Models

Stochastic Design Criteria in Linear Models AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 211 223 Stochastic Design Criteria in Linear Models Alexander Zaigraev N. Copernicus University, Toruń, Poland Abstract: Within the framework

More information

Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption

Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption Alisa A. Gorbunova and Boris Yu. Lemeshko Novosibirsk State Technical University Department of Applied Mathematics,

More information

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection Biometrical Journal 42 (2000) 1, 59±69 Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection Kung-Jong Lui

More information

Journal of Biostatistics and Epidemiology

Journal of Biostatistics and Epidemiology Journal of Biostatistics and Epidemiology Original Article Robust correlation coefficient goodness-of-fit test for the Gumbel distribution Abbas Mahdavi 1* 1 Department of Statistics, School of Mathematical

More information

PROGRAMMING UNDER PROBABILISTIC CONSTRAINTS WITH A RANDOM TECHNOLOGY MATRIX

PROGRAMMING UNDER PROBABILISTIC CONSTRAINTS WITH A RANDOM TECHNOLOGY MATRIX Math. Operationsforsch. u. Statist. 5 974, Heft 2. pp. 09 6. PROGRAMMING UNDER PROBABILISTIC CONSTRAINTS WITH A RANDOM TECHNOLOGY MATRIX András Prékopa Technological University of Budapest and Computer

More information

A Statistical Analysis of Fukunaga Koontz Transform

A Statistical Analysis of Fukunaga Koontz Transform 1 A Statistical Analysis of Fukunaga Koontz Transform Xiaoming Huo Dr. Xiaoming Huo is an assistant professor at the School of Industrial and System Engineering of the Georgia Institute of Technology,

More information

VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis

VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis Pedro R. Peres-Neto March 2005 Department of Biology University of Regina Regina, SK S4S 0A2, Canada E-mail: Pedro.Peres-Neto@uregina.ca

More information

CONTROL CHARTS FOR MULTIVARIATE NONLINEAR TIME SERIES

CONTROL CHARTS FOR MULTIVARIATE NONLINEAR TIME SERIES REVSTAT Statistical Journal Volume 13, Number, June 015, 131 144 CONTROL CHARTS FOR MULTIVARIATE NONLINEAR TIME SERIES Authors: Robert Garthoff Department of Statistics, European University, Große Scharrnstr.

More information

ELEMENTARY LINEAR ALGEBRA

ELEMENTARY LINEAR ALGEBRA ELEMENTARY LINEAR ALGEBRA K. R. MATTHEWS DEPARTMENT OF MATHEMATICS UNIVERSITY OF QUEENSLAND Corrected Version, 7th April 013 Comments to the author at keithmatt@gmail.com Chapter 1 LINEAR EQUATIONS 1.1

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Czech J. Anim. Sci., 50, 2005 (4):

Czech J. Anim. Sci., 50, 2005 (4): Czech J Anim Sci, 50, 2005 (4: 163 168 Original Paper Canonical correlation analysis for studying the relationship between egg production traits and body weight, egg weight and age at sexual maturity in

More information