Correcting Jaccard and other similarity indices for chance agreement in cluster analysis
|
|
- Dorothy Evans
- 6 years ago
- Views:
Transcription
1 Adv Data Anal Classif DOI /s y REGULAR ARTICLE Correcting Jaccard and other similarity indices for chance agreement in cluster analysis Ahmed N. Albatineh Magdalena Niewiadomska-Bugaj Received: 1 June 010 / Revised: 15 March 011 / Accepted: 18 May 011 Springer-Verlag 011 Abstract Correcting a similarity index for chance agreement requires computing its expectation under fixed marginal totals of a matching counts matrix. For some indices, such as Jaccard, Rogers and Tanimoto, Sokal and Sneath, and Gower and Legendre the expectations cannot be easily found. We show how such similarity indices can be expressed as functions of other indices and expectations found by approximations such that approximate correction is possible. A second approach is based on Taylor series expansion. A simulation study illustrates the effectiveness of the resulting correction of similarity indices using structured and unstructured data generated from bivariate normal distributions. Keywords Similarity indices Matching counts matrix Correction for chance agreement Jaccard index Cluster analysis Comparing partitions Mathematics Subject Classification (000 6H30 1 Introduction Measuring similarity between two different partitions (clusterings of the same set of objects is an important issue in cluster analysis. Many similarity measures have been A. N. Albatineh (B Department of Epidemiology and Biostatistics, Florida International University, Miami, FL, USA aalbatin@fiu.edu M. Niewiadomska-Bugaj Department of Statistics, Western Michigan University, Kalamazoo, MI, USA m.bugaj@wmich.edu 13
2 A. N. Albatineh, M. Niewiadomska-Bugaj proposed in the literature and are extensively used in cluster analysis applications including validation studies and recovery of clustering structure, see Milligan et al. (1983; Saxena and Navaneerham (1991, 1993; Milligan and Cooper (1986 and Steinley (004 for discussion. The problem with these indices is that they do not account for agreement due to chance. Morey and Agresti (1984 proposed a correction for the Rand index R (Rand 1971 for chance agreement based on an asymptotic multinomial distribution, while Hubert and Arabie (1985 used the exact generalized hypergeometric distribution for the same purpose. Albatineh et al. (006, p. 308 showed that the difference between the Morey and Agresti (1984 and Hubert and Arabie (1985 expectations (asymptotic and exact is negligible when the number of objects to be clustered is not too small. Fligner et al. (00 proposed a modification of the Jaccard Tanimoto index to be used in diverse selection of chemical compounds using binary strings. These authors emphasized that the Jaccard Tanimoto index has been widely used in computational chemistry and has become the standard for measuring the structural similarity of compounds. Historically the coefficient of Jaccard Tanimoto appeared much earlier as Jaccard (1908 in an ecological context to measure the degree of relatedness between two biological communities with respect to their species composition. Albatineh et al. (006, p. 307 proposed a correction of the indices of Fowlkes and Mallows (1983; Hamann (1961; Russell and Rao (1940; Czekanowski (Cz (193 and Wallace (1983 for chance agreement. Their simulations showed that correction improves the performance of the indices in the sense that the indices take values close to zero when no clustering structure is present, while they take values close to the original index value when a clustering structure exists. Albatineh et al. (006, p. 308 introduced a family L of similarity indices that are linear functions of the sum of the squares of the matching counts. Some of the indices that are not members of the L family are of great importance and wide applicability in botany, ecology, zoology; such as the index J of Jaccard (1908; Sokal and Sneath (1963; Gower and Legendre (1986 and Rogers and Tanimoto (1960 to name a few. In this paper, our goal is to find a general method to correct similarity indices such as J, RT, SS, and GL for agreement due to chance. Two approaches will be introduced. First: as the indices of J, RT, SS, and GL are functions of two members of the L family, namely Czekanowski (Cz (193 and Rand (1971, this relationship can be approximated and the expectation in Eq..5 can be approximately computed. Second: Taylor series expansion of those indices is discussed and an approximation to the expectations of these indices is obtained and thus a correction for chance can be computed. The paper is organized as follows: Sect. presents an overview of similarity indices, Sect. 3 presents some results relating the indices to each other with a proposed method for approximating the relationships and hence the correction, while Sect. 4 presents the Taylor series idea to find the expectation of the indices. Sect. 5 presents the simulations showing the effect of the proposed methods with conclusions in Sect
3 Correcting Jaccard and other similarity indices Table 1 Binary counts for two clustering (partitioning methods Partition B Number of pairs In the same clusters In different clusters Total Partition A In the same clusters a b a + b In different clusters c d c+ d Total a + c b+ d N Overview of similarity indices A standard approach in comparing two partitions of the same data set is to calculate the similarity between the two obtained partitions of the underlying set of objects using similarity indices. Since the clusters are not predefined, the similarity of the results between different clustering procedures (algorithms is usually based on the number of pairs of objects that are (not placed together into the same cluster, according to each algorithm. Consequently a similarity table as in Table 1 is formed where a, b, c, d, and N are defined as: a: Number of pairs of objects which are joined in the same cluster for both clustering methods. b: Number of pairs of objects which are joined together by method A, and not joined together by method B c: Number of pairs of objects not joined together by method A, while joined together by method B. d: Number of pairs of objects which are not joined together by either of the two methods. ( n The total number of pairs is N = a + b + c + d = = n(n 1, where n is the number of observations to be clustered. Let U = {u 1, u,...,u I } and V ={v 1,v,...,v J } be two partitions of the same data set resulting from the two clustering methods A and B and producing I and J clusters (i = 1,,...,I and j = 1,,...,J, respectively. The entries of Table 1 can also be defined in terms of counts in the matching matrix M between the two partitions U, V as M = (m ij, where the entry m ij = u i v j is the number of common objects in cluster u i from method A, and cluster v j of method B (Jain and Dubes (1988, p. 173: a = ( = 1 n. (.1 13
4 b = c = d = ( m+ j j=1 ( mi+ ( ( ( n a b c = 1 = 1 = 1 A. N. Albatineh, M. Niewiadomska-Bugaj m + j 1 j=1. (. mi+ 1. (.3 mi+ +. (.4 m ij + n 1 j=1 m + j where m i+ = J j=1 m ij and m + j = I m ij are the ith row and jth column totals of the matching counts matrix M, respectively. Any similarity index (SI when corrected for chance agreement (CSI takes the form CSI = SI E(SI 1 E(SI (.5 where E(SI is the expected value of the index under fixed marginal totals of the matching counts matrix M and unity is the theoretical maximum value of the index, see Morey and Agresti (1984, p. 35. Any SI that takes the form SI = α +β I Jj=1, where α and β are unique for each index, is said to be a member of the family L (Albatineh et al. 006, p Albatineh (010 derived means and variances for any member of the family L under fixed marginal totals of the matching counts matrix and independence of the clustering algorithms. For example, the indices of Rand (1971 and Czekanowski (Cz (193 are among many that are members of the family L, and can be written as R = Cz = 13 ( a + d a + b + c + d = 1 1 I n(n 1 m i+ + J j=1 m + j }{{} α a a + b + c = 1 + n(n 1 }{{} β n ( I mi+ + J j=1 m + j n } {{ } α + 1. (.6 1 ( I mi+ + J j=1 m + j n } {{ } β. (.7
5 Correcting Jaccard and other similarity indices Table Selected similarity indices No. Index Symbol Formula 1 Rogers and Tanimoto (1960 RT Gower and Legendre (1986 GL 3 Jaccard (191 J 4 Sokal and Sneath (1963 SS 5 Sokal and Michener (1958; Rand (1971 R 6 Czekanowski (Cz (193; Dice (1945; Sørensen (1948 CZ 7 Hamann (1961 H 8 Mcconnaughey (1964 Mc 9 Johnson (1967 Jo 10 Kulczynski (197 K 11 Legendre and Legendre (1998 LL 1 Lamont and Grant (1979 LG 13 Maarel (1969 M 14 Sokal and Sneath (1963 SS3 15 Sokal and Sneath (1963 SS4 16 Southwood (1978 S a+d a+(b+c+d a+d a+ 1 a (b+c+d a+b+c a a+(b+c a+d a+b+c+d a a+b+c (a+d (b+c a+b+c+d a bc ((a+b(a+c a a+b + a+c a ( 1 a a+b + a+c a 3a 3a+b+c a a+b+c a (b+c a+b+c (a+d (a+d+(b+c a+d b+c a b+c Table presents partial list of similarity indices. The indices of RT, SS, GL, and J can be written in terms of m ij as a + d RT = a + (b + c + d I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j = I mi+ + J j=1 m + j + n(n 1 I Jj=1. (.8 a SS = a + (b + c I Jj=1 = ( n I mi+ + J j=1 m + j n 3 I. (.9 Jj=1 a + d GL = a + 1 (b + c + d I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j = I ( Jj=1 + n(n 1 1 I mi+ + (.10 J j=1 m + j. 13
6 A. N. Albatineh, M. Niewiadomska-Bugaj Table 3 Relationships between some similarity indices No. Comparison Relationship No. Comparison Relationship 1 (R,RT RT = R/( R 14 (LL, J LL = 3J/(1 + J (R,GL R = GL/( GL 15 (LL, SS LL = 6SS/(5SS (J, CZ J = CZ/( CZ 16 (LG, M M = 4LG 1 4 (SS, J SS = J/( J 17 (J, M M = (3J 1/(J (H, R H = R 1 0 (SS3, R R = SS3/( SS3 6 (Mc, K Mc = K 1 1 (SS4, R SS4 = R/(1 R 7 (CZ, M M = CZ 1 (SS, S SS = S/(S + 8 (SS, CZ SS = CZ/(4 3CZ 18 (LG, SS SS = LG/( 3LG 9 (RT, GL RT = GL/(4 3GL 19 (SS3, SS4 SS4 = SS3/(1 SS3 10 (Mc, Jo Mc = Jo 1 3 (S, J J = S/(S (LL, LG LG = LL/(3 LL 4 (LL, M LL = (3 + 3M/(5 + M 1 (LG, J LG = J/(1 + J 5 (GL, H GL = (H + /(H (H, RT H = (3RT 1/(RT (SS, M SS = (5 3M/(M + 1 a J = a + b + c I Jj=1 = n I mi+ + J j=1 m + j n I Jj=1. (.11 These indices are not linear in I Jj=1 m ij and hence are not members of the family L. Their conditional expectations under fixed marginal totals of the matching matrix M can not be found explicitly. Therefore, the first idea is to express them as functions of other indices which have an expectation that can be explicitly computed, and to use them in an approximating formula of the corresponding function. In Sect. 3, the indices of RT, SS, GL, and J are shown to be functions of R and Cz which are members of the L family and their expectations are known. 3 Indices that are not in family L In this paper, we will focus on the indices RT, SS, GL, and J which are not members of the family L. Other non-members of the family L can be handled in a similar way. Relationships between some indices are presented in Table 3 and have been studied in Hubálek (198 and Janson and Vegelius (1981 in the context of their suitability for measuring coexistence between two species over different localities in ecology. Similarly, Snijders et al. (1990 established relationships between R, J, and CZ and derived some distributional results. The relationships numbered 1 5 in Table 3 were established in Janson and Vegelius (1981 and Snijders et al. (1990 and are presented here for completeness. Relationships 6 6 are newly formulated. Such relationships will form the basis for approximating expectations of indices that are not members of family L as discussed in the next section. 13
7 Correcting Jaccard and other similarity indices Cz Exact Approx J Fig. 1 Exact versus approximate relationship between J and Cz indices 3.1 Correction of Jaccard index As shown in Table 3, the indices J and Cz are related by the equation J = h(cz = Cz Cz, where Cz and J can be written in terms of m ij as given by (.7 and (.11, respectively. Note that the index Cz index is a member of the family L and hence the expectation of Cz under fixed marginal totals of the matrix M can be found explicitly, see Eq..7. In order to find the mean of the J index (not a member of family L, we use the fact that the function h can be closely approximated by a quadratic function (see Fig. 1 oftheform J = ξ 1 Cz + ξ Cz + ξ 3. (3.1 Least squares estimates of ξ 1,ξ, and ξ 3 were found using the R statistical software, and thus the relationship between J and Cz can be approximated by J = Cz Cz (3. Point by point evaluation of the exact relationship of J and Cz and the quadratic approximation of J by Cz revealed that the approximation is very good, with maximum absolute difference which occurred at points close to 0 or 1. Therefore, the expectation of the approximate J index given by (3.1 can be calculated as E[J ]=ξ 1 E(Cz + ξ E(Cz + ξ 3 = E(Cz E(Cz (3.3 13
8 A. N. Albatineh, M. Niewiadomska-Bugaj To calculate E[J ], we need E(Cz and E(Cz which are established in Theorem 1 not only for Cz, but for the family L. Theorem 1 Let SI be any similarity index of the form SI = α + β I Jj=1. Under fixed marginal totals m i+ and m + j of the matching counts matrix M = (m ij and independence of the two clusterings, we have ( PQ E[SI]=α + β n(n 1 + n, (3.3 ( ( PQ E[SI ]=α + αβ n(n 1 + n + β E[U ]+ PQ n 1 n, (3.3 where E[U ]= PQ n(n 1 + 4P Q n(n 1(n + (P 4P P(Q 4Q Q, (3.3 n(n 1(n (n 3 U = P = P = m ij (m ij 1 = mi+ n,q = m + j n, j=1 m ( ij, m i+ (m i+ 1(m i+, andq = (3.3 m + j (m + j 1(m + j. j=1 Proof For fixed marginal totals m i+ and m + j of the matching counts matrix M = (m ij and independence of the two procedures, the elements of M = (m ij have a generalized hypergeometric distribution, see Lancaster (1969, p. 14 and Fowlkes and Mallows (1983. Define m (p ij = m ij (m ij 1 (m ij p + 1, then the pth factorial moment of m ij is ( E m (p ij ( I This implies that E Jj=1 p. 00. Therefore, 13 E[SI]=E α + β = α + β = m (p i+ m(p + j /n(p (3.4 = PQ n(n 1 + n (see Hubert and Arabie 1985, m ij ( PQ n(n 1 + n = α + β E. (3.5
9 Correcting Jaccard and other similarity indices Since [SI ]= α + β = α + αβ m ij we obtain E[SI ]=α + αβ E + β + β E ( PQ = α + αβ n(n 1 + n + β E,. (3.6 The expectation on the right hand side of (3.6 can be evaluated as follows: consider m ( ij Therefore, E = E Hence, E = E = = l j=j m ij (m ij 1 n m ij (m ij 1 ne = m ij (m ij 1 + ne n + n + n. n. (3.7 In order to find the first expectation on the right hand side of (3.7, i.e. E[U ],weuse the fact 13
10 ( m ( ij A. N. Albatineh, M. Niewiadomska-Bugaj = ( (m ij 1 = m ij ( 1 = m 4 ij m 3 ij + m ij = m (4 ij + 4m (3 ij + m ( ij m (4 ij +4m (3 ij +m ( ij {}}{ and therefore, U = m ( ( ij = m ( ij + m ( ij m ( ij j =1 j = j + m ( ij m ( i j + m ( ij m ( i j. (3.8 i =1 j=1 i =1 j=1 j =1 i =i i =i j = j Hence, E(U = m(4 i. m (4. j n + 4 m (3 i. m (3. j n (3 + m ( i. m (. j n ( + m(4 i. m (. j m (. j j n =1 + m( i. m ( i. m(4. j i n =1 j=1 j = j i =i + m( i. m ( i. m(. j m. j i =1 j=1 j n =1. (3.9 i =i j = j After some simplifications and collecting identical terms we obtain (3.3 as desired. Furthermore, we obtain from (3.7: E = E[U ]+ne n ( PQ = E[U ]+n n(n 1 + n n = E[U ]+ PQ n 1 + n. (3.10 Substituting (3.10 into(3.6 results in ( ( PQ E[SI ]=α + αβ n(n 1 + n + β E[U ]+ PQ n 1 n with E[U ] as given by (3.3. (3.11 Since Cz index belongs to the family L, we can obtain E(Cz and E(Cz from Theorem 1 and therefore compute E[J ]from(3.3 as an approximation to E[J], the 13
11 Correcting Jaccard and other similarity indices expected Jaccard index. Hence, an approximation to the corrected Jaccard index (CJ is given by CJ = J E[J ] 1 E[J ] where J and E[J ]aregivenby(.11 and (3.3, respectively. 3. Correction of Rogers and Tanimoto index R R It is shown in Table 3 that R and RT are related by the equation RT = f (R = where RT, not a member of the family L, can be written in terms of m ij as in (.8. The curve representing the relationship between RT and R is similar to Fig. 1, and can be approximated by a quadratic equation of the form RT = γ 1 R + γ R + γ 3. (3.1 Least squares estimates of γ 1,γ, and γ 3 were found using R statistical software, and thus the relationship between RT and R can be approximated by RT = R R (3.13 Therefore, E(RT = E(R E(R (3.14 For determining E[R] we write ( 1 I R = 1 n(n 1 m i+ + J j=1 m + j + n(n 1 }{{}}{{} α β = α + β. (3.15 Since R belongs to the family L and Theorem 1 yields 1 E(R = E 1 mi+ n(n = 1 mi+ n(n 1 + j=1 j=1 m + j m + j + + n(n 1 m ij ( PQ n(n 1 n(n 1 + n 13
12 A. N. Albatineh, M. Niewiadomska-Bugaj 1 = 1 (P + Q + n + PQ n(n 1 n (n 1 + n 1 (P + Q = 1 n(n 1 + PQ n (n 1. (3.16 Note that Theorem 1 provides a general formula for E[SI] and E[SI ]. For the R index, E[R] and E[R ]aregivenby(3.3 and (3.3 with α and β given in (3.15. In particular, E[R] is given by (3.16 and ( ( PQ E[R ]=α + αβ n(n 1 + n + β E[U ]+ PQ n 1 n (3.17 where α and β are as given in (3.15, and E[U ] isgivenby(3.3. Thus, the corrected RT (CRT can be approximately calculated as CRT = RT E(RT 1 E(RT (3.18 where RT and E(RT aregivenby(.8 and (3.14, respectively. 3.3 Correction of Gower and Legendre index R 1+R. In Table 3 it is shown that GL is related to R by the equation GL = h(r = Equation (.10 shows that GL can be written in terms of m ij but does not belong to the family L. The graph showing the relationship between R and GL is displayed in Figure, it can be approximated by a quadratic equation of the form GL = β 1 R + β R + β 3. (3.19 Least squares estimates of β 1,β, and β 3 were found using R statistical software, and thus the relationship between GL and R can be approximated by GL = R R , (3.0 The maximum absolute difference between the exact and approximate relationships (see Fig. is Therefore, E[GL ]= E(R E(R (3.1 The values of E[R] and E[R ]in(3.1 aregivenby(3.16 and (3.17, respectively. Hence the corrected GL index (CGL can be approximately calculated as CGL = GL E(GL 1 E(GL (3. where GL and E(GL aregivenby(.10 and (3.1, respectively. 13
13 Correcting Jaccard and other similarity indices GL Exact Approx Fig. Exact versus approximate relationship between R and GL indices R 3.4 Correction of Sokal and Sneath index It is shown in Table 3 that SS and Cz are related by the equation SS = g(cz = Cz 4 3Cz. In order to derive the moments of the SS index (not a member of the family L, we use the fact that the function g can be closely approximated by a quadratic function (see Fig. 3 oftheform SS = η 1 + η Cz + η 3 Cz (3.3 Using the R statistical software, estimates of η 1,η, and η 3 using least squares method were obtained and hence the approximate relationship between Cz and SS is given by SS = Cz Cz (3.4 Point by point evaluation of the exact and approximate relationships between SS and Cz reveals that their maximum absolute difference is as large as Using (3.3, E(SS can be approximated by E(SS E(SS = E(Cz E(Cz (3.5 Note that Cz is a member of( the family L, so using Theorem 1, E(Cz ( and E(Cz are given by E(Cz = α + β PQ n(n 1, + n E[Cz ]=α + αβ PQ ( n(n 1 + n + β E[U ]+ PQ n 1 n with α = P+Q n,β = P+Q and E[U ] as given by (
14 A. N. Albatineh, M. Niewiadomska-Bugaj SS Exact Approx Cz Fig. 3 Exact versus approximate relationship between Cz and SS indices Thus the corrected SS can be approximately calculated as CSS = SS E(SS 1 E(SS (3.6 where SS and E(SS aregivenby(.9 and (3.5, respectively. In the following section we propose another way to find the expectation of the indices based on Taylor series expansion of the indices as functions of I Jj=1 m ij, which provides a better approximation in case of indices such as J, GL, RT, and SS. 4 Expectations based on Taylor series expansion Consider the indices of RT, SS, GL, and J as given by Eqs..8,.9,.10, and.11, respectively. Clearly, each of these indices is non-linear in the quantity I Jj=1 m ij and therefore can be thought of as a function Y = g(x, where X = I Jj=1 since n, I mi+, and J j=1 m + j are constants. Consider the Taylor series expansion of Y around μ = E(X which is given by Y = g(x g(μ + 1 1! g (μ(x μ + 1! g (μ(x μ + (4.1 Since E(X μ = 0 and E(X μ =Var (X, Eq. 4.1 can be written as 13 E(Y = E(g(X g(μ + 1! g (μvar(x + (4.
15 Correcting Jaccard and other similarity indices Two conditional expectation formulas for correcting the Rand (1971 index for chance agreement were proposed. Hubert and Arabie (1985 proposed an expectation based on the exact generalized hypergeometric distribution of the matching counts in the matrix M which is given by E = 1 n(n 1 mi+ m + j + n (n 1 1 (n 1 mi+ + j=1 m + j (4.3 Morey and Agresti (1984 proposed an asymptotic expectation based on multinomial distribution given by E ( mi+ m + j n = 1 n mi+ m + j (4.4 Albatineh et al. (006, p. 308 showed that, as the sample size increases, the difference between the corrected Rand (1971 index using Eqs. 4.3 and 4.4 becomes negligible. For simplicity, the expectation in Eq. 4.4 will be used in the Taylor series expansion to obtain the expectation of the indices J, RT, GL, and SS as explained below. Initial evaluations revealed little contribution of the second term in Eq. 4. and therefore only the first term will be used to approximate the expectation of the indices as described below. 1. Correction of Jaccard index: The J index as a function of X = I Jj=1 m ij is given by J = g(x = I Jj=1 m ij n I m i+ + J j=1 m + j I Jj=1 m ij n (4.5 Therefore, using Eq. 4.4, the expected J index is given by E(J = E(g(X g(e(x I Jj=1 ( m i+m + j n n = I mi+ + J j=1 m + j I Jj=1 ( m i+m + j n n 1 I m Jj=1 n = i+ m + j n I mi+ + J j=1 m + j 1 I m Jj=1 n i+ m + j n 1 (P + n(q + n n n = P + Q + n 1 (P + n(q + n n (4.6 13
16 A. N. Albatineh, M. Niewiadomska-Bugaj where P = I mi+ n and Q = J j=1 m + j n. Therefore, the corrected J index is given by CJ = J E(J 1 E(J where J and E(J are given by Eqs..11 and 4.6, respectively.. Correction of Rogers and Tanimoto: The RT index is given by I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j RT = I mi+ + J j=1 m + j + n(n 1 I Jj=1 (4.7 (4.8 Therefore, using Eq. 4.4, the expected RT index is given by E(RT = = (P + n(q + n + n(n 1 (P + Q + n n P + n + Q + n + n(n 1 (P + n(q + n n (P + n(q + n + n(n 1 (P + Q + n n P + Q + n(n + 1 (P + n(q + n n (4.9 Thus, the corrected RT is given by CRT = RT E(RT 1 E(RT (4.10 where RT and E(RT are given by Eqs..8 and 4.9, respectively. 3. Correction of Gower and Legendre: The GL index is given by I ( Jj=1 I + n(n 1 mi+ + J j=1 m + j GL = I ( Jj=1 + n(n 1 1 I mi+ + (4.11 J j=1 m + j. Therefore, using Eq. 4.4, the expected GL index is given by E(GL = (P + n(q + n + n(n 1 (P + Q + n n 1 (P + n(q + n + n(n 1 1 n (P + Q + n (4.1 Thus, the corrected GL index is given by CGL = GL E(GL 1 E(GL (4.13 where GL and E(GL are given by Eqs..10 and 4.1, respectively. 13
17 Correcting Jaccard and other similarity indices 4. Correction of Sokal and Sneath: The SS index is given by I Jj=1 SS = ( n I mi+ + J j=1 m + j n 3 I. (4.14 Jj=1 Therefore, using Eq. 4.4, the expected SS index is given by E(SS = 1 n (P + n(q + n n (P + Q + n n 3 n (P + n(q + n (4.15 Thus, the corrected SS index is given by CSS = SS E(SS 1 E(SS (4.16 where SS and E(SS are given by Eqs..9 and 4.15, respectively. In the following section, results of numerical simulation for homogeneous and structured data are presented using expectations obtained in Sects. 3 and 4. 5 Simulation results In this section we investigate the performance of the correction using an expectation based on approximations of the relationships between indices and an an approximate expectation based on Taylor series expansion. Data sets with and without clustering structure will be generated, and the values of the indices before and after correction will be compared. 5.1 Homogeneous data In this case 500 observations are generated from a bivariate normal distribution with parameters μ = ( 10, and = 10 ( The data (see Fig. 4 is clustered by the average linkage method (arbitrarily chosen and we look at the obtained partition with, 3, 4,...,10 clusters (method A. In addition, the same data is randomly split into, 3, 4,...,10 clusters of equal size (method B. The similarity between the two resulting clusterings is calculated (for the same number of clusters using J, RT, GL, and SS indices along with their versions that were corrected for chance agreement as discussed in Sects. 3 and 4. This data generation and clustering process is repeated 1,000 times and the averages of the indices are calculated. Since the data have no clustering structure, we expect the values of the indices to be very close to zero. Table 4 presents the results for the homogenous data 13
18 A. N. Albatineh, M. Niewiadomska-Bugaj y x Fig. 4 Random sample of size 500 generated from bivariate normal distribution simulations using the average linkage method. For example, if we consider the column with three clusters in Table 4, the values of the original indices J, RT, GL, and SS are , 0.314, 0.543, and while it is only , , , and after correction for chance agreement (when using the Taylor series method. The values of the corrected indices are 0.004, , , and , respectively when using the proposed approximations from Sect. 3. This indicates that the proposed methods are very effective in correcting the J, RT, GL, and SS indices for chance agreement insofar as their values are close to zero when no cluster structure exists. 5. Clustered data For this example, five clusters with 100 observation each were generated from five bivariate normal distributions with parameters given by ( ( 5 5 μ 1 =,μ 5 =,μ 16 3 = ( = ( 10,μ 11 4 = ( ( 15 15,μ 16 5 =, and 5 A random sample obtained from these distributions is shown in Fig. 5. The average linkage method was used to cluster the 500 points by requesting, 3, 4,...,10 clusters. The similarity between the original five clusters (data sets and the k-class partition (k =, 3,...,10 resulting from the average linkage method was calculated. This process was repeated 1,000 times and the average of the indices were calculated. 13
19 Correcting Jaccard and other similarity indices Table 4 Values of J, RT, GL, and SS indices before and after correction obtained for data of size 500 observations generated from a bivariate normal distribution using average linkage method Index\#Clusters J CJ Taylor CJ appr RT CRT Taylor 9.e CRT appr GL CGL Taylor CGL appr SS CSS Taylor 6.6e CSS appr y Fig. 5 Data from five bivariate normal distributions each with sample size 100 x It is expected that the indices will attain maximum values at the correct number of clusters which is five, and attain values smaller as the number of clusters gets further away from the correct number of clusters, see Milligan and Cooper (1986, p. 455 for more details on using the similarity indices as tools for measuring clustering structure recovery. Table 5 presents the results obtained by the average linkage method with values of the indices J, RT, GL, and SS along with their proposed corrected versions. The values of the indices at the correct number of clusters are close to each other, whereas 13
20 A. N. Albatineh, M. Niewiadomska-Bugaj Table 5 Values of indices J, RT, GL, SS and their corrected versions using the average linkage method with data generated from five bivariate normal distributions each with sample size 100 using average linkage method Index\#Clusters J CJ Taylor CJ appr RT CRT Taylor CRT appr GL CGL Taylor CGL appr SS CSS Taylor CSS appr The bold values represent values of the similarity indices at the correct number of clusters the values of the corrected indices at and 10 clusters are smaller than the uncorrected indices (as expected since we are far from the target of five clusters. It must be noted that the values of the corrected indices drop faster once we have passed the correct number of clusters, see for example GL and CGL at 5 and 6 clusters. In summary, for homogeneous data set, the corrected indices attained values closer to zero (as desired compared to uncorrected indices. For clustered data, the corrected indices showed less similarity for number of clusters far from the target (five clusters in this case, while attaining maximum value at the correct number of clusters. This clearly shows the effectiveness of the proposed approximations in correcting the indices of J, RT, GL, and SS for chance agreement, in the sense that the corrected indices attain values close to zero for homogeneous data and close to the original index for structured data. For more on using the corrected similarity indices to find the optimal number of clusters in a data set see Albatineh and Niewiadomska-Bugaj ( Conclusion In this paper a proposal for correcting the similarity indices of J, RT, SS, and GL which are not members of the family L has been presented. Similar indices can be handled the same way. The indices of J, RT, SS, and GL are either functions of each other or functions of the indices R and Cz. In order to correct the indices of J, RT, SS, and GL for chance agreement, two ideas were discussed. The first idea is to approximate the relationship between the indices in order to find the expectation of the index and hence correcting it for chance agreement. The second idea is to find the expectation based on Taylor series expansion of the indices around μ = E(X where X = I Jj=1. Simulation results revealed that such a correction greatly 13
21 Correcting Jaccard and other similarity indices improves the performance (recovery of clustering structure of the indices, in the sense that they produce similarity values closer to zero between two clusterings when the data has no clustering structure (homogeneous data. In a structured case simulations with five clusters, the corrected indices showed the desirable small similarity when the number of clusters was far from the target and very close to the original indices at the target. However, not all indices can be expressed in terms of other indices that are linear in the matching counts. In such cases, we can find expectations of the indices using the Taylor series idea which is more general. References Albatineh AN, Niewiadomska-Bugaj M, Mihalko DP (006 On similarity indices and correction for chance agreement. J Classif 3: Albatineh AN, Niewiadomska-Bugaj M (011 MCS: a method for finding the number of clusters. J Classif 8. doi: /s Albatineh AN (010 Means and variances for a family of similarity indices used in cluster analysis. J Stat Plan Inference 140: Czekanowski J (193 Coefficient of racial likeness und durchschnittliche Differenz. Anthropologischer Anzeiger 14:7 49 Dice LR (1945 Measures of the amount of ecological association between species. Ecology 6:97 30 Fligner MA, Verducci JS, Blower PE (00 A modification of the Jaccard Tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics 44: Fowlkes EB, Mallows CL (1983 A method for comparing two hierarchical clusterings. J Am Stat Assoc 78: Gower JC, Legendre P (1986 Metric and Euclidean properties of dissimilarity coefficients. J Classif 3:5 48 Hamann U (1961 Merkmalsbestand und Verwandtschaftsbeziehungen der Farinosae. Willdenowia : Hubálek Z (198 Coefficients of association and similarity based on binary (presence absence data: an evaluation. Biol Rev 57: Hubert L, Arabie P (1985 Comparing partitions. J Classif : Jaccard P (1908 Nouvelles recherches sur la distribution florale. Bull Soc Vaudoise Sci Nat 44:3 70 Jaccard P (191 The distribution of the flora of the alpine zone. New Phytol 11:37 50 Jain AK, Dubes RC (1988 Algorithms for clustering data. Prentice Hall, New Jersey Janson S, Vegelius J (1981 Measures of ecological association. Oecologia 49: Johnson SC (1967 Hierarchical clustering schemes. Psychometrika 3:41 54 Kulczynski S (197 Die Pflanzenassoziationen der Pinien, Bulletin International de L Académie Polonaise des Sciences et des Lettres, Classe des Sciences Mathématiques et Naturelles. Series B, Supplément II :57 03 Lamont BB, Grant KJ (1979 A comparison of twenty-one measures of site dissimilarity. In: Orlóci L, Rao CR, Stiteler WM (eds Multivariate methods in ecological work. International Cooperation Publishing House, Fairland, pp Lancaster HO (1969 The Chi-squared distribution. John Wiley, New York Lehmann EL (1959 Testing statistical hypothesis. Wiley, New York Legendre P, Legendre L (1998 Numerical ecology. Elsevier, Amsterdam Mcconnaughey BH (1964 The determination and analysis of plankton communities. Marine Research, Special No, Indonesia, pp 1 40 Milligan G, Cooper M (1986 A study of the comparability of external criteria for hierarchical cluster analysis. Multivar Behav Res 1: Milligan G, Soon S, Sokol L (1983 The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Patt Anal Mach Intell PAMI-5:40 47 Morey L, Agresti A (1984 The measurement of classification agreement: an adjustment to the Rand statistic for chance agreement. Educ Psychol Meas 44:33 37 Rand W (1971 Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: Rogers DJ, Tanimoto TT (1960 A computer program for classifying plants. Science 13:
22 A. N. Albatineh, M. Niewiadomska-Bugaj Russell PF, Rao TR (1940 On habitat and association of species of anopheline larvae in South-Eastern Madras. J Malar Inst India 3: Saxena PC, Navaneerham K (1991 The effect of cluster size, dimensionality, and number of clusters on recovery of true cluster structure through Chernoff-type faces. Statistician 40: Saxena PC, Navaneerham K (1993 Comparison of Chernoff-type face and non-graphical methods for clustering multivariate observations. Comput Stat Data Anal 15:63 79 Snijders TAB, Dormaar M, Van Schuur WH, Dijkman-Caes C, Driessen G (1990 Distribution of some similarity coefficients for dyadic binary data in the case of associated attributes. J Classif 7:5 31 Sokal RR, Michener CD (1958 A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38: Sokal RR, Sneath PHA (1963 Principles of numerical taxonomy. WH Freeman, San Francisco Sørensen T (1948 A Method of establishing groups of equal amplitude in plant sociology based on similarity of species content. Biologiske Skrifter 5:1 34 Southwood TS (1978 Ecological methods. Chapman and Hall, London Steinley D (004 Properties of the Hubert Arabie adjusted Rand index. Psychol Methods 9: Van Der Maarel E (1969 On the use of ordination models in phytosociology. Vegetatio 19:1 46 Wallace DL (1983 A method for comparing two hierarchical clusterings: comment. J Am Stat Assoc 78:
arxiv: v1 [stat.ml] 17 Jun 2016
Ground Truth Bias in External Cluster Validity Indices Yang Lei a,, James C. Bezdek a, Simone Romano a, Nguyen Xuan Vinh a, Jeffrey Chan b, James Bailey a arxiv:166.5596v1 [stat.ml] 17 Jun 216 Abstract
More informationFuzzy order-equivalence for similarity measures
Fuzzy order-equivalence for similarity measures Maria Rifqi, Marie-Jeanne Lesot and Marcin Detyniecki Abstract Similarity measures constitute a central component of machine learning and retrieval systems,
More informationMARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES
REVSTAT Statistical Journal Volume 13, Number 3, November 2015, 233 243 MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES Authors: Serpil Aktas Department of
More informationCorrelation Analysis of Binary Similarity and Distance Measures on Different Binary Database Types
Correlation Analysis of Binary Similarity and Distance Measures on Different Binary Database Types Seung-Seok Choi, Sung-Hyuk Cha, Charles C. Tappert Department of Computer Science, Pace University, New
More informationarxiv: v1 [math.co] 27 Jul 2015
Perfect Graeco-Latin balanced incomplete block designs and related designs arxiv:1507.07336v1 [math.co] 27 Jul 2015 Sunanda Bagchi Theoretical Statistics and Mathematics Unit Indian Statistical Institute
More informationSTAD Research Report Adjusted Concordance Index, an extension of the Adjusted Rand index to fuzzy partitions
STAD Research Report 03 2015 arxiv:1509.00803v2 [stat.me] 16 Mar 2016 Adjusted Concordance Index, an extension of the Adjusted Rand index to fuzzy partitions Sonia Amodio a, Antonio d Ambrosio a, Carmela
More informationI L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Introduction Edps/Psych/Stat/ 584 Applied Multivariate Statistics Carolyn J Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees,
More informationAccuracy Measures for the Comparison of Classifiers
Accuracy Measures for the Comparison of Classifiers Vincent Labatut 1 and Hocine Cherifi 2 1 Galatasaray University, Computer Science Department, Çırağan cad. n 36, 34357 İstanbul, Turkey vlabatut@gsu.edu.tr
More informationAnalysis of Survival Data Using Cox Model (Continuous Type)
Australian Journal of Basic and Alied Sciences, 7(0): 60-607, 03 ISSN 99-878 Analysis of Survival Data Using Cox Model (Continuous Type) Khawla Mustafa Sadiq Department of Mathematics, Education College,
More informationClustering Ambiguity: An Overview
Clustering Ambiguity: An Overview John D. MacCuish Norah E. MacCuish 3 rd Joint Sheffield Conference on Chemoinformatics April 23, 2004 Outline The Problem: Clustering Ambiguity and Chemoinformatics Preliminaries:
More informationANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication
ANOVA approach Advantages: Ideal for evaluating hypotheses Ideal to quantify effect size (e.g., differences between groups) Address multiple factors at once Investigates interaction terms Disadvantages:
More informationSTATISTICS SYLLABUS UNIT I
STATISTICS SYLLABUS UNIT I (Probability Theory) Definition Classical and axiomatic approaches.laws of total and compound probability, conditional probability, Bayes Theorem. Random variable and its distribution
More informationConstruction of Partially Balanced Incomplete Block Designs
International Journal of Statistics and Systems ISS 0973-675 Volume, umber (06), pp. 67-76 Research India Publications http://www.ripublication.com Construction of Partially Balanced Incomplete Block Designs
More informationSimilarity measures for binary and numerical data: a survey
Int. J. Knowledge Engineering and Soft Data Paradigms, Vol., No., 2009 63 Similarity measures for binary and numerical data: a survey M-J. Lesot* and M. Rifqi* UPMC Univ Paris 06, UMR 7606, LIP6, 04, avenue
More informationFULL LIKELIHOOD INFERENCES IN THE COX MODEL
October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach
More informationEstimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk
Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:
More informationStatistical Inference of Covariate-Adjusted Randomized Experiments
1 Statistical Inference of Covariate-Adjusted Randomized Experiments Feifang Hu Department of Statistics George Washington University Joint research with Wei Ma, Yichen Qin and Yang Li Email: feifang@gwu.edu
More information2. Matrix Algebra and Random Vectors
2. Matrix Algebra and Random Vectors 2.1 Introduction Multivariate data can be conveniently display as array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns
More informationSome Processes or Numerical Taxonomy in Terms or Distance
Some Processes or Numerical Taxonomy in Terms or Distance JEAN R. PROCTOR Abstract A connection is established between matching coefficients and distance in n-dimensional space for a variety of character
More informationDimensionality of Hierarchical
Dimensionality of Hierarchical and Proximal Data Structures David J. Krus and Patricia H. Krus Arizona State University The coefficient of correlation is a fairly general measure which subsumes other,
More informationSimplified marginal effects in discrete choice models
Economics Letters 81 (2003) 321 326 www.elsevier.com/locate/econbase Simplified marginal effects in discrete choice models Soren Anderson a, Richard G. Newell b, * a University of Michigan, Ann Arbor,
More informationThe fingerprint Package
The fingerprint Package October 7, 2007 Version 2.6 Date 2007-10-05 Title Functions to operate on binary fingerprint data Author Rajarshi Guha Maintainer Rajarshi Guha
More informationNotion of Distance. Metric Distance Binary Vector Distances Tangent Distance
Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor
More informationSession 3 The proportional odds model and the Mann-Whitney test
Session 3 The proportional odds model and the Mann-Whitney test 3.1 A unified approach to inference 3.2 Analysis via dichotomisation 3.3 Proportional odds 3.4 Relationship with the Mann-Whitney test Session
More informationOptimal Selection of Blocked Two-Level. Fractional Factorial Designs
Applied Mathematical Sciences, Vol. 1, 2007, no. 22, 1069-1082 Optimal Selection of Blocked Two-Level Fractional Factorial Designs Weiming Ke Department of Mathematics and Statistics South Dakota State
More informationThe Study on Trinary Join-Counts for Spatial Autocorrelation
Proceedings of the 8th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences Shanghai, P. R. China, June 5-7, 008, pp. -8 The Study on Trinary Join-Counts
More informationReview of One-way Tables and SAS
Stat 504, Lecture 7 1 Review of One-way Tables and SAS In-class exercises: Ex1, Ex2, and Ex3 from http://v8doc.sas.com/sashtml/proc/z0146708.htm To calculate p-value for a X 2 or G 2 in SAS: http://v8doc.sas.com/sashtml/lgref/z0245929.htmz0845409
More informationNOMINAL VARIABLE CLUSTERING AND ITS EVALUATION
NOMINAL VARIABLE CLUSTERING AND ITS EVALUATION Hana Řezanková Abstract The paper evaluates clustering of nominal variables using different similarity measures. The created clusters can serve for dimensionality
More informationEnhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment
Enhancing Generalization Capability of SVM Classifiers ith Feature Weight Adjustment Xizhao Wang and Qiang He College of Mathematics and Computer Science, Hebei University, Baoding 07002, Hebei, China
More informationMeasures of Association and Variance Estimation
Measures of Association and Variance Estimation Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University D. Bandyopadhyay (VCU) BIOS 625: Categorical Data & GLM 1 / 35
More informationAffinity analysis: methodologies and statistical inference
Vegetatio 72: 89-93, 1987 Dr W. Junk Publishers, Dordrecht - Printed in the Netherlands 89 Affinity analysis: methodologies and statistical inference Samuel M. Scheiner 1,2,3 & Conrad A. Istock 1,2 1Department
More informationAnalysis of Consensus Partition in Cluster Ensemble
Analysis of Consensus Partition in Cluster Ensemble Alexander P. Topchy Martin H. C. Law Anil K. Jain Dept. of Computer Science and Engineering Michigan State University East Lansing, MI 48824, USA {topchyal,
More informationLecture 4: Probability and Discrete Random Variables
Error Correcting Codes: Combinatorics, Algorithms and Applications (Fall 2007) Lecture 4: Probability and Discrete Random Variables Wednesday, January 21, 2009 Lecturer: Atri Rudra Scribe: Anonymous 1
More informationFour aspects of a sampling strategy necessary to make accurate and precise inferences about populations are:
Why Sample? Often researchers are interested in answering questions about a particular population. They might be interested in the density, species richness, or specific life history parameters such as
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationRatio of Linear Function of Parameters and Testing Hypothesis of the Combination Two Split Plot Designs
Middle-East Journal of Scientific Research 13 (Mathematical Applications in Engineering): 109-115 2013 ISSN 1990-9233 IDOSI Publications 2013 DOI: 10.5829/idosi.mejsr.2013.13.mae.10002 Ratio of Linear
More informationChapter 2 Application to DC Circuits
Chapter 2 Application to DC Circuits In this chapter we use the results obtained in Chap. 1 to develop a new measurement based approach to solve synthesis problems in unknown linear direct current (DC)
More informationInteraction balance in symmetrical factorial designs with generalized minimum aberration
Interaction balance in symmetrical factorial designs with generalized minimum aberration Mingyao Ai and Shuyuan He LMAM, School of Mathematical Sciences, Peing University, Beijing 100871, P. R. China Abstract:
More informationChapter 30 Design and Analysis of
Chapter 30 Design and Analysis of 2 k DOEs Introduction This chapter describes design alternatives and analysis techniques for conducting a DOE. Tables M1 to M5 in Appendix E can be used to create test
More informationa 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 11 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,, a n, b are given real
More informationA L A BA M A L A W R E V IE W
A L A BA M A L A W R E V IE W Volume 52 Fall 2000 Number 1 B E F O R E D I S A B I L I T Y C I V I L R I G HT S : C I V I L W A R P E N S I O N S A N D TH E P O L I T I C S O F D I S A B I L I T Y I N
More informationRevision: Chapter 1-6. Applied Multivariate Statistics Spring 2012
Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing
More informationHierarchical Clustering
Hierarchical Clustering Some slides by Serafim Batzoglou 1 From expression profiles to distances From the Raw Data matrix we compute the similarity matrix S. S ij reflects the similarity of the expression
More informationA Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data
A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data Yujun Wu, Marc G. Genton, 1 and Leonard A. Stefanski 2 Department of Biostatistics, School of Public Health, University of Medicine
More informationDissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal
and transformations Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2017 Definitions An association coefficient is a function
More informationSection 3.2. Multiplication of Matrices and Multiplication of Vectors and Matrices
3.2. Multiplication of Matrices and Multiplication of Vectors and Matrices 1 Section 3.2. Multiplication of Matrices and Multiplication of Vectors and Matrices Note. In this section, we define the product
More informationON VARIANCE COVARIANCE COMPONENTS ESTIMATION IN LINEAR MODELS WITH AR(1) DISTURBANCES. 1. Introduction
Acta Math. Univ. Comenianae Vol. LXV, 1(1996), pp. 129 139 129 ON VARIANCE COVARIANCE COMPONENTS ESTIMATION IN LINEAR MODELS WITH AR(1) DISTURBANCES V. WITKOVSKÝ Abstract. Estimation of the autoregressive
More informationHANDBOOK OF APPLICABLE MATHEMATICS
HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester
More informationDecomposition of Parsimonious Independence Model Using Pearson, Kendall and Spearman s Correlations for Two-Way Contingency Tables
International Journal of Statistics and Probability; Vol. 7 No. 3; May 208 ISSN 927-7032 E-ISSN 927-7040 Published by Canadian Center of Science and Education Decomposition of Parsimonious Independence
More informationSpatial autoregression model:strong consistency
Statistics & Probability Letters 65 (2003 71 77 Spatial autoregression model:strong consistency B.B. Bhattacharyya a, J.-J. Ren b, G.D. Richardson b;, J. Zhang b a Department of Statistics, North Carolina
More informationELEMENTARY LINEAR ALGEBRA
ELEMENTARY LINEAR ALGEBRA K R MATTHEWS DEPARTMENT OF MATHEMATICS UNIVERSITY OF QUEENSLAND First Printing, 99 Chapter LINEAR EQUATIONS Introduction to linear equations A linear equation in n unknowns x,
More informationLimit Theorems for Exchangeable Random Variables via Martingales
Limit Theorems for Exchangeable Random Variables via Martingales Neville Weber, University of Sydney. May 15, 2006 Probabilistic Symmetries and Their Applications A sequence of random variables {X 1, X
More informationClustering Lecture 1: Basics. Jing Gao SUNY Buffalo
Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering
More informationNetworks: Lectures 9 & 10 Random graphs
Networks: Lectures 9 & 10 Random graphs Heather A Harrington Mathematical Institute University of Oxford HT 2017 What you re in for Week 1: Introduction and basic concepts Week 2: Small worlds Week 3:
More informationESTIMATION OF CONSERVATISM OF CHARACTERS BY CONSTANCY WITHIN BIOLOGICAL POPULATIONS
ESTIMATION OF CONSERVATISM OF CHARACTERS BY CONSTANCY WITHIN BIOLOGICAL POPULATIONS JAMES S. FARRIS Museum of Zoology, The University of Michigan, Ann Arbor Accepted March 30, 1966 The concept of conservatism
More informationSTAT 512 sp 2018 Summary Sheet
STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}
More informationInferences for Proportions and Count Data
Inferences for Proportions and Count Data Corresponds to Chapter 9 of Tamhane and Dunlop Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León (University of Tennessee) 1 Inference
More informationELEMENTARY LINEAR ALGEBRA
ELEMENTARY LINEAR ALGEBRA K R MATTHEWS DEPARTMENT OF MATHEMATICS UNIVERSITY OF QUEENSLAND Second Online Version, December 998 Comments to the author at krm@mathsuqeduau All contents copyright c 99 Keith
More informationKRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE BASED ON LINEAR PLACEMENTS
Bull. Korean Math. Soc. 5 (24), No. 3, pp. 7 76 http://dx.doi.org/34/bkms.24.5.3.7 KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE BASED ON LINEAR PLACEMENTS Yicheng Hong and Sungchul Lee Abstract. The limiting
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationAnalysis of Multinomial Response Data: a Measure for Evaluating Knowledge Structures
Analysis of Multinomial Response Data: a Measure for Evaluating Knowledge Structures Department of Psychology University of Graz Universitätsplatz 2/III A-8010 Graz, Austria (e-mail: ali.uenlue@uni-graz.at)
More informationApplied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition
Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world
More informationExperimental Design and Data Analysis for Biologists
Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1
More informationDepartment of Mathematics
Department of Mathematics Ma 3/103 KC Border Introduction to Probability and Statistics Winter 2017 Supplement 2: Review Your Distributions Relevant textbook passages: Pitman [10]: pages 476 487. Larsen
More informationDeterminants of Partition Matrices
journal of number theory 56, 283297 (1996) article no. 0018 Determinants of Partition Matrices Georg Martin Reinhart Wellesley College Communicated by A. Hildebrand Received February 14, 1994; revised
More informationNotes on Generalized Method of Moments Estimation
Notes on Generalized Method of Moments Estimation c Bronwyn H. Hall March 1996 (revised February 1999) 1. Introduction These notes are a non-technical introduction to the method of estimation popularized
More informationHigh-dimensional asymptotic expansions for the distributions of canonical correlations
Journal of Multivariate Analysis 100 2009) 231 242 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva High-dimensional asymptotic
More informationTHE NUMBER OF LOCALLY RESTRICTED DIRECTED GRAPHS1
THE NUMBER OF LOCALLY RESTRICTED DIRECTED GRAPHS1 LEO KATZ AND JAMES H. POWELL 1. Preliminaries. We shall be concerned with finite graphs of / directed lines on n points, or nodes. The lines are joins
More informationAdjusting for Chance Clustering Comparison Measures
Journal of Machine Learning Research 17 216) 1-32 Submitted 12/15; Revised 7/16; Published 8/16 Adjusting for Chance Clustering Comparison Measures Simone Romano simone.romano@unimelb.edu.au guyen Xuan
More informationAN IMPROVEMENT TO THE ALIGNED RANK STATISTIC
Journal of Applied Statistical Science ISSN 1067-5817 Volume 14, Number 3/4, pp. 225-235 2005 Nova Science Publishers, Inc. AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC FOR TWO-FACTOR ANALYSIS OF VARIANCE
More informationOn consistency of Kendall s tau under censoring
Biometria (28), 95, 4,pp. 997 11 C 28 Biometria Trust Printed in Great Britain doi: 1.193/biomet/asn37 Advance Access publication 17 September 28 On consistency of Kendall s tau under censoring BY DAVID
More informationOverview of clustering analysis. Yuehua Cui
Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this
More informationarxiv:math.pr/ v1 17 May 2004
Probabilistic Analysis for Randomized Game Tree Evaluation Tämur Ali Khan and Ralph Neininger arxiv:math.pr/0405322 v1 17 May 2004 ABSTRACT: We give a probabilistic analysis for the randomized game tree
More informationMULTIVARIATE ANALYSIS OF VARIANCE
MULTIVARIATE ANALYSIS OF VARIANCE RAJENDER PARSAD AND L.M. BHAR Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 0 0 lmb@iasri.res.in. Introduction In many agricultural experiments,
More informationGood Confidence Intervals for Categorical Data Analyses. Alan Agresti
Good Confidence Intervals for Categorical Data Analyses Alan Agresti Department of Statistics, University of Florida visiting Statistics Department, Harvard University LSHTM, July 22, 2011 p. 1/36 Outline
More informationEPMC Estimation in Discriminant Analysis when the Dimension and Sample Sizes are Large
EPMC Estimation in Discriminant Analysis when the Dimension and Sample Sizes are Large Tetsuji Tonda 1 Tomoyuki Nakagawa and Hirofumi Wakaki Last modified: March 30 016 1 Faculty of Management and Information
More informationLower Bounds for Testing Bipartiteness in Dense Graphs
Lower Bounds for Testing Bipartiteness in Dense Graphs Andrej Bogdanov Luca Trevisan Abstract We consider the problem of testing bipartiteness in the adjacency matrix model. The best known algorithm, due
More informationPhase Transition & Approximate Partition Function In Ising Model and Percolation In Two Dimension: Specifically For Square Lattices
IOSR Journal of Applied Physics (IOSR-JAP) ISS: 2278-4861. Volume 2, Issue 3 (ov. - Dec. 2012), PP 31-37 Phase Transition & Approximate Partition Function In Ising Model and Percolation In Two Dimension:
More informationCzechoslovak Mathematical Journal
Czechoslovak Mathematical Journal Varaporn Saenpholphat; Ping Zhang Connected resolvability of graphs Czechoslovak Mathematical Journal, Vol. 53 (2003), No. 4, 827 840 Persistent URL: http://dml.cz/dmlcz/127843
More informationOptimal Multiple Decision Statistical Procedure for Inverse Covariance Matrix
Optimal Multiple Decision Statistical Procedure for Inverse Covariance Matrix Alexander P. Koldanov and Petr A. Koldanov Abstract A multiple decision statistical problem for the elements of inverse covariance
More informationThe spectra of super line multigraphs
The spectra of super line multigraphs Jay Bagga Department of Computer Science Ball State University Muncie, IN jbagga@bsuedu Robert B Ellis Department of Applied Mathematics Illinois Institute of Technology
More informationGrowing a Large Tree
STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................
More informationDETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)
Dipartimento di Biologia Evoluzionistica Sperimentale Centro Interdipartimentale di Ricerca per le Scienze Ambientali in Ravenna INTERNATIONAL WINTER SCHOOL UNIVERSITY OF BOLOGNA DETECTING BIOLOGICAL AND
More informationChapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments
Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments We consider two kinds of random variables: discrete and continuous random variables. For discrete random
More informationSIMULATED POWER OF SOME DISCRETE GOODNESS- OF-FIT TEST STATISTICS FOR TESTING THE NULL HYPOTHESIS OF A ZIG-ZAG DISTRIBUTION
Far East Journal of Theoretical Statistics Volume 28, Number 2, 2009, Pages 57-7 This paper is available online at http://www.pphmj.com 2009 Pushpa Publishing House SIMULATED POWER OF SOME DISCRETE GOODNESS-
More informationLinear estimation in models based on a graph
Linear Algebra and its Applications 302±303 (1999) 223±230 www.elsevier.com/locate/laa Linear estimation in models based on a graph R.B. Bapat * Indian Statistical Institute, New Delhi 110 016, India Received
More informationUnbiased prediction in linear regression models with equi-correlated responses
') -t CAA\..-ll' ~ j... "1-' V'~ /'. uuo. ;). I ''''- ~ ( \ '.. /' I ~, Unbiased prediction in linear regression models with equi-correlated responses Shalabh Received: May 13, 1996; revised version: December
More informationMarginal Balance of Spread Designs
Marginal Balance of Spread Designs For High Dimensional Binary Data Joe Verducci, Ohio State Mike Fligner, Ohio State Paul Blower, Leadscope Motivation Database: M x N array of 0-1 bits M = number of compounds
More informationStochastic Design Criteria in Linear Models
AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 211 223 Stochastic Design Criteria in Linear Models Alexander Zaigraev N. Copernicus University, Toruń, Poland Abstract: Within the framework
More informationApplication of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption
Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption Alisa A. Gorbunova and Boris Yu. Lemeshko Novosibirsk State Technical University Department of Applied Mathematics,
More informationConfidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection
Biometrical Journal 42 (2000) 1, 59±69 Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection Kung-Jong Lui
More informationJournal of Biostatistics and Epidemiology
Journal of Biostatistics and Epidemiology Original Article Robust correlation coefficient goodness-of-fit test for the Gumbel distribution Abbas Mahdavi 1* 1 Department of Statistics, School of Mathematical
More informationPROGRAMMING UNDER PROBABILISTIC CONSTRAINTS WITH A RANDOM TECHNOLOGY MATRIX
Math. Operationsforsch. u. Statist. 5 974, Heft 2. pp. 09 6. PROGRAMMING UNDER PROBABILISTIC CONSTRAINTS WITH A RANDOM TECHNOLOGY MATRIX András Prékopa Technological University of Budapest and Computer
More informationA Statistical Analysis of Fukunaga Koontz Transform
1 A Statistical Analysis of Fukunaga Koontz Transform Xiaoming Huo Dr. Xiaoming Huo is an assistant professor at the School of Industrial and System Engineering of the Georgia Institute of Technology,
More informationVarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis
VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis Pedro R. Peres-Neto March 2005 Department of Biology University of Regina Regina, SK S4S 0A2, Canada E-mail: Pedro.Peres-Neto@uregina.ca
More informationCONTROL CHARTS FOR MULTIVARIATE NONLINEAR TIME SERIES
REVSTAT Statistical Journal Volume 13, Number, June 015, 131 144 CONTROL CHARTS FOR MULTIVARIATE NONLINEAR TIME SERIES Authors: Robert Garthoff Department of Statistics, European University, Große Scharrnstr.
More informationELEMENTARY LINEAR ALGEBRA
ELEMENTARY LINEAR ALGEBRA K. R. MATTHEWS DEPARTMENT OF MATHEMATICS UNIVERSITY OF QUEENSLAND Corrected Version, 7th April 013 Comments to the author at keithmatt@gmail.com Chapter 1 LINEAR EQUATIONS 1.1
More informationCHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)
FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter
More informationCzech J. Anim. Sci., 50, 2005 (4):
Czech J Anim Sci, 50, 2005 (4: 163 168 Original Paper Canonical correlation analysis for studying the relationship between egg production traits and body weight, egg weight and age at sexual maturity in
More information