Correlation Analysis of Binary Similarity and Distance Measures on Different Binary Database Types

Correlation Analysis of Binary Similarity and Distance Measures on Different Binary Database Types Seung-Seok Choi, Sung-Hyuk Cha, Charles C. Tappert Department of Computer Science, Pace University, New York, U.S.A. {schoi, scha, ctappert}@pace.edu Abstract Binary similarity and dissimilarity measures are of great importance to pattern recognition and other fields. Here, correlations between pairs of 76 binary similarity and distance measures are studied. Some similarity measures are highly correlated while others are not, and the variability of the correlation can depend on the characteristics of the underlying binary data. To better understand the variation of the correlations, we define three basic types of binary databases. The variations of the correlations on these database types are statistically analyzed, and database variant and invariant correlations are identified. In addition to common linear correlation patterns between measures, numerous unusual and interesting correlation patterns are also presented. Keywords: binary similarity measure, distance measure, correlation. 1. Introduction The binary feature vector is one of the most common representations of patterns, and similarity and distance measures between them play a critical role in many pattern recognition problems such as classification, clustering, etc. Over a hundred years, numerous binary similarity and distance measures have been proposed in various fields such as ecology [7, 8, 11], biology [10, 15], ethnology [5], taxonomy [19], geology [9], chemistry [21], computer vision [17], and biometrics [2, 22]. Finding an appropriate measure is an important issue of classification and clustering problems. Numerous comparative studies to find the best measure can be found in literature. Jackson et al. compared eight binary similarity measures for ecological 25 fish species [12]. Tubbs summarized seven conventional similarity measures to solve the template matching problem [20], and Zhang et al. compared those seven measures to show the recognition capability in handwriting identification [23]. Willett evaluated 13 similarity measures for binary fingerprint code [22]. Cha et al. compared 11 measures and proposed a weighted binary measure to improve classification performance on handwritten character recognition and IRIS biometric authentication [2]. In our earlier survey work [3], we collected 76 binary similarity and distance measures. Similarity among various binary similarity or distance measures has been studied through correlation analysis and clustering. Hubalek collected 43 similarity measures, and 20 of them were used for cluster analysis on fungi data to produce five clusters of related coefficients [10]. Hohn categorized binary measures as four types: similarity coefficients, association coefficients, matching coefficients, and distance coefficients. He demonstrated the cluster analysis of 9 binary similarity and distance measures with stratigrahphic and taxa samples [9]. Batagelj et al. performed an equivalence study on 22 binary similarity measures using a cluster technique [1]. Murguia et al. compared the correlations of 9 binary similarity measures in biogeographic samples to show how the selection of measures affected the classification results [16]. Correlation and clustering results vary significantly depending on the characteristics of the data, and most comparative studies have been domain specific. Random Binary Database (RBD 10 ) Equal Random Binary Database (ERBD 10 4) Flattened Binary Database (FBD 10 4) 001 0100 100 100 100 010 001 1000 0001 0010 0001 010 001 001 100 1010000101 0101000101 1111000000 0000001111 0101010100 1000000100 0101010101 1111111111 1000100010 0000000000 Figure 1 Three basic types of binary data In order to discover the correlations of various binary similarity or distance measures, we take a different approach. To capture the variety of characteristics of

binary data from different domains, we formally define the three different types of binary databases as shown in Figure 1. Correlations among 76 measures are then computed for the three different types of binary databases. As a result, we identify those correlations that are database-type variant and those that are database-type invariant. We also observe interesting types of correlations. This paper is organized as follows. Section 2 formally defines the three types of binary feature vector databases. Section 3 describes the correlations among the 76 binary similarity and distance measures on the three database types. In Section 4, we compute correlation matrices for each five different types of binary feature databases in order to observe dramatic changes in their correlation. Various types of correlation patterns are also given in Section 4. Finally, Section 5 concludes this work. 2. Binary Database Types A binary feature vector, x = (x 1,, x d ) is a sequence of binary element x i {0. 1} for i = 1,,d. Its length, x is fixed to d. In other words, x is a binary string of length d. There are 2 d possible binary feature vectors of length d. We call a database of n arbitrary (random) binary feature vector instances a random binary database (RBD). Definition 1. RBD (Random Binary Database) RBD d = {x x = d x i {0, 1} for i = 1,, d} Let x 1 and x 0 be the number of x i s whose value is 1 and 0, respectively. Then any x in RBD has the following two properties. Property 1. 0 x 1 d. Property 2. x 0 = d x 1. If every instance in a binary feature vector database has the same number of one s, i.e., x 1 = p, then we call the database an equal random binary database (ERBD). disjoint values. A nominal p feature vector, z, is represented by an ordered p-ary relation: z = (z 1, z 2,, z p ) and each feature, z i, has different finite number of possible values, z i {v 1,, v q }. Let v(z i ) be an ordered list of possible values at the z i attribute. v(z i ) and v(z j ) for i j are not necessarily the same. For example, a weather data schema might be (temp, humidity, windy), with the temp attribute having possible values {hot, mild, cool}, the humidity {high, normal, low}, and windy {true, false}. Each categorical (nominal) attribute is converted into an asymmetric binary string of length q where only one value in the string is 1 and all others are 0, as exemplified in Table 1. We denote this function as f b (z i ). Table 1 Nominal attribute and flattened binary data z i v(z i ) f b (z i ) hot 1 0 0 temp mild 0 1 0 cool 0 0 1 high 1 0 0 humidity normal 0 1 0 low 0 0 1 windy true 1 0 false 0 1 An instance z = ( mild, low, false ) is binarized to f b (z) = (010 001 01). We call a database of converted binary feature vectors a flattened binary database (FBD). Definition 3. FBD (Flattened Binary Database) FBD d p = {x x = (f b (z 1 ),, f b (z p )) z {(z 1,, z p ) z 1 v(z 1 ) z p v(z p )}} If x RBD, x 1 = p must be equivalent to the number of nominal features in z. The number of possible binary feature vectors in FBD is only p i=1 v(z i ). The dimension d of RBD is d = p i=1 v(z i ). Note that FBD ERBD RBD as shown in Figure 1. Figure 2 gives examples of each category. Definition 2. ERBD (Equal Random Binary Database) ERBD d p = {x x RBD x 1 = p} The number of possible binary feature vectors in ERBD is d C p. Consider a nominal or categorical feature vector where each feature can have a small number of possible

RBD d = 100 (Random Binary Database) 100 Attributes 11010101110101000. 101111000000101 Σ = 30 01010100101101110. 000111001000100 Σ = 70 11111111111111111. 111111111111111 Σ = 100 ERBD d = 100, p = 10 (Equal Random Binary Database) 10000100110100000. 100100000000100 Σ = 10 01000100101001000. 000101000000000 Σ = 10 00000010000000100. 000000000010001 Σ = 10 FBD d = 100, p = 10 (Flattened Binary Database) 10000100100000001. 100000010000100 Σ = 10 01000100001001000. 000001100000001 Σ = 10 01000010000010100. 000100001010000 Σ = 10 f 1 f 2 f 3 f 4 f 8 f 9 f 10 Figure 2 Three basic types of binary data 3. Correlations between Similarity measures A similarity measure, s, or distance measure, d, takes two binary feature vectors as input arguments and quantifies how similar or dissimilar they are. Table 2 enumerates 76 binary similarity and distance measures collected in our earlier survey study [3]. For simplicity, we denote the measures as s i where i = 1~76 even though the distance measures should perhaps be denoted as d i (e.g., s 7 is the Hamming distance measure). Table 2 Binary similarity and distance measures (1) Jaccard (2) Dice & Sorenson (3) Czekanowski (4) Sokal & Sneath (5) 3 weighted Jaccard (6) Nei & Li (7) Hamming (8) Bin Squared Euclid (9) Canberra (10) Manhattan (11) City Block (12) Minkowski (13) Bin Euclid (14) Size Difference (16) Shape Difference (16) Shape Difference (17) Variance (18) Mean Manhattan (19) Lance Williams (20) Bray & Curtis (21) Sokal & Michener (22) Sokal & Sneath 2 (23) Rogers & Tanimoto (24) Faith (25) Gower & Legendre (26) Inner Product (27) Intersection (28) Russell & Rao (29) Cosine (30) Gilbert & Wells (31) Ochiai 1 (32) Forbes 1 (33) Fossum (34) Sorgenfrei (35) Mountford (36) Otsuka (37) Hellinger (38) Chord (39) McConnaughey (40) Tarwid (41) Kulczynski 2 (42) Driver & Kroeber (43) Johnson (44)Dennis (45)Simpson (46) Braun-Banquet (47) Fager & McGowan (48) Forbes 2 (49) Sokal & Sneath 4 (50) Gower (51) Pearson & Heron 1 (52) Pearson 1 (53) Pearson 2 (54) Pearson 3 (55) Cole (56) Stiles (57) Sokal & Sneath 5 (58) Ochiai 2 (59) Yule Q Distance (60) Yule Q (61) Yule w (62) Pearson & Heron 2 (63) Kulczynski 1 (64) Sokal & Sneath 3 (65) Tanimoto (66) Dispersion (67) Hamann (68) Michael (69) Goodman & Kruskal (70) Anderberg (71) Baroni-Urbani & Buser 1 (72) Baroni-Urbani & Buser 2 (73) Peirce (74) Eyraud (75) Tarantula (76) AMPLE A nearest neighbor classification algorithm is an instance based classifier which has been widely used due to its simplicity [6]. A query instance q is classified to the class of the most similar instance in a reference database R. The classification accuracy depends highly on the choice of similarity or distance measure. Consider a random binary database R = {r 1,, r n } and a query instance q where each r i and q are binary feature vectors. If a certain measure s x is applied to all n instances in the database R, then n scalar similarity or distance values are computed by s x (R,q). Figure 3 shows plots of s x (R,q) versus s y (R,q) for several pairs of measures on a RBD. The correlation coefficient in equation (1) below quantifies the relationship between a pair of similarity values computed from two similarity measures s x and s y. Corr( s, s ) x y where n i 1 ( s ( r, q) )( s ( r, q) ) n n 2 2 ( sx ( ri, q) x ) ( sy ( ri, q) y ) i 1 i 1 x r i 1 x i x y i y s ( r, q) x r It can have values between 1 and 1. If Corr(s x, s y ) = 1, s x and s y behaves the same on the database R. If Corr(s x, s y ) = 1, one of measures is a similarity measure and the other is a distance measure, but they also behave the same when applied to a nearest neighbor classifier. i (1)

(a) Corr(s 1, s 2 ) = 0.9991 (b) Corr(s 7, s 21 ) = 1 (c) Corr(s 1, s 21 ) = 0.9085 (d) Corr(s 23, s 32 ) = 0.2638 Figure 3 Various correlations on a RBD As shown in Figure 3, the closer Corr(s x, s y ) is to either 1 or -1, the more similar two measures are. Hence, the equation (2) is used to estimate the strength of the correlation between two binary measures as a distance. subscript numbers in ERBD represent the percentage of ones in the binary feature vector. RDB and ERBD s are randomly generated and the FDB is generated by flattening the nominal type of a mushroom data set [13]. d Corr (s x, s y ) = 1 - Corr(s x, s y ) (2) When equation (2) is applied to all pairs of 76 similarity or distance measures in Table 2, a 76 76 correlation distance matrix, C, is produced. It is visualized as a gray scale image in Figure 4, where the darker a cell is the more similar the two measures are. (1) (1) (76) 0 similar 4. Statistical Experiments 0.5 As shown in Figure 3 (d), the Roger & Tanimoto and Forbes I similarity measures have a weak correlation coefficient value on a RDB. However, if the database is FDB, they show a very strong correlation, Corr(s (1), s (14) ) = 0.99927. Several pairs of similarity/distance measures are database type variant (they vary and show different correlations depending on the database type) while others are invariant. So as to assess the database type variances, we performed the following statistical experiments. First, we prepared five different types of binary databases: RDB, ERBD 10, ERBD 50, ERBD 90, and FBD where the (76) Figure 4 Correlation matrix of 76 binary similarity and dissimilarity measures 1 different As shown in Figure 5, 30 correlation matrices are independently generated from each database. The ith correlation matrix from a certain database D is denoted as C Di and C D denotes the mean correlation matrix of all

30 C Di s. In order to assess the degree of difference between two correlation matrices, the following equation (3) is used. d(c x, C y ) = C x C y (3) RBD ERBD 10% ERBD 50% ERBD 90% FBD 30 trials of correlation matrices Compute Mean Mean Equality Tests T 1 T 2 T 3 T 24 T 25 (a) RBD FBD C RBD1 C FBD1 C RBD2 C RBD C FBD C FBD2 C RBD30 Mean test Result 344.84 / 81.46 C FBD30 (b) Figure 5 The procedure of the mean test (a) and comparison of the mean test results between RBD and FBD (b)

Table 3 is a symmetric matrix showing the degree of difference between all the pairs of the five mean correlation matrices. The diagonal entries are the means of the distributions as described below. The nondiagonal entries are computed by equation (3) for each pair of mean correlation matrices from different databases, e.g., d(c RBD, C FBD ) = 1220.8. The higher numbers indicate greater differences, and as anticipated the greatest difference is between C RBD and C FBD. Table 3 Binary similarity and distance measures C RBD C EBD10 C EBD50 C EBD90 C FBD RBD 344.84 851.15 703.59 877.57 1220.8 ERBD 10 851.15 164.71 312.36 113.23 725.97 ERBD 50 703.59 312.36 141.89 250.61 655.43 ERBD 90 877.57 113.23 250.61 140.89 651.8 FBD 1220.8 725.97 655.43 651.8 81.46 We now compute distribution curves. For each database, 30 distance values of the individual instances relative to the mean are computed by d(c D, C Di ) for i = 1 to 30. Figure 6 displays these distance distributions for each database and the diagonal elements in Table 3 are the mean of 30 d(c D, C Di ) s. The higher means indicate greater differences (fewer similarities) among the 76 x 76 correlation distances, showing that the differences are increasing in going from FBD to RBD. The increasing differences in going from FBD to RBD can also be seen in the increasing whiteness (the degree of whiteness indicates the degree of difference) of the mean correlation matrices of Figure 5 in going from FBD to RBD. Also, the variation (spread) of the 30 instances relative to the mean is also increasing in going from FBD to RBD. As clearly shown in Table 3 and Figure 5, correlation matrices are significantly different depending on the binary database types. FBD (Nominal Mushroom Data) ERBD - 90% ERBD - 50% ERBD - 10% RBD Figure 6 Distribution curves for five data sets However, not all similarity measures are significantly different. While some pairs of similarity measures are database-type variant (and some vary substantially from one database type to another), other pairs are invariant. Figure 7 shows some examples.

RBD ERBD 10 ERBD 50 ERBD 90 FBD Jaccard / AMPLE (a) Yule Q / Mountford (b) Ochiai I / Stiles (c) Dice & Sorenson / Pearson I (d) Jaccard / Dice & Sorenson (e) Ochiai I / Kulczynski II (f) Pearson III / Sokal & Sneath IV (g) Figure 7 Data set dependent correlations (a)-(d) and data set invariant correlations (e)-(g)

5. Conclusion In this paper, three types of binary feature vector representations are formally defined: RBD, ERBD, and FBD. The choice of similarity or distance measure to use in a particular application must be made carefully depending on the characteristics of the data, such as the types of binary database. The impact of the data types is demonstrated statistically via analyzing correlations between similarity measures. Correlations of the 2,850 (76*75/2) possible pairs of the 76 binary similarity and distance measures are analyzed. The higher the correlation, the more similar two measures behave. Various shapes of correlation curves are found. Analyzing these patterns is ongoing work. Defining other types of binary databases such as uniform, normal, and hybrid binary databases is also future work. 6. References [1] Batagelj, V. and Bren, M., Comparing Resemblance Measures, DISTANCIA 92, 1992. [2] Cha, S.-H., Yoon S-, and Tappert, C.C., Enhancing Binary Feature Vector Similarity Measures, Journal of Pattern Recognition research I, 2006. [3] Choi, S.-S, Cha, S.-H., and Tappert, C.C., A Survey of Binary Similarity and Distance Measures, WMSCI, 2009. [4] Cormack, R.M., A review of classification, Journal of the Royal Statistical Society, Series A, 134, 321-353, 1971. [5] Driver, H.E., Kroeber, A.L., Quantitative Expression of Cultural Relationships, University of California Press, 1932. [6] Duda, R.O., Hart, P.E., Pattern Classification and Scene Analysis, Wiley, New York, 1973. [7] Forbes, S.A., On the local distribution of certain Illinois fishes. An essay in statistical ecology, Bulletin of the Illinois State Laboratory of Natural History, 1907. [8] Forbes, S.A., Method of determining and measuring the associative relations of species, Science 61, 524, 1925. [9] Hohn, M., Binary coefficients: A theoretical and empirical study, Mathematical Geology, Volume 8, Number 2, April, 1976. [10] Hubalek, Z., Coefficients of Association and Similarity, Based on Binary (Presence-Absence) Data: An Evaluation, Biological Reviews, Vol.57-4,669-689, 1982. [11] Jaccard, P., Étude comparative de la distribution florale dans une portion des Alpes et des Jura, Bull Soc Vandoise Sci Nat 37:547-579, 1901. [12] Jackson, D.A., Somers, K.M., Harvey, H.H., Similarity Coefficients: Measures of Co-Occurrence and Association or Simply Measures of Occurrence?, The American Nat1uralist, Vol. 133, No. 3, pp. 436-453, 1989. [13] Knopf, A.A., The Audubon Society Field Guide to North American Mushrooms. G. H. Lincoff (Pres.), New York, 1981. [14] Kuhns, J.L., The continuum of coefficients of association, Statistical Association Methods for Mechanized Documentation, (Edited by Stevens et al.) National Bureau of Standards, Washington, 33-39, 1965. [15] Michael, E.L., Marine ecology and the coefficient of association: a plea in behalf of quantitative biology, Ecology 8, 54-59, 1920. [16] Murguia, M. and Villasenor, J.L., Estimating the effect of the similarity coefficient and the cluster algorithm on biogeographic classifications, Ann. Bot. fennici 40: 415-421, 2003. [17] Smith, J.R., Chang, S.-F., Automated binary texture feature sets for image retrieval, International Conf. Accoust., Speech, Signal processing, Atlantic, GA, 1996. [18] Sneath, P.H.A., Sokal, R.R., Numerical Taxonomy: The Principles and Practice of Numerical Classification, W.H. Freeman and Company, San Francisco, 1973. [19] Sokal, R.R., Sneath P.H., Principles of numeric taxonomy, San Francisco, W.H. Freeman, 1963. [20] Tubbs, J.D., A note on binary template matching, Pattern Recognition, 22(4):359-365, 1989. [21] Willett, P., Barnard, J.M., Downs, G.M., Chemical similarity searching Chem Inf Computer Sci 38: 983-996, 1998. [22] Willett, P., Similarity-based approaches to virtual screening, Biochemical Society Transactions 31, 603 606, 2003. [23] Zhang, B., Srihari, S.N., Binary vector dissimilarities for handwriting identification, Proceedings of SPIE, Document Recognition and Retrieval X, p 15-166, 2003.