Correlation Analysis of Binary Similarity and Distance Measures on Different Binary Database Types

Similar documents
Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

arxiv: v1 [stat.ml] 17 Jun 2016

Distant Meta-Path Similarities for Text-Based Heterogeneous Information Networks

Comparison of proximity measures: a topological approach

Fuzzy order-equivalence for similarity measures

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

Clustering Ambiguity: An Overview

proximity similarity dissimilarity distance Proximity Measures:

Measuring the Structural Similarity between Source Code Entities

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Descriptive Data Summarization

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Similarity measures for binary and numerical data: a survey

Group Discriminatory Power of Handwritten Characters

The fingerprint Package

SHAPE RECOGNITION WITH NEAREST NEIGHBOR ISOMORPHIC NETWORK

Unsupervised Learning with Permuted Data

Measurement and Data

Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval

A Contrario Detection of False Matches in Iris Recognition

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal

Relations. Relations of Sets N-ary Relations Relational Databases Binary Relation Properties Equivalence Relations. Reading (Epp s textbook)

Chemical Similarity Searching

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Design and characterization of chemical space networks

Data Mining: Concepts and Techniques

Application of hopfield network in improvement of fingerprint recognition process Mahmoud Alborzi 1, Abbas Toloie- Eshlaghy 1 and Dena Bazazian 2

Similarity and Dissimilarity

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Multivariate Analysis of Ecological Data

Functional Group Fingerprints CNS Chemistry Wilmington, USA

BIO 682 Multivariate Statistics Spring 2008

Relational Nonlinear FIR Filters. Ronald K. Pearson

University of Florida CISE department Gator Engineering. Clustering Part 1

Computer programs for hierarchical polythetic classification ("similarity analyses")

Multimedia Retrieval Distance. Egon L. van den Broek

Modern Information Retrieval

Mining Classification Knowledge

Spatial Analyses of Bowhead Whale Calls by Type of Call. Heidi Batchelor and Gerald L. D Spain. Marine Physical Laboratory

Correcting Jaccard and other similarity indices for chance agreement in cluster analysis

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Some thoughts on the design of (dis)similarity measures

Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze)

Data Mining 4. Cluster Analysis

Similarity methods for ligandbased virtual screening

diversity(datamatrix, index= shannon, base=exp(1))

Information-theoretic and Set-theoretic Similarity

A METHOD OF FINDING IMAGE SIMILAR PATCHES BASED ON GRADIENT-COVARIANCE SIMILARITY

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Towards a Ptolemaic Model for OCR

Unsupervised Learning. k-means Algorithm

Distances & Similarities

In-Depth Assessment of Local Sequence Alignment

Spazi vettoriali e misure di similaritá

Machine Learning for natural language processing

CS249: ADVANCED DATA MINING

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Pattern Recognition 2

Introduction to Machine Learning

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv

Page 1 of 8 of Pontius and Li

Gaussian Mixture Distance for Information Retrieval

Machine learning for pervasive systems Classification in high-dimensional spaces

Naive Bayesian classifiers for multinomial features: a theoretical analysis

That s Hot: Predicting Daily Temperature for Different Locations

Compositional similarity and β (beta) diversity

Three-Way Analysis of Facial Similarity Judgments

Interaction Analysis of Spatial Point Patterns

CITS 4402 Computer Vision

Modern Information Retrieval

Overriding the Experts: A Stacking Method For Combining Marginal Classifiers

USE OF STATISTICAL BOOTSTRAPPING FOR SAMPLE SIZE DETERMINATION TO ESTIMATE LENGTH-FREQUENCY DISTRIBUTIONS FOR PACIFIC ALBACORE TUNA (THUNNUS ALALUNGA)

Machine Learning on temporal data

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES

Affinity analysis: methodologies and statistical inference

Data-Intensive Similarity Measures for Categorical Data

Multiple Similarities Based Kernel Subspace Learning for Image Classification

FIG S1: Rarefaction analysis of observed richness within Drosophila. All calculations were

A Topological Discriminant Analysis

Hierarchical Clustering

Statistical comparisons for the topological equivalence of proximity measures

Some Processes or Numerical Taxonomy in Terms or Distance

Learning Methods for Linear Detectors

Approximate Test for Comparing Parameters of Several Inverse Hypergeometric Distributions

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

Estimation and sample size calculations for correlated binary error rates of biometric identification devices

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Issues and Techniques in Pattern Classification

p(x ω i 0.4 ω 2 ω

Outline. 15. Descriptive Summary, Design, and Inference. Descriptive summaries. Data mining. The centroid

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

Transcription:

Correlation Analysis of Binary Similarity and Distance Measures on Different Binary Database Types Seung-Seok Choi, Sung-Hyuk Cha, Charles C. Tappert Department of Computer Science, Pace University, New York, U.S.A. {schoi, scha, ctappert}@pace.edu Abstract Binary similarity and dissimilarity measures are of great importance to pattern recognition and other fields. Here, correlations between pairs of 76 binary similarity and distance measures are studied. Some similarity measures are highly correlated while others are not, and the variability of the correlation can depend on the characteristics of the underlying binary data. To better understand the variation of the correlations, we define three basic types of binary databases. The variations of the correlations on these database types are statistically analyzed, and database variant and invariant correlations are identified. In addition to common linear correlation patterns between measures, numerous unusual and interesting correlation patterns are also presented. Keywords: binary similarity measure, distance measure, correlation. 1. Introduction The binary feature vector is one of the most common representations of patterns, and similarity and distance measures between them play a critical role in many pattern recognition problems such as classification, clustering, etc. Over a hundred years, numerous binary similarity and distance measures have been proposed in various fields such as ecology [7, 8, 11], biology [10, 15], ethnology [5], taxonomy [19], geology [9], chemistry [21], computer vision [17], and biometrics [2, 22]. Finding an appropriate measure is an important issue of classification and clustering problems. Numerous comparative studies to find the best measure can be found in literature. Jackson et al. compared eight binary similarity measures for ecological 25 fish species [12]. Tubbs summarized seven conventional similarity measures to solve the template matching problem [20], and Zhang et al. compared those seven measures to show the recognition capability in handwriting identification [23]. Willett evaluated 13 similarity measures for binary fingerprint code [22]. Cha et al. compared 11 measures and proposed a weighted binary measure to improve classification performance on handwritten character recognition and IRIS biometric authentication [2]. In our earlier survey work [3], we collected 76 binary similarity and distance measures. Similarity among various binary similarity or distance measures has been studied through correlation analysis and clustering. Hubalek collected 43 similarity measures, and 20 of them were used for cluster analysis on fungi data to produce five clusters of related coefficients [10]. Hohn categorized binary measures as four types: similarity coefficients, association coefficients, matching coefficients, and distance coefficients. He demonstrated the cluster analysis of 9 binary similarity and distance measures with stratigrahphic and taxa samples [9]. Batagelj et al. performed an equivalence study on 22 binary similarity measures using a cluster technique [1]. Murguia et al. compared the correlations of 9 binary similarity measures in biogeographic samples to show how the selection of measures affected the classification results [16]. Correlation and clustering results vary significantly depending on the characteristics of the data, and most comparative studies have been domain specific. Random Binary Database (RBD 10 ) Equal Random Binary Database (ERBD 10 4) Flattened Binary Database (FBD 10 4) 001 0100 100 100 100 010 001 1000 0001 0010 0001 010 001 001 100 1010000101 0101000101 1111000000 0000001111 0101010100 1000000100 0101010101 1111111111 1000100010 0000000000 Figure 1 Three basic types of binary data In order to discover the correlations of various binary similarity or distance measures, we take a different approach. To capture the variety of characteristics of

binary data from different domains, we formally define the three different types of binary databases as shown in Figure 1. Correlations among 76 measures are then computed for the three different types of binary databases. As a result, we identify those correlations that are database-type variant and those that are database-type invariant. We also observe interesting types of correlations. This paper is organized as follows. Section 2 formally defines the three types of binary feature vector databases. Section 3 describes the correlations among the 76 binary similarity and distance measures on the three database types. In Section 4, we compute correlation matrices for each five different types of binary feature databases in order to observe dramatic changes in their correlation. Various types of correlation patterns are also given in Section 4. Finally, Section 5 concludes this work. 2. Binary Database Types A binary feature vector, x = (x 1,, x d ) is a sequence of binary element x i {0. 1} for i = 1,,d. Its length, x is fixed to d. In other words, x is a binary string of length d. There are 2 d possible binary feature vectors of length d. We call a database of n arbitrary (random) binary feature vector instances a random binary database (RBD). Definition 1. RBD (Random Binary Database) RBD d = {x x = d x i {0, 1} for i = 1,, d} Let x 1 and x 0 be the number of x i s whose value is 1 and 0, respectively. Then any x in RBD has the following two properties. Property 1. 0 x 1 d. Property 2. x 0 = d x 1. If every instance in a binary feature vector database has the same number of one s, i.e., x 1 = p, then we call the database an equal random binary database (ERBD). disjoint values. A nominal p feature vector, z, is represented by an ordered p-ary relation: z = (z 1, z 2,, z p ) and each feature, z i, has different finite number of possible values, z i {v 1,, v q }. Let v(z i ) be an ordered list of possible values at the z i attribute. v(z i ) and v(z j ) for i j are not necessarily the same. For example, a weather data schema might be (temp, humidity, windy), with the temp attribute having possible values {hot, mild, cool}, the humidity {high, normal, low}, and windy {true, false}. Each categorical (nominal) attribute is converted into an asymmetric binary string of length q where only one value in the string is 1 and all others are 0, as exemplified in Table 1. We denote this function as f b (z i ). Table 1 Nominal attribute and flattened binary data z i v(z i ) f b (z i ) hot 1 0 0 temp mild 0 1 0 cool 0 0 1 high 1 0 0 humidity normal 0 1 0 low 0 0 1 windy true 1 0 false 0 1 An instance z = ( mild, low, false ) is binarized to f b (z) = (010 001 01). We call a database of converted binary feature vectors a flattened binary database (FBD). Definition 3. FBD (Flattened Binary Database) FBD d p = {x x = (f b (z 1 ),, f b (z p )) z {(z 1,, z p ) z 1 v(z 1 ) z p v(z p )}} If x RBD, x 1 = p must be equivalent to the number of nominal features in z. The number of possible binary feature vectors in FBD is only p i=1 v(z i ). The dimension d of RBD is d = p i=1 v(z i ). Note that FBD ERBD RBD as shown in Figure 1. Figure 2 gives examples of each category. Definition 2. ERBD (Equal Random Binary Database) ERBD d p = {x x RBD x 1 = p} The number of possible binary feature vectors in ERBD is d C p. Consider a nominal or categorical feature vector where each feature can have a small number of possible

RBD d = 100 (Random Binary Database) 100 Attributes 11010101110101000. 101111000000101 Σ = 30 01010100101101110. 000111001000100 Σ = 70 11111111111111111. 111111111111111 Σ = 100 ERBD d = 100, p = 10 (Equal Random Binary Database) 10000100110100000. 100100000000100 Σ = 10 01000100101001000. 000101000000000 Σ = 10 00000010000000100. 000000000010001 Σ = 10 FBD d = 100, p = 10 (Flattened Binary Database) 10000100100000001. 100000010000100 Σ = 10 01000100001001000. 000001100000001 Σ = 10 01000010000010100. 000100001010000 Σ = 10 f 1 f 2 f 3 f 4 f 8 f 9 f 10 Figure 2 Three basic types of binary data 3. Correlations between Similarity measures A similarity measure, s, or distance measure, d, takes two binary feature vectors as input arguments and quantifies how similar or dissimilar they are. Table 2 enumerates 76 binary similarity and distance measures collected in our earlier survey study [3]. For simplicity, we denote the measures as s i where i = 1~76 even though the distance measures should perhaps be denoted as d i (e.g., s 7 is the Hamming distance measure). Table 2 Binary similarity and distance measures (1) Jaccard (2) Dice & Sorenson (3) Czekanowski (4) Sokal & Sneath (5) 3 weighted Jaccard (6) Nei & Li (7) Hamming (8) Bin Squared Euclid (9) Canberra (10) Manhattan (11) City Block (12) Minkowski (13) Bin Euclid (14) Size Difference (16) Shape Difference (16) Shape Difference (17) Variance (18) Mean Manhattan (19) Lance Williams (20) Bray & Curtis (21) Sokal & Michener (22) Sokal & Sneath 2 (23) Rogers & Tanimoto (24) Faith (25) Gower & Legendre (26) Inner Product (27) Intersection (28) Russell & Rao (29) Cosine (30) Gilbert & Wells (31) Ochiai 1 (32) Forbes 1 (33) Fossum (34) Sorgenfrei (35) Mountford (36) Otsuka (37) Hellinger (38) Chord (39) McConnaughey (40) Tarwid (41) Kulczynski 2 (42) Driver & Kroeber (43) Johnson (44)Dennis (45)Simpson (46) Braun-Banquet (47) Fager & McGowan (48) Forbes 2 (49) Sokal & Sneath 4 (50) Gower (51) Pearson & Heron 1 (52) Pearson 1 (53) Pearson 2 (54) Pearson 3 (55) Cole (56) Stiles (57) Sokal & Sneath 5 (58) Ochiai 2 (59) Yule Q Distance (60) Yule Q (61) Yule w (62) Pearson & Heron 2 (63) Kulczynski 1 (64) Sokal & Sneath 3 (65) Tanimoto (66) Dispersion (67) Hamann (68) Michael (69) Goodman & Kruskal (70) Anderberg (71) Baroni-Urbani & Buser 1 (72) Baroni-Urbani & Buser 2 (73) Peirce (74) Eyraud (75) Tarantula (76) AMPLE A nearest neighbor classification algorithm is an instance based classifier which has been widely used due to its simplicity [6]. A query instance q is classified to the class of the most similar instance in a reference database R. The classification accuracy depends highly on the choice of similarity or distance measure. Consider a random binary database R = {r 1,, r n } and a query instance q where each r i and q are binary feature vectors. If a certain measure s x is applied to all n instances in the database R, then n scalar similarity or distance values are computed by s x (R,q). Figure 3 shows plots of s x (R,q) versus s y (R,q) for several pairs of measures on a RBD. The correlation coefficient in equation (1) below quantifies the relationship between a pair of similarity values computed from two similarity measures s x and s y. Corr( s, s ) x y where n i 1 ( s ( r, q) )( s ( r, q) ) n n 2 2 ( sx ( ri, q) x ) ( sy ( ri, q) y ) i 1 i 1 x r i 1 x i x y i y s ( r, q) x r It can have values between 1 and 1. If Corr(s x, s y ) = 1, s x and s y behaves the same on the database R. If Corr(s x, s y ) = 1, one of measures is a similarity measure and the other is a distance measure, but they also behave the same when applied to a nearest neighbor classifier. i (1)

(a) Corr(s 1, s 2 ) = 0.9991 (b) Corr(s 7, s 21 ) = 1 (c) Corr(s 1, s 21 ) = 0.9085 (d) Corr(s 23, s 32 ) = 0.2638 Figure 3 Various correlations on a RBD As shown in Figure 3, the closer Corr(s x, s y ) is to either 1 or -1, the more similar two measures are. Hence, the equation (2) is used to estimate the strength of the correlation between two binary measures as a distance. subscript numbers in ERBD represent the percentage of ones in the binary feature vector. RDB and ERBD s are randomly generated and the FDB is generated by flattening the nominal type of a mushroom data set [13]. d Corr (s x, s y ) = 1 - Corr(s x, s y ) (2) When equation (2) is applied to all pairs of 76 similarity or distance measures in Table 2, a 76 76 correlation distance matrix, C, is produced. It is visualized as a gray scale image in Figure 4, where the darker a cell is the more similar the two measures are. (1) (1) (76) 0 similar 4. Statistical Experiments 0.5 As shown in Figure 3 (d), the Roger & Tanimoto and Forbes I similarity measures have a weak correlation coefficient value on a RDB. However, if the database is FDB, they show a very strong correlation, Corr(s (1), s (14) ) = 0.99927. Several pairs of similarity/distance measures are database type variant (they vary and show different correlations depending on the database type) while others are invariant. So as to assess the database type variances, we performed the following statistical experiments. First, we prepared five different types of binary databases: RDB, ERBD 10, ERBD 50, ERBD 90, and FBD where the (76) Figure 4 Correlation matrix of 76 binary similarity and dissimilarity measures 1 different As shown in Figure 5, 30 correlation matrices are independently generated from each database. The ith correlation matrix from a certain database D is denoted as C Di and C D denotes the mean correlation matrix of all

30 C Di s. In order to assess the degree of difference between two correlation matrices, the following equation (3) is used. d(c x, C y ) = C x C y (3) RBD ERBD 10% ERBD 50% ERBD 90% FBD 30 trials of correlation matrices Compute Mean Mean Equality Tests T 1 T 2 T 3 T 24 T 25 (a) RBD FBD C RBD1 C FBD1 C RBD2 C RBD C FBD C FBD2 C RBD30 Mean test Result 344.84 / 81.46 C FBD30 (b) Figure 5 The procedure of the mean test (a) and comparison of the mean test results between RBD and FBD (b)

Table 3 is a symmetric matrix showing the degree of difference between all the pairs of the five mean correlation matrices. The diagonal entries are the means of the distributions as described below. The nondiagonal entries are computed by equation (3) for each pair of mean correlation matrices from different databases, e.g., d(c RBD, C FBD ) = 1220.8. The higher numbers indicate greater differences, and as anticipated the greatest difference is between C RBD and C FBD. Table 3 Binary similarity and distance measures C RBD C EBD10 C EBD50 C EBD90 C FBD RBD 344.84 851.15 703.59 877.57 1220.8 ERBD 10 851.15 164.71 312.36 113.23 725.97 ERBD 50 703.59 312.36 141.89 250.61 655.43 ERBD 90 877.57 113.23 250.61 140.89 651.8 FBD 1220.8 725.97 655.43 651.8 81.46 We now compute distribution curves. For each database, 30 distance values of the individual instances relative to the mean are computed by d(c D, C Di ) for i = 1 to 30. Figure 6 displays these distance distributions for each database and the diagonal elements in Table 3 are the mean of 30 d(c D, C Di ) s. The higher means indicate greater differences (fewer similarities) among the 76 x 76 correlation distances, showing that the differences are increasing in going from FBD to RBD. The increasing differences in going from FBD to RBD can also be seen in the increasing whiteness (the degree of whiteness indicates the degree of difference) of the mean correlation matrices of Figure 5 in going from FBD to RBD. Also, the variation (spread) of the 30 instances relative to the mean is also increasing in going from FBD to RBD. As clearly shown in Table 3 and Figure 5, correlation matrices are significantly different depending on the binary database types. FBD (Nominal Mushroom Data) ERBD - 90% ERBD - 50% ERBD - 10% RBD Figure 6 Distribution curves for five data sets However, not all similarity measures are significantly different. While some pairs of similarity measures are database-type variant (and some vary substantially from one database type to another), other pairs are invariant. Figure 7 shows some examples.

RBD ERBD 10 ERBD 50 ERBD 90 FBD Jaccard / AMPLE (a) Yule Q / Mountford (b) Ochiai I / Stiles (c) Dice & Sorenson / Pearson I (d) Jaccard / Dice & Sorenson (e) Ochiai I / Kulczynski II (f) Pearson III / Sokal & Sneath IV (g) Figure 7 Data set dependent correlations (a)-(d) and data set invariant correlations (e)-(g)

5. Conclusion In this paper, three types of binary feature vector representations are formally defined: RBD, ERBD, and FBD. The choice of similarity or distance measure to use in a particular application must be made carefully depending on the characteristics of the data, such as the types of binary database. The impact of the data types is demonstrated statistically via analyzing correlations between similarity measures. Correlations of the 2,850 (76*75/2) possible pairs of the 76 binary similarity and distance measures are analyzed. The higher the correlation, the more similar two measures behave. Various shapes of correlation curves are found. Analyzing these patterns is ongoing work. Defining other types of binary databases such as uniform, normal, and hybrid binary databases is also future work. 6. References [1] Batagelj, V. and Bren, M., Comparing Resemblance Measures, DISTANCIA 92, 1992. [2] Cha, S.-H., Yoon S-, and Tappert, C.C., Enhancing Binary Feature Vector Similarity Measures, Journal of Pattern Recognition research I, 2006. [3] Choi, S.-S, Cha, S.-H., and Tappert, C.C., A Survey of Binary Similarity and Distance Measures, WMSCI, 2009. [4] Cormack, R.M., A review of classification, Journal of the Royal Statistical Society, Series A, 134, 321-353, 1971. [5] Driver, H.E., Kroeber, A.L., Quantitative Expression of Cultural Relationships, University of California Press, 1932. [6] Duda, R.O., Hart, P.E., Pattern Classification and Scene Analysis, Wiley, New York, 1973. [7] Forbes, S.A., On the local distribution of certain Illinois fishes. An essay in statistical ecology, Bulletin of the Illinois State Laboratory of Natural History, 1907. [8] Forbes, S.A., Method of determining and measuring the associative relations of species, Science 61, 524, 1925. [9] Hohn, M., Binary coefficients: A theoretical and empirical study, Mathematical Geology, Volume 8, Number 2, April, 1976. [10] Hubalek, Z., Coefficients of Association and Similarity, Based on Binary (Presence-Absence) Data: An Evaluation, Biological Reviews, Vol.57-4,669-689, 1982. [11] Jaccard, P., Étude comparative de la distribution florale dans une portion des Alpes et des Jura, Bull Soc Vandoise Sci Nat 37:547-579, 1901. [12] Jackson, D.A., Somers, K.M., Harvey, H.H., Similarity Coefficients: Measures of Co-Occurrence and Association or Simply Measures of Occurrence?, The American Nat1uralist, Vol. 133, No. 3, pp. 436-453, 1989. [13] Knopf, A.A., The Audubon Society Field Guide to North American Mushrooms. G. H. Lincoff (Pres.), New York, 1981. [14] Kuhns, J.L., The continuum of coefficients of association, Statistical Association Methods for Mechanized Documentation, (Edited by Stevens et al.) National Bureau of Standards, Washington, 33-39, 1965. [15] Michael, E.L., Marine ecology and the coefficient of association: a plea in behalf of quantitative biology, Ecology 8, 54-59, 1920. [16] Murguia, M. and Villasenor, J.L., Estimating the effect of the similarity coefficient and the cluster algorithm on biogeographic classifications, Ann. Bot. fennici 40: 415-421, 2003. [17] Smith, J.R., Chang, S.-F., Automated binary texture feature sets for image retrieval, International Conf. Accoust., Speech, Signal processing, Atlantic, GA, 1996. [18] Sneath, P.H.A., Sokal, R.R., Numerical Taxonomy: The Principles and Practice of Numerical Classification, W.H. Freeman and Company, San Francisco, 1973. [19] Sokal, R.R., Sneath P.H., Principles of numeric taxonomy, San Francisco, W.H. Freeman, 1963. [20] Tubbs, J.D., A note on binary template matching, Pattern Recognition, 22(4):359-365, 1989. [21] Willett, P., Barnard, J.M., Downs, G.M., Chemical similarity searching Chem Inf Computer Sci 38: 983-996, 1998. [22] Willett, P., Similarity-based approaches to virtual screening, Biochemical Society Transactions 31, 603 606, 2003. [23] Zhang, B., Srihari, S.N., Binary vector dissimilarities for handwriting identification, Proceedings of SPIE, Document Recognition and Retrieval X, p 15-166, 2003.