Generalized Ward and Related Clustering Problems

Similar documents
arxiv: v1 [stat.ml] 27 Nov 2011

Clustering and Blockmodeling

Machine Learning for Data Science (CS4786) Lecture 8

Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data

Generalized Blockmodeling with Pajek

Hilbert s Metric and Gromov Hyperbolicity

STATISTICA MULTIVARIATA 2

Statistical Machine Learning

MATH 31BH Homework 1 Solutions

Inverse Brascamp-Lieb inequalities along the Heat equation

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Chapter 10 Conjoint Use of Variables Clustering and PLS Structural Equations Modeling

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Topological properties

Generalization of the Principal Components Analysis to Histogram Data

1 Basic Concept and Similarity Measures

1 Overview. CS348a: Computer Graphics Handout #8 Geometric Modeling Original Handout #8 Stanford University Thursday, 15 October 1992

Continuous Ordinal Clustering: A Mystery Story 1

Motivating the Covariance Matrix

ABSOLUTE VALUES AND VALUATIONS

3. Coding theory 3.1. Basic concepts

Lecture 2: Data Analytics of Narrative

TRACIAL POSITIVE LINEAR MAPS OF C*-ALGEBRAS

Machine Learning for Data Science (CS4786) Lecture 2

Principal component analysis (PCA) for clustering gene expression data

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Understanding (dis)similarity measures

A metric space is a set S with a given distance (or metric) function d(x, y) which satisfies the conditions

[Ahmed*, 5(3): March, 2016] ISSN: (I2OR), Publication Impact Factor: 3.785

University of Florida CISE department Gator Engineering. Clustering Part 1

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Chapter 5: Microarray Techniques

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms

arxiv: v2 [math.gn] 27 Sep 2010

Reconstruction of certain phylogenetic networks from their tree-average distances

2 Metric Spaces Definitions Exotic Examples... 3

Small distortion and volume preserving embedding for Planar and Euclidian metrics Satish Rao

ON THE DIAMETER OF THE ATTRACTOR OF AN IFS Serge Dubuc Raouf Hamzaoui Abstract We investigate methods for the evaluation of the diameter of the attrac

Set, functions and Euclidean space. Seungjin Han

Geometry of Vector Spaces

Course 212: Academic Year Section 1: Metric Spaces

High Dimensional Discriminant Analysis

Heuristics for The Whitehead Minimization Problem

Centrality in Social Networks

11 Minimal Distance and the Parity Check Matrix

Multivariate Analysis Cluster Analysis

1. Introduction. 2. Outlines

Lecture Notes DRE 7007 Mathematics, PhD

Chapter 1 - Preference and choice

SZEMERÉDI S REGULARITY LEMMA FOR MATRICES AND SPARSE GRAPHS

More on Unsupervised Learning

DEGREE AND SOBOLEV SPACES. Haïm Brezis Yanyan Li Petru Mironescu Louis Nirenberg. Introduction

Mid Term-1 : Practice problems

Lecture 1: September 25, A quick reminder about random variables and convexity

Least squares multidimensional scaling with transformed distances

Computational Connections Between Robust Multivariate Analysis and Clustering

Notes: Pythagorean Triples

Decision trees on interval valued variables

Fourth Week: Lectures 10-12

On Locating-Dominating Codes in Binary Hamming Spaces

1 Integration in many variables.

The Integers. Peter J. Kahn

group Jean-Eric Pin and Christophe Reutenauer

Overview of clustering analysis. Yuehua Cui

Lecture 1 Complex Numbers. 1 The field of complex numbers. 1.1 Arithmetic operations. 1.2 Field structure of C. MATH-GA Complex Variables

On the maximum number of isosceles right triangles in a finite point set

Notes taken by Costis Georgiou revised by Hamed Hatami

Conjoint use of variables clustering and PLS structural equations modelling

274 Curves on Surfaces, Lecture 4

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C.

Properties of Real Numbers

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

DECOMPOSING DIFFEOMORPHISMS OF THE SPHERE. inf{d Y (f(x), f(y)) : d X (x, y) r} H

Citation PHYSICAL REVIEW E (2002), 65(6) RightCopyright 2002 American Physical So

Histogram data analysis based on Wasserstein distance

Lecture Notes 1: Vector spaces

Unsupervised machine learning

The Integers. Math 3040: Spring Contents 1. The Basic Construction 1 2. Adding integers 4 3. Ordering integers Multiplying integers 12

Introduction to Real Analysis Alternative Chapter 1

On the eigenvalues of Euclidean distance matrices

Lasso-based linear regression for interval-valued data

Spring 2014 Advanced Probability Overview. Lecture Notes Set 1: Course Overview, σ-fields, and Measures

Definition 2.1. A metric (or distance function) defined on a non-empty set X is a function d: X X R that satisfies: For all x, y, and z in X :

Note: Each problem is worth 14 points except numbers 5 and 6 which are 15 points. = 3 2

Principal Component Analysis

Clustering analysis of vegetation data

Tutorials in Optimization. Richard Socher

Differential Resultants and Subresultants

Econ Lecture 3. Outline. 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n

A combinatorial algorithm minimizing submodular functions in strongly polynomial time

Effective Resistance and Schur Complements

The Hamming Codes and Delsarte s Linear Programming Bound

Generalized Blockmodeling of Valued Networks

Data Exploration and Unsupervised Learning with Clustering

CHAPTER 5. Higher Order Linear ODE'S

1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

ONLINE LINEAR DISCREPANCY OF PARTIALLY ORDERED SETS

Unsupervised Learning with Permuted Data

OUTER MEASURE AND UTILITY

Applying cluster analysis to 2011 Census local authority data

Transcription:

Generalized Ward and Related Clustering Problems Vladimir BATAGELJ Department of mathematics, Edvard Kardelj University, Jadranska 9, 6 000 Ljubljana, Yugoslavia Abstract In the paper an attempt to legalize the use of Ward and related clustering methods also for dissimilarities different from squared Euclidean distance is presented. Introduction The Ward method is often successfully used for solving clustering problems over dissimilarity matrices which do not consist of squared Euclidean distances between units [2, p. 45]. While use of the centroid and incremental sum of squares strategies has been regarded as dubious with other than distance measures and squared Euclidean distance, it is argued that both can be viewed as formulated algebraically and then retain their established space-distortion properties. There is, however, some empirical evidence that use of the ISS strategy with non-metrics does produce readily interpreted classifications [, p. 439-440]. In this paper an attempt to legalize such extended uses of Ward and related methods is presented. We start with an overview of the properties of the ordinary Ward criterion function (2-). Replacing the squared Euclidean distance in (5) by any dissimilarity d we get the generalized Ward clustering problem (,2,3). To preserve the analogy with the ordinary problem and for notational convenience we introduce the notion of generalized center of cluster (5,6). In general there is no evident interpretation of generalized centers. By the analogy with the ordinary cases, where the dissimilarity d is the squared Euclidean distance, we generalize also the Gower-Bock dissimilarity between clusters (28) and the dissimilarities based on inertia, variance and weighted increase of variance (32-34). For all these generalized dissimilarities we obtain the same coefficients in the Lance-Williams-Jambu formula (29,35-38). Therefore the corresponding agglomerative clustering methods can be used for any dissimilarity d and not only for the squared Euclidean distance. At the end the generalized Huyghens theorem and some properties of generalized dissimilarities are given. Published in: Classification and Related Methods of Data Analysis. H.H. Bock (editor). Holland, Amsterdam, 988. p. 67-74. North-

2 Ward clustering problem Let us first repeat some basic facts about Ward clustering problem. It can be posed as follows: Determine the clustering C Π k, for which where P (C ) = min C Π k P (C) () Π k = {C : C is partition of set of units E and card(c) = k} and the Ward criterion function P (C) has the form P (C) = C C p(c) (2) and p(c) = d 2 2(X, C) (3) where C is the center (of gravity) of the cluster C [ C] = n C [X], n C = card(c), [X] IR m (4) and d 2 is the Euclidean distance. We make a distinction between the (name of) the unit X and its value (description) [X]. In the French literature the quantity p(c) (error sum of squares of the cluster C) is often called inertia of the cluster C [4, p. 30] and denoted I(C). Let us list some well known equalities relating the above quantities [, p. 50]: If C u C v = then p(c) = d 2 2 (X, Y ) (5) 2.n C X,Y C d 2 2 (X, U) = p(c) + n C d 2 2 (U, C) (6) d 2 2 (U, C) = n C C = argmin U (d 2 2 (X, U) d2 2 (X, C)) (7) (n u + n v ) p(c u C v ) = n u p(c u ) + n v p(c v ) + d 2 2 (X, U) (8) u,y C v d 2 2 (X, Y ) (9) and p(c u C v ) = p(c u ) + p(c v ) + n u n v d 2 2 n u + n ( C u, C v ) (0) v The last term in (0) d W (C u, C v ) = n u n v d 2 n u + n 2( C u, C v ) () v is also called Ward dissimilarity between clusters C u and C v. 2

3 The generalized Ward clustering problem To justify the use of Ward and related agglomerative clustering methods also for dissimilarities different from squared Euclidean distance we have to appropriately generalize the Ward clustering problem. Afterward we shall show that we obtain for generalized problems the same coefficients in Lance-Williams-Jambu formula as for the ordinary problems (based on the squared Euclidean distance). Therefore the agglomerative (hierarchical) procedures corresponding to these problems can be used for any dissimilarity. To generalize the Ward clustering problem we proceed as follows: Let E E, where E is the space of units (set of all possible units; the set [E] of descriptions of units is not neccessary a subset of IR m ), be a finite set, be a dissimilarity between units and d : E E IR + 0 w : E IR + be a weight of units, which is extended to clusters by: X E : w({x}) = w(x) C u C v = w(c u C v ) = w(c u ) + w(c v ) (2) To obtain the generalized Ward clustering problem we must appropriately replace formula (3). We define, relying on (5): p(c) = 2 w(c) X,Y C w(x) w(y ) d(x, Y ) (3) Note that d in (3) can be any dissimilarity on E and not only the squared Euclidean distance. From (3) we can easily derive the following generalization of (9) : If C u C v = then w(c u C v ) p(c u C v ) = w(c u ) p(c u ) + w(c v ) p(c v ) + u,y C v w(x) w(y ) d(x, Y ) (4) All other properties are expressed using the notion of center C of cluster C and for dissimilarities different from squared Euclidean distance these properties usually do not hold. There is another possibility: to replace C by a generalized, possibly imaginary (with descriptions not neccessary in the set [E]) central element and try to preserve the properties characteristic for Ward problem. For this purpose let us introduce the extended space of units: E = { C : C E, 0 < card(c) < } E (5) The generalized center of cluster C is called an (abstract) element C C for which the dissimilarity between it and any U E is determined by d(u, C) = d( C, U) = w(c) ( w(x) d(x, U) p(c)) (6) 3

When for all units w(x) =, the right part of (6) can be read: average dissimilarity between unit/center U and cluster C diminished by average radius of cluster C. It is easy to verify that for every X cale and U E C = {X} d(u, C) = d(u, X) (7) Note that the equality (6) tells us only how to determine the dissimilarity between two units/centers; but does not provide us with an explicit expression for C. Therefore the generalized centers can be viewed as a notationaly convenient notion generalizing our intuition relaying on the squared euclidean distance; although in some cases it may be possible to find a representation for them. It would be interesting to obtain some results in this direction. Can the Young-Householder theorem [9] be used for this purpose? Multiplying (6) by w(u) summing on U running over C we get p(c) = w(x) d(x, C) (8) which generalizes the original definition of p(c) given by formula (3). Substituting (8) in (6) we obtain generalized (7): d(u, C) = w(c) which gives for U = C : d( C, C) = 0. Again, applying (6) twice we get: w(x) (d(x, U) d(x, C)) (9) d( C u, C v ) = w(c u ) w(c v ) u Y Cv w(x) w(y ) d(x, Y ) p(c u) w(c u ) p(c v) w(c v ) (20) Multiplying (20) by w(c u ) w(c v ), rearanging and substituting so obtained double sum into (4) we get for C u C v = : p(c u C v ) = p(c u ) + p(c v ) + w(c u) w(c v ) w(c u C v ) d( C u, C v ) (2) which generalizes (0). Therefore we can define the generalized Ward dissimilarity between clusters C u and C v as: D W (C u, C v ) = w(c u) w(c v ) w(c u C v ) d( C u, C v ) (22) 4 Agglomerative methods for Ward and related clustering problems It is well known that there are no efficient (exact) algorithms for solving the clustering problem (), except for some special criterion functions; and there are strong arguments 4

that no such algorithm exists [5]. Therefore approximative methods such as local optimization and hierarchical (agglomerative and divisive) methods should be used. The equality (2) gives rise to the following reasoning: Suppose that the criterion function P (C) takes the form P (C) = C C p D (C) (23) where p D (C u C v ) = p D (C u ) + p D (C v ) + D(C u, C v ) (24) and D(C u, C v ) is a dissimilarity between clusters C u and C v. dissimilarities will be given in the continuation. Let be C = (C \ {C u, C v }) {C u C v } then Some examples of such P (C ) = P (C) + D(C u, C v ) (25) We can found on the last equality the following greedy heuristic: if we start with C 0 = {{X} : X E} and at each step we join the most similar pair of clusters, we probably obtain clusterings which are (almost) optimal. This heuristic is the basis for the agglomerative methods for solving the clustering problem. It is useful to slightly generalize ([7], [3]) the equality (24) to: p D (C) = min (p D(C ) + p D (C \ C ) + D(C, C \ C )) (26) C C which still leads to the same heuristic. Therefore we can, for a given dissimilarity between clusters D(C u, C v ), look at the agglomerative methods as the approximative methods for solving the clustering problem () for the criterion function defined by (23,26) and p D ({X}) = 0. Very popular scheme of agglomerative clustering algorithms is the scheme based on the Lance-Williams-Jambu formula for updating dissimilarities between clusters [0, 6]: D(C s, C p C q ) = α D(C s, C p ) + α 2 D(C s, C q ) + β D(C p, C q ) + (27) γ D(C s, C p ) D(C s, C q ) + δ f(c p ) + δ 2 f(c q ) + δ 3 f(c s ) where f(c) is a value of cluster C. In the following (29,35-38) we shall show that the (generalized) Ward and some related dissimilarities satisfy this formula. We can generalize the Gower-Bock dissimilarity by defining D G (C u, C v ) = d( C u, C v ) (28) It is not difficult to verify, using (4) and (2), the Gower-Bock equality (generalized median theorem, [4, p. 57]): D G (C s, C p C q ) = w p D G (C s, C p ) + w q D G (C s, C q ) w p w q D G (C p, C q ) (29) w pq w pq 5 w 2 pq

For this purpose it is useful to introduce the quantity for which the relations a(c u, C v ) = w u w v u Y Cv w uv a(c t, C u C v ) = w u a(c t, C u ) + w v a(c t, C v ) w(x) w(y ) d(x, Y ) (30) w uv p(c u C v ) = w u p(c u ) + w v p(c v ) + w u w v a(c u, C v ) (3) D G (C u, C v ) = d( C u, C v ) = a(c u, C v ) p(c u) p(c v) w u w v hold. Using (29) and relations among the following dissimilarities [6, 0, p. 27]: Inertia: D I (C u, C v ) = p(c u C v ) (32) Variance: D V (C u, C v ) = var(c u C v ) = p(c u C v ) (33) w uv Weighted increase of variance: D v (C u, C v ) = var(c u C v ) w uv (w u var(c u ) + w v var(c v )) = w uv D W (C u, C v ) (34) we derive the well known equalities, w = w(c p C q C s ): D W (C s, C p C q ) = w ps w DW (C s, C p ) + w qs w DW (C s, C q ) w s w DW (C p, C q ) (35) D I (C s, C p C q ) = w ps w DI (C s, C p ) + w qs w DI (C s, C q ) + w pq w DI (C p, C q ) w p w p(c p) w q w p(c q) w s w p(c s) (36) D V (C s, C p C q ) = ( w ps w )2 D V (C s, C p ) + ( w qs w )2 D V (C s, C q ) + ( w pq w )2 D V (C p, C q ) w s w p(c p) w q 2 w p(c q) w s 2 w p(c s) (37) 2 D v (C s, C p C q ) = ( w ps w )2 D v (C s, C p ) + ( w qs w )2 D v (C s, C q ) w s w pq w 2 D v (C p, C q ) (38) which therefore hold also in the case when we take for p(c) in relations (27-29) the generalized cluster error function (3). So, we can conclude that the coefficients in the Lance-Williams-Jambu formula for Ward and related agglomerative clustering methods are valid for any dissimilarity d. 5 The generalized Huyghens theorem Taking in (20) C u = C and C v = E, multiplying it with w(c) w(e) and summing it over all C C, we obtain: p(e) = C C(w(C) d( C, Ẽ) + p(c)) (39) 6

If we denote [4, p. 50]: I = p(e), I W = p(c), I B = w(c) d( C, Ẽ) (40) C C C C we can express (39) in the form (generalized Huyghens theorem): I = I W + I B (4) 6 Some properties of the generalized dissimilarities For dissimilarities different from d 2 2 the extended dissimilarity d(u, C) is not neccessarily positive quantity for every U E ; so do D G (C u, C v ) and D W (C u, C v ). To obtain a partial answer to this problem we substitute (3) into (4) giving d(u, C) = = 2 w(c) 2 X,Y C 2 w(c) 2 X,Y C w(x) w(y ) (2 d(x, U) d(x, Y )) w(x) w(y ) (d(x, U) + d(u, Y ) d(x, Y )) (42) Therefore: if the dissimilarity d satisfies the triangle inequality then for each U E it holds d(u, C) 0 (43) The property (43) has another important consequence. Considering it in (6) we get: for each U E w(x) d(x, U) p(c) (44) which in some sense generalizes (8). Note that (42) does not imply d( C u, C v ) 0, because we do not know that d(x, C) + d( C, Y ) d(x, Y ) (45) but it is easy to verify that: if d satisfies the triangle inequality then also d( C, Z) + d(z, X) d( C, X) d( C u, Z) + d(z, C v ) d( C u, C v ) (46) for X, Z E. So, it is still possible that there exists a center C for which w(x) d(x, C ) < p(c) (47) Note also that d 2 2 does not satisfy the triangle inequality. The main open question remains: for which dissimilarities d it holds d(u, V ) 0 for every U, V E? This inequality combined with (6) gives us the generalized (8). 7

References [] Abel, D.J., Williams, W.T., A re-examination of four classificatory fusion strategies. The Computer Journal, 28(985), 439-443. [2] Anderberg, M.R., Cluster Analysis for Applications. Academic Press, New York, 973. [3] Batagelj, V., Agglomerative methods in clustering with constraints. Submitted, 984. [4] Bouroche, J-M., and Saporta, G., L analyse des donnees, Que sais-je? Presses Universitaires de France, Paris, 980. [5] Brucker, P., On the complexity of clustering problems, in: Henn, R., Korte, B. and Oettli, W., (eds.), Optimization and Operations Research; Proceedings in Lecture Notes in Economics and Mathematical Systems, Vol. 57. Springer-Verlag, Berlin, 978. [6] Diday, E., Inversions en classification hierarchique: application a la construction adaptive d indices d agregation. Revue de Statistique Appliquee, 3(983), 45-62. [7] Ferligoj, A. and Batagelj, V., Clustering with relational constraint. Psychometrika, 47(982), 43-426. [8] Gower, J.C., Legendre, P., Metric and Euclidean Properties of Dissimilarity Coefficients. Journal of Classification, 3(986), 5-48. [9] Gregson, A.M.R, Psychometrics of Similarity. Academic Press, New York, 975. [0] Jambu, M., Lebeaux, M-O., Cluster analysis and data analysis. North-Holland, Amsterdam, 983. [] Späth, H., Cluster analyse algorithmen. R. Oldenbourg Verlag, München, 977. 8