The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance Marina Meilă University of Washington Department of Statistics Box 354322 Seattle, WA 98195-4322 phone:(206)543-8484 e-mail:mmpstat.washington.edu February 20, 2006 Abstract We prove that the above two distances between partitions of a finite set are equivalent in the neighborhood of 0. In other words, if the two partitions are very similar, then d χ 2 defines upper and lower bounds on d ME and viceversa. The proof is geometric and relies on the convexity of a certain set of probability measures. The motivation for this work is in the area of data clustering, where these distances are frequently used to compare two clusterings of a set of observations. Besides, our result applies to any pair of finite valued random variables, and provides simple yet tight upper and lower bounds on the χ 2 measure of (in)dependence valid when the two variables are strongly dependent.

1 Motivation Clustering, or finding partitions in data, has become an increasingly popular part of data analysis. In order to theoretically study clustering, or in order to assess its behaviour empirically, one needs to compare clusterings of a finite set in a meaningful way. The Misclassification Error and the χ 2 distance are two distinct criteria for comparing clusterings, the first one being widely used in the computer science literature on clustering and the other one originating in statistics. Here we show that these two distances are equivalent in the case when the two partitions are very similar. In other words, if d ME is small, then d χ 2 is small too, and vicerversa. This result is, to my knowledge, the first ever to give a detailed local comparison of two distances between partitions. The case of small distances is of utmost importance, as it is in this regime that one desires the behaviour of any clustering algorithm to lie. Therefore, this proof provides a theoretical tool for the analysis of algorithms behavior and for the analysis of clustering criteria. In the empirical evaluation of clustering algorithms, understanding the small distances case allows one to make fine distinctions between various algorithms. The present equivalence theorems represent a step towards removing the dependence on the distance from the evaluation outcome. 2 Definitions and representation We consider a finite set D n. A clustering is a partition of D n is into sets C 1, C 2,... C K called clusters such that C k C l = and K C k = D n. Let the cardinality of cluster C k be n k respectively. We have, of course, that n = K k=1 n k. We also assume that n k > 0; in other words, that K represents the number of non-empty clusters. k=1 Representing clusterings as matrices. W.l.o.g. the set D n can be taken to be {1, 2,... n} def [n]. Denote by X a clustering {C 1, C 2,... C K }; X can be represented by the n K matrix A X with A ik = 1 if i C k and 0 otherwise. In this representation, the columns of A X are indicator vectors of the clusters and are orthogonal. Representing clusterings as random variables. The clustering X can also be represented as the random variable (denoted abusively by) X : [n] [K] taking value k [K] w.p. 2

n k n. One typically requires distances between partitions to be invariant to the permutations of the labels 1,... K. By this representation, any distance between two clusterings can be seen as a particular type of distance between random variables which is invariant to permutations. Let a second clustering of D n be Y = {C 1, C 2,... C K }, with cluster sizes n k. Note that the two clusterings may have different numbers of clusters. Lemma 1 The joint distribution of variables X, Y is given by p XY = 1 n AT XA Y (2.1) In other words, p XY (x, y) is the x, y-th element of the K K matrix in (2.1). In the above, the superscript () T denotes matrix transposition. The proof is immediate and is left to the reader. We now define the two distances between clusterings in terms of the joint probability matrix defined above. Definition 2 The misclassification error distance d ME between clusterings X, Y (with K K is where Π K d ME (X, Y ) = 1 max π Π K p XY (x, π(x)) i [K] is the set of all permutations of K objects. Although the maximization above is over a set of size K!, d ME can be computed in polynomial time by a maximum bipartite matching algorithm [Papadimitriou and Steigliz, 1998]. It can be shown that d ME is a metric (see e.g. [Meila, 2005]). This distance is widely used in the computer science literature on clustering, due to its direct relationship with the misclassification error cost of classification. It has indeed very appealing properties as long as X, Y are close[meilă, 2006]. Otherwise, its poor resolution represents a major hindrance. Definition 3 The χ 2 distance d χ 2 is defined as d χ 2(X, Y ) = min(k, K ) χ 2 (p XY ) with χ 2 (p XY ) = x,y p XY (x, y) 2 p X (x)p Y (y) (2.2) 3

The above definition and notation are motivated as follows. Lemma 4 Let p X = (p x ) x [K], p Y = (p y) y [K ] the marginals of p XY. Then, the function χ 2 (p XY ) defined in (2.2) represents the functional χ 2 (f, g) + 1 applied to f = p XY, g = p X p Y. Proof Denote p xy = p XY (x, y). χ 2 (f, g) = xy (p xy p x p y) 2 [ p 2 xy p x p y ] = xy p x p y 2p xy + p x p y = xy p 2 xy p x p y 2 + 1 Hence, d χ 2 is a measure of independence. It is equal to 0 when the random variables X, Y are identical up to a permutation, and it equals 1 when they are independent. From lemma 4 one can see that d χ 2 is non-negative. This distance with slight variants has been used as a distance between partitions by [Hubert and Arabie, 1985, Bach and Jordan, 2004] with the obvious motivation of being related to the familiar χ 2 functional. The following lemma gives another, technical motivation for paying attention to d χ 2. Lemma 5 Let ÃX, Ã Y be the normalized matrix representations for X, Y defined by ÃX(i, k) = 1 nk if i C k and 0 otherwise. Hence, ÃX, (ÃY ) have orthonormal columns. Then, where F represents the Frobenius norm. χ 2 (p XY ) = ÃT XÃY 2 F (2.3) Proof Note that (ÃT XÃY ) xy = pxy pxp. y The above lemma shows that the d χ 2 distance is a quadratic function, making it a convenient instrument in proofs. Contrast this with the apparently simple d ME distance, which is not everywhere differentiable and is theoretically much harder to analyze. We close this section by noting that d χ 2 is concave in p XY while d ME is convex. For d χ 2, this follows from the convexity of the χ 2 functional [Vajda, 1989]. The d ME can be expressed as the minimum of a set of linear functions 1 ; therefore it is convex, which completes the argument. 1 d ME = minimum of the offidiagonal mass of p XY over all permutations 4

3 Small d χ 2 implies small d ME To prove this statement, we adopt the following framework. First, for simplicity, we assume that K = K; the generalization to K K is straightforward. Second, we will assume w.l.o.g that partition X is fixed, while Y is allowed to vary. In terms of random variables, the two assumptions describe the set of distributions over [K] [K] that have a fixed marginal p X = (p 1,... p K ). We denote this domain by P. In the rest of the section we will adopt the following notation: p represents a distribution from P, p xy is the probability of pair (x, y) [K] [K] under p, p Y = (p 1... p K ) is the second marginal of p. Thus, P = { p = [p xy ] x,y [K], p xy 0, y p xy = p x for y [K]}. Consequently, P is convex and bounded. We will show that the maxima of χ 2 over P have value K and are attained when the second random variable is a one-to-one function of the first. We call such a point optimal; the set of optimal points of P is denoted by E.Any element p in E is defined as p kk = p k if k = π(k) 0 otherwise where π represents a permutation of the indices 1, 2,... K. In the following it will be proved that if a joint distribution p in P is more than away from any optimal point then χ 2 ( p) will be bounded away from K. Theorem 6 For two clusterings represented by the joint distribution p XY, denote p min = min [K] p x, p max = max [K] p x. Then, for any p min, if d χ 2(p XY ) p max then d ME (p XY ). Outline of proof For a fixed π, we denote the corresponding optimal point by p π and the points which differ from p π by in p aa, p ab by p,π (a, b). Below is the definition of p,π in the case of the identical permutation. In what follows, whenever we consider one optimal point only, we shall assume w.l.o.g that π is the identical permutation, and omit it from the notation. x = a, y = b p a, x = y = a [ p (a, b)] xy = (3.1) p x, x = y a 0, otherwise 5

and thus, x = y = a [ p p (a, b)] xy = x = a, y = b (3.2) 0, otherwise For p min = min x p x let E π = { p,π (a, b), a, b [K] [K], a b}. We lower bound the value of χ 2 at all points in E, then we show that if d ME is greater than, then the value of χ 2 cannot be lower than this bound. These results will be proved as a series of lemmas, after which the formal proof of the theorem will close this section. Lemma 7 (i) The set of extreme points of P is E = { p φ : [K] [K ], p xy = p x if y = φ(x), 0 otherwise} (3.3) (ii) For p E, χ 2 ( p) = Rangeφ. Proof The proof of (i) is immediate and left to the reader. To prove (ii) let p E. We can write successively χ 2 ( p) = y p y >0 x φ 1 (y) = y p y >0 x φ 1 (y) p x z φ 1 (y) p z p 2 x p x z φ 1 (y) p z = y p y >0 1 = Rangeφ If Range(φ) = K, then φ is a permutation and we denote it by π. Let E = { p π } the set of extreme points for which χ 2 = K and E = E \ E the set of the extreme points for which χ 2 K 1. Lemma 8 Let B 1 () be the 1-norm ball of radius centered at p E. Then, B 1 (2) P = convex({ p } E ) Proof First we show that p p (a, b) 1 = 2. p p (a, b) 1 = x,y p xy p (a, b) xy = p aa p (a, b) aa + p ab p (a, b) ab = + = 2 (3.4) 6

For any point p B 1 (2) P denote by Then, it is easy to check that and Lemma 9 For all p B 1 (2) P e = x p = (1 e ) p + a (1 e ) + a y x b a p xy b a p ab d ME ( p) Proof Obvious, since d ME ( p) x y x p xy =. p ab p (a, b) = 1 Lemma 10 Let x = i α ix i with α i 0, i α i = 1 and, for all i, let y i be a point of the segment (x, x i ]. Then x is a convex combination of {y i }. Proof Let y i = β i x + (1 β i )x i, β [0, 1). Then x i = y i β i x 1 β i and replacing the above in the expression of x we get successively: [ αi x = y i α ] iβ i x 1 β i 1 β i i (3.5) = i α i 1 β i y i x i α i β i 1 β i (3.6) Hence with γ i 0 and γ i = i x = i i 1 + j α i 1 β i α i 1 β i α jβ j 1 β j 1 + j }{{} γ i α jβ j = 1 β j 1 + i 1 + j y i (3.7) α iβ i 1 β i α jβ j = 1 (3.8) 1 β j 7

Lemma 11 The set { p d ME ( p) } with p min is included in the convex hull of {E π} Π K E. Proof Let A = {d ME ( p) } and p A. Because p P it is a convex combination of the extreme points of P, it can be written as p = = E α i p π i, α i 0, i=1 K! i=1 E α i p π i + i=1 α i = 1 i α i+k! p i Let us look at the segment [ p, p π i ]; its first end, p is in A, while its other end is outside A and inside the ball B πi 1 (). As the ball is convex, there is a (unique) point p i = [ p, p π i ] B πi 1 (). This point being on the boundary of the ball, it can be written as a convex combination of points in E πi by Lemma 8. We now apply Lemma 10, with y i = p π i for i = 1,...K! and y i = p i K! for i > K!. It follows that p is a convex combination of p i, i = 1,...m, which completes the proof. Lemma 12 For p min χ 2 ( p ) χ 2 ( p(a, b)) p max Proof Compute χ 2 ( p (a, b)): Therefore χ 2 ( p (a, b)) = K 2 + (p a ) 2 p a (p a ) + 2 p a (p b + ) + p 2 b p b (p b + ) = K 2 + 1 p a + = K (p a + p b ) p a (p b + ) 2 p a (p b + ) + 1 + p b + (3.9) (3.10) (3.11) K p a (3.12) χ 2 ( p ) χ 2 ( p (a, b)) p a p max Proof of Theorem 6 By contradiction. Assume d ME ( p). Then, p A by lemma 11. Since χ 2 is convex on A, χ 2 ( p) cannot be larger than the maximum value at the extreme points 8

of A, which are contained in E ( π E π). But we know by lemma 12 that the value of χ2 is bounded above by K /p max at any point in E π and by K 1 at any point in E. Note also that a tight, non-linear bound can be obtained by maximizing (3.11) over all a, b. 4 Small d ME implies small d χ 2 Theorem 13 Let p XY represent a pair of clusterings with d ME (p XY ). Then d χ 2(p XY ) 2 p min The proof is based on the fact that a convex function is always above any tangent to its graph. We pick a point p that has d ME ( p) = and lower bound χ 2 ( p) by the tangent to χ 2 in the nearest p. We start by proving three lemmas then follow with the formal proof of the theorem. Lemma 14 The unconstrained partial derivatives of χ 2 in p are Proof χ 2 p ab = p ab χ 2 = p xy p [ = 1 p a p ab x 1 p 2 xy p x y ( p 2 ab x p x y = 1 p a 2p ab p b p2 ab.1 p 2 b = 2p ab p a p b x p 2 xb p x p 2 b 1 p y, x y 1 p x, x = y x p x y ] ) + x a 1 + x a p 2 xb p x p 2 b The result follows now by setting p xb = p x δ xb, p b = p b. p x p ab ( ) p 2 xb x p x b (4.1) (4.2) (4.3) (4.4) Lemma 15 For any p P χ 2 ( p ) χ 2 ( p) x y x ( pxy + p ) xy p x p y 9

Proof χ 2 is convex, therefore χ 2 ( p) is above the tangent at p, i.e χ 2 ( p) χ 2 ( p ) + vec( χ 2 ( p )) vec( p p ) (4.5) Denote vec( χ 2 ( p )) vec( p p ) = 1 xy + p x x y xp 1 p x y y x = ( pxy + p ) xy p x x p y y x p xy (4.6) (4.7) x = 1 p xy, x [K] (4.8) p x y x y = 1 p xy y [K] (4.9) p y x y These quantitites represent the relative leak of probability mass from the diagonal to the offdiagonal cells in row x, respectively in column y of the matrix p w.r.t p. Lemma 16 Let x, x [K] be as defined above, and assume that the marginals p x are sorted so that p min = p 1 p 2 p 3... p K = p max with x p x x =. Then, p 1, if [0, p 1 ] 1 + p1 p max x = 2, if (p 1, p 1 + p 2 ] { x} x... k + P x k px p k+1, if (p 1 + + p k, p 1 + + p k+1 ] Proof It is easy to verify the solution for p 1. For the other intervals, on verifies the solution by induction over k [K]. Proof [Theorem 13] Assume that d ME ( p) = 0. Then, w.l.o.g. one can assume that the off-diagonal elements of p sum to. It is easy to see that under the conditions of lemma 16 x x p min By symmetry, this bound also holds for y y. Therefore, by lemma 15 χ 2 ( p ) χ 2 ( p) 10 2 p min (4.10)

or d χ 2( p) 2 p min 2 0 p min 5 Remarks Although the original motivation for this work stems from comparing partitions, we have proved a result which holds for any two finite-valued random variables. In particular, the two theorems give lower and upper bounds on the χ 2 measure of independence between two random variables, holding locally when the two variables are strongly dependent. The present approximation complements an older approximation of χ 2 by the mutual information I XY = xy p xy ln pxy p xp. y It is known [Cover and Thomas, 1991] that the second order Taylor approximation of I XY = 1 2 (χ2 (p XY ) 1) with χ 2 defined as in (2.2). This approximation is good around p XY = p X p Y, hence in the weak dependence region. The non-linear bound (3.11) in Theorem 6 is tight. The proofs hold when the condition K = K is replaced by K K or even by K. It can be seen that both sets of bounds are tighter and hold for a larger range of when the clusterings have approximately equal clusters, that is when p min, p max both approach 1/K. This confirms the general intuition that clusterings with equal sized clusters are easier (and its counterpart, that clusterings containing very small clusters are hard ). Finally, a useful property of the theorems presented here is that they involve the values p min, p max of on one clustering only. Hence they can be applied in cases when only one clustering is known. For example, [Meilă et al., 2005] used this result in the context of spectral clustering, to prove that any clustering with low enough normalized cut is close to the (unknown) optimal clustering of that data set. References [Bach and Jordan, 2004] Bach, F. and Jordan, M. I. (2004). Learning spectral clustering. In Thrun, S. and Saul, L., editors, Advances in Neural Information Processing Systems 16,, Cambridge, MA. MIT Press. [Cover and Thomas, 1991] Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley. 11

[Hubert and Arabie, 1985] Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193 218. [Meila, 2005] Meila, M. (2005). Comparing clusterings an axiomatic view. In Wrobel, S. and De Raedt, L., editors, Proceedings of the International Machine Learning Conference (ICML). Morgan Kaufmann. [Meilă, 2006] Meilă, M. (2006). Comparing clusterings an information based metric. Journal of Multivariate Analysis, page (in print). [Meilă et al., 2005] Meilă, M., Shortreed, S., and Xu, L. (2005). Regularized spectral learning. In Cowell, R. and Ghahramani, Z., editors, Proceedings of the Artificial Intelligence and Statistics Workshop(AISTATS 05). [Papadimitriou and Steigliz, 1998] Papadimitriou, C. and Steigliz, K. (1998). Combinatorial optimization. Algorithms and complexity. Dover Publication, Inc., Minneola, NY. [Vajda, 1989] Vajda, I. (1989). Theory of statistical inference and information. Theory and Decision Library. Series B: Mathematical and Statistical methods. Kluwer Academic Publishers, Norwell, MA. 12