The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance

Similar documents
Local equivalences of distances between clusterings a geometric perspective. Marina Meilă. Machine Learning ISSN

Linear Algebra: Matrix Eigenvalue Problems

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

Unsupervised Learning with Permuted Data

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Majorizations for the Eigenvectors of Graph-Adjacency Matrices: A Tool for Complex Network Design

The stability of a good clustering

Supplemental for Spectral Algorithm For Latent Tree Graphical Models

Spectral Clustering. Zitao Liu

The Expectation Maximization Algorithm

On the Tightness of an LP Relaxation for Rational Optimization and its Applications

Preliminary draft only: please check for final version

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Boolean Inner-Product Spaces and Boolean Matrices

An indicator for the number of clusters using a linear map to simplex structure

An Algorithm for Transfer Learning in a Heterogeneous Environment

Analysis-3 lecture schemes

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Topological properties

Spectral Techniques for Clustering

Discriminative Direction for Kernel Classifiers

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Numerical Analysis Lecture Notes

Grothendieck s Inequality

Support Vector Machine Classification via Parameterless Robust Linear Programming

Transpose & Dot Product

Another algorithm for nonnegative matrices

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

The general programming problem is the nonlinear programming problem where a given function is maximized subject to a set of inequality constraints.

Set, functions and Euclidean space. Seungjin Han

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Transpose & Dot Product

On the simultaneous diagonal stability of a pair of positive linear systems

Graph-Based Semi-Supervised Learning

Expectation Propagation in Factor Graphs: A Tutorial

Repeated Eigenvalues and Symmetric Matrices

Recoverabilty Conditions for Rankings Under Partial Information

ACI-matrices all of whose completions have the same rank

Lecture 9: Random sampling, ɛ-approximation and ɛ-nets

Part III. 10 Topological Space Basics. Topological Spaces

Kernels for Multi task Learning

Mathematical Preliminaries

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The matrix approach for abstract argumentation frameworks

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Sample width for multi-category classifiers

Lecture Notes 1: Vector spaces

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

On Dominator Colorings in Graphs

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Problem Set 2: Solutions Math 201A: Fall 2016

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Quadratic and Copositive Lyapunov Functions and the Stability of Positive Switched Linear Systems

Submodularity in Machine Learning

Learning Kernel Parameters by using Class Separability Measure

More Spectral Clustering and an Introduction to Conjugacy

Approximating Submodular Functions. Nick Harvey University of British Columbia

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

9. Birational Maps and Blowing Up

Optimization and Optimal Control in Banach Spaces

Kernel Methods and Support Vector Machines

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Estimating Gaussian Mixture Densities with EM A Tutorial

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

The Informativeness of k-means for Learning Mixture Models

EE731 Lecture Notes: Matrix Computations for Signal Processing

Voronoi Diagrams for Oriented Spheres

Linear Algebra. Preliminary Lecture Notes

Variable Objective Search

Tutorials in Optimization. Richard Socher

Lecture 19 : Reed-Muller, Concatenation Codes & Decoding problem

Convex Functions and Optimization

Support Vector Machine (SVM) and Kernel Methods

AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Nonlinear Discrete Optimization

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Basis Construction from Power Series Expansions of Value Functions

Notes on Complex Analysis

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Support Vector Machine (SVM) and Kernel Methods

REVIEW OF DIFFERENTIAL CALCULUS

Curve Fitting Re-visited, Bishop1.2.5

12 : Variational Inference I

Tangent spaces, normals and extrema

Linear & nonlinear classifiers

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Without Loss of Generality

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Orientations of Simplices Determined by Orderings on the Coordinates of their Vertices. September 30, 2018

10. Smooth Varieties. 82 Andreas Gathmann

BMO Round 2 Problem 3 Generalisation and Bounds

Fourier analysis of boolean functions in quantum computation

An introduction to multivariate data

Transcription:

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance Marina Meilă University of Washington Department of Statistics Box 354322 Seattle, WA 98195-4322 phone:(206)543-8484 e-mail:mmpstat.washington.edu February 20, 2006 Abstract We prove that the above two distances between partitions of a finite set are equivalent in the neighborhood of 0. In other words, if the two partitions are very similar, then d χ 2 defines upper and lower bounds on d ME and viceversa. The proof is geometric and relies on the convexity of a certain set of probability measures. The motivation for this work is in the area of data clustering, where these distances are frequently used to compare two clusterings of a set of observations. Besides, our result applies to any pair of finite valued random variables, and provides simple yet tight upper and lower bounds on the χ 2 measure of (in)dependence valid when the two variables are strongly dependent.

1 Motivation Clustering, or finding partitions in data, has become an increasingly popular part of data analysis. In order to theoretically study clustering, or in order to assess its behaviour empirically, one needs to compare clusterings of a finite set in a meaningful way. The Misclassification Error and the χ 2 distance are two distinct criteria for comparing clusterings, the first one being widely used in the computer science literature on clustering and the other one originating in statistics. Here we show that these two distances are equivalent in the case when the two partitions are very similar. In other words, if d ME is small, then d χ 2 is small too, and vicerversa. This result is, to my knowledge, the first ever to give a detailed local comparison of two distances between partitions. The case of small distances is of utmost importance, as it is in this regime that one desires the behaviour of any clustering algorithm to lie. Therefore, this proof provides a theoretical tool for the analysis of algorithms behavior and for the analysis of clustering criteria. In the empirical evaluation of clustering algorithms, understanding the small distances case allows one to make fine distinctions between various algorithms. The present equivalence theorems represent a step towards removing the dependence on the distance from the evaluation outcome. 2 Definitions and representation We consider a finite set D n. A clustering is a partition of D n is into sets C 1, C 2,... C K called clusters such that C k C l = and K C k = D n. Let the cardinality of cluster C k be n k respectively. We have, of course, that n = K k=1 n k. We also assume that n k > 0; in other words, that K represents the number of non-empty clusters. k=1 Representing clusterings as matrices. W.l.o.g. the set D n can be taken to be {1, 2,... n} def [n]. Denote by X a clustering {C 1, C 2,... C K }; X can be represented by the n K matrix A X with A ik = 1 if i C k and 0 otherwise. In this representation, the columns of A X are indicator vectors of the clusters and are orthogonal. Representing clusterings as random variables. The clustering X can also be represented as the random variable (denoted abusively by) X : [n] [K] taking value k [K] w.p. 2

n k n. One typically requires distances between partitions to be invariant to the permutations of the labels 1,... K. By this representation, any distance between two clusterings can be seen as a particular type of distance between random variables which is invariant to permutations. Let a second clustering of D n be Y = {C 1, C 2,... C K }, with cluster sizes n k. Note that the two clusterings may have different numbers of clusters. Lemma 1 The joint distribution of variables X, Y is given by p XY = 1 n AT XA Y (2.1) In other words, p XY (x, y) is the x, y-th element of the K K matrix in (2.1). In the above, the superscript () T denotes matrix transposition. The proof is immediate and is left to the reader. We now define the two distances between clusterings in terms of the joint probability matrix defined above. Definition 2 The misclassification error distance d ME between clusterings X, Y (with K K is where Π K d ME (X, Y ) = 1 max π Π K p XY (x, π(x)) i [K] is the set of all permutations of K objects. Although the maximization above is over a set of size K!, d ME can be computed in polynomial time by a maximum bipartite matching algorithm [Papadimitriou and Steigliz, 1998]. It can be shown that d ME is a metric (see e.g. [Meila, 2005]). This distance is widely used in the computer science literature on clustering, due to its direct relationship with the misclassification error cost of classification. It has indeed very appealing properties as long as X, Y are close[meilă, 2006]. Otherwise, its poor resolution represents a major hindrance. Definition 3 The χ 2 distance d χ 2 is defined as d χ 2(X, Y ) = min(k, K ) χ 2 (p XY ) with χ 2 (p XY ) = x,y p XY (x, y) 2 p X (x)p Y (y) (2.2) 3

The above definition and notation are motivated as follows. Lemma 4 Let p X = (p x ) x [K], p Y = (p y) y [K ] the marginals of p XY. Then, the function χ 2 (p XY ) defined in (2.2) represents the functional χ 2 (f, g) + 1 applied to f = p XY, g = p X p Y. Proof Denote p xy = p XY (x, y). χ 2 (f, g) = xy (p xy p x p y) 2 [ p 2 xy p x p y ] = xy p x p y 2p xy + p x p y = xy p 2 xy p x p y 2 + 1 Hence, d χ 2 is a measure of independence. It is equal to 0 when the random variables X, Y are identical up to a permutation, and it equals 1 when they are independent. From lemma 4 one can see that d χ 2 is non-negative. This distance with slight variants has been used as a distance between partitions by [Hubert and Arabie, 1985, Bach and Jordan, 2004] with the obvious motivation of being related to the familiar χ 2 functional. The following lemma gives another, technical motivation for paying attention to d χ 2. Lemma 5 Let ÃX, Ã Y be the normalized matrix representations for X, Y defined by ÃX(i, k) = 1 nk if i C k and 0 otherwise. Hence, ÃX, (ÃY ) have orthonormal columns. Then, where F represents the Frobenius norm. χ 2 (p XY ) = ÃT XÃY 2 F (2.3) Proof Note that (ÃT XÃY ) xy = pxy pxp. y The above lemma shows that the d χ 2 distance is a quadratic function, making it a convenient instrument in proofs. Contrast this with the apparently simple d ME distance, which is not everywhere differentiable and is theoretically much harder to analyze. We close this section by noting that d χ 2 is concave in p XY while d ME is convex. For d χ 2, this follows from the convexity of the χ 2 functional [Vajda, 1989]. The d ME can be expressed as the minimum of a set of linear functions 1 ; therefore it is convex, which completes the argument. 1 d ME = minimum of the offidiagonal mass of p XY over all permutations 4

3 Small d χ 2 implies small d ME To prove this statement, we adopt the following framework. First, for simplicity, we assume that K = K; the generalization to K K is straightforward. Second, we will assume w.l.o.g that partition X is fixed, while Y is allowed to vary. In terms of random variables, the two assumptions describe the set of distributions over [K] [K] that have a fixed marginal p X = (p 1,... p K ). We denote this domain by P. In the rest of the section we will adopt the following notation: p represents a distribution from P, p xy is the probability of pair (x, y) [K] [K] under p, p Y = (p 1... p K ) is the second marginal of p. Thus, P = { p = [p xy ] x,y [K], p xy 0, y p xy = p x for y [K]}. Consequently, P is convex and bounded. We will show that the maxima of χ 2 over P have value K and are attained when the second random variable is a one-to-one function of the first. We call such a point optimal; the set of optimal points of P is denoted by E.Any element p in E is defined as p kk = p k if k = π(k) 0 otherwise where π represents a permutation of the indices 1, 2,... K. In the following it will be proved that if a joint distribution p in P is more than away from any optimal point then χ 2 ( p) will be bounded away from K. Theorem 6 For two clusterings represented by the joint distribution p XY, denote p min = min [K] p x, p max = max [K] p x. Then, for any p min, if d χ 2(p XY ) p max then d ME (p XY ). Outline of proof For a fixed π, we denote the corresponding optimal point by p π and the points which differ from p π by in p aa, p ab by p,π (a, b). Below is the definition of p,π in the case of the identical permutation. In what follows, whenever we consider one optimal point only, we shall assume w.l.o.g that π is the identical permutation, and omit it from the notation. x = a, y = b p a, x = y = a [ p (a, b)] xy = (3.1) p x, x = y a 0, otherwise 5

and thus, x = y = a [ p p (a, b)] xy = x = a, y = b (3.2) 0, otherwise For p min = min x p x let E π = { p,π (a, b), a, b [K] [K], a b}. We lower bound the value of χ 2 at all points in E, then we show that if d ME is greater than, then the value of χ 2 cannot be lower than this bound. These results will be proved as a series of lemmas, after which the formal proof of the theorem will close this section. Lemma 7 (i) The set of extreme points of P is E = { p φ : [K] [K ], p xy = p x if y = φ(x), 0 otherwise} (3.3) (ii) For p E, χ 2 ( p) = Rangeφ. Proof The proof of (i) is immediate and left to the reader. To prove (ii) let p E. We can write successively χ 2 ( p) = y p y >0 x φ 1 (y) = y p y >0 x φ 1 (y) p x z φ 1 (y) p z p 2 x p x z φ 1 (y) p z = y p y >0 1 = Rangeφ If Range(φ) = K, then φ is a permutation and we denote it by π. Let E = { p π } the set of extreme points for which χ 2 = K and E = E \ E the set of the extreme points for which χ 2 K 1. Lemma 8 Let B 1 () be the 1-norm ball of radius centered at p E. Then, B 1 (2) P = convex({ p } E ) Proof First we show that p p (a, b) 1 = 2. p p (a, b) 1 = x,y p xy p (a, b) xy = p aa p (a, b) aa + p ab p (a, b) ab = + = 2 (3.4) 6

For any point p B 1 (2) P denote by Then, it is easy to check that and Lemma 9 For all p B 1 (2) P e = x p = (1 e ) p + a (1 e ) + a y x b a p xy b a p ab d ME ( p) Proof Obvious, since d ME ( p) x y x p xy =. p ab p (a, b) = 1 Lemma 10 Let x = i α ix i with α i 0, i α i = 1 and, for all i, let y i be a point of the segment (x, x i ]. Then x is a convex combination of {y i }. Proof Let y i = β i x + (1 β i )x i, β [0, 1). Then x i = y i β i x 1 β i and replacing the above in the expression of x we get successively: [ αi x = y i α ] iβ i x 1 β i 1 β i i (3.5) = i α i 1 β i y i x i α i β i 1 β i (3.6) Hence with γ i 0 and γ i = i x = i i 1 + j α i 1 β i α i 1 β i α jβ j 1 β j 1 + j }{{} γ i α jβ j = 1 β j 1 + i 1 + j y i (3.7) α iβ i 1 β i α jβ j = 1 (3.8) 1 β j 7

Lemma 11 The set { p d ME ( p) } with p min is included in the convex hull of {E π} Π K E. Proof Let A = {d ME ( p) } and p A. Because p P it is a convex combination of the extreme points of P, it can be written as p = = E α i p π i, α i 0, i=1 K! i=1 E α i p π i + i=1 α i = 1 i α i+k! p i Let us look at the segment [ p, p π i ]; its first end, p is in A, while its other end is outside A and inside the ball B πi 1 (). As the ball is convex, there is a (unique) point p i = [ p, p π i ] B πi 1 (). This point being on the boundary of the ball, it can be written as a convex combination of points in E πi by Lemma 8. We now apply Lemma 10, with y i = p π i for i = 1,...K! and y i = p i K! for i > K!. It follows that p is a convex combination of p i, i = 1,...m, which completes the proof. Lemma 12 For p min χ 2 ( p ) χ 2 ( p(a, b)) p max Proof Compute χ 2 ( p (a, b)): Therefore χ 2 ( p (a, b)) = K 2 + (p a ) 2 p a (p a ) + 2 p a (p b + ) + p 2 b p b (p b + ) = K 2 + 1 p a + = K (p a + p b ) p a (p b + ) 2 p a (p b + ) + 1 + p b + (3.9) (3.10) (3.11) K p a (3.12) χ 2 ( p ) χ 2 ( p (a, b)) p a p max Proof of Theorem 6 By contradiction. Assume d ME ( p). Then, p A by lemma 11. Since χ 2 is convex on A, χ 2 ( p) cannot be larger than the maximum value at the extreme points 8

of A, which are contained in E ( π E π). But we know by lemma 12 that the value of χ2 is bounded above by K /p max at any point in E π and by K 1 at any point in E. Note also that a tight, non-linear bound can be obtained by maximizing (3.11) over all a, b. 4 Small d ME implies small d χ 2 Theorem 13 Let p XY represent a pair of clusterings with d ME (p XY ). Then d χ 2(p XY ) 2 p min The proof is based on the fact that a convex function is always above any tangent to its graph. We pick a point p that has d ME ( p) = and lower bound χ 2 ( p) by the tangent to χ 2 in the nearest p. We start by proving three lemmas then follow with the formal proof of the theorem. Lemma 14 The unconstrained partial derivatives of χ 2 in p are Proof χ 2 p ab = p ab χ 2 = p xy p [ = 1 p a p ab x 1 p 2 xy p x y ( p 2 ab x p x y = 1 p a 2p ab p b p2 ab.1 p 2 b = 2p ab p a p b x p 2 xb p x p 2 b 1 p y, x y 1 p x, x = y x p x y ] ) + x a 1 + x a p 2 xb p x p 2 b The result follows now by setting p xb = p x δ xb, p b = p b. p x p ab ( ) p 2 xb x p x b (4.1) (4.2) (4.3) (4.4) Lemma 15 For any p P χ 2 ( p ) χ 2 ( p) x y x ( pxy + p ) xy p x p y 9

Proof χ 2 is convex, therefore χ 2 ( p) is above the tangent at p, i.e χ 2 ( p) χ 2 ( p ) + vec( χ 2 ( p )) vec( p p ) (4.5) Denote vec( χ 2 ( p )) vec( p p ) = 1 xy + p x x y xp 1 p x y y x = ( pxy + p ) xy p x x p y y x p xy (4.6) (4.7) x = 1 p xy, x [K] (4.8) p x y x y = 1 p xy y [K] (4.9) p y x y These quantitites represent the relative leak of probability mass from the diagonal to the offdiagonal cells in row x, respectively in column y of the matrix p w.r.t p. Lemma 16 Let x, x [K] be as defined above, and assume that the marginals p x are sorted so that p min = p 1 p 2 p 3... p K = p max with x p x x =. Then, p 1, if [0, p 1 ] 1 + p1 p max x = 2, if (p 1, p 1 + p 2 ] { x} x... k + P x k px p k+1, if (p 1 + + p k, p 1 + + p k+1 ] Proof It is easy to verify the solution for p 1. For the other intervals, on verifies the solution by induction over k [K]. Proof [Theorem 13] Assume that d ME ( p) = 0. Then, w.l.o.g. one can assume that the off-diagonal elements of p sum to. It is easy to see that under the conditions of lemma 16 x x p min By symmetry, this bound also holds for y y. Therefore, by lemma 15 χ 2 ( p ) χ 2 ( p) 10 2 p min (4.10)

or d χ 2( p) 2 p min 2 0 p min 5 Remarks Although the original motivation for this work stems from comparing partitions, we have proved a result which holds for any two finite-valued random variables. In particular, the two theorems give lower and upper bounds on the χ 2 measure of independence between two random variables, holding locally when the two variables are strongly dependent. The present approximation complements an older approximation of χ 2 by the mutual information I XY = xy p xy ln pxy p xp. y It is known [Cover and Thomas, 1991] that the second order Taylor approximation of I XY = 1 2 (χ2 (p XY ) 1) with χ 2 defined as in (2.2). This approximation is good around p XY = p X p Y, hence in the weak dependence region. The non-linear bound (3.11) in Theorem 6 is tight. The proofs hold when the condition K = K is replaced by K K or even by K. It can be seen that both sets of bounds are tighter and hold for a larger range of when the clusterings have approximately equal clusters, that is when p min, p max both approach 1/K. This confirms the general intuition that clusterings with equal sized clusters are easier (and its counterpart, that clusterings containing very small clusters are hard ). Finally, a useful property of the theorems presented here is that they involve the values p min, p max of on one clustering only. Hence they can be applied in cases when only one clustering is known. For example, [Meilă et al., 2005] used this result in the context of spectral clustering, to prove that any clustering with low enough normalized cut is close to the (unknown) optimal clustering of that data set. References [Bach and Jordan, 2004] Bach, F. and Jordan, M. I. (2004). Learning spectral clustering. In Thrun, S. and Saul, L., editors, Advances in Neural Information Processing Systems 16,, Cambridge, MA. MIT Press. [Cover and Thomas, 1991] Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley. 11

[Hubert and Arabie, 1985] Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193 218. [Meila, 2005] Meila, M. (2005). Comparing clusterings an axiomatic view. In Wrobel, S. and De Raedt, L., editors, Proceedings of the International Machine Learning Conference (ICML). Morgan Kaufmann. [Meilă, 2006] Meilă, M. (2006). Comparing clusterings an information based metric. Journal of Multivariate Analysis, page (in print). [Meilă et al., 2005] Meilă, M., Shortreed, S., and Xu, L. (2005). Regularized spectral learning. In Cowell, R. and Ghahramani, Z., editors, Proceedings of the Artificial Intelligence and Statistics Workshop(AISTATS 05). [Papadimitriou and Steigliz, 1998] Papadimitriou, C. and Steigliz, K. (1998). Combinatorial optimization. Algorithms and complexity. Dover Publication, Inc., Minneola, NY. [Vajda, 1989] Vajda, I. (1989). Theory of statistical inference and information. Theory and Decision Library. Series B: Mathematical and Statistical methods. Kluwer Academic Publishers, Norwell, MA. 12