A Bayesian Criterion for Clustering Stability

Size: px

Start display at page:

Download "A Bayesian Criterion for Clustering Stability"

Amos Henderson
5 years ago
Views:

1 A Bayesian Criterion for Clustering Stability B. Clarke 1 1 Dept of Medicine, CCS, DEPH University of Miami Joint with H. Koepke, Stat. Dept., U Washington 26 June 2012 ISBA Kyoto

2 Outline 1 Assessing Stability

3 Stability of Clustering, not Clustering per se Imagine n points say x i = (x 1,i,..., x D,i ) for i = 1,..., n. Our goal is to put the points that belong together in the same set and make sure points that don t belong together are in different sets. Such a set is called a cluster; a set of clusters is called a clustering (of the points). There are many ways to form clusters... Given a proposed clustering of almost any sort we propose a technique for seeing if it is reasonable, validation. There are also many ways to do this. Most of these are based on some notion of clustering stability...so is ours.

4 Key Definitions Suppose we have Ĉ = {Ĉ1,..., ĈK } where the Ĉk s are disjoint and k Ĉ k = {x i,..., x n }. We evaluate the stability of a fixed cluster Ĉk that has x i as a member using sets of the form Ŝ ik = {(λ 1,..., λ K ) l k : λ k d(x i, ˆµ k ) min l k λ ld(x i, ˆµ l )} the λ k s are parameters, d is a distance, and ˆµ k is the centroid of Ĉk. The larger Ŝik is the more stable Ĉ is, so choose a prior distribution F and calculate F(Ŝik).

5 The Main Stability Assessment Let I be an indicator function...then [φ ik ] i=1,...n;k=1,...,k = IŜik df (λ K ) This is the averaged assignment matrix. If F puts all its mass on λ K = (1,..., 1) then we get φ ik = 0, 1 depending on whether x i Ĉk. More generally F spreads the membership of x i across the K clusters. The pointwise stability of x i is where h i = k where x i Ĉ K. PW (x i ) = PW i = φ i,hi

6 Behavior of PW stability Suppose we fix the centroids ˆµ k of the Ĉ s and use PW for a general point x. Thus: PW (x) = F({λ K : λ k d(x, ˆµ k ) λ l d(x, ˆµ l ) l k}) Then we can visualize (in 2 dimensions) the effect of F. For K = 5 and F IID shifted exponential i.e., f (λ β) = βe β(λ 1) 1 λ 1, with β =.1,.75, 2, 10 As the hyperparameter β increases the regions of stability increase because it s harder and harder to regard a point as belonging to a cluster it wasn t assigned to.

β =.1 Assessing Stability 2.0 Pointwise Stability for β = 0.

7 β =.1 Assessing Stability 2.0 Pointwise Stability for β =

8 β = Pointwise Stability for β =

β = 2 Assessing Stability 2.0 Pointwise Stability for β = 2.

9 β = 2 Assessing Stability 2.0 Pointwise Stability for β =

β = 10 2.0 Pointwise Stability for β = 10.000000 1.

10 β = Pointwise Stability for β =

11 Pointwise Stability with Data and Adaptive β 2 1 B A D 0 1 C 2 E

12 A Visualization not Limited to 2 Dimensions Recall i indexes data points, k indexes clusters. Sort the elements φ ik over cluster, then for i Ĉ k0 put the φ i,k0 s in decreasing order. The result is a heatmap...

Heatmap for the Data 0. 0. 1400 1. 1200 0. 0. 1000 0. 800 0. 600 0. 400 0.

13 Heatmap for the Data A B C D E 0. Dark lines at the bottom of diagonal blocks show instability.

14 But Is This Bayes? No: The prior F is not multiplying a likelihood. There is no posterior. There is no credible set, point estimator, hyothesis test... Yes: The prior F represents a distinct source of information and we are making inferences from a combination of the data and the prior. Editorializing: I think this is extended Bayes in the sense that we are combining the data and pre-experimental information. In this case, the prior represents our pre-experimental views about how much perturbation is reasonable.

15 And what does this have to do with prediction? This is intended for prediction with classification data. Sanity check that the classes are reasonable. An example below will give good reason for this skepticism. So, you want to find the classes from the clustering and then use the clusters to make predictions. Natural predictor: Assign x new to class k when k = arg min u d(x new, ˆµ u ) At this point we have only developed the stability method...we have not tested how well it gives predictions. We have only (so far) shown that clustering stability may give more reasonable classes than the classes in the data.

16 Let s look at the φ ik s in more detail We actually want the Average Pointwise Stability for Ĉ: APW = (1/n) n i=1 φ i,hi To understand the APW, let s imagine there is a true clustering C = (C 1,..., C KT ) with centroids µ k and S ik = {(λ 1,..., λ K ) l k : λ k d(x i, µ k ) min l k λ ld(x i, µ l )}

17 Relation Between Empirical and Population Quantities Let s also assume convexity : x Ĉk l k d(x, ˆµ k ) d(x, ˆµ l ) x C k l k d(x, µ k ) d(x, µ l ) This lets us go back and forth treating Ĉ as a partition of the data and as a partition of the sample space. Suppose P(Ĉ k C k ) 0, 1 depending on whether the empirical clusters go to the population clusters as n. As long as the ˆµ k s converge to their corresponding µ k s we get convergence of the clusters too. This is basically a law of large numbers.

18 Empirical and Population Criteria We approximate the APW by Q n (K ) = K j=1 1 n n i=1 I {xi ĈKj } We approximate the limit of the APW by Q (K ) = K EI {X1 C Kj } j=1 Theorem: Q n (K ) Q (K ). IŜij (λ K ) (X i)df(λ K 1 ) I S1j (λ K ) (X 1)dF(λ K 1 )

19 Choosing K Regard Q n (K ) as a data dependent objective function for choosing K. Let [K 1, K 2 ] be the (compact) set of integers strictly between K 1 1 and K Write ˆK = arg max Q n(k ). (1) K [K 1,K 2 ] By Theorem 1, for any bounded interval [K 1, K 2 ] we have that Q n (K ) Q (K ) uniformly in probability on [K 1, K 2 ]. Let K opt = arg max Q (K ). K [K 1,K 2 ]

20 Consistency Theorem Theorem: Suppose that K opt is the unique maximum of Q (K ) over K and that K opt [K 1, K 2 ]. Then, arg max Q n(k ) arg max Q (K ), (2) K [K 1,K 2 ] K [K 1,K 2 ] i.e., ˆK K opt, in probability as n. Proof: Convergence follows from a simple modification of Theorem 2.1 in Newey and McFadden (1994).

21 Three Steps to Make this Work Step 1: Get the input. Assume that for each K we have an optimal clustering e.g., K -means. Step 2: Rescale: For each K, get a bootstrap sample generate and B values S K,b = log APW APW b. Using this bootstrap distribution for S k, choose ˆK to be the smallest plausible maximum of (1/B) b S K,b. Like using a permutation distribution to get at relative stability.

22 Prior Selection Step 3: After some testing we found that the shifted exponential prior with hyperparameter β worked well. So...we estimated the hyperparameter essentially by maximizing the empirical form of K 2 1 E(S K ) K 2 K 1 K =K 1 For one dimensional hyperparameters this can be done by a simple bisection algorithm.

23 Synthetic Data and Five Stability Methods We used three classes of normal mixtures in in 2, 5, 10, and 20 dimensions, and with 100, 150, 200, and 300 data points each, respectively. These classes were: T 1: Four mixture components and with strong non-linear transformations to give the components more difficult shapes. T 2: Five mixture components, but easier non-linear operations so the clusters are better modeled by centroid-based methods. T 3: Samples from a single unimodal normal density to tests the ability of the methods to detect when there is no actual clustering, i.e. K = 1.

24 Five Stability Methods for Choosing K We used two versions of our stability measure one with d set to squared error, the other with d based on average linkage. (Take the average of point to centroid distances.) We also used three conceptually different assessments of stability more or less in their standard form: Subsampling Silhouette distance Gap Statistic

25 T1, K T = 4: 10d, n = 200 and 20d, n = 400. ˆK Gap Subsamp Silhouette Pertub Avg. Link Gap Subsamp Silhouette Pertub Avg. Link

26 T2, K T = 5: 10d, n = 200 and 20d, n = 400. ˆK Gap Subsamp Silhouette Pertub Avg. Link Gap Subsamp Silhouette Pertub Avg. Link

27 T3 data In this case the correct number of clusters was 1. In all cases we examined, both forms of the perturbation statistic were best... Except for one. With 2 dimensions the Gap statistic gave 1 correctly about 14% of the time and all the other methods did worse. In this one setting, no method did well except for our perturbation method and average linkage did better than Euclidean.

28 General Comparisons Gap statistic performs poorly it captures the first statistically significant clustering not the most significant clustering. Can t tell two separated groups of clusters from 2 clusters and tends to break long thin clusters into several clusters. Silhouette: Our method outperforms it in every case we looked at its intuition is similar to ours but ours is more thorough. Subsampling methods are the only real competitor...but can t detect K T = 1 case and they have the same problems as the Gap statistic. OTOH, in high dimensions may be competitive.

29 Yeast data This is classification data with 10 classes and n = but we threw out the class indicators and clustered the 8 explanatory variables. First written up in Nakai et al. (1991) Osaka! UC Irvine repository Clustering on this dataset is quite difficult as the classes do not separate easily. We sphered the data and then found the K -means clusterings for a range of K. The stability sequence shows K = 8 is most stable, but values five through nine were not bad.

30 Stability Plot The stability sequence for K -means clusterings for K = 2,..., 12 for the yeast data set. The most stable choice is K = 8, not K = 10 that is physically correct.

31 What we did... We sphered the data, rather than studentizing or using it as is because it gave the most reasonable stability sequence with K -means clustering. Also, the heatmap of the most stable clustering showed clear distinctions between the classes. The stability sequences for the raw or studentized data had a maximum at two and then declined. With 2 clusters the heatmaps did not show distinct components. It remains unclear when to sphere, studentize or use the data as is. Here, we show the most stable version. We plotted stability heatmaps for K = 5,..., 10.

32 The stability heatmaps for K = 6, C C

33 The stability heatmaps for K = 8, C C

34 What the Heatmaps Show: When K = 6, the blocks on the main diagonal are not very light and on each row there is not a lot of difference between the cluster from the main diagonal and its competing clusters. For K = 7, the blocks off the main diagonal are a little darker than those on the main diagonal. The contrast is stronger for K = 8 and K = 9. However, there is little (if any) improvement in the contrast between the blocks on and off the main diagonal in moving from K = 8 to K = 9. The heatmaps for K = 5, 10 show worse stability.

35 The Class Label Bar on the Right: The class label bars indicate how well the clustering tends to reproduce the class labels. These classes do not separate well but there are distinct groups forming each cluster, particularly at K = 8. The bars show that most stable blocks tend to be from the same class but the tendency is not strong. A practitioner can note which clusters are well separated and which are separated so little that merging them should be considered. (The reverse happens too!) Our analysis finds 7 or 8 classes is more stable than the ten apparent classes suggesting that some of the classes overlap significantly.

36 Summary I We have formulated a stability criterion based on how easily inequalities between point-to-cluster-centroid distances are preserved. Its absolute form approximates the average pointwise stability and gives consistent selection of K. Large values of our stability criterion indicate stability. It accurately encapsulates our usual notion of stability in several key cases, in particular responds to how populated boundary regions are.

37 Summary II For usage we need a squence of clusterings optimal for each K. We convert our criterion to a relative form. We estimate the hyperparameter by using the relative form. We transform the data (studentize, sphere, as is etc.) to find the ˆK of the most stable clustering. We analyze the most stable clusterings by heatmaps and class label bars. Sometimes stable clusterings suggest more or fewer classes with classification data; this reveals structure that might otherwise be overlooked.

38 Special Case: Concentration at the µ k s If the distribution of X concentrates at µ 1 and µ 2 then Q (2) goes to 1. This generalizes to K clusters. For K = 2, let µ j = E(X C j ) and D j = d(x, µ j ). Let Λ 1 = λ 2 /λ 1, Λ 2 = λ 1 /λ 2 and let G Λu be the survival function for Λ u. Can show: Q (2) = EI D1 /D 2 1G Λ1 (D 1 /D 2 ) + EI D2 /D 1 1G Λ2 (D 2 /D 1 ). So, if D 1 /D 2 small on C 1 then the first term is near P(C 1 ) so C 1 is stable. If D 2 /D 1 small on C 2 then the second term is near P(C 2 ) This means Q (2) is near its maximum P(C 1 ) + P(C 1 ) = 1. so ˆQ n (2) should be close to its maximum 1, too.

39 Special Case: Mass of P Near the Boundary If the mass of P accumulates near the boundary between C 1 and C 2 then Q (2) goes to 1/2. For K clusters, Q (K ) can decrease to 1/K, if P concentrates at a point. For K = 2, if D 1 /D 2 large on C 1, i.e., D 1 /D 2 1 we expect many points in C 1 to be close to the boundary between C 1 and C 2. Similarly if D 2 /D 1 close to 1 on C 2. In these cases, Q (2) P(C 1 )G Λ1 (1) + P(C 2 )G Λ2 (1) = 1/2 and one can prove Q (2) 1/2. Since 1/2 Q (2) 1, it seems reasonable to regard Q (2) as indicating stability.

40 Special Case: Moving the Centroids If P has 2 modes and they separate and the regions between them have probability going to zero then Q (2) tends to increase. Generalizes to arbitrary K. If P has 2 modes and they get closer then Q (2) tends to decrease. Generalizes to arbitrary K. In general, 1/K φ (K ) 1. Overall, φ (K ) seems to assess stability and therefore we think Q n (K ) does too. Informal Statement of a formal Theorem: As P concentrates at its modes/centroids, Q (K ) is largest for the correct number K of clusters.

41 Subsampling Subsampling: Take a subsample of the data, possibly perturb it with noise, then recluster and measure the difference between the original and perturbed data clustering. Take the average over several runs. Adjusted Rand and variation of information distances are most popular.

42 Silhouette Distance Defines the point-to-cluster distance for any fixed point to any fixed cluster as the average of the distances from that point to all the other points in a given cluster. The silhouette score for a point is then the difference between its point-to-cluster distances for the cluster it was asigned to and the next best cluster, scaled by the maximum of these two point-to-cluster distances. Thus each point is scored by its proximity to a boundary. The silhouette score for a clustering is the average of the silhouette scores for each point. K is chosen from the clustering with the smallest average silhouette distance.

43 Gap Statistic The gap statistic of Tibshirani et al. (2001) uses the difference in total cluster spread, defined in terms of the sum of all pairwise distances in a cluster, between the actual dataset and the clusterings of reference distributions with no true clusters. In Euclidean space with the squared error distance, the spread is the total empirical variance of all the clusters. The reference distributions adjust for the dependence on K in the measure and to guard against spurious clusters. Tibshirani et al. (2001) proposes using the uniform distribution within the bounding box of the original dataset using principal components.

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M