Clustering Methods without Given Number of Clusters

Clutering Method without Given Number of Cluter Peng Xu, Fei Liu Introduction A we now, mean method i a very effective algorithm of clutering. It mot powerful feature i the calability and implicity. However, the mot diadvantage i that we mut now the number of cluter in the firt place, which i uually a difficult problem in practice. In thi paper, we propoe a new approach pea-earching clutering to realize clutering without given the number of cluter. Our method i baed on the imilarity graph[] of the data point. Through the relationhip between point, we capture thoe point which are near the cluter center. By finding the pea area of the tationary ditribution of the random wal on the correponding imilarity graph, we figure out a way to capture the point near the cluter center area, which we call pea point in thi paper. The advantage of our method i that we don t need the number of cluter a an input our algorithm etimate the number of cluter in the dataet, which we believe can be a good indictor of the true value. Pea-earching clutering Denote X := {x (i) } i=,...,m a the ample data et. m i the ample ize, n i the dimenion, x (i) R n. Since we do not now how many cluter there are in X, We try to have a good etimation of the number of cluter or the poible cluter center. Aume the data et i dividable, then the point in one cluter t to be near to each other. Our imple trategy i to capture thoe point which are near the cluter center.. degree and tationary ditribution Conider the imilarity graph G = (V, E), with V = {x (), x (),..., x (m) }. And W R m m i the correponding weighted adjacency matrix. The generalized degree of a point x (i) i defined a d i = n j= W i,j. The degree matrix i defined a D = diag(d) = diag(d, d,..., d m ). Here, d i i determined by i, or by x (i) there no eential difference. In the following paage we may ue d = d(x (i) ) to denote the mapping x (i) d(x (i) ) to implify the illutration. Similarly, W i,j ha no difference with W (x (i), x (j) ), and i can be ued to denote the point x (i) it won t caue any ambiguity. Now, conider the random wal on graph G with weighted adjacency matrix W. Then, the random wal ha a tationary ditribution π over all the point. It reaonable to claim that, the cloer to the cluter center a point i i, the higher π i i. That i to ay, π i i a local maximum of the tationary ditribution if i i the cloet point to the center. From the theorie in tochatic proce, we now that, π i the olution of the linear equation: πp = π, ubjected to m π i = i= where P = D W i the correponding tranition matrix. Since G i ymmetric, we can directly olve π a π d. So we can ue d to indicate the relative value of π.. pea-earching proce For a well-defined clutering problem, there hould be everal pea in the et {(x (i), d(x (i) )) : i =,,..., m}, that i, if we conider the mapping d from {x (), x (),..., x (m) } to R, then the mapping hould have everal local maxima area, which are the center area of the cluter. Given the degree of all the point, we want to capture one point per uch area, to locate one cluter center. We call thee point the pea point of cluter. To find uch point, firt, we get the point with the highet degree a the firt pea point. Note that, pea point hould not be cloe to each other becaue each of them i near the center of different cluter, therefore, if we have an appropriate neighborhood of the firt pea point, then the point of highet degree outide the neighborhood will

be the econd pea point. Theoretically, we can capture all the ret pea point by cutting off appropriate neighborhood of the exiting pea point. But it i difficult to determine the ize of neighborhood. Here we introduce the concept of peritency of point. A we increae the ize of neighborhood, the highet degree outide the neighborhood will decreae. Specifically, we conider a -nearet neighborhood of the firt pea point. And a we increae from to m, more and more point are included in the neighborhood. The highet degree outide the growing neighborhood i decreaing. But there exit ome reitance againt uch drop tency. We illutrate thi by an D example, Figure..... d........ x x 5 5 x x x Figure : In the left figure, there are two pea and we need to capture the pea point x,x. We get x a the firt pea point, becaue it ha the maximum value of d. Then we continuouly remove the neighborhood of x, which mean we remove (x, x + ) with growing from to. A grow, the maximum value outide the interval (x, x + ) drop, a hown in the right figure. But it drop to d(x ) at = and eep the level until x i included in the neighborhood of x and then the value goe down again. In thi example i continuou, but the baic idea i the ame for dicrete ituation. We call uch reitance againt the drop tency a the peritency of a point. In the example above, a we cut off the neighborhood of x by larger and larger ize, x how up to reit the drop tency. The peritency of x i defined a x x, which mean the length of period it hold the maximum value of d outide the neighborhood of x. Similarly we can define peritency to any other point. To implify our illutration, we call the maximum value of d outide the -nearet neighborhood of ome point a (). alo dep on the ome point, but for implicity, we jut ue () if it doen t caue ambiguity. Pea point t to have high peritency than nonpea point. In the example above, x ha the highet peritency, o we pic it out a the econd pea point. After we get N pea point. We earch the (N + )-th pea point (if there i any) by the following rule: We cut off the -nearet neighborhood of the N current pea point imultaneouly and oberve the point with the highet d among the ret of the point, then we pic out the mot peritent point during the growth of from to m. Then, the point we pic out i the potential (N + )-th pea point. Now we conider the termination of the earching proce. One way i to et a lower bound for the peritency, ince hort peritency doe not reflect the table tructure. But the lower bound varie between different data et and it i not eay to elect. Here we ue another method. We preproce the data to find out thoe point that are liely to be near the center of cluter. Recall that we have d a the indicator of the lielihood of a point to be near the center. If d i i higher than mot of the d j where j i near i, then d i i more liely to be a center. To compare d i with d j where j i near i, we imply compare it with the average weighted value of d j, where the weight i W i,j, an indicator of how j i near i. We call uch average weighted value a h i, then we can compute h by: h = dp T = dw D Then, the termination condition i: For a potential pea point x (i), if d i > h i, then i can be conidered a a real pea point, and the earching proce can proceed; if it not, then i i not conidered a a real pea point, and the proce terminate. The idea of finding h come from the theory related to heat equation: h i indeed a moothne of d by convoluting d with ome probability ditribution. It drag down the pea and raie up the valley. The behavior of heat i the ame: a the time goe on, heat flow from poition with high temperature to poition with low temperature. And the olution to the

heat equation i a convolution of the initial data with the Poion heat ernel. If we want to now where the pea are, we imply find where the value are dragged down. And the dragged down area hould be the pea area. Algorithm Pea-earching clutering Input: {x (i) : i =,,..., m}, input data W, weighted adjacency matrix Output: pea point (cluter center) cluter label of all point Peudo code:. Compute d i = m j= W i,j. Compute h i = m j= ( m j= W i,jd j )/d i 3. Add x (i) which ha the highet d i into the et of pea point. Find all pea point, repeat until top: Set the peritency of all point to for =,,..., m Find x (i) which ha the highet d i a- mong thoe point that are not in the -nearet neighborhood of all the current pea point Let the peritency of x (i) increae by Find x (c) which ha the highet peritency if d c > h c add x (c) into the et of pea point ele top finding pea point.3 clutering After getting all the pea point, we have the number of cluter a the number of pea point. Noting that the pea point are quite cloe to the cluter center, we can directly regard the pea point a the cluter center and all the other point are aigned to different cluter baed on the ditance between the point to the pea point. The point hare the ame cluter label with the nearet pea point. 3 Experiment In thi ection, we will ue both imulated data and real data to demontrate the utility of our approach. In the real data experiment, we compare our method with dpmean method[] and mean method. Throughout the experiment, we ue normalized mutual information(nmi) between the ground truth clae and algorithm output for evaluation. When uing mean, we ue our etimation of the number of cluter a the input. A to dpmean, we apply max-min random election method to etimate λ baed on our etimation of the number of cluter. Contruction of Graph: We can either ue mutual -nearet neighbor graph or Gauian weighted graph in our experiment. In thi paper, we ue Gauian weight, we chooe the parameter σ in Gauian imilarity function to be the mean variance of the original data. We alo have done experiment with mutual - nearet neighbor graph and generally we can achieve good performance when etting 5%m, the reult i not hown in thi paper. Firt, we ue 3 imulated et data of Gauian ditribution on the D plane to how how our algorithm wor (Figure,3,,5). We alo apply our method to 5 UCI data et. For each et of real data, we firt apply PCA to the original data(eeping over 9% principle component), and then implement our clutering method. The reult are hown in Tabel. 5. Set the pea point a the cluter center. Let each of the other point hare the ame cluter label a it nearet pea point. 3

data with label data without label Figure : The left ﬁgure i the original ample point generated by 3 Gauian ditribution with covariance matrix equal to I, and mean equal to ( 3, ) for the red-colored et, (, 3) for the green-colored et, and (3, ) for the blue-colored et. The number of ample point in each colored et i: 5 red, green, 5 blue. the degree d of each point h of each point the comparion of d and h Figure 3: In the left ﬁgure, d repreent the tationary ditribution of the random wal on the graph G, the value of d uccefully reﬂect how much a point i near a cluter center the pea area in d are exactly the center area of cluter. The middle ﬁgure how h, a moothne of d. In the right ﬁgure, we mar the point with d > h a red, and d < h a blue, we ee that uch tandard i a good indicator of whether or not a point i in the center area of a cluter. 7 7 7 5 5 5 3 3 3 peritency, with pea point 5 5 5 3 peritency, with pea point 5 5 5 3 peritency, with 3 pea point 5 5 Figure : Thee three ﬁgure how the value of () of the pea point. () drop a grow, but a we can ee, there are point that reit uch drop. In the ﬁgure, we mared out with red line the larget peritency period. In the ﬁrt ﬁgure, we ue pea point (the point with highet value of d) and eep cutting oﬀ it -th nearet neighborhood, and capture the econd pea point by it peritent behavior. In the middle ﬁgure, we ue pea point and imultaneouly to cut oﬀ their -th nearet neighborhood and capture the third pea point. In the right ﬁgure, A we ve already found 3 pea point, there no igniﬁcant peritency anymore, what we get i a fae pea point. By the correponding value of h and d, it not in the center area and the earching proce terminate.

pea point aigned label Figure 5: A the earching for pea point terminate, we can aign the cluter label to all the point. The left figure how the pea point found, which i mared a red. The right figure how the aigned label. data et PSC dpmean mean Iri(3).7().77().79() Wine(3).35().33().59() Seed(3).97(3).57().99(3) Soybean().7375().93(5).73() Pima().57().9(5).() Table : UCI data et:nmi(number of cluter). In the firt column are the name of data et and the number in the parenthee are the number of true clae. In the other column, the number in the parenthee are the number of cluter algorithm output. In the cae of mean, it i ame with that of pea-earching clutering(psc). Dicuion In the previou ection, we have exhibited a brand new clutering method pea-earching clutering. It ha good performance in many cae. There are a few point we hould mention here. Firt, our method i baed on the imilarity graph, and eentially baed on the Euclid ditance of data point. In thi cae, our method can only deal with linearly dividable problem, the ame with mean approach. For the tructure lie a ring within a ring, we cannot eperate the two ring. Second, PSC i a very intuitive and traightforward method. We tart from the connection between point and try to ditinguih point lying in the center with thoe lying on the margin. We give reaonable cluter baed on thoe center point. However, a to clutering problem itelf, the number of cluter i probably undeterminable in practice, ince there i no abolute tandard of clutering. A good clutering method can only provide reaonable olution intead of the right olution, which i exactly what our approach ha done. The lat point i the parameter election problem. A in PSC, the contruction of imilarity graph turn out to be a very important tep. Either uing -NN graph or Guaian weighted graph, we have parameter or σ to be determined. In ome extreme cae, the clutering reult i quite enitive to the parameter election. In our experiment, we chooe σ a the mean of variance of the data point in all attribute. Probably thi i not the bet choice. How to elect σ might be a good problem to wor on. 5 Concluion In thi paper, we tart from a imple intuition, and provide a new clutering method without nowing the number of the cluter pea-earching clutering. We explain our idea from the perpective of random wal. We alo introduce the peritency concept in the peaearching proce. And in the experiment, PSC doe a good job in clutering problem. Reference [] U. von Luxburg, A tutorial on pectral clutering, Tech. Rep. 9, Max Planc Intitute for Biological Cybernetic, Augut. [] Kuli, B. and Jordan,M.I. Reviiting -mean: New Algorithm via Bayeian Nonparametric. In ICML,. 5