arxiv: v4 [cs.ds] 7 Mar 2014

Size: px

Start display at page:

Download "arxiv: v4 [cs.ds] 7 Mar 2014"

Brook Foster
5 years ago
Views:

1 Analysis of Agglomerative Clustering Marcel R. Ackermann Johannes Blömer Daniel Kuntze Christian Sohler arxiv: v [cs.ds] 7 Mar 01 Abstract The iameter k-clustering problem is the problem of partitioning a finite subset of R into k subsets calle clusters such that the maximum iameter of the clusters is minimize. One early clustering algorithm that computes a hierarchy of approximate solutions to this problem (for all values of k) is the agglomerative clustering algorithm with the complete linkage strategy. For ecaes, this algorithm has been wiely use by practitioners. However, it is not well stuie theoretically. In this paper, we analyze the agglomerative complete linkage clustering algorithm. Assuming that the imension is a constant, we show that for any k the solution compute by this algorithm is an O(log k)-approximation to the iameter k-clustering problem. Our analysis oes not only hol for the Eucliean istance but for any metric that is base on a norm. Furthermore, we analyze the closely relate k-center an iscrete k-center problem. For the corresponing agglomerative algorithms, we euce an approximation factor of O(log k) as well. Keywors: agglomerative clustering, hierarchical clustering, complete linkage, approximation guarantees 1 Introuction Clustering is the process of partitioning a set of objects into subsets(calle clusters) such that each subset contains similar objects an objects in ifferent subsets are issimilar. There are many applications for clustering, incluing ata compression [1], analysis of gene expression ata [5], anomaly etection [11], an structuring results of search engines []. For every application, a proper objective function is use to measure the quality of a clustering. One particular objective function is the largest iameter of the clusters. If the esire number of clusters k is given, we call the problem of minimizing this objective function the iameter k-clustering problem. One of the earliest an most wiely use clustering strategies is agglomerative clustering. The history of agglomerative clustering goes back at least to the 1950s (see for example [7, 1]). Later, biological taxonomy became one of the riving forces of cluster analysis. In [15] the authors, who where the first biologists using computers to classify organisms, iscuss several agglomerative clustering methos. Agglomerative clustering is a bottom-up clustering process. At the beginning, every input object forms its own cluster. In each subsequent step, the two closest clusters will be merge until only one cluster remains. This clustering process creates a hierarchy of clusters, such that for any two ifferent clusters A an B from possibly ifferent levels of the hierarchy we either have A B =, A B, or B A. Such a hierarchyis useful in many applications, for example, when one is intereste in hereitary properties of the clusters (as in some bioinformatics applications) or if the exact number of clusters is a priori unknown. In orer to efine the agglomerative strategy properly, we have to specify a istance measure between clusters. Given a istance function between ata objects, the following istance measures between clusters are frequently use. In the single linkage strategy, the istance between two clusters is efine A preliminary version of this article appeare in Proceeings of the 8th International Symposium on Theoretical Aspects of Computer Science (STACS 11), March 011, pp This article also appeare in Algorithmica. The final publication is available at Schloss Dagstuhl Leibniz Center for Informatics, Waern, Germany, mra@bis.uni-trier.e, work one while at Department of Computer Science, University of Paerborn, Germany Department of Computer Science, University of Paerborn, 098 Paerborn, Germany, {bloemer,kuntze}@upb.e Department of Computer Science, TU Dortmun, 1 Dortmun, Germany, christian.sohler@tu-ortmun.e For all four authors this research was supporte by the German Research Founation (DFG), grants BL 1/6- an SO 51/-.

2 as the istance between their closest pair of ata objects. Using this strategy is equivalent to computing a minimum spanning tree of the graph inuce by the istance function using Kruskal s algorithm. In case of the complete linkage strategy, the istance between two clusters is efine as the istance between their farthest pair of ata objects. In the average linkage strategy the istance is efine as the average istance between ata objects from the two clusters. 1.1 Relate Work In this paper, we stuy the agglomerative clustering algorithm using the complete linkage strategy to fin a hierarchical clustering of n points from R. The running time is obviously polynomial in the escription length of the input. Therefore, our only goal in this paper is to give an approximation guarantee for the iameter k-clustering problem. The approximation guarantee is given by a factor α such that the cost ofthe k-clusteringcompute by the algorithm is at most α times the cost of an optimal k-clustering. Although the agglomerative complete linkage clustering algorithm is wiely use, there are only few theoretical results consiering the quality of the clustering compute by this algorithm. It is known that there exists a certain metric istance function such that this algorithm computes a k- clustering with an approximation factor of Ω(log k) []. However, prior to the analysis we present in this paper, no non-trivial upper boun for the approximation guarantee of the classical complete linkage agglomerative clustering algorithm was known, an eriving such a boun has been iscusse as one of the open problems in []. The iameter k-clustering problem is closely relate to the k-center problem. In this problem, we are searching for k centers an the objective is to minimize the maximum istance of any input point to the nearest center. When the centers are restricte to come from the set of the input points, the problem is calle the iscrete k-center problem. It is known that for metric istance functions the costs of optimal solutions to all three problems are within a factor of from each other. For the Eucliean case, we know that for fixe k, i.e. when we are not intereste in a hierarchy of clusterings, the iameter k-clustering problem an the k-center problem are N P-har. In fact, it is alreay N P-har to approximate both problems with an approximation factor below 1.96 an 1.8 respectively [6]. Furthermore, there exist provably goo approximation algorithms in this case. For the iscrete k- center problem, a simple -approximation algorithm is known for metric spaces [9], which immeiately yiels a -approximation algorithm for the iameter k-clustering problem. For the k-center problem, a variety of results is known. For example, for the Eucliean metric in [1] a (1+ǫ)-approximation algorithm with running time O(k log k /ǫ ) n is shown. This implies a (+ǫ)-approximation algorithm with the same running time for the iameter k-clustering problem. Also, for metric spaces a hierarchical clustering strategy with an approximation guarantee of 8 for the iscrete k-center problem is known []. This implies an algorithm with an approximation guarantee of 16 for the iameter k-clustering problem. This paper as well as all of the above mentione work is about static clustering, i.e. in the problem efinition we are given the whole set of input points at once. An alternative moel of the input ata is to consier sequences of points that are given one after another. In [], the authors iscuss clustering in a so-calle incremental clustering moel. They give an algorithm with constant approximation factor that maintains a hierarchical clustering while new points are ae to the input set. Furthermore, they show a lower boun of Ω(log k) for the agglomerative complete linkage algorithm an the iameter k-clustering problem. However, since their moel iffers from ours, their results have no bearing on our results. 1. Our contribution In this paper, we stuy the agglomerative complete linkage clustering algorithm an relate algorithms for input sets X R. To measure the istance between ata points, we use a metric that is base on a norm, e.g., the Eucliean metric. We prove that the agglomerative solution to the iameter k-clustering problem is an O(log k)-approximation. Here, the O-notation hies a constant that is oubly exponential in. This approximation guarantee hols for every level of the hierarchy compute by the algorithm. That is, we compare each compute k-clustering with an optimal solution for that particular value of k. These optimal k-clusterings o not necessarily form a hierarchy. In fact, there are simple examples where optimal solutions have no hierarchical structure.

3 Our analysisalso yiels that if we allow k instea of k clusters an comparethe cost ofthe compute k-clustering to an optimal solution with k clusters, the approximation factor is inepenent of k an epens only on. Moreover, the techniques of our analysis can be applie to prove stronger results for the k-center problem an the iscrete k-center problem. For the k-center problem, we erive an approximation guarantee that is logarithmic in k an only singly exponential in. For the iscrete k-center problem, we erive an approximation guarantee that is logarithmic in k an the epenence on is only linear an aitive. Furthermore, we give almost matching upper an lower bouns for the one-imensional case. These bouns are inepenent of k. For an the metric base on the l -norm, we provie a lower boun that excees the upper boun for = 1. For, we give a lower boun for the Eucliean case which is larger than the lower boun for = 1. Finally, we construct instances proviing lower bouns for any metric base on an l p -norm with 1 p. However, the construction of these instances nees the imension to epen on k. Preliminaries an problem efinitions Throughout this paper, we consier input sets that are finite subsets of R. Our results hol for arbitrary metrics that are base on a norm, i.e., the istance x y between two points x,y R is measure using an arbitrary norm. Reaers who are not familiar with arbitrary metrics or are only intereste in the Eucliean case, may assume that is use, i.e. x y = i=1 (x i y i ). For r R an y R, we enote the close -imensional ball of raius r centere at y by B r(y) := {x x y r}. Given k N an a finite set X R with k X, we say that C k = {C 1,...,C k } is a k-clustering of X if the sets C 1,...,C k (calle clusters) form a partition of X into k non-empty subsets. We call a collection of k-clusterings of the same finite set X but for ifferent values of k hierarchical, if it fulfills the following two properties. First, for any 1 k X the collection contains at most one k-clustering. Secon, for any two of its clusterings C i,c j with C i = i < j = C j, every cluster in C i is the union of one or more clusters from C j. A hierarchical collection of clusterings is calle a hierarchical clustering. We efine the iameter of a finite an non-empty set C R to be iam(c) := max x,y C x y. Furthermore, we efine the iameter cost of a k-clustering C k as its largest iameter, i.e. cost iam (C k ) := max C Ck iam(c). The raius of C is efine as ra(c) := min y R max x C x y an the raius cost of a k-clustering C k is efine as its largest raius, i.e. cost ra (C k ) := max C Ck ra(c). Finally, we efine the iscrete raius of C to be ra(c) := min y C max x C x y an the iscrete raius cost of a k-clustering C k is efine as its largest iscrete raius, i.e. cost ra (C k ) := max C Ck ra(c). Problem 1 (iscrete k-center). Given k N an a finite set X R with X k, fin a k-clustering C k of X with minimal iscrete raius cost. Problem (k-center). Given k N an a finite set X R with X k, fin a k-clustering C k of X with minimal raius cost. Problem (iameter k-clustering). Given k N an a finite set X R with X k, fin a k-clustering C k of X with minimal iameter cost. For our analysis of agglomerative clustering, we repeately use the volume argument state in Lemma 5. This argument provies an upper boun on the minimum istance between two points from a finite set of points lying insie the union of finitely many balls. For the application of this argument, the following efinition is crucial. Definition. Let k N an r R. A set X R is calle (k,r)-coverable if there exist y 1,...,y k R with X k i=1 B r (y i). Lemma 5. Let k N, r R an P R be finite an (k,r)-coverable with P > k. Then, there exist istinct p,q P such that p q r k P. Proof. Let Z R with Z = k an P z Z B r (z). We efine δ to be the minimum istance between two points of P, i.e. δ := min p,q P p q p q. We assume for contraiction that u := r k P < δ. Since

4 u/ p r r+ u / z Figure 1: The volume argument. P > k there exists z Z with B r(z) P. It follows that δ r an hence, u < r. Note that for ) any y R, R R, an any norm, we have vol (B R (y) = R V, where V is the volume of the -imensional unit ball B 1 (0) (see [16], Corollary 6..15). Therefore, we euce ( ) vol B r+ /(z) < ( ) B z Zvol r(z) k (r) V u. z Z Furthermore, since any p P is containe in a ball B r (z) for some z Z, we conclue that any ball B (p) for p P is containe in a ball u/ B r+ (z) for some z Z (see Figure 1). Thus, we euce u / vol B u/(p) < k (r) V. (1) p P However, since u < δ, for any istinct p,q P, we have B (p) u/ B (q) =. Therefore, the total u/ volume of the P balls B (p) is given by u/ vol ( B (p) u ) V = P = k (r) V u/, p P using the efinition of u. This contraicts (1). We obtain δ u, which proves the lemma. Analysis In this section we analyze the agglomerative clustering algorithms for the (iscrete) k-center problem an the iameter k-clustering problem. As mentione in the introuction, an agglomerative algorithm takes a bottom-up approach. It starts with the X -clustering that contains one cluster for each input point an then successively merges two of the remaining clusters such that the cost of the resulting clustering is minimize. That is, in each merge step the agglomerative algorithms for Problem 1, Problem an Problem minimize the iscrete raius, the raius an the iameter of the resulting cluster, respectively. Our main objective is the agglomerative complete linkage clustering algorithm, which minimizes the iameter in every step. Nevertheless, we start with the analysis of the agglomerative algorithm for the iscrete k-center problem since it is the simplest one of the three. Then we aapt our analysis to the k-center problem an finally to the iameter k-clustering problem. In each case we nee to introuce further techniques to eal with the increase complexity of the given problem. We show that all three algorithms compute an O(log k)-approximation for the particular corresponing clustering problem. However, the epenency on the imension which is hien in the O-notation ranges from only linear an aitive in case of the iscrete k-center problem to a factor that is oubly exponential in case of the iameter k-clustering problem.

5 AgglomerativeDiscreteRaius(X): X finite set of input points from R 1: C X := {{x} x X} : for i = X 1,...,1 o : fin istinct clusters A,B C i+1 minimizing ra(a B) : C i := (C i+1 \{A,B}) {A B} 5: en for 6: return C 1,...,C X Algorithm 1: The agglomerative algorithm for the iscrete k-center problem. As mentione in the introuction, the cost of optimal solutions to the three problems are within a factor of from each other. That is, each algorithm computes an O(log k)-approximation for all three problems. However, we will analyze the proper agglomerative algorithm for each problem..1 Discrete k-center clustering The agglomerative algorithm for the iscrete k-center problem is state as Algorithm 1. Given a finite set X R of input points, the algorithm computes hierarchical k-clusterings for all values of k between 1 an X. We enote them by C 1,...,C X. Throughout this section, cost always means iscrete raius cost. opt k refers to the cost of an optimal iscrete k-center clustering of X R where k N with k X, i.e. the cost of an optimal solution to Problem 1. The following theorem states our result for the iscrete k-center problem. Theorem 6. Let X R be a finite set of points. Then, for all k N with k X, the partition C k of X into k clusters as compute by Algorithm 1 satisfies cost ra (C k ) < (0+log (k)+) opt k, where opt k enotes the cost of an optimal solution to Problem 1. Since any cluster C is containe in a ball of raius ra(c), we have that X is (k,cost ra (C k ))- coverable for any k-clustering C k of X. It follows, that X is (k,opt k )-coverable. This fact, as well as the following observation about the greey strategy of Algorithm 1, will be use frequently in our analysis.. Observation 7. The cost of all compute clusterings is equal to the iscrete raius of the cluster create last. Furthermore, the iscrete raius of the union of any two clusters is always an upper boun for the cost of the clustering to be compute next. We prove Theorem 6 in two steps. First, Proposition 8 in Section.1.1 provies an upper boun to the cost of the intermeiate k-clustering. This upper boun is inepenent of k an X, only linear in an may be of inepenent interest. In its proof, we use Lemma 5 to boun the istance between the centers of pairs of remaining clusters. The cost of merging such a pair gives an upper boun to the cost of the next merge step. Therefore, we can boun the iscrete raius of the create cluster by the sum of the larger of the two clusters iscrete raii an the istance between their centers. Secon, in Section.1., we analyze the remaining k merge steps of Algorithm 1 own to the computation of the k-clustering. There, we no longer nee to apply the volume argument from Lemma 5 to boun the istance between two cluster centers. It will be replace by a very simple boun that is alreay sufficient. Analogously to the first step, this leas to a boun for the cost of merging a pair of clusters..1.1 Analysis of the k-clustering Proposition 8. Let X R be finite. Then, for all k N with k X, the partition C k of X into k clusters as compute by Algorithm 1 satisfies cost ra (C k ) < 0 opt k, where opt k enotes the cost of an optimal solution to Problem 1. 5

6 cost ra (C m ) cost ra (C m ) p C p C1 Figure : ra(c 1 C ) < cost ra (C m )+ p C1 p C. To prove Proposition 8, we ivie the merge steps of Algorithm 1 into phases, each reucing the number of remaining clusters by one fourth. The following lemma bouns the increase of the cost uring a single phase by an aitive term. Lemma 9. Let m N with k < m X. Then, ) k cost ra (C m < cost ra (C m )+ m opt k. Proof. Let R := cost ra (C m ). From every cluster C C m, we fix a center p C C with C B R(p C ). Let t := m. Then, Cm C t+1 is the set of clusters from C m that still exist m 1 merge steps after the computation of C m. In each iteration of its loop, the algorithm can merge at most two clusters from C m. Thus, C m C t+1 > m. Let P := {p C C C m C t+1 }. Then, P = C m C t+1 > m > k. Since X is (k,opt k)-coverable, so is P X. Therefore, by Lemma 5, there exist istinct C 1,C C m C t+1 such that p C1 p C k m opt k. Then, the istance from p C1 to any q C is at most k m opt k+r. We conclue that merging C 1 an C woul result in a cluster whose iscrete raius can be upper boune by k ra(c 1 C ) < cost ra (C m )+ m opt k (see Figure ). The result follows using C 1,C C t+1 an Observation 7. To prove Proposition 8, we apply Lemma 9 for Proof of Proposition 8. Let u := log X k log an efine m i := m u k an m i > k for all i = 0,...,u 1. Since m i ( ) i+1 X X k consecutive phases. ( ) i X = ( for all i = 0,...,u. Then, ) i X ( ) i+1 X + = m i+1 an Algorithm 1 uses a greey strategy, we get cost ra (C mi+1 ) cost ra (C m i ) for all i = 0,...,u 1. Combining this with Lemma 9 (applie to m = m i ), we obtain cost ra (C mi+1 ) < cost ra (C mi )+ k m i opt k. By repeately applying this inequality for i = 0,...,u 1 an using cost ra (C k ) cost ra (C mu ) an cost ra (C m0 ) = 0, we euce ( u 1 ) u 1 k k cost ra (C k ) < opt m k < i X ( ) i opt k. Solving the geometric series an using u 1 < log k cost ra (C k ) < X ( X k u ) 1 leas to opt k < 1 6 opt k. () 1

7 By taking only the first two terms of the series expansion of the exponential function, we get = > 1+ ln. Substituting this boun into () gives e ln cost ra (C k ) < opt k < 0 opt k. ln.1. Analysis of the remaining merge steps The analysis of the remaining merge steps introuces the O(log k) term to the approximation factor of our result. It is similar to the analysis use in the proof of Proposition 8. Again, we ivie the merge steps into phases. However, this time one phase consists of one half of the remaining merge steps. Furthermore, we are able to replace the volume argument from Lemma 5 by a simpler boun. More precisely, as long as there are more than k clusters left, we are able to fin a pair of clusters whose centers lie in the same cluster of an optimal k-clustering. That is, the istance between the centers is at most two times the iscrete raius of the common cluster in the optimal clustering. The following lemma bouns the increase of the cost uring a single phase. Lemma 10. Let m N with k < m X. Then, cost ra (C k+ m k ) < cost ra(c m )+opt k. Proof. Let R := cost ra (C m ). From every cluster C C m, we fix a center p C C with C B R (p C). Let t := k+ m k. Then, Cm C t+1 is the set of clusters from C m that still exist m k 1 merge steps after the computation of C m. In each iteration of its loop, the algorithm can merge at most two clusters from C m. Thus, C m C t+1 > k. Let P := {p C C C m C t+1 }. Since X is (k,opt k )-coverable, so is P X. Therefore, using P > k it follows that there exist istinct C 1,C C m C t+1 such that p C1 an p C are containe in the same ball of raius opt k, i.e. p C1 p C opt k. Then, the istance from p C1 to any q C is at most opt k +R. We conclue that merging C 1 an C woul result in a cluster whose iscrete raius can be upper boune by ra(c 1 C ) < cost ra (C m ) + opt k (see Figure ). The result follows using C 1,C C t+1 an Observation 7. To prove Theorem 6, we apply Lemma 10 for about logk consecutive phases. Proof of Theorem 6. Let ε > 0 an u := log (k)+ε such that log k < u log (k)+1. Furthermore, (1 ) ik efine m i := k + for all i = 0,...,u. Then, m u = k an m i > k for all i = 0,...,u 1. Since k + m i k (1 ) = k + 1 ik (1 ) i+1k k + = m i+1 an Algorithm 1 uses a greey strategy, we ) euce cost ra (C mi+1 ) cost ra (C mi k k+ for all i = 0,...,u 1. Combining this with Lemma 10 (applie to m = m i ), we obtain cost iam (C mi+1 ) < cost ra (C mi )+opt k. By repeately applying this inequality for i = 0,...,u 1 an using m 0 = k, we get u 1 cost ra (C k ) < cost ra (C k )+ opt k cost ra (C k )+(log (k)+) opt k. Hence, the result follows using Proposition 8. 7

8 AgglomerativeRaius(X): X finite set of input points from R 1: C X := {{x} x X} : for i = X 1,...,1 o : fin istinct clusters A,B C i+1 minimizing ra(a B) : C i := (C i+1 \{A,B}) {A B} 5: en for 6: return C 1,...,C X Algorithm : The agglomerative algorithm for the k-center problem.. k-center clustering The agglomerative algorithm for the k-center problem is state as Algorithm. The only ifference to Algorithm 1 is the minimization of the raius instea of the iscrete raius in Step. In the following, cost always means raius cost an opt k refers to the cost of an optimal k-center clustering of X R where k N with k X. Observation 11 (analogous to Observation 7). The cost of all compute clusterings is equal to the raius of the cluster create last. Furthermore, the raius of the union of any two clusters is always an upper boun for the cost of the clustering to be compute next. The following theorem states our result for the k-center problem. Theorem 1. Let X R be a finite set of points. Then, for all k N with k X, the partition C k of X into k clusters as compute by Algorithm satisfies cost ra (C k ) = O(logk) opt k, where opt k enotes the cost of an optimal solution to Problem, an the constant hien in the O- notation is singly exponential in the imension. Theorem 1 hols for any particular tie-breaking strategy. However, to keep the analysis simple, we assume that there are no ties. That is, we assume that for any input set X the clusterings compute by Algorithm are uniquely etermine. As in the proof of Theorem 6, we first show a boun for the cost of the intermeiate k-clustering. However, we have to apply a ifferent analysis. As a consequence, the epenency on the imension increases from linear an aitive to a singly exponential factor...1 Analysis of the k-clustering Proposition 1. Let X R be finite. Then, for all k N with k X, the partition C k of X into k clusters as compute by Algorithm satisfies cost ra (C k ) < e opt k, where opt k enotes the cost of an optimal solution to Problem. Just as in the analysis of Algorithm 1, we ivie the merge steps of Algorithm into phases, such that in each phase the number of remaining clusters is reuce by one fourth. Like in the iscrete case, the input points are (k,opt k )-coverable. However, centers corresponing to an intermeiate solution compute by Algorithm nee not be covere by the k balls inuce by an optimal solution. As a consequence, we are no longer able to apply Lemma 5 on the centers as in the iscrete case. To boun the increase of the cost uring a single phase, we cover the remaining clusters at the beginning of a phase by a set of overlapping balls. Each of the clusters is completely containe in one of these balls that all have the same raius. Furthermore, the number of remaining clusters will be at least twice the number of these balls. It follows that there are many pairs of clusters that are containe in the same ball. Then, as long as the existence of at least one such pair can be guarantee, the raius of the cluster create next can be boune by the raius of the balls. The following lemma will be use to boun the increase of the cost uring a single phase. 8

9 cost ra (C m ) z C opt k opt k +cost ra (C m ) y i Figure : Intermeiate centers. Lemma 1. Let m N with k < m X. Then, ( ) ) k k cost ra (C m < 1+6 cost ra (C m )+6 m m opt k. Proof. Let P = {P 1,...,P k } be an optimal k-clustering of X. We fix y 1,...,y k R such that P i B optk (y i ) for i = 1,...,k. For any C C m let z C R such that C B costra (C m)(z C ). It follows that each z C is containe in at least one of the balls B optk +cost ra (C m)(y i ) for i = 1,...,k (see Figure ). ( ) For λ R with λ > 0 a ball of raius opt k +cost ra (C m ) can be covere by balls of raius λ(opt k +cost ra (C m )) (see [1]). Choosingλ = m k, weget thateachofthe ballsb opt k +cost ra (C m)(y i ) for i = 1,...,k can be covere by l := m k m k balls of raius ε = m k (opt k+cost ra (C m )). Therefore, there exist k l m balls B 1,...,B kl of raius ε such that each z C for C C m is containe in at least one of these balls. For i = 1,...,kl let a i R such that B i = B ε (a i ). Then, any cluster C C m is containe in at least one of the balls A 1,...,A kl with A i = B costra (C m)+ε(a i ) for i = 1,...,kl (see Figure ). λ cost ra (C m )+ε z C cost ra (C m ) z C1 a i ε opt k +cost ra (C m ) Figure : Covering centers an clusters. Let t := m an Cm C t+1 be the set of clusters from C m that still exist m 1 merge steps after the computation of C m. In each iteration of its loop, the algorithm can merge at most two clusters from C m. Thus, C m C t+1 > m. Since kl m, there exist two clusters C 1,C C m C t+1 that are containe in the same ball A i with i {1,...,kl}. Therefore, merging clusters C 1 an C woul result in a cluster whose raius can be upper boune by ra(c 1 C ) cost ra (C m )+ε. Using Observation 11 an the fact that C 1 an C are part of the clustering C t+1, we can upper boun the cost of C t by cost ra (C t ) cost ra (C m )+ε. k It remains to show ε < 6 m (opt k +cost ra (C m ) ). Since m k > 1, we have m k < m k. Thus, m k < m k m k an m k < 6 k m. 9

10 To prove Proposition 1, we apply Lemma 1 for log X k consecutive phases. ( ) i X X Proof of Proposition 1. Let u := log k an efine m i := for all i = 0,...,u. Then, m u k an m i > k for all i = 0,...,u 1. Analogously to the proof of Proposition 8, we get mi ) mi+1 anusinglemma1, weeucecost ra (C mi+1) < (1+6 k m i cost ra (C mi )+6 k m i opt k for all i = 0,...,u 1. By repeately applying this inequality an using cost ra (C k ) cost ra (C mu ) an cost ra (C m0 ) = 0, we get u 1 ) u 1 k cost ra (C k ) < 6 (1+6 k opt m k i j=i+1 u 1 k 6 X ( ) i k = 6 X ( )u 1 u 1 j=i+1 u 1 ( m j (1+6 ) i u 1 j=u i k X ( (1+6 ) j) opt k () k X ( ) u 1 ( ) )u 1 j opt k. Here, we obtain () using m i ( i X ) an we obtain () by substituting u 1 i for i. Using X u 1 < log k, we euce u 1 ( cost ra (C k ) < 6 ) i ( i j=0 ( () ) j) opt k. (5) By taking only the first two terms of the series expansion ofthe exponential function, we get 1+6 ( ) j < e 6( ) j an therefore ( i j=0 ( ) j) < i 1 The sum in the exponent can be boune by the infinite geometric series j=0 j=0 ( ) j 1 < ( ) e 6( ) j = e 6 i 1 j=0( ) j. (6) 1 ( ) 1, (7) where the last inequality follows by upper bouning the convex function f(x) = ( x ) in the interval [0,1] by the line through f(0) an f(1). Putting Inequalities (5), (6) an (7) together then gives ( u 1 ( ) i ) cost ra (C k ) < 6 e opt k < e opt k, where the last inequality follows by using Inequality (7) again... Connecte instances The analysis of the remaining merge steps from the iscrete k-center case (cf. Section.1.) is not transferable to the k-center case. Again, as in the proof of Proposition 1, we are no longer able to erive a simple aitive boun on the increase of the cost when merging two clusters. In orer to preserve the logarithmic epenency of the approximation factor on k, we show that it is sufficient to analyze Algorithm on a subset Y X satisfying a certain connectivity property. Using this property, we are able to apply a combinatorial approach that relies on the number of merge steps left. We start by efining the connectivity property that will be use to relate clusters to an optimal k-clustering. 10

11 Definition 15. Let Z R an r R. Two sets A,B R are calle (Z,r)-connecte if there exists a z Z with B r(z) A an B r(z) B. Note that for any two (Z,r)-connecte clusters A,B, we have ra(a B) ra(a)+ ra(b)+r. (8) Next, we show that for any input set X we can boun the cost of the k-clustering compute by Algorithm by the cost of the l-clustering compute by the algorithm on a connecte subset Y X for a proper l k. Recall that by our convention from the beginning of Section., the clusterings compute by Algorithm on a particular input set are uniquely etermine. Lemma 16. Let X R be finite an k N with k X. Then, there exists a subset Y X, a number l N with l k an l Y, an a set Z R with Z = l such that: 1. Y is (l,opt k )-coverable;. cost ra (C k ) cost ra (P l );. For all n N with l+1 n Y, every cluster in P n is (Z,opt k )-connecte to another cluster in P n. Here, the collection P 1,...,P Y enotes the hierarchical clustering compute by Algorithm on input Y. Proof. To efine Y,Z, an l we consier the (k + 1)-clustering compute by Algorithm on input X. We know that X = A C k+1 A is (k,opt k )-coverable. Let E C k+1 be a minimal subset such that A E A is ( E 1,opt k )-coverable, i.e., for all sets F C k+1 with F < E the union A F A is not ( F 1,opt k )-coverable. Since a set F of size 1 cannot be ( F 1,opt k )-coverable, we get E. Let Y := A E A an l := E 1. Then, l k an Y is (l,opt k)-coverable. This establishes property 1. It follows that there exists a set Z R with Z = l an Y z Z B opt k (z). Furthermore, we let P 1,...,P Y be the hierarchical clustering compute by Algorithm on input Y. Since Y is the union of the clusters from E C k+1, each merge step between the computation of C X an C k+1 merges either two clusters A,B Y or two clusters A,B X \ Y. The merge steps insie X \Y have no influence on the clusters insie Y. Furthermore, the merge steps insie Y woul be the same in the absence of the clusters insie X \Y. Therefore, on input Y, Algorithm computes the (l+1)-clustering P l+1 = E = C k+1 Y. Thus, P l+1 C k+1. To compute P l, on input Y, Algorithm merges two clusters from P l+1 that minimize the raius of the resulting cluster. Analogously, on input X, Algorithm merges two clusters from C k+1 to compute C k. Since P l+1 C k+1, Observation 11 implies cost ra (C k ) cost ra (P l ), thus proving property. It remains to show that for all n N with l + 1 n Y it hols that every cluster in P n is (Z,opt k )-connecte to another cluster in P n (property ). By the efinition of Z, ever cluster in P n intersects at least one ball B opt k (z) for z Z. Therefore, it is enough to show that each ball B opt k (z) intersects at least two clusters from P n. We first show this property for n = l + 1. For l = 1 this follows from the fact that B opt k (z) with Z = {z} has to contain both clusters from P. For l > 1, we are otherwise able to remove one cluster from P l+1 an get l clusters whose union is (l 1,opt k )-coverable. This contraicts the efinition of E = P l+1 as a minimal subset with this property. To show property for general n, let C 1 P n an z Z with B opt k (z) C 1. There exists a unique cluster C 1 P l+1 with C 1 C 1. Then, we have B opt k (z) C 1. However, we have just shown that B opt k (z) has to intersect at least two clusters from P l+1. Thus, there exists another cluster C P l+1 with B opt k (z) C. Since every cluster from P l+1 is a union of clusters from P n, there exists at least one cluster C P n with C C an B opt k (z) C... Analysis of the remaining merge steps Let Y,Z,l, an P 1,...,P Y be as given by Lemma 16. Then, Proposition 1 can be use to obtain an upper boun for the cost of P l. In the following, we analyze the merge steps leaing from P l to P l+1 an show how to obtain an upper boun for the cost of P l+1. As in Section..1, we analyze the merge steps in phases. The following lemma is use to boun the increase of the cost uring a single phase. Note that opt k still refers to the cost of an optimal solution on input X, not Y. 11

12 opt k cost ra(p m ) B cost ra (P n ) A 1 A Figure 5: Merging (Z,opt k )-connecte clusters. Lemma 17. Let m,n N with n l an l < m n Y. If there are no two (Z,opt k )-connecte clusters in P m P n, it hols cost ra (P m+l ) cost ra(p m )+ cost ra (P n )+opt k. Proof. We show that there exist at least m l isjoint pairs of clusters from P m such that the raius of their union can be upper boune by cost ra (P m )+ cost ra (P n )+opt k. By Observation 11, this upper bouns the cost of the compute clusterings as long as such a pair of clusters remains. Then, the lemma follows from the fact that in each iteration of its loop the algorithm can estroy at most two of these pairs. To boun the number of these pairs of clusters, we start with a structural observation. P m P n is the set of clusters from P n that still exist in P m. By our efinition of Y,Z, an l, we conclue that any cluster A P m P n is (Z,opt k )-connecte to another cluster B P m. If we assume that there are no two (Z,opt k )-connecte clusters in P m P n, this implies B P m \P n (see Figure 5). Thus, using A P n, B P m, an Inequality (8), the raius of A B can be boune by ra(a B) cost ra (P m )+ cost ra (P n )+opt k. (9) Moreover, using a similar argument, we erive the same boun for two clusters A 1,A P m P n that are (Z,opt k )-connecte to the same cluster B P m \P n. That is, ra(a 1 A ) cost ra (P m )+ cost ra (P n )+opt k. (10) Next, we show that there exist at least isjoint pairs of clusters from P m such that the Pm P n raius of their union can be boune either by Inequality (9) or by Inequality (10). To o so, we first consier the pairs of clusters from P m P n that are (Z,opt k )-connecte to the same cluster from P m \P n until no caniates are left. For these pairs, we can boun the raius of their union by Inequality (10). Then, each cluster from P m \ P n is (Z,opt k )-connecte to at most one of the remaining clusters from P m P n. Thus, each remaining cluster A P m P n can be paire with a ifferent cluster B P m \P n such that A an B are (Z,opt k )-connecte. For these pairs, we can boun the raius of their union by Inequality (9). Since for all pairs either one or both of the clusters come from the set P m P n, we can lower boun the number of pairs by Pm P n. To complete the proof, we show that m l can merge at most two clusters from P n. Therefore, there are at least the computations of P n an P m. Hence, m n m l Pm Pn. Pm P n. In each iteration of its loop, the algorithm n Pm P n Lemma 18. Let n N with n l an l < n Y. Then, n Pm P n cost ra (P l+1 ) < (log (l)+) (cost ra (P n )+opt k ). merge steps between n + Pm Pn. Using n l, we get Proof. For n = l+1 there is nothing to show. Hence, assume n > l+1. Then, by efinition of Z, there exist two (Z,opt k )-connecte clusters in P n. Now let ñ N with ñ < n be maximal such that no two (Z,opt k )-connecte clusters exist in Pñ P n. The number ñ is well-efine since P 1 = 1 implies ñ 1. It follows that the same hols for all m N with m ñ. We conclue that Lemma 17 is applicable for all m N with l < m ñ. 1

13 By the efinition of ñ there still exist at least two (Z,opt k )-connecte clusters in Pñ+1 P n. Then, Observation 11 implies cost ra (Pñ) cost ra (P n )+opt k. (11) If ñ l + 1 then Inequality (11) proves the lemma. For ñ > l + 1 let u := log (ñ l) an efine (1 ) i(ñ l)+l m i := > l for all i = 0,...,u. Then, m 0 = ñ an m u = l+1. Furthermore, we obtain mi+l = 1 = (1 (1 ) i(ñ l)+l + l ) i+1(ñ l)+l+ 1 1 (1 ( (1 ) i(ñ l)+l+1 ) + l ) i+1(ñ l)+l = m i+1. Since Algorithm uses a greey strategy, we euce cost ra (P mi+1 ) cost ra (P m i +l ) for all i = 0,...,u 1. Combining this with Lemma 17 (applie to m = m i ), we obtain cost ra (P mi+1 ) cost ra (P mi )+ cost ra (P n )+opt k. By repeately applying this inequality for i = 0,...,u 1 an summing up the costs, we get cost ra (P mu ) < cost ra (Pñ)+u (cost ra (P n )+opt k ) (11) < (u+1) (cost ra (P n )+opt k ). Since ñ < l, we get u < log (l)+1 an the lemma follows using m u = l+1. The following lemma finishes the analysis except for the last merge step. Lemma 19. Let Y R be finite an l Y such that Y is (l,opt k )-coverable. Furthermore, let Z R with Z = l such that for all n N with l + 1 n Y every cluster in P n is (Z,opt k )- connecte to another cluster in P n, where P 1,...,P Y enotes the hierarchical clustering compute by Algorithm on input Y. Then, cost ra (P l+1 ) < (log (l)+) ( e +1 ) opt k. Proof. Let n := min( Y,l). Then, using Proposition 1, we get cost ra (P n ) < e opt k. The lemma follows by using this boun in combination with Lemma Proof of Theorem 1 Using Lemma 16, we know that there is a subset Y X, a number l k, an a hierarchical clustering P 1,...,P Y of Y with cost ra (C k ) cost ra (P l ). Furthermore, there is a set Z R such that every cluster from P l+1 is (Z,opt k )-connecte to another cluster in P l+1. Thus, P l+1 contains two clusters A,B that intersect with the same ball of raius opt k. Hence The theorem follows using Lemma 19 an l k.. Diameter k-clustering cost ra (C k ) ra(a B) cost ra (P l+1 )+opt k. In this section, we analyze the agglomerative complete linkage clustering algorithm for Problem state as Algorithm. Again, the only ifference to Algorithm 1 an is the minimization of the iameter in Step. As in the analysis of Algorithm, we may assume that for any input set X the clusterings compute by Algorithm are uniquely etermine, i.e. the minimum in Step is always unambiguous. Note that in this section cost always means iameter cost an opt k refers to the cost of an optimal iameter k-clustering of X R where k N with k X. Analogously to the (iscrete) raius case, any cluster C is containe in a ball of raius iam(c) an thus the set X is (k,opt k )-coverable. Observation 0 (analogous to Observation 7 an 11). The cost of all compute clusterings is equal to the iameter of the cluster create last. Furthermore, the iameter of the union of any two clusters is always an upper boun for the cost of the clustering to be compute next. 1

14 AgglomerativeCompleteLinkage(X): X finite set of input points from R 1: C X := {{x} x X} : for i = X 1,...,1 o : fin istinct clusters A,B C i+1 minimizing iam(a B) : C i := (C i+1 \{A,B}) {A B} 5: en for 6: return C 1,...,C X Algorithm : The agglomerative complete linkage clustering algorithm. The following theorem states our main result. Theorem 1. Let X R be a finite set of points. Then, for all k N with k X, the partition C k of X into k clusters as compute by Algorithm satisfies cost iam (C k ) = O(logk) opt k, where opt k enotes the cost of an optimal solution to Problem, an the constant hien in the O- notation is oubly exponential in the imension. As in the proof of Theorem 6 an 1, we first show a boun for the cost of the intermeiate kclustering. However, we have to apply a ifferent analysis again. This time, the new analysis results in a boun that epens oubly exponential on the imension...1 Analysis of the k-clustering Proposition. Let X R be finite. Then, for all k N with k X, the partition C k of X into k clusters as compute by Algorithm satisfies cost iam (C k ) < σ (8+6) opt k, where σ = () an opt k enotes the cost of an optimal solution to Problem. In our analysis of the k-center problem, we mae use of the fact that merging two clusters lying insie a ball of some raius r results in a new cluster of raius at most r. This is no longer true for the iameter k-clustering problem. We are not able to erive a boun for the iameter of the new cluster that is significantly less than r. The aitional factor of makes our analysis from Section..1 useless for the iameter case. To prove Proposition, we ivie the merge steps of Algorithm into two stages. The first stage consists of the merge steps own to a O( log ) k-clustering. The analysis of the first stage is base on the following notion of similarity. Two clusters are calle similar if one cluster can be translate such that every point of the translate cluster is near a point of the secon cluster. Then, by merging similar clusters, the iameter essentially increases by the length of the translation vector. During the first stage, we guarantee that there is a sufficiently large number of similar clusters left. The cost of the intermeiate O( log ) k-clustering can be upper boune by O() opt k. The secon stage consists of the steps reucing the number of remaining clusters from O( log ) k to only k. In this stage, we are no longer able to guarantee that a sufficiently large number of similar clusters exists. Therefore, we analyze the merge steps of the secon stage using a weaker argument, very similar to the one use in the secon step of the analysis in the iscrete k-center case (cf. Section.1.). As long as there are more than k clusters left, we are able to fin sufficiently many pairs of clusters that intersect with the same cluster of an optimal k-clustering. Therefore, we can boun the cost of merging such a pair by the sum of the iameters of the two clusters plus the iameter of the optimal cluster. We fin that the cost of the intermeiate k-clustering is upper boune by O( log ) opt k. Let us remark that we o not obtain our main result if we alreay use this argument for the first stage. Both stages are again subivie into phases, such that in each phase the number of remaining clusters is reuce by one fourth. 1

15 u p Ca p Ca p Cb w p Cb w v v u+p Cb p Ca w Figure 6: Congruent configurations... Stage one The following lemma will be use to boun the increase of the cost uring a single phase. ( ) Lemma. Let λ R with 0 < λ < 1 an ρ := λ. Furthermore, let m N with ρ+1 k < m X. Then, cost iam (C m ) < (1+λ) cost ρ+1 k iam(c m )+ m opt k. (1) Proof. From every cluster C C m, we fix an arbitrary point an enote it by p C. Let R := cost iam (C m ). Then, the istance from p C to any q C is at most R an we get C p C B R (0). A ball of raius R can be covere by ρ balls of raius λr (see [1]). Hence, there exist y 1,...,y ρ R with B R(0) ρ i=1 B λr(y i ). For C C m, we call the set Conf(C) := {y i 1 i ρ an B λr(y i ) (C p C ) } the configuration of C. That is, we ientify each cluster C C m with the subset of the balls B λr(y 1 ),...,B λr(y ρ ) that intersect with C p C. Note that no cluster from C C m has an empty configuration. The number of possible configurations is upper boune by ρ. Let t := m an Cm C t+1 be the set of clusters from C m that still exist m 1 merge steps after the computation of C m. In each iteration of its loop, the algorithm can merge at most two clusters from C m. Thus, C m C t+1 > m m. It follows that there exist j > istinct clusters C ρ+1 1,...,C j C m C t+1 with the same configuration. Using m > ρ+1 k, we euce j > k. Let P := {p C1,...,p Cj }. Since X is (k,opt k )-coverable, so is P X. Therefore, by Lemma 5, there exist istinct a,b {1,...,j} such that p Ca p Cb ρ+1 k m opt k. Next, we want to boun the iameter of the union of the corresponing clusters C a an C b. The istance between any two points u,v C a or u,v C b is at most the cost of C m. Now let u C a an v C b. Using the triangle inequality, for any w R, we obtain u v p Ca p Cb + u+p Cb p Ca w + w v (see Figure 6). For p Ca p Cb, we just erive an upper boun. To boun u + p Cb p Ca w, we let y Conf(C a ) = Conf(C b ) such that u p Ca B λr(y). Furthermore, we fix w C b with w p Cb B λr(y). Hence, u+p Cb p Ca w = u p Ca (w p Cb ) can be upper boune by λr = λ cost iam (C m ). For w C b the istance w v is boune by iam(c b ) cost iam (C m ). We conclue that merging clusters C a an C b results in a cluster whose iameter can be upper boune by iam(c a C b ) < (1+λ) cost iam (C m )+ ρ+1 k m opt k. Using Observation 0 an the fact that C a an C b are part of the clustering C t+1, we can upper boun the cost of C t by cost iam (C t ) iam(c a C b ). Note that the parameter λ from Lemma establishes a trae-off between the two terms on the right-han sie of Inequality (1). To complete the analysis of the first stage, we have to carefully choose λ. In the proof of the following lemma, we use λ = ln / X an apply Lemma for log σ+1 k 15

16 consecutive phases, where σ = (). Then, we are able to upper boun the total increase of the cost by a term that is linear in an r an inepenent of X an k. The number of remaining clusters is inepenent of the number of input points X an only epens on the imension an the esire number of clusters k. Lemma. Let σ+1 k < X for σ = (). Then, on input X, Algorithm computes a clustering C σ+1 k with cost iam (C σ+1 k) < (8+) opt k. Proof. Let u := log σ+1 ( ) k i X X an efine m i := for all i = 0,...,u. Furthermore, let λ = ln /. Thisimpliesρ σ fortheparameterρoflemma. Then, m u σ+1 k anm i > σ+1 k ρ+1 k for all i = 0,...,u 1. Since ( ) m i = i X ( ) i+1 X + ( ) i+1 X = m i+1 an Algorithm uses a greey strategy, we euce cost iam (C mi+1 ) cost iam (C m i ) for all i = 0,...,u 1. Combining this with Lemma (applie to m = m i ), we obtain cost iam (C mi+1 ) < (1+λ) cost iam (C mi )+ ρ+1 k m i opt k. By repeately applying this inequality for i = 0,...,u 1 an using cost iam (C σ+1 k) cost iam (C mu ) an cost iam (C m0 ) = 0, we get ( ) u 1 cost iam (C σ+1 k) < (1+λ) i σ+1 k ) u 1 i X opt k Using u 1 < log σ+1 k X, we euce ( σ+1 u 1 k ( ) i = ( u 1 X opt k (1+λ) ) i. i u 1 cost iam (C σ+1 k) < opt k 1+λ. (1) By taking only the first two terms of the series expansion of the exponential function, we get 1+λ = 1+ ln < e ln =. Substituting this boun into Inequality (1) an extening the sum gives cost iam (C σ+1 k) < opt k Solving the geometric series leas to cost iam (C σ+1 k) < 1 i < opt k ( ) i 1. 1+λ ( ) 1 λ +1 opt k < (8+) opt k... Stage two The secon stage covers the remaining merge steps until Algorithm computes the clustering C k. However, compare to stage one, the analysis of a single phase yiels a weaker boun. The following lemma provies an analysis of a single phase of the secon stage. It is very similar to Lemma 9 an Lemma 10 in the analysis of the iscrete k-center problem. Lemma 5. Let m N with k < m X. Then, cost iam (C m ) < (cost iam(c m )+opt k ). 16

17 iam(a B) A iam(a) r iam(b) B Figure 7: Merging two clusters intersecting with a ball of raius r. Proof. Let t := m. Then, Cm C t+1 is the set of clusters from C m which still exist m 1 < m merge steps after the computation of C m. In each iteration of its loop the algorithm can merge at most two clusters from C m. Thus, C m C t+1 > m > k. Since X is (k,opt k )-coverable there exists a point y R such that B opt k (y) intersects with two clusters A,B C m C t+1. We conclue that merging A an B woul result in a cluster whose iameter can be upper boune by iam(a B) < cost iam (C m )+opt k (cf. Figure 7). The result follows using A,B C t+1 an Observation 0. Lemma 6. Let n N with n σ+1 k an k < n X for σ = (). Then, on input X, Algorithm computes a clustering C k with cost iam (C k ) < σ (cost iam (C n )+opt k ). ( ) k in Proof. Let u := log n an efine m i := for all i = 0,...,u. Then, m u k an m i > k for alli = 0,...,u 1. Analogouslyto the proofoflemma, weget m i mi+1 anusinglemma 5, we euce cost iam (C mi+1 ) < (cost iam (C mi )+opt k )) for all i = 0,...,u 1. By repeately applying this inequality an using cost iam (C k ) cost iam (C mu ), we get cost iam (C k ) < u (cost iam (C n )+opt k ). Hence using u log σ < σ, the result follows. Proposition follows immeiately by combining Lemma an Lemma 6... Analysis of the remaining merge steps We analyze the remaining merge steps analogously to the k-center problem. Therefore, in this section we only iscuss the ifferences, most of which are slightly moifie bouns for the cost of merging two clusters (cf. Figure 7). The connectivity property from Section.. remains the same. However, for any two(z, r)-connecte clusters A,B, we use iam(a B) iam(a)+iam(b)+r (1) as a replacement for Inequality (8). Furthermore, Lemma 16 also hols for the iameter k-clustering problem, i.e. with cost iam (C k ) cost iam (P l ). Using Inequality (1) in the proof of Lemma 17, we get iam(a B) cost iam (P m )+cost iam (P n )+opt k as a replacement for Inequality (9) while Inequality (10) can be replace by iam(a 1 A ) cost iam (P m )+ (cost iam (P n )+opt k ). That is, for the iameter k-clustering problem the two upper bouns are ifferent. However, the secon one is larger than the first one. Using it in both cases, the inequality from Lemma 17 changes slightly to cost iam (P m+l ) cost iam(p m )+ (cost iam (P n )+opt k ). Together with cost iam (Pñ) cost iam (P n )+opt k as a replacement for Inequality (11), the boun from Lemma 18 becomes cost iam (P l+1 ) < (log (l)+) (cost iam (P n )+opt k ). 17

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration