CLUSTERING UNDER PERTURBATION RESILIENCE

Size: px

Start display at page:

Download "CLUSTERING UNDER PERTURBATION RESILIENCE"

Lorena Barber
5 years ago
Views:

1 CLUSTERING UNDER PERTURBATION RESILIENCE MARIA FLORINA BALCAN AND YINGYU LIANG Abstract. Motvated by the fact that dstances between data ponts n many real-world clusterng nstances are often based on heurstc measures, Blu and Lnal [13] proposed analyzng objectve based clusterng problems under the assumpton that the optmum clusterng to the objectve s preserved under small multplcatve perturbatons to dstances between ponts. The hope s that by explotng the structure n such nstances, one can overcome worst case hardness results. In ths paper, we provde several results wthn ths framework. For center-based objectves, we present an algorthm that can optmally cluster nstances reslent to perturbatons of factor (1 + 2), solvng an open problem of Awasth et al. [3]. For k-medan, a center-based objectve of specal nterest, we addtonally gve algorthms for a more relaxed assumpton n whch we allow the optmal soluton to change n a small ɛ fracton of the ponts after perturbaton. We gve the frst bounds known for k-medan under ths more realstc and more general assumpton. We also provde postve results for mn-sum clusterng whch s typcally a harder objectve than center-based objectves from approxmablty standpont. Our algorthms are based on new lnkage crtera that may be of ndependent nterest. Addtonally, we gve sublnear-tme algorthms, showng algorthms that can return an mplct clusterng from only access to a small random sample. Key words. clusterng, perturbaton reslence, k-medan clusterng, mn-sum clusterng AMS subject classfcatons. 68Q25, 68Q32, 68T05, 68W25, 68W40 1. Introducton. Problems of clusterng data from parwse dstance nformaton are ubqutous n scence. A common approach for solvng such problems s to vew the data ponts as nodes n a weghted graph (wth the weghts based on the gven parwse nformaton), and then to desgn algorthms to optmze varous objectve functons such as k-medan or mn-sum. For example, n the k-medan clusterng problem the goal s to partton the data nto k clusters C, gvng each a center c, n order to mnmze the sum of the dstances of all data ponts to the centers of ther cluster. In the mn-sum clusterng approach the goal s to fnd k clusters C that mnmze the sum of all ntra-cluster parwse dstances. Yet unfortunately, for most natural clusterng objectves, fndng the optmal soluton to the objectve functon s NP-hard. As a consequence, there has been substantal work on approxmaton algorthms [18, 14, 9, 15, 1] wth both upper and lower bounds on the approxmablty of these objectve functons on worst case nstances. Recently, Blu and Lnal [13] suggested an exctng, alternatve approach amed at understandng the complexty of clusterng nstances whch arse n practce. Motvated by the fact that dstances between data ponts n clusterng nstances are often based on a heurstc measure, they argue that nterestng nstances should be reslent to small perturbatons n these dstances. In partcular, f small perturbatons can cause the optmum clusterng for a gven objectve to change drastcally, then that probably s not a meanngful objectve to be optmzng. Blu and Lnal [13] specfcally defne an nstance to be α-perturbaton reslent 1 for an objectve Φ f perturbng parwse dstances by multplcatve factors n the range [1, α] does not change the op- Part of the results n ths artcle appeared under the ttle Clusterng under Perturbaton Reslence n the Proceedngs of the Thrty-Nnth Internatonal Colloquum on Automata, Languages and Programmng, Carnege Mellon Unversty, Pttsburgh, PA (nnamf@cs.cmu.edu). Prnceton Unversty, Prnceton, NJ (yngyul@cs.prnceton.edu). 1 Blu and Lnal [13] refer to such nstances as perturbaton stable nstances. 1

2 2 M. F. BALCAN, AND Y. LIANG tmum clusterng under Φ. They consder n detal the case of Max-Cut clusterng and gve an effcent algorthm to recover the optmum when the nstance s reslent to perturbatons on the order of α > mn{n/2, n } where s the maxmal degree of the graph. They also gve an effcent algorthm for unweghted Max-Cut nstances that are reslent to perturbatons on the order of α 4n/δ where δ s the mnmal degree of the graph. Two mportant questons rased by the work of Blu and Lnal [13] are: (1) the degree of reslence needed for ther algorthm to succeed s qute hgh: can one develop algorthms for mportant clusterng objectves that requre much less reslence? (2) the reslence defnton requres the optmum soluton to reman exactly the same after perturbaton: can one succeed under weaker condtons? In the context of center-based clusterng objectves such as k-medan and k-center, Awasth et al. [3] partally address the frst of these questons and show that an algorthm based on the sngle-lnkage heurstc can be used fnd the optmal clusterng for α-perturbatonreslent nstances for α = 3. They also conjecture t to be NP-hard to beat 3 and prove beatng 3 s NP-hard for a related but weaker noton (see the α-center proxmty property n Defnton 3.1). In ths work, we address both questons rased by [13] and addtonally mprove over [3]. Frst, for the center-based objectves we desgn a polynomal tme algorthm for fndng the optmum soluton for nstances reslent to perturbatons of value α = 1 + 2, thus beatng the prevously best known factor of 3 of Awasth et al [3]. Second, for k-medan (whch s a specfc center-based objectve), we consder a weaker, relaxed, and more realstc noton of perturbaton-reslence where we allow the optmal clusterng of the perturbed nstance to dffer from the optmal of the orgnal n a small ɛ fracton of the ponts. Compared to the orgnal perturbaton reslence assumpton, ths s arguably a more natural though also more dffcult condton to deal wth. We gve postve results for ths case as well, showng for somewhat larger values of α that we can stll acheve a near-optmal clusterng on the gven nstance (see Secton 1.1 below for precse results). We addtonally gve postve results for mn-sum clusterng whch s typcally a harder objectve than center-based objectves from approxmablty standpont. For example, the best known guarantee for mn-sum clusterng on worst-case nstances s an O(υ 1 log 1+υ n)-approxmaton algorthm that runs n tme n O(1/υ) for any υ > 0 due to Bartal et al. [9]; by contrast, the best guarantee known for k-medan s factor ɛ [20] for any ɛ > 0. Our results are acheved by carefully dervng structural propertes of perturbaton reslence. At a hgh level, all the algorthms we ntroduce work by frst runnng approprate lnkage procedures to produce a herarchcal clusterng, and then runnng dynamc programmng to retreve the best k-clusterng present n the tree. To ensure that (under perturbaton reslent nstances) the herarchy output n the frst step has a prunng of low cost, we derve new lnkage procedures (closure lnkage and robust average lnkage) whch are of ndependent nterest. Whle the overall analyss s qute nvolved, the clusterng algorthms we devse are smple and robust. Ths smplcty and robustness allow us to show how our algorthms can be made sublnear-tme by returnng an mplct clusterng from only a small random sample of the nput. From a learnng theory perspectve, the reslence parameter, α, can also be seen as an analog to a margn for clusterng. In supervsed learnng, the margn of a data pont s the dstance, after scalng, between the data pont and the decson boundary of ts classfer, and many algorthms have stronger guarantees when the smallest margn over the entre data set s suffcently large [27, 28]. The α parameter,

3 CLUSTERING UNDER PERTURBATION RESILIENCE 3 smlarly controls the magntude of the perturbaton the data can wthstand before beng clustered dfferently, whch s, n essence, the data s dstance to the decson boundary for the gven clusterng objectve. Hence, perturbaton reslence s also a natural and nterestng assumpton to study from a learnng theory perspectve. Our Results. In ths paper, we advance the lne of work of [13] by solvng several mportant problems of clusterng perturbaton-reslent nstances under metrc centerbased and mn-sum objectves. In Secton 3 we mprove on the bounds of [3] for α-perturbaton reslent nstances for center-based objectves, gvng an algorthm that effcently 2 fnds the optmum clusterng for α = Most of the frequently used center-based objectves, such as k-medan, are NP-hard to even approxmate, yet we can recover the exact soluton for perturbaton reslent nstances. Our algorthm s based on a new lnkage procedure usng a new noton of dstance (closure dstance) between sets that may be of ndependent nterest. In Secton 4 we consder the more challengng and more general noton of (α, ɛ)- perturbaton reslence for k-medan, where we allow the optmal soluton after perturbaton to be ɛ-close to the orgnal. We provde an effcent algorthm whch for α > produces (1 + O(ɛ/ρ))-approxmaton to the optmum, where ρ s the fracton of the ponts n the smallest cluster. The key structural property we derve and explot s that, except for ɛn bad ponts, most ponts are α tmes closer to ther own center than to any other center. To elmnate the nose ntroduced by the bad ponts, we carefully partton the ponts nto a lst of suffcently large blobs, each of whch contans only good ponts from one optmal cluster. Ths then allows us to construct a tree on the blobs wth a low-cost prunng that s a good approxmaton to the optmum. In Secton 5 we provde the frst effcent algorthm for optmally clusterng α- perturbaton reslent mn-sum nstances. We show that when α n the order of the rato between the szes of the largest and smallest clusters, there exsts an algorthm that can output the optmal clusterng n polynomal tme. Our algorthm s based on an approprate modfcaton of average lnkage that explots the structure of mn-sum perturbaton reslent nstances. In Secton 6, we show that for (α, ɛ)-perturbaton reslent mn-sum nstances wth α n the order of the rato between the szes of the largest and smallest clusters and ɛ = Õ(ρ), there exsts a polynomal tme algorthm that outputs a clusterng that s both a (1 + Õ(ɛ/ρ))-approxmaton and Õ(ɛ)-close to the optmal clusterng. The key structural property s that except for Õ(ɛn) bad ponts, most ponts are O(α) tmes closer to ther own optmal cluster than to any other optmal cluster. Smlar to the case of k-medan, we can partton the ponts nto a lst of suffcently large blobs, each of whch contans only good ponts from one optmal cluster. However, the propertes of the good ponts are sgnfcantly weaker than those n the k-medan case, and thus the lnkage there does not guarantee a tree wth a low-cost prunng. To utlze these propertes, we ntroduce the noton of potentally good ponts whch can act as a proxy of the actual good ponts. We then desgn a robust average lnkage algorthm based on the cost computed only on the potentally good ponts, whch constructs a tree wth a prunng that assgns all good ponts correctly. The prunng can be found out effcently, and after some processng t leads to a clusterng that s both a good approxmaton and close to the optmal clusterng. 2 For clarty, n ths paper effcent means polynomal n both n (the number of ponts) and k (the number of clusters).

4 4 M. F. BALCAN, AND Y. LIANG We also provde sublnear-tme algorthms both for the k-medan and mn-sum objectves (Sectons 4.3 and 5.1), showng algorthms that can return an mplct clusterng from only access to a small random sample. Related Work. A subsequent work [12] of [13] by Blu, Danely, Lnal and Saks studed the Max-Cut problem under perturbaton reslence, and showed how to solve n polynomal tme (1 + ɛ)-perturbaton reslent nstances of metrc and dense Max- Cut, and Ω( n)-perturbaton reslent nstances of general Max-Cut. The later bound s further mproved by Makarychev, Makarychev and Vjayaraghavan [22]. They proposed a polynomal tme exact algorthm for Ω( log n log log n)-perturbaton reslent Max-Cut nstances based on semdefnte programmng. They also proved that for Max k-cut wth k 3, there s no polynomal-tme algorthm that solves - perturbaton reslent nstances of Max k-cut unless NP= RP. Here an nstance s -perturbaton reslent f t s α-perturbaton reslent for every α. Fnally, they also studed a noton called (γ, N)-weakly stablty for Max-Cut, whch means that after perturbng the weghts by a factor of at most γ, the optmal soluton must be from the set N. When N s the set of solutons that dffer from the optmal soluton on at most δ fracton of nodes, the noton s the same as the (γ, δ)-perturbaton reslence studed n our work. They showed that when γ = Ω( log n log log n), there exsts an effcent algorthm that can fnd a cut from N. In a recent work [23], the same authors further proposed a beyond worst-case analyss model for Balanced-Cut, whch s a planted model wth random edges from permutaton-nvarant dstrbutons. They acheved a constant factor approxmaton wth respect to the cost of the planted cut when the number of random edges s Ω(npolylog(n)). Several recent papers have showed how to explot the structure of perturbaton reslent nstances n order to obtan better approxmaton guarantees (than those possble on worst case nstances) for other dffcult optmzaton problems. These nclude the game theoretc problem of fndng Nash equlbra [6, 21] and the classc travelng salesman problem [24]. In the context of objectve based clusterng, several recent papers have showed how to explot other notons of stablty for overcomng the exstng hardness results on worst case nstances. The ORSS stablty noton of Ostrovsky, Raban, Schulman and Swamy [26, 3] assumes that the cost of the optmal k-means soluton s small compared to the cost of the optmal (k 1)-means soluton. The BBG (c, ɛ)-approxmaton stablty condton of Balcan, Blum and Gupta [5] assumes that every c-approxmaton soluton s close to the target clusterng. We note that when the target clusterng s the optmal clusterng for the clusterng objectve, (c, ɛ)-approxmaton stablty mples (c, ɛ)-perturbaton reslence. Awasth, Sheffet and Blum [2] proposed a stablty condton called weak-deleton stablty, and showed that t s mpled by both the ORSS stablty and the BBG stablty. Kumar and Kannan [19] proposed a proxmty condton whch assumes that n the target clusterng, most data ponts satsfy that they are closer to ther center than to any other center by an addtve factor n the order of the maxmal standard varance of ther clusters n any drecton. Ther results are mproved by Awasth and Sheffet [4], whch proposed a weaker verson of the proxmty condton called center separaton, and desgned algorthms achevng stronger guarantees under ths weaker condton. These notons are not drectly comparable to the perturbaton reslence property. 2. Notaton and Prelmnares. In a clusterng nstance, we are gven a set S of n ponts n a fnte metrc space, and we denote d : S S R 0 as the

5 CLUSTERING UNDER PERTURBATION RESILIENCE 5 dstance functon. Φ denotes the objectve functon over a partton of S nto k < n clusters whch we want to optmze over the metrc, that s, Φ assgns a score to every clusterng. The optmal clusterng wth respect to Φ s denoted as C = {C 1, C 2,..., C k }, and ts cost s denoted as OPT. The core concept we study n ths paper s the perturbaton reslence noton ntroduced by [13]. Formally: Defnton 2.1. A clusterng nstance (S, d) s α-perturbaton reslent to a gven objectve Φ f for any functon d : S S R 0 such that p, q S, d(p, q) d (p, q) αd(p, q), there s a unque optmal clusterng C for Φ under d and ths clusterng s equal to the optmal clusterng C for Φ under d. Note that n the defnton, d need not be a metrc. Also note that the defnton depends on the objectve. In ths paper, we focus on the center-based and mnsum objectves. For the center-based objectves, we consder separable center-based objectves defned by [3]. Defnton 2.2. A clusterng objectve s center-based f the optmal soluton can be defned by k ponts c 1,, c k n the metrc space called centers such that every data pont s assgned to ts nearest center. Such a clusterng objectve s separable f t furthermore satsfes the followng two condtons: (1) The objectve functon value of a gven clusterng s ether a (weghted) sum or the maxmum of the ndvdual cluster scores. (2) Gven a proposed sngle cluster, ts score can be computed n polynomal tme. One partcular center-based objectve s the k-medan objectve. We partton S nto k dsjont subsets P = {P 1, P 2,..., P k } and assgn a set of centers p = {p 1, p 2,..., p k } S for the subsets. The objectve s Φ(P, p) = k =1 p P d(p, p ). The centers n the optmal clusterng are denoted as c = {c 1,..., c k }. Clearly, n an optmal soluton, each pont s assgned to ts nearest center. In such cases, the objectve s denoted as Φ(c). For the mn-sum objectve, we partton S nto k dsjont subsets denoted as P = {P 1, P 2,..., P k }, and the goal s to mnmze Φ(P) = k =1 p P q P d(p, q). Note that we sometmes denote Φ as Φ S n the case where the dstncton s necessary, such as n Secton 4.3. In Secton 4 we consder a generalzaton of perturbaton reslence where we allow a small dfference between the orgnal optmum and the new optmum after perturbaton. Formally: Defnton 2.3. Let C be the optmal k-clusterng and C be another k-clusterng of a set of n ponts. We say C k s ɛ-close to C f mn σ Sk =1 C \C σ() ɛn, where σ s a matchng between ndces of clusters of C and those of C. Defnton 2.4. A clusterng nstance (S, d) s (α, ɛ)-perturbaton reslent to a gven objectve Φ f for any functon d : S S R 0 s.t. p, q S, d(p, q) d (p, q) αd(p, q), the optmal clusterng C for Φ under d s ɛ-close to the optmal clusterng C for Φ under d. For smplcty, we assume ɛn s an nteger and assume that mn C s known (otherwse, we can smply search over the n possble dfferent values). For A, B S and a dstance functon d, we defne d s (A, B) := p A,q B d(p, q), d s (p, B) := d s ({p}, B), and d s (p, q) := d s ({p}, {q}). Also, we defne d a (A, B) := d s (A, B)/( A B ) and d a (p, B) := d a ({p}, B) for nonempty A and B. 3. α-perturbaton Reslence for Center-based Objectves. In ths secton we show that, for α 1 + 2, f the clusterng nstance s α-perturbaton reslent for center-based objectves, then we can n polynomal tme fnd the optmal clusterng.

6 6 M. F. BALCAN, AND Y. LIANG Ths mproves on the α 3 bound of [3] and stands n sharp contrast to the NP- Hardness results on worst-case nstances. Our algorthm succeeds for an even weaker property, the α-center proxmty, ntroduced n [3]. Defnton 3.1. A clusterng nstance (S, d) satsfes the α-center proxmty property f for any optmal cluster C C wth center c, C j C(j ) wth center c j, any pont p C satsfes αd(p, c ) < d(p, c j ). Lemma 3.2. Any clusterng nstance that s α-perturbaton reslent to centerbased objectves also satsfes the α-center proxmty. The proof follows easly by constructng a specfc perturbaton that blows up all the parwse dstances wthn cluster C by a factor of α. By α-perturbaton reslence, the optmal clusterng remans the same after ths perturbaton. Ths then mples the desred result. The full proof appears n [3]. In the remander of ths secton, we prove our results for α-center proxmty, but because t s a weaker condton, our upper bounds also hold for α-perturbaton reslence. We begn wth some key propertes of α-center proxmty nstances. Lemma 3.3. For any ponts p C and q C j (j ) n the optmal clusterng of an α-center proxmty nstance, we have (1) d(c, q) > α(α 1) α+1 d(c, p), (2) d(p, q) > (α 1) max{d(p, c ), d(q, c j )}. Consequently, when α 1 + 2, we have (1) d(c, q) > d(c, p), (2) d(p, q) > d(p, c ). Proof. (1) Lemma 3.2 gves us that d(q, c ) > αd(q, c j ). By the trangle nequalty, we have d(c, c j ) d(q, c j )+d(q, c ) < (1+1/α)d(q, c ). On the other hand, d(p, c j ) > αd(p, c ) and therefore d(c, c j ) d(p, c j ) d(p, c ) > (α 1)d(p, c ). Combnng these nequaltes, we get (1). (2) The proof frst appears n [3], and we nclude t for completeness. Wthout loss of generalty, we can assume that d(p, c ) d(q, c j ). By the trangle nequalty we have d(p, q) d(p, c j ) d(q, c j ). From Lemma 3.2 we have d(p, c j ) > αd(p, c ). Hence d(p, q) > αd(p, c ) d(q, c j ) (α 1)d(p, c ) (α 1)d(q, c j ). Lemma 3.3 mples for any optmal cluster C, the ball of radus max p C d(c, p) around the center c contans only ponts from C, and moreover, ponts nsde the ball are each closer to the center than to any pont outsde the ball. Inspred by ths structural property, we defne the noton of closure dstance between two sets as the radus of the mnmum ball that covers the sets and has some margn from ponts outsde the ball. We show that any (strct) subset of an optmal cluster has smaller closure dstance to another subset n the same cluster than to any subset of other clusters or to unons of other clusters. Usng ths, we wll be able to defne an approprate lnkage procedure that, when appled to the data, produces a tree on subsets that wll all be lamnar wth respect to the clusters n the optmal soluton. Ths wll then allow us to extract the optmal soluton usng dynamc programmng appled to the tree. We now defne the noton of closure dstance and then present our algorthm for α-center proxmty nstances (Algorthm 1). Let B(p, r) := {q : d(q, p) r} denote the ball around p wth radus r. Defnton 3.4. The closure dstance d S (A, A ) between two dsjont nonempty subsets A and A of pont set S s the mnmum d 0 such that there s a pont c A A satsfyng the followng requrements: (1) coverage: the ball B(c, d) covers A and A, that s, A A B(c, d);

7 CLUSTERING UNDER PERTURBATION RESILIENCE 7 (2) margn: ponts nsde B(c, d) are closer to the center c than to ponts outsde, that s, p B(c, d), q B(c, d), we have d(c, p) < d(p, q). A A c c d p q Fg. 1: Illustraton for the closure dstance. Note that d S (A, A ) = d S (A, A) max p,q S d(p, q) for any A and A. Furthermore, t can be computed n polynomal tme. Algorthm 1 Center-based objectves, α-perturbaton reslence Input: Data set S, dstance functon d(, ) on S. 1: Begn wth n sngleton clusters. 2: Repeat tll only one cluster remans: merge clusters C, C whch mnmze d S (C, C ). 3: Let T be the tree wth sngle ponts as leaves and nternal nodes correspondng to the merges performed. 4: Run dynamc programmng on T to get the mnmum cost prunng C. Output: Clusterng C. Theorem 3.5. For (1 + 2)-center proxmty nstances, Algorthm 1 outputs the optmal clusterng n polynomal tme. The proof follows mmedately from the followng key property of the Phase 1 of Algorthm 1. The detals of dynamc programmng are presented n Appendx A, and an effcent mplementaton of the algorthm s presented n Appendx B. Theorem 3.6. For (1 + 2)-center proxmty nstances, Algorthm 1 constructs a bnary tree T such that the optmal clusterng s a prunng of ths tree. Proof. We prove correctness by nducton. In partcular, assume that our current clusterng s lamnar wth respect to the optmal clusterng. That s, for each cluster A n our current clusterng and each C n the optmal clusterng, we have ether A C, or C A, or A C =. Ths s clearly true at the start. To prove that the merge steps keep the lamnarty, we need to show the followng: f A s a strct subset of an optmal cluster C, A s a subset of another optmal cluster or the unon of one or more other clusters, then there exsts B from C \ A, such that d S (A, B) < d S (A, A ). We frst prove that there s a cluster B C \ A n the current cluster lst such that d S (A, B) d := max p C d(c, p). There are two cases. Frst, f c A, then defne B to be the cluster n the current cluster lst that contans c. By nducton, B C and thus B C \ A. Then we have d S (B, A) d snce there s c B, and (1) for any p A B, d(c, p) d, (2) for any p S satsfyng d(c, p) d, and any q S satsfyng d(c, q) > d, by Lemma 3.3 we know p C and q C, and thus

8 8 M. F. BALCAN, AND Y. LIANG d(c, p) < d(p, q). In the second case when c A, we pck any B C \ A and a smlar argument gves d S (A, B) d. case 1: c A case 2: c A p p A c A B c q c j A c A B q c c j Fg. 2: Comparng d and d S (A, A ) n closure lnkage. As a second step, we need to show that d < ˆd := d S (A, A ). There are two cases: the center for d S (A, A ) s n A or n A. See Fgure 2 for an llustraton. In the frst case, there s a pont c A such that c and ˆd satsfy the requrements of the closure dstance. Pck a pont q A, and defne C j to be the cluster n the optmal clusterng that contans q. As d(c, q) ˆd, and by Lemma 3.3 we have d(c j, q) < d(c, q), then d(c j, c) ˆd (otherwse t volates the second requrement of closure dstance). Suppose p = arg max p C d(c, p ). Then we have d = d(p, c ) < d(p, c j )/α ( d+d(c, c)+d(c, c j ))/α where the frst nequalty comes from Lemma 3.2 and the second from the trangle nequalty. Snce d(c, c) < d(c, c j )/α, we can combne the above nequaltes and compare d and d(c, c j ), and when α we have d < d(c, c j ) ˆd. Now consder the second case, when there s a pont c A such that c and ˆd satsfy the requrements n the defnton of the closure dstance. Select an arbtrary pont q A. We have ˆd d(c, q) from the frst requrement, and d(c, q) > d(c, q) by Lemma 3.3. Then from the second requrement of closure dstance d(c, c) ˆd. And by Lemma 3.3, d = d(c, p) < d(c, c), we have d < d(c, c) ˆd. Note 3.1. Our factor of α = 1+ 2 beats the NP-hardness lower bound of α = 3 of [3] for center-proxmty nstances. The reason s that the lower bound of [3] requres the addton of Stener ponts that can act as centers but are not part of the data to be clustered (though the upper bound of [3] does not allow such Stener ponts). One can also show a lower bound for center-proxmty nstances wthout Stener ponts. In partcular for any ɛ > 0, the problem of solvng (2 ɛ)-center proxmty k-medan nstances s NP-hard [10]. There s also a low bound for perturbaton reslence. Balcan, Haghtalab and Whte [8] recently showed that there s no polynomal tme algorthm for k-center nstances under (2 ɛ)-perturbaton reslence, unless NP= RP. They also showed that closure lnkage solves k-center nstances under 2-perturbaton

9 CLUSTERING UNDER PERTURBATION RESILIENCE 9 reslence n polynomal tme. Note 3.2. The frst condton n our defnton of closure dstance s smlar to the mnmax lnkage crtera [11]. More precsely, our closure dstance defnton has two condtons: coverage condton and margn condton. If the margn condton s removed from the defnton, then the closure dstance reduces to the mnmax lnkage dstance. For our purposes however, the margn condton s crucal n partcular, we can provably argue that when the center promxty condton s satsfed Algorthm 1 produces a tree such that the optmal clusterng s a prunng of the tree (Theorem 3.5). 4. (α, ɛ)-perturbaton Reslence for the k-medan Objectve. In ths secton we consder a natural relaxaton of the α-perturbaton reslence, the (α, ɛ)- perturbaton reslence property, that requres the optmum after perturbaton of up to a multplcatve factor α to be ɛ-close to the orgnal (one should thnk of ɛ as sub-constant). We show that f the nstance s (α, ɛ)-perturbaton reslent wth α > 2 + 3, then we can n polynomal tme output a clusterng that provdes a (1 + 5ɛ/ρ)-approxmaton to the optmum, where ρ s the fracton of the ponts n the smallest cluster. Thus ths mproves over the best worst-case approxmaton guarantees known [20] when ɛ 3ρ/5 and also beats the lower bound of (1 + 1/e) on the best approxmaton achevable on worst case nstances for the metrc k-medan objectve [17, 18] when ɛ ρ/(5e). The key dea s to understand and leverage the structure mpled by (α, ɛ)- perturbaton reslence. We show that perturbaton reslence mples that there exsts only a small fracton of ponts that are bad n the sense that ther dstance to ther own center s not α tmes smaller than ther dstance to any other centers n the optmal soluton. We then use ths bounded number of bad ponts n our clusterng algorthm Structure of (α, ɛ)-perturbaton Reslence. Throughout ths secton we wll assume that C s suffcently large compared to ɛn, snce for nterestng practcal clusterng nstances, one would expect that a large fracton of a optmal cluster wll reman the same after small perturbaton. The exact bound wll be stated explctly n our man theorems. For now we can smply assume C > 2ɛn for all. To understand the structure of (α, ɛ)-perturbaton reslence, we need to consder the dfference between the optmal clusterng C under d and the optmal clusterng C under a perturbaton d k, defned as mn σ Sk =1 C \C σ(). Snce k =1 C \C σ() ɛn by assumpton, we clearly have separately for each that C \ C σ() ɛn. Snce C > 2ɛn ths mples that C σ() s the unque cluster n C such that C C σ() > 1 2 C. Wthout loss of generalty, let us ndex C so that σ s the dentty. We denote by c the center of C. In the followng we ntroduce the notons of bad ponts and good ponts, and then show that under perturbaton reslence we do not have too many bad ponts. Defnton 4.1. Defne bad ponts for k-medan to be those that are not α tmes closer to ts own center than to any other center n the optmal clusterng. That s, B := B, B := {p C : j, αd(c, p) d(c j, p)}. The other ponts G := S \ B are called good ponts. Let G := G C denote the good ponts n cluster C. Theorem 4.2. Suppose the clusterng nstance s (α, ɛ)-perturbaton reslent and ( mn C > 6 α+1 α 1 ) (ɛn + α + 1). Then B ɛn.

10 10 M. F. BALCAN, AND Y. LIANG Intuton. Assume for contradcton that B > ɛn. The man dea s to select a subset of (ɛn + 1) bad ponts and then construct a specfc perturbaton so that n the new optmal clusterng these (and only these) selected bad ponts move to new clusters, leadng to a clusterng that s ɛ far from the orgnal optmal clusterng. Ths s contradctory to the (α, ɛ)-perturbaton reslence property, and thus there are at most ɛn bad ponts. The selected bad ponts and the perturbaton are defned as follows. Select an arbtrary subset ˆB of (ɛn + 1) bad ponts from B, and let ˆB = ˆB C denote the selected bad ponts n C. Let c(p) denote the second nearest center for p ˆB and the nearest center for p C \ ˆB. That s, for any 1 k and any p C, let { c j where j = arg mn c(p) = j d(p, c j ) f p ˆB c f p C \ ˆB. The perturbaton blows up all dstances by a factor of α except for those dstances between p and c(p). Formally, { d d(p, q) f p = c(q), or q = c(p), (p, q) = αd(p, q) otherwse. The key challenge n showng the contradcton s to show that c = c for all, that s, the optmal centers do not change after the perturbaton. Once ths s shown t s then mmedate that n the optmum clusterng under d each pont p s assgned to the center c(p), and thus the selected bad ponts ˆB wll move from ther orgnal optmal clusters and all others wll not. So the dstance between the new clusterng and the orgnal clusterng s ˆB > ɛn, whch s contradctory to the (α, ɛ)-perturbaton reslence property. V ˆB A M W C C C C (a) Notatons A and M (b) Notatons W and V Fg. 3: Dfferent types of ponts. (a) A = C \ C, M = C \ C. (b) W = (C C ) \ ˆB, V = (C C ) \ ˆB. As a result, C = W V M and C = W V A. It wll now be convenent to defne a few quanttes. Let A = C \ C (the ponts added when swtchng from C to C ), M = C \ C (the ponts removed), W = (C C ) \ ˆB (the common ponts excludng selected bad ponts), and V = (C C ) ˆB (the selected bad ponts n common). So, C = W V M and C = W V A. See Fgure 3. Note that A ɛn, M ɛn, and V ɛn + 1, wth the bulk of the ponts n W. The ntuton for the proof that c = c s the followng. Assume for contradcton that c c. Frst, d(c, c ) cannot be too large compared to the average dstance

11 CLUSTERING UNDER PERTURBATION RESILIENCE 11 between c and W, else by the trangle nequalty d s (c, W ) would also be large, volatng the fact that d s(c, C ) d s(c, C ); see Clam 4.3. On the other hand, f d(c, c ) s small then by the trangle nequalty d s(c, W ) d s (c, W ). Snce dstances between c and W are blown up by a factor of α n movng from d to d but dstances between c and W are not, d s(c, W ) wll be sgnfcantly larger than d s(c, W ), whch wll also volate the fact that d s(c, C ) d s(c, C ); see Clam 4.4. Proof of Theorem 4.2. We now present the formal proof. Before provng the two key clams mentoned n the ntuton, we begn wth two convenent clams. The frst convenent clam shows that c c j for j. The second convenent clam shows the relaton of d s and d s on A. Clam 4.1. If mn C > ( 2 α 1 + 3)ɛn + 1, then c c j( j ). Proof. Assume for contradcton that c = c j. We frst need to show c j c l( l). Clearly, c j c j, snce otherwse, movng all the ponts n C j to C wll not ncrease the cost, whch volates (α, ɛ)-perturbaton reslence. We also know that c j c l(l j) snce otherwse, there s p W j, d(c l, p) = d(c j, p) d (c j, p) < d (c, p) = d(c j, p), whch contradcts the fact that p C j. Now we can apply the ntuton descrbed above to show that c = c j and c j c l ( l) lead to an contradcton. Note that ponts n W j V j = C j C j are closer to c j than to c = c j under d. Then back to d, for any p W j, snce c j c l( l), αd(c j, p) = d (c j, p) d (c, p) = d(c j, p), resultng n d s (c j, W j) d s (c j, W j )/α. Smlarly, for any p V j, αd(c j, p) = d (c j, p) d (c, p) = αd(c j, p), resultng n d s (c j, V j) d s (c j, V j ). These facts have two consequences. Frst, snce ponts n W j are α tme closer to c j than to c j, the dstance between c j and c j s small: (4.1) d(c j, c j ) d(c j, W j) W j + d s(c j, W j ) W j (1 + 1 α )d s(c j, W j ). Second, snce c j s the optmal center for C j = W j V j M j, t should save a lot of cost on M j compared to c j, whch suggests that c j and c j would be far apart. Formally, d s (c j, C j ) = d s (c j, W j V j M j ) d s (c j, C j ) = d s (c j, W j V j M j ). Snce d s (c j, W j) d s (c j, W j )/α and d s (c j, V j) d s (c j, V j ), we have d s (c j, M j ) d s (c j, M j ) d s (c j, W j ) 1 α d s(c j, W j ), (4.2) M j d s (c j, c j ) (1 1 α )d s(c j, W j ). When C j > ( 2 α 1 + 3)ɛn + 1, we have (1 1/α) W j > (1 + 1/α) M j. Then Inequaltes 4.2 and 4.1 lead to d(c j, c j ) = 0. Ths means c j = c j whch s a contradcton to the assumptons. Clam 4.2. Suppose mn C > ( 2 α 1 + 3)ɛn + 1. If c c, then we have (1) d s(c, A ) αd s (c, A \ {c(c )}), (2) d s(c, A ) αd s (c, A \ {c(c )}) + α(1 + α)d(c, c ). Proof. These translatons from d to d can be verfed by the defnton of d. In most cases, d (, ) = αd(, ); the only exceptons are the dstances between p and c(p). The detaled verfcaton s presented below.

12 12 M. F. BALCAN, AND Y. LIANG (1) Snce c c, and by Clam 4.1, we know c c j( j). So we only need to check f c(c ) A. We have d s(c, A ) d s(c, A \ {c(c )}) = αd s (c, A \ {c(c )}). (2) If c(c ) A, then the nequalty s trval. If c(c ) A, then d s(c, A ) = d s(c, A \ {c(c )}) + d (c, c(c )) αd s (c, A \ {c(c )}) + αd(c, c(c )). We have d(c, c(c )) d(c, c ) + d(c, c(c )). If c s a selected bad pont, then d(c, c(c )) αd(c, c ). Otherwse, c(c ) s the nearest center for c, then d(c, c(c )) d(c, c ). In any case, the nequalty for d s(c, A ) follows. We are now ready to present the complete proofs of the two key clams. Clam 4.3. For each, d(c, c ) 3 ( ) α+1 ds(c,w ) α C. Proof. The key dea s that snce d s(c, C ) d s(c, C ) and C \ W s small, t must be the case that d s(c, W ) s not too much larger than d s(c, W ). Now, snce dstances between c and W reman the same n movng from d to d but dstances between c and W are blown up by a factor of α (except for the dstance between c and c tself), ths means that αd s (c, W ) d s (c, W ) must be small. Ths s then used together wth the trangle nequalty to get an upper bound on d(c, c ). We provde the formal proof below. Frst, f c = c the clam s trvally true so assume c c. We begn wth the fact that d s(c, C ) d s(c, C ) and then break C nto ts three components W, V, and A. We move the W terms to one sde and move the rest of the terms to the other sde, resultng n (4.3) d s(c, W ) d s(c, W ) d s(c, A ) d s(c, A ) + d s(c, V ) d s(c, V ). Begnnng wth the rght-hand sde of (4.3), by the trangle nequalty we have d s (c, V ) d s (c, V ) + V d(c, c ). Thus, d s(c, V ) d s(c, V ) + α V d(c, c ). Smlarly, by Clam 4.2 we have d s(c, A ) d s(c, A ) + α A d(c, c ) + α(α + 1)d(c, c ). So, the rght-hand sde of (4.3) s at most α( V + A + α + 1)d(c, c ). Now, examnng the left-hand sde, ths quantty s at least αd s (c, W \ {c(c )}) d s(c, W ). So, we have (4.4) αd s (c, W \ {c(c )}) d s (c, W ) α( V + A + α + 1)d(c, c ). Usng the fact that by the trangle nequalty, α( W 1)d(c, c ) αd s(c, W \ {c(c )}) + αd s(c, W \ {c(c )}), and subtractng (α + 1)d s(c, W ) from both sdes, we get (4.5) αd s (c, c )( W 1) (α + 1)d s (c, W ) αd s (c, W \{c(c )}) d s (c, W ). Combnng (4.4) and (4.5) we have: αd(c, c )( W 1) (α + 1)d s (c, W ) αd(c, c ) ( V + A + α + 1) whch mples the desred result when C > 5ɛn + 2α + 6. Clam 4.4. For each, f c c then d(c, c ) ( ) α 1 d(c,w ) 2α ɛn+α+1. Proof. Assume c c and let d = d(c, c ). We wll begn wth the fact that d s (c, C ) d s (c, C ), and then proceed to compare d s (c, C ) and d s(c, C ), and fnally compare d s(c, C ) and d s(c, C ), whch wll gve the desred bound.

13 CLUSTERING UNDER PERTURBATION RESILIENCE 13 Frst, snce d s (c, C ) d s (c, C ) and the dfference between C and C s small, d s (c, C ) cannot be much larger than d s(c, C ). Specfcally, by (α, ɛ)-perturbaton reslence, A ɛn and M ɛn. We have by the trangle nequalty (4.6) (4.7) d s (c, A ) d s (c, A ) + (ɛn)d, d s (c, M ) d s (c, M ) (ɛn)d. So, d s (c, C ) d s(c, C ) + 2(ɛn)d. Now, we turn to compare d s(c, C ) and d s(c, C ). We begn wth A. By Clam 4.2 we have (4.8) d s(c, A ) d s(c, A ) + αɛnd + α(α + 1)d. On C \ M, the cost of c s smaller than that of c. Specfcally, from (4.7) we have: so d s (c, C \ M ) d s (c, C \ M ) + (ɛn)d. (4.9) αd s (c, C \ M ) αd s (c, C \ M ) + α(ɛn)d d s(c, C \ M ) + (α 1)d + α(ɛn)d where the second step s from the followng fact: d s(c, C \ M ) = αd s (c, C \ M ) f c(c ) c, else d s(c, C \ M ) = αd s (c, C \ M ) (α 1)d. Now, the left-hand sde above equals d s(c, C \ M ) + (α 1)d s (c, W ) because dstances between c and W are not blown up by a factor of α. Addng up (4.8) and (4.9) means that we get a contradcton f (α 1)d s (c, W ) > (2αɛn + α(α + 1) + α 1)d. In other words, f our savngs n usng c as center s greater than our extra cost. Therefore, d ( ) α 1 ds(c,w ) 2α ɛn+α+1 as desred. Combnng the upper bound of Clam 4.3 wth the lower bound of Clam 4.4 when c c, we get a contradcton for suffcently large C as gven n the theorem statement, yeldng c = c. G 1 = 1 2ɛ 2 n B = ɛn G 2 = 1 2ɛ 2 n M α αm α+1 1 Fg. 4: An example showng the optmalty of the bound on the number of bad ponts. Note 4.1. The bound n Theorem 4.2 s optmal n the sense that for any α > 1 and 0 < ɛ < 1/5, we can easly construct an (α, ɛ)-perturbaton reslent 2-medan nstance whch has ɛn bad ponts. The nstance s shown n Fgure 4. It has 3 groups of ponts: G 1, G 2, and B. Both G 1 and G 2 have (1 ɛ)n/2 ponts, and B has ɛn ponts. Let M be a suffcently

14 14 M. F. BALCAN, AND Y. LIANG large constant, say, M > n 2 /ɛ. The dstances wthn the same group are 1, whle those between the ponts n G 1 and G 2 are M, those between the ponts n B and G 1 are M α+1 + 1, and those between the ponts n B and G 2 are αm α+1 1. The nstance satsfes the trangle nequalty, whch can be verfed by a case analyss. The optmal clusterng before perturbaton has one center n G 1 and the other n G 2. Then B are trvally bad ponts, and thus we have ɛn bad ponts n ths nstance. Now we show that the nstance s (α, ɛ)-perturbaton reslent. To prove that the optmal clusterng after perturbaton C s ɛ-close to the orgnal optmal clusterng, t suffces to show that C has one center from G 1 B and the other center from G 2. Assume for contradcton that ths s not true. If both centers come from G 2, the cost of ponts n G 1 s (1 ɛ)n 2 M. On the other hand, the optmal cost before perturbaton s (1 ɛ)n 2 + ɛn( M α+1 + 1), so the optmal cost after perturbaton s no more than α((1 ɛ)n 2 + ɛn( M (1 ɛ)n α+1 + 1)). But ths s smaller than 2 M, whch s a contradcton. Smlarly, we get a contradcton f both centers come from G 1 B Approxmaton Bound. Now, we consder the problem of approxmatng the cost of the optmum clusterng. We can see that after removng the bad ponts, the optmal clusters are far apart from each other. In order to get rd of the nfluence of the bad ponts, we generate a lst of blobs, whch form a partton of the data ponts, and each of whch contans only good ponts from one optmal cluster. Then we construct a tree on the lst of blobs wth a prunng that assgns all good ponts correctly. We wll show that ths prunng has low cost, so the lowest cost prunng of the tree s a good approxmaton. The detals are descrbed n Algorthm 2. A key step s to generate the lst of almost pure blobs, whch s descrbed n Algorthm 3. Suppose for any and any good pont p G, ts γ G nearest neghbors contan no good ponts outsde C. Also suppose the algorthm knows the value of γ. Informally, the algorthm mantans a threshold t. At each threshold, for each pont p that has not been added to the lst, the algorthm checks ts γt nearest neghbors N γt (p). It constructs a graph F t by connectng any two ponts that have suffcently many common neghbors. It then bulds another graph H t by connectng any two ponts that have suffcently many common neghbors n F t, and adds suffcently large components n H t to the lst. Fnally, for each remanng pont p, t checks f most of p s neghbors are n the lst and f there are blobs contanng a sgnfcant amount of p s neghbors. If so, t nserts p nto such a blob wth the smallest medan dstance. Then the threshold s ncreased and the above steps are repeated. The ntuton behnd Algorthm 3 s as follows. As mentoned above, the algorthm works when for any and any good pont p G, the γ G nearest neghbors of p contan no good ponts outsde C (γ = 1 for the k-medan nstances consdered n ths secton, as shown n Lemma 4.3; γ = 4 5 for the mn-sum nstances consdered n Secton 6, as shown n Clam 6.3). Wthout loss of generalty, assume C 1 C 2 C k. When t C 1, good ponts n dfferent clusters do not have most neghbors n common and thus are not connected n F t. However, they may be connected by a path of bad ponts. So we further buld the graph H t to dsconnect such paths, whch ensures that the blobs added nto the lst contan only good ponts from one optmal cluster. The fnal nsert step (Step 6) makes sure that when t = C 1, all remanng good ponts n C 1 wll be added to the lst and wll not affect the constructon of blobs from other optmal clusters. We can show by nducton that, at the end of the teraton t = C, all good ponts n C j (j ) are added to the lst. When t s large enough, any remanng bad ponts are nserted nto the lst, so the ponts are

15 CLUSTERING UNDER PERTURBATION RESILIENCE 15 Algorthm 2 k-medan, (α, ɛ) perturbaton reslence Input: Data set S, dstance functon d(, ) on S, mn C, ɛ > 0 1: Run Algorthm 3 to generate a lst L of blobs wth parameters u B = ɛn, γ = 1. 2: Run the robust lnkage procedure n [7] to get a cluster tree T. 3: Run dynamc programmng on T to get the mnmum cost prunng C and ts centers c. Output: Clusterng C and ts centers c. Algorthm 3 Generatng nterestng blobs Input: Data set S, dstance functon d(, ) on S, the sze of the smallest optmal cluster mn C, the upper bound on the number of bad ponts u B, a parameter γ [4/5, 1] 1: Let N r (p) denote the r nearest neghbors of p n S. 2: Let L =, A S = S. Let the ntal threshold t = mn C. 3: Construct a graph F t by connectng p, q A S f N γt (p) N γt (q) > (2γ 1)t 2u B. 4: Construct a graph H t by connectng ponts p, q A S f p, q share more than u B neghbors n F t. 5: Add to L all the components C of H t wth C 1 2 mn C and remove them from A S. 6: For each pont p A S, check f most of N γt (p) are n L and f there exsts C L contanng a sgnfcant number of ponts n N γt (p). More precsely, check f (1) N γt (p) \ L 1 2 mn C + 2u B ; (2) L p where L p = {C L : C N γt (p) (γ 3 5 ) C }. If so, assgn p to the blob n L p of smallest medan dstance, remove p from A S. 7: Whle A S > 0, ncrease t by 1 and go to Step 3. Output: The lst L. parttoned nto a lst of almost pure blobs. The formal guarantee for Algorthm 3 s stated n Lemma 4.4. Another key step s to construct a tree on these blobs. Snce good ponts are closer to good ponts n the same optmal cluster than to those n other clusters (Lemma 4.3), there exst algorthms that can buld a tree wth a prunng that assgns all good ponts correctly. In partcular, we can use the robust lnkage procedure n [7], whch repeatedly merges the two blobs C, C wth the maxmum score(c, C ) defned as follows. For each p C, sort the other blobs n decreasng order of the medan dstance between p and ponts n the blob, and let rank(p, C ) denote the rank of C. Then defne rank(c, C ) = medan x C [rank(x, C )] and score(c, C ) = mn[rank(c, C ), rank(c, C)]. Intutvely, for any blobs A, A from the same optmal cluster and D from a dfferent cluster, good ponts n A always rank A later than D n the sorted lst, so rank(a, A ) > rank(a, D). Smlarly, rank(a, A) > rank(a, D), and thus score(a, A) > score(a, D). Ths means the algorthm wll always merge blobs from the same cluster before mergng them wth blobs outsde, so there s a prunng that assgns all good ponts correctly. In the followng, we prove that Algorthm 2 outputs a good approxmaton. We frst prove a key property of the good ponts n (α, ɛ)-perturbaton reslence nstances n Lemma 4.3 and show n Lemma 4.4 that the property ensures the success of Algo-

16 16 M. F. BALCAN, AND Y. LIANG rthm 3, and then prove a property of the bad ponts n Lemma 4.5. Fnally, we use these lemmas to prove the approxmaton bound n Theorem 4.6. Lemma 4.3 (Theorem 8 n [10], Lemma 2.6 n [3]). When α > 2 + 3, for any good ponts p 1, p 2 G, q G j (j ), we have d(p 1, p 2 ) < d(p 1, q). Consequently, for any good pont p G, all ts G nearest neghbors belong to C B. Proof. The followng proof s mplct n [3] and we nclude t for completeness. We rephrase t slghtly so that t s more ntutve. By the trangle nequalty and the defnton of good ponts, d(p 2, p 1 ) + d(p 1, q) + d(q, c j ) d(p 2, c j ) > αd(p 2, c ) α(d(p 1, p 2 ) d(p 1, c )). Rearrangng terms leads to (4.10) d(p 1, q) + d(q, c j ) + αd(p 1, c ) > (α 1)d(p 1, p 2 ). Now, to compare d(p 1, q) and d(p 1, p 2 ), we need to get rd of the extra terms d(q, c j ) and d(p 1, c ). By the same proof n Lemma 3.3(2), d(p 1, q) > (α 1)d(p 1, c ), and d(p 1, q) > (α 1)d(q, c j ). Pluggng these nto (4.10), we have ( α 1 + α α 1 ) d(p 1, q) > (α 1)d(p 1, p 2 ). So when α > 2 + 3, d(p 1, q) > d(p 1, p 2 ). Lemma 4.4. Suppose the number of bad ponts s bounded by u B, and for any and any good pont p G, all ts γ G nearest neghbors n S are from C B. If mn C > 30u B, then Algorthm 3 generates a lst L of blobs each of sze at least 1 2 mn C such that: (1) The blobs n L form a partton of S. (2) Each blob n L contans good ponts from only one optmal cluster. Proof. Wthout loss of generalty, assume C 1 C 2 C k. We prove the followng two clams by nducton on k: (1) For any t G, any blob n the lst L only contans good ponts from only one optmal cluster; all blobs have sze at least 1 2 mn C. (2) At the begnnng of the teraton t = G + 1, any good pont p G j, j has already been assgned to a blob n the lst that contans good ponts only from C j. The frst two clams mply that each blob n the lst contans good ponts from only one optmal cluster. Moreover, at the begnnng of the teraton t = G k + 1, all good ponts have been assgned to one of the blobs n L, so there are only bad ponts left, the number of whch s smaller than 1 2 mn C. These remanng ponts wll eventually be assgned to the blobs before γt > n, so the blobs form a partton of S. The clams are clearly both true ntally. We show now that as long as t G 1, the graphs F t and H t have the followng propertes. No good pont p n cluster C s connected n F t to a good pont p j n a dfferent cluster C j. By assumpton, p has no neghbors outsde C B and p j has no neghbors outsde C j B, so they share at most u B < (2γ 1)t 2u B neghbors.

17 CLUSTERING UNDER PERTURBATION RESILIENCE 17 G 1 G 2 G 3 B Fg. 5: A hgh level llustraton of the graph F t. No pont q s connected n F t to both a good pont p n C and a good pont p j n a dfferent cluster C j. If q s connected to p, then N γt (p ) N γt (q) > (2γ 1)t 2u B. Snce p has no neghbors outsde C B, N γt (q) contans more than (2γ 1)t 3u B γt/2 ponts from G. Smlarly, f q s connected to p j, then N γt (q) contans more than γt/2 ponts from G j, whch s contradctory. Thus, the graph F t looks lke the llustraton n Fgure 5. All the components n H t of sze at least 1 2 mn C wll only contan good ponts from one optmal cluster. As there are at most u B bad ponts, any two ponts connected n H t must be connected n F t to at least one good pont. Then by the above two propertes, ponts on a path n H t must be connected n F t to good ponts n the same cluster, so there s no path connectng good ponts from dfferent clusters. We can use the three propertes to argue the frst clam: as long as t G 1, each blob n L contans good ponts from at most one optmal cluster. Ths s true at the begnnng and by the thrd property, for any t G 1, anytme we nsert a whole new blob n the lst n Step 5, that blob must contan pont from at most one optmal cluster. We now argue that ths property s never volated as we assgn ponts to blobs already n the lst n Step 6. Suppose a good pont p C s nserted nto C L. Then C L p, whch means N γt (p) C C /5 > u B. So N γt (p) C contans at least one good pont, whch must be from C snce N γt (p) contans no good ponts outsde C. Then by nducton C must contan only good ponts from C, and thus addng p to C does not volate the frst clam. We now show the second clam: after the teraton t = G 1, all the good ponts n C 1 have already been assgned to a blob n the lst that only contans good ponts from C 1. There are two cases. Frst, f at the begnnng of the teraton t = G 1, there are stll at least 1 2 mn C ponts from the good pont set G 1 that do not belong to blobs n the lst. Any such good pont has all γ G 1 neghbors n C 1 B. Then any two such good ponts share at least 2γ G 1 C 1 B (2γ 1) G 1 B (2γ 1)t 2u B neghbors. So they wll connect to each other n F t and then n H t, and thus we wll add one blob to L contanng all these ponts. Second, t could be that at the begnnng of the teraton t = G 1, all but less than 1 2 mn C good ponts n G 1 have been assgned to a blob n the lst. Denote the ponts that have not yet been assgned as E. Any pont p E has no neghbors outsde C 1 B. Then N γt (p) \ L E + B 1 2 mn C + 2u B. Also, there exsts a blob C contanng good ponts from C 1 such that C L p. Otherwse, N γt (p) contans at most (γ 3 5 )( C 1 B ) < γ C C 1 2u B ponts n C 1 L, whle t contans at most E good ponts n C 1 \L and contans no ponts outsde C 1 B. In total, N γt (p) has less than γt ponts, whch s contradctory. So L p and p wll be added to the

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem