COMS 4995: Usupervised Learig (Summer 8) May 24, 208 Lecture 2 Clusterig Part II Istructor: Nakul Verma Scribes: Jie Li, Yadi Rozov Today, we will be talkig about the hardess results for k-meas. More specifically, we will develop tools ad complete a proof that the 2-meas problem is NP-hard alog the lies of [3]. k-meas overview. k-meas problem - defiitio I The defiitio of the k-meas from the previous class: - Iput: A set of poits x,...x R d ad a positive iteger k <. - Output: T R s.t. T = k. - Goal: miimize cost of T where: cost(t ) := i= mi µ j T x i µ j 2. µ j = C,..., C k are the clusters (specific partitio of the poits). xi C j x i C j ad.2 k-meas problem - defiitio II Alterative defiitio of the problem that is more useful for today s proof: - Iput: A set of poits X = x,...x R ad a positive iteger k <. - Output: (a) P, P 2,...P k X, partitios s.t. i P i = X, P i P j = Ø (b) µ, µ 2,..., µ k cetroids - Goal: miimize cost of P where cost is defied as: (a) k i P j x i µ j 2 where P,..., P k are the clusters (specific partitio of the poits) [k-meas cost].3 Observatios - The obvious way to fid the optimal solutio to k-meas is through exhaustive search which is uteable, as that takes a log time ad has expoetial complexity. While there are oly O( k ) combiatios of possible choices for cetroids (assumig oly poits of X are admissible) there are k possible partitios, which for k = 0 ad = 00 equals the umber of atoms i the uiverse!
- The idetity E X Y 2 = 2 E X E X 2 implies that the cost fuctio i the secod defiitio above ca be re-writte as: k i P j x i µ j 2 = k 2 P j i,k P j x i x k 2 - The first of these is true because (assumig X ad Y to be I.I.D.): E X Y 2 = E x E y [ X 2 + Y 2 2XY ] = E x X 2 + E y Y 2 2 E x E y [XY ] = 2[E X 2 (E X) 2 ] = 2 E[X E X] 2 = 2 E X E X 2 - Ad the secod of these is true because by usig the first idetity ad sice µ j = E X = x i P j x i : P j k i P j x i µ j 2 = k x i x k 2 = P j i P j x k P j k 2 P j i,k P j x i x k 2.4 Review of NP-hard problems For a more complete review of complexity ad hardess please go to referece [4] chapter 34. - problems that are NP-hard admit polyomial time reductios from all other problems NP - to carry out such a ecessary reductio that proves a problem (B below) is NP-hard the followig steps ca used (based off page 052 from referece [4]): (a) Give a istace α of a problem A that has previously bee prove to be NP, use a polyomial time reductio algorithm to trasform it to a istace β of problem B (b) Ru a decisio algorithm for B o istace β (c) Use the aswer for β to get α.5 2-meas hardess - statemet of mai theorem ad discussio of approach Theorem. 2-meas clusterig is a NP-hard optimizatio problem Approach to the problem is based o Dasgupta from 2008 [3]. To prove this we will start with the kow NP-hard problem of 3SAT ad show a reductio from it to the NAE-3SAT* problem. From that problem we will show a reductio to the Geeralized 2-meas problem ad fially show a reductio from that to the 2-meas problem. I each reductio as above we eed to show how a istace of the kow NP-hard problem is polyomially modified cleverly ito a istace of the problem we wat to show is NP-hard ad back (to show that the reductio maps a yes istace of the kow problem to a yes istace of the ew problem ad o istace of the kow problem to a o istace of the ew problem). Note sice we are dealig with decisio problems, the iput of a istace of a problem must iclude the decisio threshold for the problem. We begi by defiig the various problems before provig hardess ad properties of the reductios. 2
We ll briefly review NP-completeess, oly to the extet ecessary to set the stage for this proof. A more thorough treatmet ca foud i a computatioal complexity course. As a cosequece of the Cook-Levi Theorem, which poited to the first NP-hard problem, we kow that SAT ad variatios, such as 3SAT ad NAE 3-SAT, are NP-hard..6 Defiitios of various problems required for provig the mai theorem Defiitio 2 (3SAT). Iput: A Boolea formula i 3CNF-form: a formula of m clauses, each cotaiig 3 literals, coected by ad operator. Output: true if formula is satisfiable, false if ot Defiitio 3 (NAE 3-SAT). Not-all-equal-3SAT. A 3SAT formula, with the additioal requiremet that, i each clause at least oe literal is true ad at least oe literal is false. This removes the case where all three literals i a clause are true. Defiitio 4 (NAE 3-SAT*). A boolea formula φ cotaiig literals x,...x. Exactly 3 literals for each of m clauses. Each pair of variables x i, x j appears i at most 2 clauses. Oce as (x i, x j ) or ( x i, x j ) ad oce as (x i, x j ) or ( x i, x j ) Defiitio 5 (Geeralized k-meas). Iput: x matrix, distace matrix with elemets D ij = distace betwee object i ad object j. Output: Partitio of objects ito P ad P 2 Goal: miimize cost(p, P 2 ) = 2.7 Hardess of NAE-3SAT* Lemma 6. see [3] 2 p j i,j p j D ij.8 Hardess of Geeralized 2-meas For ay istace φ of x,...x of NAE-3SAT* we costruct a 2 x 2 distace matrix D α,β as below where α, β x,...x, x,..., x. Note that because the defiitio of NAE-3SAT* requires that each pair of variables x i, x j appears i at most 2 clauses, oce as (x i, x j ) or ( x i, x j ) ad oce as (x i, x j ) or ( x i, x j ), the matrix is uiquely defied for a give φ. Defiitio 7 (Distace matrix for Geeralized 2-meas - D(φ)). 0 if α = β + if α = β D α,β = + δ if α β otherwise () Where α β meas that either α ad β occur together i a clause or α ad β occur together i a clause Where: = 5m 5m + 2 3 (2)
Ad: δ = 5m + 2 (3) Note that above implies that 0 < δ < < ad by usig algebra we get that: 4δm < 2δ (4) Lemma 8. If φ is NAE-3SAT* satisfiable, the D(φ) admits to a geeralized 2-meas cost of cost(φ) = + 2δm Proof. Partitio the correspodig matrix object (2 object) for the NAE-3SAT* satisfied φ ito two partitios; oe for all the literals that are assiged true ad a secod for all literals that are assiged false. Sice each literal is represeted twice we have P = P 2 =. By defiitio of the NAE-3SAT*, each clause cotributes oe pair to P ad pair to P 2. Also this leads to the fact that the distaces betwee pairs ca oly be, + δ, with m istaces of the later ad the fact that the two clusters have idetical costs. So we get that cost(p, P 2 ) = 2 2 P j = ( 2 (2 2 ( ) = = + 2mδ i,j P j D ij ) + 2mδ) + + 2mδ ( 2 (2 2 ) + 2mδ) Lemma 9. For ay partitio P ad P 2, WLOG P cotais a variable ad its egatio, with cost(p, P 2 ) + 2 > + 2mδ = cost(φ). Proof. Let = P. Note cost(p, P 2 ) ( ) ( ) ( 2 + ) + 2 2 2 = + + 2 Lemma 0. If D(φ) admits a geeralized 2-meas cost of cost(φ) + 2δm, the φ is a satisfiable istace of NAE-3SAT*. Proof. Let P ad P 2 be the partitio with cost + 2δm. First ote that P ad P 2 do ot cotai a variable ad its egatio ad P = P 2 =. The cost of clusterig P ad P 2 4
= 2 ( ) ( + δ { if clause is split across P ad P 2 2 3 otherwise clauses ) Sice cost + 2δm, it follows that all clauses are split betwee P ad P 2. That is, every clause has at least oe literal i P ad oe literal i P 2. Therefore, the assigmet that sets all of the P to true ad all of P 2 to false is a valid NAE-3SAT* assigmet..9 From Geeralized 2-meas to 2-meas - Embeddig of D(φ) Fact. Note that ay symmetric matrix D ca be embedded i l 2 2 iff ut Du 0 for all u R s.t. u i = 0. Proof. Homework Fact 2. For D(φ), ote u T Du = u α u β D αβ α,β = u α u β ( (α=β) + (α= β) + δ (α β) ) α,β = α,β u α u β α u 2 α + 2 (u + u ) + δ α,β u α u β (α β) ( u α ) 2 u 2 + 2 (u + u ) + δ α,β u 2 + ( u + 2 + u 2 ) + δ( u α ) 2 ( ) u 2 + δ2 u 2, ad sice ( ) δ2 0.0 Proof of Theorem u α u β, ad use: 2ab a 2 + b 2 Proof. NAE-3SAT* is NP hard from Lemma 6. From Defiitio 7 ad Lemmas 8,9,0 we have that ay istace of the NAE-3SAT*, φ of x,...x ca be reduced to a istace of the (decisio versio of the) Geeralized 2-meas problem with D(φ) ad threshold cost(φ). We also have from the Lemma that with these specific istaces that NAE-3SAT* is solved, if ad oly if the Geeralized 2-meas problem is solved. This combied with the fact that the reductio steps take polyomial time i ad Fact 2 that D(φ) ca be embedded ito l 2, completes the proof for 2-meas. Refereces [] Gozalez, F. Clusterig to miimize the maximum itercluster distace. Theoretical Computer Sciece 38 (985): 293-306. 5
[2] Hartiga, Joh A. Clusterig Algorithms Joh wiley & sos (977). [3] Sajoy Dasgupta. The hardess of k-meas clusterig Departmet of Computer Sciece ad Egieerig Uiversity of Califoria, Sa Diego (2008): Techical Report CS2008-096. [4] Thomas H. Corme, Charles E. Leiserso, Roald L. Rivest, Clifford Stei Itroductio to Algorithms, Third Editio The MIT Press (2009 ) 6