Lecture 2 Clustering Part II

Similar documents
Recursive Algorithm for Generating Partitions of an Integer. 1 Preliminary

Commutativity in Permutation Groups

Lecture 4: Unique-SAT, Parity-SAT, and Approximate Counting

Algebra of Least Squares

Analytic Continuation

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Resolution Proofs of Generalized Pigeonhole Principles

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

5.1 Review of Singular Value Decomposition (SVD)

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

SRC Technical Note June 17, Tight Thresholds for The Pure Literal Rule. Michael Mitzenmacher. d i g i t a l

MA131 - Analysis 1. Workbook 2 Sequences I

Algorithms for Clustering

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

CHAPTER I: Vector Spaces

Polynomial identity testing and global minimum cut

Recurrence Relations

10-701/ Machine Learning Mid-term Exam Solution

1. By using truth tables prove that, for all statements P and Q, the statement

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Sequences and Series of Functions

Optimization Methods: Linear Programming Applications Assignment Problem 1. Module 4 Lecture Notes 3. Assignment Problem

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

1 Generating functions for balls in boxes

Real Variables II Homework Set #5

Bertrand s Postulate

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

2.1. The Algebraic and Order Properties of R Definition. A binary operation on a set F is a function B : F F! F.

Math 299 Supplement: Real Analysis Nov 2013

Solutions for the Exam 9 January 2012

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Chimica Inorganica 3

The Rand and block distances of pairs of set partitions

Chapter 6 Principles of Data Reduction

Estimation for Complete Data

Lecture 14: Randomized Computation (cont.)

Polynomial Functions and Their Graphs

THE ASYMPTOTIC COMPLEXITY OF MATRIX REDUCTION OVER FINITE FIELDS

Lecture 23: Minimal sufficiency

Lecture 10: Mathematical Preliminaries

Relations Among Algebras

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Introduction to Computational Biology Homework 2 Solution

ON POINTWISE BINOMIAL APPROXIMATION

Math 155 (Lecture 3)

6.003 Homework #3 Solutions

Lecture 20. Brief Review of Gram-Schmidt and Gauss s Algorithm

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

Some special clique problems

Fermat s Little Theorem. mod 13 = 0, = }{{} mod 13 = 0. = a a a }{{} mod 13 = a 12 mod 13 = 1, mod 13 = a 13 mod 13 = a.

Lesson 10: Limits and Continuity

Physics 116A Solutions to Homework Set #9 Winter 2012

State Space Representation

Langford s Problem. Moti Ben-Ari. Department of Science Teaching. Weizmann Institute of Science.

Notes #3 Sequences Limit Theorems Monotone and Subsequences Bolzano-WeierstraßTheorem Limsup & Liminf of Sequences Cauchy Sequences and Completeness

Square-Congruence Modulo n

1 Hash tables. 1.1 Implementation

SOME TRIBONACCI IDENTITIES

Lecture 2. The Lovász Local Lemma

Appendix: The Laplace Transform

Homework 3. = k 1. Let S be a set of n elements, and let a, b, c be distinct elements of S. The number of k-subsets of S is

4 The Sperner property.

1. ARITHMETIC OPERATIONS IN OBSERVER'S MATHEMATICS

CSE 1400 Applied Discrete Mathematics Number Theory and Proofs

Economics 250 Assignment 1 Suggested Answers. 1. We have the following data set on the lengths (in minutes) of a sample of long-distance phone calls

Math 61CM - Solutions to homework 3

Math 525: Lecture 5. January 18, 2018

TEACHER CERTIFICATION STUDY GUIDE

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Large holes in quasi-random graphs

A statistical method to determine sample size to estimate characteristic value of soil parameters

Optimally Sparse SVMs

M A T H F A L L CORRECTION. Algebra I 1 4 / 1 0 / U N I V E R S I T Y O F T O R O N T O

Lecture 16: Monotone Formula Lower Bounds via Graph Entropy. 2 Monotone Formula Lower Bounds via Graph Entropy

Lecture 12: September 27

CSE 191, Class Note 05: Counting Methods Computer Sci & Eng Dept SUNY Buffalo

Optimization Methods MIT 2.098/6.255/ Final exam

Chapter 0. Review of set theory. 0.1 Sets

MATH 324 Summer 2006 Elementary Number Theory Solutions to Assignment 2 Due: Thursday July 27, 2006

Chapter 2 The Solution of Numerical Algebraic and Transcendental Equations

Advanced Stochastic Processes.

A GENERALIZATION OF THE SYMMETRY BETWEEN COMPLETE AND ELEMENTARY SYMMETRIC FUNCTIONS. Mircea Merca

Spectral Partitioning in the Planted Partition Model

HOMEWORK #10 SOLUTIONS

The multiplicative structure of finite field and a construction of LRC

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Computability and computational complexity

Addition: Property Name Property Description Examples. a+b = b+a. a+(b+c) = (a+b)+c

STAT Homework 1 - Solutions

PROPERTIES OF AN EULER SQUARE

Sequences I. Chapter Introduction

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Infinite Sequences and Series

Math 475, Problem Set #12: Answers

The multi capacitated clustering problem

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

Lecture Notes for CS 313H, Fall 2011

You may work in pairs or purely individually for this assignment.

Transcription:

COMS 4995: Usupervised Learig (Summer 8) May 24, 208 Lecture 2 Clusterig Part II Istructor: Nakul Verma Scribes: Jie Li, Yadi Rozov Today, we will be talkig about the hardess results for k-meas. More specifically, we will develop tools ad complete a proof that the 2-meas problem is NP-hard alog the lies of [3]. k-meas overview. k-meas problem - defiitio I The defiitio of the k-meas from the previous class: - Iput: A set of poits x,...x R d ad a positive iteger k <. - Output: T R s.t. T = k. - Goal: miimize cost of T where: cost(t ) := i= mi µ j T x i µ j 2. µ j = C,..., C k are the clusters (specific partitio of the poits). xi C j x i C j ad.2 k-meas problem - defiitio II Alterative defiitio of the problem that is more useful for today s proof: - Iput: A set of poits X = x,...x R ad a positive iteger k <. - Output: (a) P, P 2,...P k X, partitios s.t. i P i = X, P i P j = Ø (b) µ, µ 2,..., µ k cetroids - Goal: miimize cost of P where cost is defied as: (a) k i P j x i µ j 2 where P,..., P k are the clusters (specific partitio of the poits) [k-meas cost].3 Observatios - The obvious way to fid the optimal solutio to k-meas is through exhaustive search which is uteable, as that takes a log time ad has expoetial complexity. While there are oly O( k ) combiatios of possible choices for cetroids (assumig oly poits of X are admissible) there are k possible partitios, which for k = 0 ad = 00 equals the umber of atoms i the uiverse!

- The idetity E X Y 2 = 2 E X E X 2 implies that the cost fuctio i the secod defiitio above ca be re-writte as: k i P j x i µ j 2 = k 2 P j i,k P j x i x k 2 - The first of these is true because (assumig X ad Y to be I.I.D.): E X Y 2 = E x E y [ X 2 + Y 2 2XY ] = E x X 2 + E y Y 2 2 E x E y [XY ] = 2[E X 2 (E X) 2 ] = 2 E[X E X] 2 = 2 E X E X 2 - Ad the secod of these is true because by usig the first idetity ad sice µ j = E X = x i P j x i : P j k i P j x i µ j 2 = k x i x k 2 = P j i P j x k P j k 2 P j i,k P j x i x k 2.4 Review of NP-hard problems For a more complete review of complexity ad hardess please go to referece [4] chapter 34. - problems that are NP-hard admit polyomial time reductios from all other problems NP - to carry out such a ecessary reductio that proves a problem (B below) is NP-hard the followig steps ca used (based off page 052 from referece [4]): (a) Give a istace α of a problem A that has previously bee prove to be NP, use a polyomial time reductio algorithm to trasform it to a istace β of problem B (b) Ru a decisio algorithm for B o istace β (c) Use the aswer for β to get α.5 2-meas hardess - statemet of mai theorem ad discussio of approach Theorem. 2-meas clusterig is a NP-hard optimizatio problem Approach to the problem is based o Dasgupta from 2008 [3]. To prove this we will start with the kow NP-hard problem of 3SAT ad show a reductio from it to the NAE-3SAT* problem. From that problem we will show a reductio to the Geeralized 2-meas problem ad fially show a reductio from that to the 2-meas problem. I each reductio as above we eed to show how a istace of the kow NP-hard problem is polyomially modified cleverly ito a istace of the problem we wat to show is NP-hard ad back (to show that the reductio maps a yes istace of the kow problem to a yes istace of the ew problem ad o istace of the kow problem to a o istace of the ew problem). Note sice we are dealig with decisio problems, the iput of a istace of a problem must iclude the decisio threshold for the problem. We begi by defiig the various problems before provig hardess ad properties of the reductios. 2

We ll briefly review NP-completeess, oly to the extet ecessary to set the stage for this proof. A more thorough treatmet ca foud i a computatioal complexity course. As a cosequece of the Cook-Levi Theorem, which poited to the first NP-hard problem, we kow that SAT ad variatios, such as 3SAT ad NAE 3-SAT, are NP-hard..6 Defiitios of various problems required for provig the mai theorem Defiitio 2 (3SAT). Iput: A Boolea formula i 3CNF-form: a formula of m clauses, each cotaiig 3 literals, coected by ad operator. Output: true if formula is satisfiable, false if ot Defiitio 3 (NAE 3-SAT). Not-all-equal-3SAT. A 3SAT formula, with the additioal requiremet that, i each clause at least oe literal is true ad at least oe literal is false. This removes the case where all three literals i a clause are true. Defiitio 4 (NAE 3-SAT*). A boolea formula φ cotaiig literals x,...x. Exactly 3 literals for each of m clauses. Each pair of variables x i, x j appears i at most 2 clauses. Oce as (x i, x j ) or ( x i, x j ) ad oce as (x i, x j ) or ( x i, x j ) Defiitio 5 (Geeralized k-meas). Iput: x matrix, distace matrix with elemets D ij = distace betwee object i ad object j. Output: Partitio of objects ito P ad P 2 Goal: miimize cost(p, P 2 ) = 2.7 Hardess of NAE-3SAT* Lemma 6. see [3] 2 p j i,j p j D ij.8 Hardess of Geeralized 2-meas For ay istace φ of x,...x of NAE-3SAT* we costruct a 2 x 2 distace matrix D α,β as below where α, β x,...x, x,..., x. Note that because the defiitio of NAE-3SAT* requires that each pair of variables x i, x j appears i at most 2 clauses, oce as (x i, x j ) or ( x i, x j ) ad oce as (x i, x j ) or ( x i, x j ), the matrix is uiquely defied for a give φ. Defiitio 7 (Distace matrix for Geeralized 2-meas - D(φ)). 0 if α = β + if α = β D α,β = + δ if α β otherwise () Where α β meas that either α ad β occur together i a clause or α ad β occur together i a clause Where: = 5m 5m + 2 3 (2)

Ad: δ = 5m + 2 (3) Note that above implies that 0 < δ < < ad by usig algebra we get that: 4δm < 2δ (4) Lemma 8. If φ is NAE-3SAT* satisfiable, the D(φ) admits to a geeralized 2-meas cost of cost(φ) = + 2δm Proof. Partitio the correspodig matrix object (2 object) for the NAE-3SAT* satisfied φ ito two partitios; oe for all the literals that are assiged true ad a secod for all literals that are assiged false. Sice each literal is represeted twice we have P = P 2 =. By defiitio of the NAE-3SAT*, each clause cotributes oe pair to P ad pair to P 2. Also this leads to the fact that the distaces betwee pairs ca oly be, + δ, with m istaces of the later ad the fact that the two clusters have idetical costs. So we get that cost(p, P 2 ) = 2 2 P j = ( 2 (2 2 ( ) = = + 2mδ i,j P j D ij ) + 2mδ) + + 2mδ ( 2 (2 2 ) + 2mδ) Lemma 9. For ay partitio P ad P 2, WLOG P cotais a variable ad its egatio, with cost(p, P 2 ) + 2 > + 2mδ = cost(φ). Proof. Let = P. Note cost(p, P 2 ) ( ) ( ) ( 2 + ) + 2 2 2 = + + 2 Lemma 0. If D(φ) admits a geeralized 2-meas cost of cost(φ) + 2δm, the φ is a satisfiable istace of NAE-3SAT*. Proof. Let P ad P 2 be the partitio with cost + 2δm. First ote that P ad P 2 do ot cotai a variable ad its egatio ad P = P 2 =. The cost of clusterig P ad P 2 4

= 2 ( ) ( + δ { if clause is split across P ad P 2 2 3 otherwise clauses ) Sice cost + 2δm, it follows that all clauses are split betwee P ad P 2. That is, every clause has at least oe literal i P ad oe literal i P 2. Therefore, the assigmet that sets all of the P to true ad all of P 2 to false is a valid NAE-3SAT* assigmet..9 From Geeralized 2-meas to 2-meas - Embeddig of D(φ) Fact. Note that ay symmetric matrix D ca be embedded i l 2 2 iff ut Du 0 for all u R s.t. u i = 0. Proof. Homework Fact 2. For D(φ), ote u T Du = u α u β D αβ α,β = u α u β ( (α=β) + (α= β) + δ (α β) ) α,β = α,β u α u β α u 2 α + 2 (u + u ) + δ α,β u α u β (α β) ( u α ) 2 u 2 + 2 (u + u ) + δ α,β u 2 + ( u + 2 + u 2 ) + δ( u α ) 2 ( ) u 2 + δ2 u 2, ad sice ( ) δ2 0.0 Proof of Theorem u α u β, ad use: 2ab a 2 + b 2 Proof. NAE-3SAT* is NP hard from Lemma 6. From Defiitio 7 ad Lemmas 8,9,0 we have that ay istace of the NAE-3SAT*, φ of x,...x ca be reduced to a istace of the (decisio versio of the) Geeralized 2-meas problem with D(φ) ad threshold cost(φ). We also have from the Lemma that with these specific istaces that NAE-3SAT* is solved, if ad oly if the Geeralized 2-meas problem is solved. This combied with the fact that the reductio steps take polyomial time i ad Fact 2 that D(φ) ca be embedded ito l 2, completes the proof for 2-meas. Refereces [] Gozalez, F. Clusterig to miimize the maximum itercluster distace. Theoretical Computer Sciece 38 (985): 293-306. 5

[2] Hartiga, Joh A. Clusterig Algorithms Joh wiley & sos (977). [3] Sajoy Dasgupta. The hardess of k-meas clusterig Departmet of Computer Sciece ad Egieerig Uiversity of Califoria, Sa Diego (2008): Techical Report CS2008-096. [4] Thomas H. Corme, Charles E. Leiserso, Roald L. Rivest, Clifford Stei Itroductio to Algorithms, Third Editio The MIT Press (2009 ) 6