k-variates++: more pluses in the k-means++

Size: px

Start display at page:

Download "k-variates++: more pluses in the k-means++"

Susanna Jones
6 years ago
Views:

1 (formerly NICTA) k-variates++: more pluses in the k-means++ m p 7 3 n o M, 9 Poster #2 Richard Nock, Raphaël Canyasse, Roksana Boreli, Frank Nielsen DATA61 ANU TECHNION ECOLE POLYTECHNIQUE UNSW SONY CS LABS, INC wwwdata61csiroau

In this talk k-variates A generalization of the popular k-means++ seeding! Two theorems on k-variates++! guarantees on approximation of the global optimum!

2 In this talk k-variates A generalization of the popular k-means++ seeding! Two theorems on k-variates++! guarantees on approximation of the global optimum! likelihood ratio bound between neighbouring instances! Applications: reductions between clustering algorithms + approximation bounds of new clustering algorithms, privacy 2

3 In this talk k-variates A generalization of the popular k-means++ seeding! Two theorems on k-variates++! And more! guarantees on approximation of the global optimum! (see poster) likelihood ratio bound between neighbouring instances! Applications: reductions between clustering algorithms + approximation bounds of new clustering algorithms, privacy 3

4 In this talk k-variates A generalization of the popular k-means++ seeding! Two theorems on k-variates++! And more! guarantees on approximation of the global optimum! (see poster)! likelihood ratio bound between neighbouring instances! (see paper!) Applications: reductions between clustering algorithms + approximation bounds of new clustering algorithms, privacy 4

5 Motivation k-means++ seeding = a gold standard in clustering:! utterly simple to implement (iteratively pick centers squ distance to previous centers)! k-means++ distributed on-line assumption-free (expected) approximation guarantee wrt the k-means global optimum: E C [potential] apple (2 + log k) 8 opt (Arthur & Vassilvitskii, SODA 2007)! streamed no closed form centroid tensors more potentials Inspired many variants (tensor clustering, distributed, data stream, on-line, parallel clustering, clustering without centroids in closed form, etc) 5

6 Motivation Approaches are spawns of k-means++:! k-variates modify the algorithm (eg )! use it as building block! Our objective:! distributed on-line k-means++ more applications all in the same bag : a generalisation of k-means++ from which such approaches would be just instanciations reductions! ) streamed ) Because general new applications no closed form centroid more potentials 6

7 k-means++ Arthur & Vassilvitskii, SODA 07 Input: data A R d with A = m, k 2 N ; Step 1: Initialise centers C ;; Step 2: for t =1, 2,,k 21: randomly sample a qt A,withq 1 = u m and, for t>1, q t (a) = D t (a) X a 0 2A D t (a 0 )! 1, where D t (a) =min x2c ka xk2 2 ; 22: x a; 23: C C [ {x}; Output: C; 7

8 k-variates Input: data A R d with A = m, k 2 N, random variables {X a, a 2 A}, probe functions } t : A! R d (t 1); Step 1: Initialise centers C ;; Step 2: for t =1, 2,,k 21: randomly sample a qt A,withq 1 = u m and, for t>1, q t (a) = D t (a) X a 0 2A D t (a 0 )! 1, where D t (a) =min x2c k} t(a) xk 2 2 ; 22: randomly sample x X a ; 23: C C [ {x}; Output: C; 8

9 Two theorems & applications 9

10 Theorem 1 approximation of global optimum k-means potential for C:, with! ( 0) Suppose is -stretching: for any optimal cluster with size > 1 and any, (A; C) = P a2a ka c(a)k2 2 } t A a 0 2 A (A; C) (A; {a 0 }) apple (1 + ) (} t (A); C) (} t (A); {} t (a 0 )}), 8t E C k variates++ [ (A; C)] apple (2 + log k) Then, with =(6+4 ) opt +2 bias +2 var c(a) = arg min ka c2c ck2 2 opt = X a2a ka c opt (a)k 2 2 bias = X a2a ke[x a ] c opt (a)k 2 2 var = X a2a tr (cov[x a ]) 10

11 Theorem 1 approximation of global optimum k-means potential for C:, with! ( 0) Suppose is -stretching: for any optimal cluster with size > 1 and any, (A; C) = P a2a ka c(a)k2 2 } t A a 0 2 A (A; C) (A; {a 0 }) apple (1 + ) (} t (A); C) (} t (A); {} t (a 0 )}), 8t E C k variates++ [ (A; C)] apple (2 + log k) Then, with =(6+4 ) opt +2 bias +2 var c(a) = arg min ka c2c ck2 2 k-means++:! probe = Id! = Diracs X opt bias = X a2a ka c opt (a)k 2 2 = X a2a ke[x a ] c opt (a)k 2 2 var = X a2a tr (cov[x a ]) 11

12 Theorem 1 approximation of global optimum k-means potential for C:, with! ( 0) Suppose is -stretching: for any optimal cluster with size > 1 and any, (A; C) = P a2a ka c(a)k2 2 } t A a 0 2 A (A; C) (A; {a 0 }) apple (1 + ) (} t (A); C) (} t (A); {} t (a 0 )}), 8t E C k variates++ [ (A; C)] apple (2 + log k) Then, with =(6+4 ) opt +2 bias +2 var c(a) = arg min ka c2c ck2 2 k-means++: bias = opt var =0 =0 ) =8 opt opt bias var = X a2a ka c opt (a)k 2 2 = X a2a ke[x a ] c opt (a)k 2 2 = X a2a tr (cov[x a ]) 12

13 Remarks Guarantee approaches statistical lowerbound (Fréchet-Cramér-Rao-Darmois)! Can be better than Arthur-Vassilvitskii bound, in particular if bias < bias opt = knob from which background / domain knowledge may improve the general bound 13

14 Applications Reductions from k-variates++ ) approximability ratios! pick clustering algorithm L,! show that expected output of L= that of k-variates++ for particular choices of } t and X (note: no computational constraint, just need existence)! Get approximability ratio for L! 14

15 Summary (poster, paper) } t X Setting AlgorithmL Probe functions Densities Batch k-means++ Identity Diracs Distributed d-k-means++ Identity Uniform, support = subsets Distributed p + d-k-means++ Identity Non uniform, compact support Streaming s-k-means++ synopses Diracs On-line ol-k-means++ point (batch not hit) Diracs / closest center (batch hit) 15

16 Summary (poster, paper) } t X Setting AlgorithmL Probe functions Densities Batch k-means++ Identity Diracs Distributed d-k-means++ Identity Uniform, support = subsets Distributed p + d-k-means++ Identity Non uniform, compact support Streaming s-k-means++ synopses Diracs On-line ol-k-means++ point (batch not hit) Diracs / closest center (batch hit) 16

Distributed clustering Setting: {data nodes = Forgy nodes} & special node Sampling node no data & non-uniform sampling eg hybrid, server-assisted P2P networks!

17 Distributed clustering Setting: {data nodes = Forgy nodes} & special node Sampling node no data & non-uniform sampling eg hybrid, server-assisted P2P networks! (or Forgy node) (F 1, A 1 ) (F 2, A 2 ) (F 3, A 3 ) N k data points! no data! communicated communicated (F 4, A 4 ) (F 5, A 5 ) Forgy nodes data ([ i A i = A) & uniform sampling 17

18 Algorithm + Theorem Algorithm: iterate for t =1, 2,,k:! d-k-means++ N chooses (non-uniformly, D ) Forgy node, say! t F i F i samples (uniformly) point a t 2 A i, sends to! F j, 8j F j, 8j computes & sends d j 2 R + to N, which updates D! t Theorem: E[ (A, C)] apple (2 + log k), with and F s = P i2[n] P a2a i kc(a i ) ak 2 2 the spread of Forgy nodes! Remarks: global optimum on total data; bound gets opt all the better as Forgy nodes aggregate local data a t = 10 opt +6 F s 18

19 Theorem 2 likelihood ratio bound for neighbour samples Assumption: = Ball(L 2,R), all satisfy (see eg differential privacy) X (dp Xa 0 /dp Xa )(x) apple %(R), 8a, a 0 2 A, 8x 2 19

20 Theorem 2 likelihood ratio bound for neighbour samples Fix For any neighbour A 0 A (differ from 1), 0 < w, s 1are spread and monotonicity parameters! } t =Id() P C k variates++ [C A 0 ] P C k variates++ [C A] apple (1 + w ) k 1 + f(k) w (1 + s ) k 1 %(R) They can be estimated / computed from data! definition in In general, they! 0 with m(formal poster / paper) 20

21 Theorem 2 likelihood ratio bound for neighbour samples } t =Id() Fix For any neighbour A 0 A (differ from 1), P C k variates++ [C A 0 ] P C k variates++ [C A] apple (1 + w ) k 1 + f(k) w (1 + s ) k 1 %(R) Conditions for ) 1 & ) 0? 21

22 Theorem 2 likelihood ratio bound for neighbour samples } t =Id() Fix For any neighbour A 0 A(differ from 1), P C k variates++ [C A 0 ] P C k variates++ [C A] apple (1 + w ) k 1 + f(k) w (1 + s ) k 1 %(R) If densities of all are in [ m, M ] 63 0, with prob P[C A 0 ] P[C A] apple 1+ X a M as long as k! apple m 2 4 M m k pm No w, s in bound (proof exhibits small values whp, experiments display such values) Application in differential privacy (sublinear noise!) ! k %(2R) m d+1 k 2 d m {z } o(1) 1 22

Experiments k-variates++ ( d-k-means++ ) vs k-means++ & k-means (Bahmani & al 2012), simulated data, d=50, sample peers E[ A i ] = 500 with until For each peer, (a) data uniformly

23 Experiments k-variates++ ( d-k-means++ ) vs k-means++ & k-means (Bahmani & al 2012), simulated data, d=50, sample peers E[ A i ] = 500 with until For each peer, (a) data uniformly sampled in an hyper rectangle + (b) p% of points given to a random peer (increases, problem more difficult ) F s P i A i F s (p) F s (0) k p 23

24 Experiments k-variates++ (d-k-means++) vs k-means++ & k-meansk (Bahmani & al 2012) (used with their best parameters) k 7 k-variates++ ++ s n a e m k s t a be p 4 5 (k-means++) (H) = (d-k-means++) (H) 6 k p (k-meansk ) (H)

25 Conclusions We provide a generalisation of k-means++ with guaranteed approximation of the global optimum! k-variates++ can be used as is (eg privacy, k-means++) or to prove approximation properties of other algorithms via reductions between clustering algorithms Come see the poster for more examples! Future: use Theorems to address stability, generalisation and smoothed analysis 25

26 Thank you! k-variates Questions? 26

k-variates++: more pluses in the k-means++

k-variates++: more pluses in the k-means++ arxiv:698v [cslg] 3 Feb 6 Richard Nock Nicta & The Australian National University richardnock@nictacomau Raphaël Canyasse Ecole Polytechnique & The Technion raphaelcanyasse@polytechniqueedu