The Informativeness of k-means for Learning Mixture Models

Size: px

Start display at page:

Download "The Informativeness of k-means for Learning Mixture Models"

Ariel Crawford
5 years ago
Views:

1 The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, /35

2 Gaussian distribution For F dimensions, the Gaussian distribution of a vector x R F is defined by: ( 1 N (x u, Σ) = (2π) F /2 Σ exp 1 ) 2 (x u)t Σ 1 (x u), where u is the mean vector, Σ is the covariance matrix of the Gaussian. [ ] [ ] Example: Mean u = and Covariance matrix Σ = /35

3 Gaussian mixture model (GMM) w k : mixing weight P(x) = K w k N (x u k, Σ k ). k=1 u k : component mean vector Σ k : component covariance matrix; if Σ k = σ 2 k I, the GMM is said to be spherical 3/35

4 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from 4/35

5 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. 4/35

6 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. The correct target clustering I := {I 1, I 2,..., I K } satisfies n I k iff v n comes from the k-th component. 4/35

7 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. The correct target clustering I := {I 1, I 2,..., I K } satisfies n I k iff v n comes from the k-th component. Thereby inferring the important parameters of the GMM. 4/35

8 Algorithms for learning GMM i) Expectation Maximization (EM) A local-search heuristic approach for maximum likelihood estimation in the presence of incomplete data; Cannot guarantee the convergence to global optima. 5/35

9 Algorithms for learning GMM ii) Algorithms based on spectral decomposition and method of moments; Definition 2 (non-degeneracy condition) The mixture model is said to satisfy a non-degeneracy condition if the component mean vectors u 1,..., u K span a K-dimensional subspace, and the mixing weight w k > 0, for k {1, 2,..., K}. 6/35

10 Algorithms for learning GMM iii) Algorithms proposed in theoretical computer science with guarantees; Need to assume separability assumptions. Vempala and Wang [2002]: for any i, j [K], i j, u i u j 2 > C max{σ i, σ j }K 1 4 log 1 4 ( F w min ). 7/35

11 Algorithms for learning GMM iii) Algorithms proposed in theoretical computer science with guarantees; Need to assume separability assumptions. Vempala and Wang [2002]: for any i, j [K], i j, u i u j 2 > C max{σ i, σ j }K 1 4 log 1 4 ( F w min ). A simple spectral algorithm with running time polynomial in both F and K works well for correctly clustering samples. 7/35

12 The k-means algorithm Large number of algorithms for finding the (approximately) correct clustering of GMM; 8/35

13 The k-means algorithm Large number of algorithms for finding the (approximately) correct clustering of GMM; Many practitioners stick with k-means algorithm because of its simplicity and successful applications in various fields. 8/35

14 The objective function of k-means Objective function: the so-called sum-of-squares distortion. where D(V, I ) := K k=1 I k : the index set of k-th cluster; c k := 1 I k n I k v n c k 2 2, n I k v n is the centroid of the k-th cluster. 9/35

15 The objective function of k-means Objective function: the so-called sum-of-squares distortion. where D(V, I ) := K k=1 I k : the index set of k-th cluster; c k := 1 I k n I k v n c k 2 2, n I k v n is the centroid of the k-th cluster. Finding an optimal clustering I opt that satisfies D(V, I opt ) = min D(V, I ) =: I D (V). 9/35

16 k-means algorithm 10/35

17 k-means algorithm 11/35

18 k-means algorithm 12/35

19 k-means algorithm 13/35

20 k-means algorithm 14/35

21 k-means algorithm 15/35

22 Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? 16/35

23 Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? Yes! Kumar and Kannan [2010] showed that if: Data points satisfy a proximity condition, i.e., when they independently generated from a GMM with a certain separability assumption 16/35

24 Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? Yes! Kumar and Kannan [2010] showed that if: Data points satisfy a proximity condition, i.e., when they independently generated from a GMM with a certain separability assumption k-means algorithm with a proper initialization can correctly cluster nearly all data points with high probability 16/35

25 Using k-means to learn GMM? The key condition to be satisfied for performing k-means to learn the parameters of a GMM? 17/35

26 Using k-means to learn GMM? The key condition to be satisfied for performing k-means to learn the parameters of a GMM? The correct clustering Any optimal clustering 17/35

27 Main contributions We prove if data points generated from a K-component spherical GMM; non-degeneracy condition and an separability assumption; The correct clustering Any optimal clustering 18/35

28 Main contributions We also prove if data points generated from a K-component spherical GMM; projected onto the low-dimensional space; non-degeneracy condition and an even weaker separability assumption; The correct clustering Any optimal clustering for the dimensionality-reduced dataset 19/35

29 Advantages of dimensionality reduction Significantly faster running time Reduced memory usage Weaker separability assumption Other advantages 20/35

30 Lower bound of distortion Let Z be the centralized data matrix of V and denote S = Z T Z. According to Ding and He [2004], for any K-clustering I, where K 1 D(V, I ) D (V) := tr(s) λ k (S), are the sorted eigenvalues of S. k=1 λ 1 (S) λ 2 (S) /35

31 Misclassification error (ME) distance Definition 3 (ME distance) The misclassification error distance of any two K-clusterings I 1 := {I 1 1, I 1 2,..., I 1 K }, I 2 := {I 2 1, I 2 2,..., I 2 K } and is defined as d(i 1, I 2 ) := 1 1 N max π P K K k=1 I 1 k I 2 π(k), where π P K represents that the distance is minimized over all permutations of the labels {1, 2,..., K}. Meilă [2005]: ME distance defined above is indeed a metric. 22/35

32 Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; 23/35

33 Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; Let and p max := max k 1 N I 1 k, and p min := min k N I k δ := D(V, I ) D (V) λ K 1 (S) λ K (S), where D (V) := min D(V, I ). I 23/35

34 Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; Let and p max := max k 1 N I 1 k, and p min := min k N I k δ := D(V, I ) D (V) λ K 1 (S) λ K (S), where D (V) := min D(V, I ). I If δ K 1 2 and ( τ(δ) := 2δ 1 δ ) p min, K 1 then d(i, optimal) p max τ(δ). 23/35

35 Definitions Define the increasing function the average variances ζ(p) := p p/(K 1), σ 2 := and the minimum eigenvalue K w k σk 2 k=1 ( K ) λ min := λ K 1 w k (u k ū)(u k ū) T. k=1 24/35

36 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); 25/35

37 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; 25/35

38 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; Let and assume w min := min w k, and w max := max k k δ 0 := (K 1) σ2 λ min < ζ(w min ). w k 25/35

39 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; Let w min := min w k, and w max := max k k and assume (K 1) σ2 δ 0 := < ζ(w min ). λ min For sufficiently large N, w.h.p., d(correct, optimal) τ (δ 0 ) w max. w k 25/35

40 Remark for separability assumption Remark 1 The condition δ 0 < ζ(w min ) can be considered as a separability assumption. For example, K = 2 implies that and we have λ min = w 1 w 2 u 1 u u 1 u 2 2 > σ w1 w 2 ζ(w min ). 26/35

41 Remark for non-degeneracy condition Remark 2 The non-degeneracy condition is used to ensure that λ min > 0. For K = 2, we have λ min = w 1 w 2 u 1 u and we only need the two component mean vectors are distinct and we do not need that they are linearly independent. 27/35

42 Theorem for dimensionality-reduced datasets Theorem 2 V R F N : generated under the same conditions given in Theorem 1; The separability assumption being modified to δ 1 := (K 1) σ2 λ min + σ 2 < ζ(w min). Ṽ R (K 1) N : the post-(k 1)-PCA dataset of V. 28/35

43 Theorem for dimensionality-reduced datasets Theorem 2 V R F N : generated under the same conditions given in Theorem 1; The separability assumption being modified to δ 1 := (K 1) σ2 λ min + σ 2 < ζ(w min). Ṽ R (K 1) N : the post-(k 1)-PCA dataset of V. For sufficiently large N, w.h.p., d(correct, optimal) τ (δ 1 ) w max. 28/35

44 Upper bound for ME distance between optimal clusterings Combining the results of Theorem 1 and Theorem 2, by the triangle inequality: Corollary 1 V R F N : generated under the same conditions given in Theorem 1; Ṽ: the post-(k 1)-PCA dataset of V. For sufficiently large N, w.h.p. d(optimal, optimal) (τ (δ 0 ) + τ (δ 1 )) w max. 29/35

45 Parameter settings K = 2, for all k = 1, 2, we set or where ε = σk 2 = λ minζ(w min ε), corr. to 4(K 1) σk 2 = λ minζ(w min ε), corr. to K 1 δ 0 ζ(w min ) 1 4, δ 0 ζ(w min ) 1, 30/35

46 Parameter settings K = 2, for all k = 1, 2, we set or where ε = σk 2 = λ minζ(w min ε), corr. to 4(K 1) σk 2 = λ minζ(w min ε), corr. to K 1 δ 0 ζ(w min ) 1 4, δ 0 ζ(w min ) 1, The former corresponds to well-separated clusters; the latter corresponds to moderately well-separated clusters 30/35

47 Visualization of post-2-svd datasets δ 0 /ζ (w min) 1/4 6 δ 0 /ζ (w min) 1 Cluster 1 Cluster 2 Component Mean 4 8 Cluster 1 Cluster 2 Component Mean /35

48 Original datasets d org := d(i, I opt ), d org := τ(δ 0 )w max. δ emp 0 := D(V,I ) D (V) λ K 1 (S) λ K (S) is an approximation of δ 0, d emp org := τ(δ emp 0 )p max is an approximation of d org original dataset original dataset ME distance d emp org d org d org d emp org d org d org N N Figure: True distances and upper bounds for original datasets. 32/35

49 Dimensionality-reduced datasets 0.14 post-pca dataset post-pca dataset ME distance d emp pca d pca d pca d emp pca d pca d pca N N Figure: True distances and upper bounds for post-pca datasets. 33/35

50 Comparisons of running time original dataset post-pca dataset 0.3 original dataset post-pca dataset Running time N N 34/35

51 Further extensions Randomized SVD instead of exact SVD; Random projection; Non-spherical Gaussian or even more general distributions, e.g., log-concave distributions; 35/35

EM for Spherical Gaussians

EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical