The Informativeness of k-means for Learning Mixture Models

The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35

Gaussian distribution For F dimensions, the Gaussian distribution of a vector x R F is defined by: ( 1 N (x u, Σ) = (2π) F /2 Σ exp 1 ) 2 (x u)t Σ 1 (x u), where u is the mean vector, Σ is the covariance matrix of the Gaussian. [ ] [ ] 0 0.25 0.3 Example: Mean u = and Covariance matrix Σ = 0 0.3 1.0 2/35

Gaussian mixture model (GMM) w k : mixing weight P(x) = K w k N (x u k, Σ k ). k=1 u k : component mean vector Σ k : component covariance matrix; if Σ k = σ 2 k I, the GMM is said to be spherical 3/35

Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from 4/35

Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. The correct target clustering I := {I 1, I 2,..., I K } satisfies n I k iff v n comes from the k-th component. 4/35

Algorithms for learning GMM i) Expectation Maximization (EM) A local-search heuristic approach for maximum likelihood estimation in the presence of incomplete data; Cannot guarantee the convergence to global optima. 5/35

Algorithms for learning GMM ii) Algorithms based on spectral decomposition and method of moments; Definition 2 (non-degeneracy condition) The mixture model is said to satisfy a non-degeneracy condition if the component mean vectors u 1,..., u K span a K-dimensional subspace, and the mixing weight w k > 0, for k {1, 2,..., K}. 6/35

Algorithms for learning GMM iii) Algorithms proposed in theoretical computer science with guarantees; Need to assume separability assumptions. Vempala and Wang [2002]: for any i, j [K], i j, u i u j 2 > C max{σ i, σ j }K 1 4 log 1 4 ( F w min ). A simple spectral algorithm with running time polynomial in both F and K works well for correctly clustering samples. 7/35

The k-means algorithm Large number of algorithms for finding the (approximately) correct clustering of GMM; 8/35

The k-means algorithm Large number of algorithms for finding the (approximately) correct clustering of GMM; Many practitioners stick with k-means algorithm because of its simplicity and successful applications in various fields. 8/35

The objective function of k-means Objective function: the so-called sum-of-squares distortion. where D(V, I ) := K k=1 I k : the index set of k-th cluster; c k := 1 I k n I k v n c k 2 2, n I k v n is the centroid of the k-th cluster. 9/35

k-means algorithm 10/35

k-means algorithm 11/35

k-means algorithm 12/35

k-means algorithm 13/35

k-means algorithm 14/35

k-means algorithm 15/35

Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? 16/35

Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? Yes! Kumar and Kannan [2010] showed that if: Data points satisfy a proximity condition, i.e., when they independently generated from a GMM with a certain separability assumption 16/35

Using k-means to learn GMM? The key condition to be satisfied for performing k-means to learn the parameters of a GMM? 17/35

Using k-means to learn GMM? The key condition to be satisfied for performing k-means to learn the parameters of a GMM? The correct clustering Any optimal clustering 17/35

Main contributions We prove if data points generated from a K-component spherical GMM; non-degeneracy condition and an separability assumption; The correct clustering Any optimal clustering 18/35

Main contributions We also prove if data points generated from a K-component spherical GMM; projected onto the low-dimensional space; non-degeneracy condition and an even weaker separability assumption; The correct clustering Any optimal clustering for the dimensionality-reduced dataset 19/35

Advantages of dimensionality reduction Significantly faster running time Reduced memory usage Weaker separability assumption Other advantages 20/35

Lower bound of distortion Let Z be the centralized data matrix of V and denote S = Z T Z. According to Ding and He [2004], for any K-clustering I, where K 1 D(V, I ) D (V) := tr(s) λ k (S), are the sorted eigenvalues of S. k=1 λ 1 (S) λ 2 (S)... 0 21/35

Misclassification error (ME) distance Definition 3 (ME distance) The misclassification error distance of any two K-clusterings I 1 := {I 1 1, I 1 2,..., I 1 K }, I 2 := {I 2 1, I 2 2,..., I 2 K } and is defined as d(i 1, I 2 ) := 1 1 N max π P K K k=1 I 1 k I 2 π(k), where π P K represents that the distance is minimized over all permutations of the labels {1, 2,..., K}. Meilă [2005]: ME distance defined above is indeed a metric. 22/35

Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; 23/35

Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; Let and p max := max k 1 N I 1 k, and p min := min k N I k δ := D(V, I ) D (V) λ K 1 (S) λ K (S), where D (V) := min D(V, I ). I 23/35

Definitions Define the increasing function the average variances ζ(p) := p 1 + 1 2p/(K 1), σ 2 := and the minimum eigenvalue K w k σk 2 k=1 ( K ) λ min := λ K 1 w k (u k ū)(u k ū) T. k=1 24/35

Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); 25/35

Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; 25/35

Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; Let and assume w min := min w k, and w max := max k k δ 0 := (K 1) σ2 λ min < ζ(w min ). w k 25/35

Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; Let w min := min w k, and w max := max k k and assume (K 1) σ2 δ 0 := < ζ(w min ). λ min For sufficiently large N, w.h.p., d(correct, optimal) τ (δ 0 ) w max. w k 25/35

Remark for separability assumption Remark 1 The condition δ 0 < ζ(w min ) can be considered as a separability assumption. For example, K = 2 implies that and we have λ min = w 1 w 2 u 1 u 2 2 2 u 1 u 2 2 > σ w1 w 2 ζ(w min ). 26/35

Remark for non-degeneracy condition Remark 2 The non-degeneracy condition is used to ensure that λ min > 0. For K = 2, we have λ min = w 1 w 2 u 1 u 2 2 2 and we only need the two component mean vectors are distinct and we do not need that they are linearly independent. 27/35

Theorem for dimensionality-reduced datasets Theorem 2 V R F N : generated under the same conditions given in Theorem 1; The separability assumption being modified to δ 1 := (K 1) σ2 λ min + σ 2 < ζ(w min). Ṽ R (K 1) N : the post-(k 1)-PCA dataset of V. 28/35

Upper bound for ME distance between optimal clusterings Combining the results of Theorem 1 and Theorem 2, by the triangle inequality: Corollary 1 V R F N : generated under the same conditions given in Theorem 1; Ṽ: the post-(k 1)-PCA dataset of V. For sufficiently large N, w.h.p. d(optimal, optimal) (τ (δ 0 ) + τ (δ 1 )) w max. 29/35

Parameter settings K = 2, for all k = 1, 2, we set or where ε = 10 6. σk 2 = λ minζ(w min ε), corr. to 4(K 1) σk 2 = λ minζ(w min ε), corr. to K 1 δ 0 ζ(w min ) 1 4, δ 0 ζ(w min ) 1, 30/35

Parameter settings K = 2, for all k = 1, 2, we set or where ε = 10 6. σk 2 = λ minζ(w min ε), corr. to 4(K 1) σk 2 = λ minζ(w min ε), corr. to K 1 δ 0 ζ(w min ) 1 4, δ 0 ζ(w min ) 1, The former corresponds to well-separated clusters; the latter corresponds to moderately well-separated clusters 30/35

Visualization of post-2-svd datasets δ 0 /ζ (w min) 1/4 6 δ 0 /ζ (w min) 1 Cluster 1 Cluster 2 Component Mean 4 8 Cluster 1 Cluster 2 Component Mean 6 4 2 2 0 0 2 2 4 4 6 6 8 6 4 2 0 2 8 10 8 6 4 2 0 2 4 31/35

Original datasets d org := d(i, I opt ), d org := τ(δ 0 )w max. δ emp 0 := D(V,I ) D (V) λ K 1 (S) λ K (S) is an approximation of δ 0, d emp org := τ(δ emp 0 )p max is an approximation of d org. 0.14 original dataset original dataset ME distance 0.12 0.1 0.08 0.06 0.04 d emp org d org d org 0.25 0.2 0.15 0.1 d emp org d org d org 0.02 0.05 0 2000 4000 6000 8000 10000 N 0 2000 4000 6000 8000 10000 N Figure: True distances and upper bounds for original datasets. 32/35

Dimensionality-reduced datasets 0.14 post-pca dataset post-pca dataset 0.12 0.25 ME distance 0.1 0.08 0.06 0.04 0.02 d emp pca d pca d pca 0.2 0.15 0.1 0.05 d emp pca d pca d pca 0 2000 4000 6000 8000 10000 N 0 2000 4000 6000 8000 10000 N Figure: True distances and upper bounds for post-pca datasets. 33/35

Comparisons of running time 0.06 0.05 original dataset post-pca dataset 0.3 original dataset post-pca dataset Running time 0.04 0.03 0.02 0.25 0.2 0.15 0.1 0.01 0.05 0 2000 4000 6000 8000 10000 N 0 2000 4000 6000 8000 10000 N 34/35

Further extensions Randomized SVD instead of exact SVD; Random projection; Non-spherical Gaussian or even more general distributions, e.g., log-concave distributions; 35/35