The Informativeness of k-means for Learning Mixture Models

Similar documents
EM for Spherical Gaussians

PCA, Kernel PCA, ICA

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Machine Learning for Data Science (CS4786) Lecture 12

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Unsupervised Learning

Latent Variable Models and EM Algorithm

On Learning Mixtures of Well-Separated Gaussians

Data Preprocessing. Cluster Similarity

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Lecture 7: Con3nuous Latent Variable Models

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Randomized Algorithms

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Robust Statistics, Revisited

Clustering VS Classification

Statistical Machine Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Gaussian Mixture Models

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Isotropic PCA and Affine-Invariant Clustering

PCA and admixture models

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Clustering and Gaussian Mixture Models

Review and Motivation

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

CS534 Machine Learning - Spring Final Exam

HST.582J/6.555J/16.456J

A minimalist s exposition of EM

Sketching for Large-Scale Learning of Mixture Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Principal Component Analysis

Lecture 7 Spectral methods

DS-GA 1002 Lecture notes 12 Fall Linear regression

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CSC 411 Lecture 12: Principal Component Analysis

The Expectation-Maximization Algorithm

Latent Variable Models and EM algorithm

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

STA 414/2104: Machine Learning

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Kernel methods for comparing distributions, measuring dependence

Self-Tuning Spectral Clustering

Expectation Maximization Algorithm

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Lecture 8: Linear Algebra Background

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

Principal Component Analysis

DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING

Data Mining Techniques

Least Squares Optimization

Expectation Maximization

1 Singular Value Decomposition and Principal Component

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Partitioning Metric Spaces

Principal Component Analysis

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Mathematical foundations - linear algebra

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Unsupervised machine learning

Maximum variance formulation

is the desired collar neighbourhood. Corollary Suppose M1 n, M 2 and f : N1 n 1 N2 n 1 is a diffeomorphism between some connected components

Advanced Introduction to Machine Learning

Final Exam, Machine Learning, Spring 2009

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

Lecture 6: Gaussian Mixture Models (GMM)

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Mixtures of Gaussians continued

CS6220: DATA MINING TECHNIQUES

Machine Learning Techniques for Computer Vision

Dissertation Defense

PATTERN RECOGNITION AND MACHINE LEARNING

Multi-View Clustering via Canonical Correlation Analysis

K-Means and Gaussian Mixture Models

arxiv: v5 [math.na] 16 Nov 2017

25 Minimum bandwidth: Approximation via volume respecting embeddings

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CSCI-567: Machine Learning (Spring 2019)

4 Gaussian Mixture Models

Isotropic PCA and Affine-Invariant Clustering

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Dimensionality Reduction and Principal Components

Gaussian Mixture Models, Expectation Maximization

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

STA141C: Big Data & High Performance Statistical Computing

Non-Parametric Bayes

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

DS-GA 1002 Lecture notes 10 November 23, Linear models

Introduction to Machine Learning

Orthonormal Systems. Fourier Series

Joint Factor Analysis for Speaker Verification

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Transcription:

The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35

Gaussian distribution For F dimensions, the Gaussian distribution of a vector x R F is defined by: ( 1 N (x u, Σ) = (2π) F /2 Σ exp 1 ) 2 (x u)t Σ 1 (x u), where u is the mean vector, Σ is the covariance matrix of the Gaussian. [ ] [ ] 0 0.25 0.3 Example: Mean u = and Covariance matrix Σ = 0 0.3 1.0 2/35

Gaussian mixture model (GMM) w k : mixing weight P(x) = K w k N (x u k, Σ k ). k=1 u k : component mean vector Σ k : component covariance matrix; if Σ k = σ 2 k I, the GMM is said to be spherical 3/35

Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from 4/35

Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. 4/35

Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. The correct target clustering I := {I 1, I 2,..., I K } satisfies n I k iff v n comes from the k-th component. 4/35

Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. The correct target clustering I := {I 1, I 2,..., I K } satisfies n I k iff v n comes from the k-th component. Thereby inferring the important parameters of the GMM. 4/35

Algorithms for learning GMM i) Expectation Maximization (EM) A local-search heuristic approach for maximum likelihood estimation in the presence of incomplete data; Cannot guarantee the convergence to global optima. 5/35

Algorithms for learning GMM ii) Algorithms based on spectral decomposition and method of moments; Definition 2 (non-degeneracy condition) The mixture model is said to satisfy a non-degeneracy condition if the component mean vectors u 1,..., u K span a K-dimensional subspace, and the mixing weight w k > 0, for k {1, 2,..., K}. 6/35

Algorithms for learning GMM iii) Algorithms proposed in theoretical computer science with guarantees; Need to assume separability assumptions. Vempala and Wang [2002]: for any i, j [K], i j, u i u j 2 > C max{σ i, σ j }K 1 4 log 1 4 ( F w min ). 7/35

Algorithms for learning GMM iii) Algorithms proposed in theoretical computer science with guarantees; Need to assume separability assumptions. Vempala and Wang [2002]: for any i, j [K], i j, u i u j 2 > C max{σ i, σ j }K 1 4 log 1 4 ( F w min ). A simple spectral algorithm with running time polynomial in both F and K works well for correctly clustering samples. 7/35

The k-means algorithm Large number of algorithms for finding the (approximately) correct clustering of GMM; 8/35

The k-means algorithm Large number of algorithms for finding the (approximately) correct clustering of GMM; Many practitioners stick with k-means algorithm because of its simplicity and successful applications in various fields. 8/35

The objective function of k-means Objective function: the so-called sum-of-squares distortion. where D(V, I ) := K k=1 I k : the index set of k-th cluster; c k := 1 I k n I k v n c k 2 2, n I k v n is the centroid of the k-th cluster. 9/35

The objective function of k-means Objective function: the so-called sum-of-squares distortion. where D(V, I ) := K k=1 I k : the index set of k-th cluster; c k := 1 I k n I k v n c k 2 2, n I k v n is the centroid of the k-th cluster. Finding an optimal clustering I opt that satisfies D(V, I opt ) = min D(V, I ) =: I D (V). 9/35

k-means algorithm 10/35

k-means algorithm 11/35

k-means algorithm 12/35

k-means algorithm 13/35

k-means algorithm 14/35

k-means algorithm 15/35

Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? 16/35

Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? Yes! Kumar and Kannan [2010] showed that if: Data points satisfy a proximity condition, i.e., when they independently generated from a GMM with a certain separability assumption 16/35

Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? Yes! Kumar and Kannan [2010] showed that if: Data points satisfy a proximity condition, i.e., when they independently generated from a GMM with a certain separability assumption k-means algorithm with a proper initialization can correctly cluster nearly all data points with high probability 16/35

Using k-means to learn GMM? The key condition to be satisfied for performing k-means to learn the parameters of a GMM? 17/35

Using k-means to learn GMM? The key condition to be satisfied for performing k-means to learn the parameters of a GMM? The correct clustering Any optimal clustering 17/35

Main contributions We prove if data points generated from a K-component spherical GMM; non-degeneracy condition and an separability assumption; The correct clustering Any optimal clustering 18/35

Main contributions We also prove if data points generated from a K-component spherical GMM; projected onto the low-dimensional space; non-degeneracy condition and an even weaker separability assumption; The correct clustering Any optimal clustering for the dimensionality-reduced dataset 19/35

Advantages of dimensionality reduction Significantly faster running time Reduced memory usage Weaker separability assumption Other advantages 20/35

Lower bound of distortion Let Z be the centralized data matrix of V and denote S = Z T Z. According to Ding and He [2004], for any K-clustering I, where K 1 D(V, I ) D (V) := tr(s) λ k (S), are the sorted eigenvalues of S. k=1 λ 1 (S) λ 2 (S)... 0 21/35

Misclassification error (ME) distance Definition 3 (ME distance) The misclassification error distance of any two K-clusterings I 1 := {I 1 1, I 1 2,..., I 1 K }, I 2 := {I 2 1, I 2 2,..., I 2 K } and is defined as d(i 1, I 2 ) := 1 1 N max π P K K k=1 I 1 k I 2 π(k), where π P K represents that the distance is minimized over all permutations of the labels {1, 2,..., K}. Meilă [2005]: ME distance defined above is indeed a metric. 22/35

Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; 23/35

Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; Let and p max := max k 1 N I 1 k, and p min := min k N I k δ := D(V, I ) D (V) λ K 1 (S) λ K (S), where D (V) := min D(V, I ). I 23/35

Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; Let and p max := max k 1 N I 1 k, and p min := min k N I k δ := D(V, I ) D (V) λ K 1 (S) λ K (S), where D (V) := min D(V, I ). I If δ K 1 2 and ( τ(δ) := 2δ 1 δ ) p min, K 1 then d(i, optimal) p max τ(δ). 23/35

Definitions Define the increasing function the average variances ζ(p) := p 1 + 1 2p/(K 1), σ 2 := and the minimum eigenvalue K w k σk 2 k=1 ( K ) λ min := λ K 1 w k (u k ū)(u k ū) T. k=1 24/35

Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); 25/35

Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; 25/35

Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; Let and assume w min := min w k, and w max := max k k δ 0 := (K 1) σ2 λ min < ζ(w min ). w k 25/35

Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; Let w min := min w k, and w max := max k k and assume (K 1) σ2 δ 0 := < ζ(w min ). λ min For sufficiently large N, w.h.p., d(correct, optimal) τ (δ 0 ) w max. w k 25/35

Remark for separability assumption Remark 1 The condition δ 0 < ζ(w min ) can be considered as a separability assumption. For example, K = 2 implies that and we have λ min = w 1 w 2 u 1 u 2 2 2 u 1 u 2 2 > σ w1 w 2 ζ(w min ). 26/35

Remark for non-degeneracy condition Remark 2 The non-degeneracy condition is used to ensure that λ min > 0. For K = 2, we have λ min = w 1 w 2 u 1 u 2 2 2 and we only need the two component mean vectors are distinct and we do not need that they are linearly independent. 27/35

Theorem for dimensionality-reduced datasets Theorem 2 V R F N : generated under the same conditions given in Theorem 1; The separability assumption being modified to δ 1 := (K 1) σ2 λ min + σ 2 < ζ(w min). Ṽ R (K 1) N : the post-(k 1)-PCA dataset of V. 28/35

Theorem for dimensionality-reduced datasets Theorem 2 V R F N : generated under the same conditions given in Theorem 1; The separability assumption being modified to δ 1 := (K 1) σ2 λ min + σ 2 < ζ(w min). Ṽ R (K 1) N : the post-(k 1)-PCA dataset of V. For sufficiently large N, w.h.p., d(correct, optimal) τ (δ 1 ) w max. 28/35

Upper bound for ME distance between optimal clusterings Combining the results of Theorem 1 and Theorem 2, by the triangle inequality: Corollary 1 V R F N : generated under the same conditions given in Theorem 1; Ṽ: the post-(k 1)-PCA dataset of V. For sufficiently large N, w.h.p. d(optimal, optimal) (τ (δ 0 ) + τ (δ 1 )) w max. 29/35

Parameter settings K = 2, for all k = 1, 2, we set or where ε = 10 6. σk 2 = λ minζ(w min ε), corr. to 4(K 1) σk 2 = λ minζ(w min ε), corr. to K 1 δ 0 ζ(w min ) 1 4, δ 0 ζ(w min ) 1, 30/35

Parameter settings K = 2, for all k = 1, 2, we set or where ε = 10 6. σk 2 = λ minζ(w min ε), corr. to 4(K 1) σk 2 = λ minζ(w min ε), corr. to K 1 δ 0 ζ(w min ) 1 4, δ 0 ζ(w min ) 1, The former corresponds to well-separated clusters; the latter corresponds to moderately well-separated clusters 30/35

Visualization of post-2-svd datasets δ 0 /ζ (w min) 1/4 6 δ 0 /ζ (w min) 1 Cluster 1 Cluster 2 Component Mean 4 8 Cluster 1 Cluster 2 Component Mean 6 4 2 2 0 0 2 2 4 4 6 6 8 6 4 2 0 2 8 10 8 6 4 2 0 2 4 31/35

Original datasets d org := d(i, I opt ), d org := τ(δ 0 )w max. δ emp 0 := D(V,I ) D (V) λ K 1 (S) λ K (S) is an approximation of δ 0, d emp org := τ(δ emp 0 )p max is an approximation of d org. 0.14 original dataset original dataset ME distance 0.12 0.1 0.08 0.06 0.04 d emp org d org d org 0.25 0.2 0.15 0.1 d emp org d org d org 0.02 0.05 0 2000 4000 6000 8000 10000 N 0 2000 4000 6000 8000 10000 N Figure: True distances and upper bounds for original datasets. 32/35

Dimensionality-reduced datasets 0.14 post-pca dataset post-pca dataset 0.12 0.25 ME distance 0.1 0.08 0.06 0.04 0.02 d emp pca d pca d pca 0.2 0.15 0.1 0.05 d emp pca d pca d pca 0 2000 4000 6000 8000 10000 N 0 2000 4000 6000 8000 10000 N Figure: True distances and upper bounds for post-pca datasets. 33/35

Comparisons of running time 0.06 0.05 original dataset post-pca dataset 0.3 original dataset post-pca dataset Running time 0.04 0.03 0.02 0.25 0.2 0.15 0.1 0.01 0.05 0 2000 4000 6000 8000 10000 N 0 2000 4000 6000 8000 10000 N 34/35

Further extensions Randomized SVD instead of exact SVD; Random projection; Non-spherical Gaussian or even more general distributions, e.g., log-concave distributions; 35/35