The Informativeness of k-means for Learning Mixture Models

Size: px
Start display at page:

Download "The Informativeness of k-means for Learning Mixture Models"

Transcription

1 The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, /35

2 Gaussian distribution For F dimensions, the Gaussian distribution of a vector x R F is defined by: ( 1 N (x u, Σ) = (2π) F /2 Σ exp 1 ) 2 (x u)t Σ 1 (x u), where u is the mean vector, Σ is the covariance matrix of the Gaussian. [ ] [ ] Example: Mean u = and Covariance matrix Σ = /35

3 Gaussian mixture model (GMM) w k : mixing weight P(x) = K w k N (x u k, Σ k ). k=1 u k : component mean vector Σ k : component covariance matrix; if Σ k = σ 2 k I, the GMM is said to be spherical 3/35

4 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from 4/35

5 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. 4/35

6 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. The correct target clustering I := {I 1, I 2,..., I K } satisfies n I k iff v n comes from the k-th component. 4/35

7 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. The correct target clustering I := {I 1, I 2,..., I K } satisfies n I k iff v n comes from the k-th component. Thereby inferring the important parameters of the GMM. 4/35

8 Algorithms for learning GMM i) Expectation Maximization (EM) A local-search heuristic approach for maximum likelihood estimation in the presence of incomplete data; Cannot guarantee the convergence to global optima. 5/35

9 Algorithms for learning GMM ii) Algorithms based on spectral decomposition and method of moments; Definition 2 (non-degeneracy condition) The mixture model is said to satisfy a non-degeneracy condition if the component mean vectors u 1,..., u K span a K-dimensional subspace, and the mixing weight w k > 0, for k {1, 2,..., K}. 6/35

10 Algorithms for learning GMM iii) Algorithms proposed in theoretical computer science with guarantees; Need to assume separability assumptions. Vempala and Wang [2002]: for any i, j [K], i j, u i u j 2 > C max{σ i, σ j }K 1 4 log 1 4 ( F w min ). 7/35

11 Algorithms for learning GMM iii) Algorithms proposed in theoretical computer science with guarantees; Need to assume separability assumptions. Vempala and Wang [2002]: for any i, j [K], i j, u i u j 2 > C max{σ i, σ j }K 1 4 log 1 4 ( F w min ). A simple spectral algorithm with running time polynomial in both F and K works well for correctly clustering samples. 7/35

12 The k-means algorithm Large number of algorithms for finding the (approximately) correct clustering of GMM; 8/35

13 The k-means algorithm Large number of algorithms for finding the (approximately) correct clustering of GMM; Many practitioners stick with k-means algorithm because of its simplicity and successful applications in various fields. 8/35

14 The objective function of k-means Objective function: the so-called sum-of-squares distortion. where D(V, I ) := K k=1 I k : the index set of k-th cluster; c k := 1 I k n I k v n c k 2 2, n I k v n is the centroid of the k-th cluster. 9/35

15 The objective function of k-means Objective function: the so-called sum-of-squares distortion. where D(V, I ) := K k=1 I k : the index set of k-th cluster; c k := 1 I k n I k v n c k 2 2, n I k v n is the centroid of the k-th cluster. Finding an optimal clustering I opt that satisfies D(V, I opt ) = min D(V, I ) =: I D (V). 9/35

16 k-means algorithm 10/35

17 k-means algorithm 11/35

18 k-means algorithm 12/35

19 k-means algorithm 13/35

20 k-means algorithm 14/35

21 k-means algorithm 15/35

22 Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? 16/35

23 Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? Yes! Kumar and Kannan [2010] showed that if: Data points satisfy a proximity condition, i.e., when they independently generated from a GMM with a certain separability assumption 16/35

24 Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? Yes! Kumar and Kannan [2010] showed that if: Data points satisfy a proximity condition, i.e., when they independently generated from a GMM with a certain separability assumption k-means algorithm with a proper initialization can correctly cluster nearly all data points with high probability 16/35

25 Using k-means to learn GMM? The key condition to be satisfied for performing k-means to learn the parameters of a GMM? 17/35

26 Using k-means to learn GMM? The key condition to be satisfied for performing k-means to learn the parameters of a GMM? The correct clustering Any optimal clustering 17/35

27 Main contributions We prove if data points generated from a K-component spherical GMM; non-degeneracy condition and an separability assumption; The correct clustering Any optimal clustering 18/35

28 Main contributions We also prove if data points generated from a K-component spherical GMM; projected onto the low-dimensional space; non-degeneracy condition and an even weaker separability assumption; The correct clustering Any optimal clustering for the dimensionality-reduced dataset 19/35

29 Advantages of dimensionality reduction Significantly faster running time Reduced memory usage Weaker separability assumption Other advantages 20/35

30 Lower bound of distortion Let Z be the centralized data matrix of V and denote S = Z T Z. According to Ding and He [2004], for any K-clustering I, where K 1 D(V, I ) D (V) := tr(s) λ k (S), are the sorted eigenvalues of S. k=1 λ 1 (S) λ 2 (S) /35

31 Misclassification error (ME) distance Definition 3 (ME distance) The misclassification error distance of any two K-clusterings I 1 := {I 1 1, I 1 2,..., I 1 K }, I 2 := {I 2 1, I 2 2,..., I 2 K } and is defined as d(i 1, I 2 ) := 1 1 N max π P K K k=1 I 1 k I 2 π(k), where π P K represents that the distance is minimized over all permutations of the labels {1, 2,..., K}. Meilă [2005]: ME distance defined above is indeed a metric. 22/35

32 Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; 23/35

33 Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; Let and p max := max k 1 N I 1 k, and p min := min k N I k δ := D(V, I ) D (V) λ K 1 (S) λ K (S), where D (V) := min D(V, I ). I 23/35

34 Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; Let and p max := max k 1 N I 1 k, and p min := min k N I k δ := D(V, I ) D (V) λ K 1 (S) λ K (S), where D (V) := min D(V, I ). I If δ K 1 2 and ( τ(δ) := 2δ 1 δ ) p min, K 1 then d(i, optimal) p max τ(δ). 23/35

35 Definitions Define the increasing function the average variances ζ(p) := p p/(K 1), σ 2 := and the minimum eigenvalue K w k σk 2 k=1 ( K ) λ min := λ K 1 w k (u k ū)(u k ū) T. k=1 24/35

36 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); 25/35

37 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; 25/35

38 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; Let and assume w min := min w k, and w max := max k k δ 0 := (K 1) σ2 λ min < ζ(w min ). w k 25/35

39 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; Let w min := min w k, and w max := max k k and assume (K 1) σ2 δ 0 := < ζ(w min ). λ min For sufficiently large N, w.h.p., d(correct, optimal) τ (δ 0 ) w max. w k 25/35

40 Remark for separability assumption Remark 1 The condition δ 0 < ζ(w min ) can be considered as a separability assumption. For example, K = 2 implies that and we have λ min = w 1 w 2 u 1 u u 1 u 2 2 > σ w1 w 2 ζ(w min ). 26/35

41 Remark for non-degeneracy condition Remark 2 The non-degeneracy condition is used to ensure that λ min > 0. For K = 2, we have λ min = w 1 w 2 u 1 u and we only need the two component mean vectors are distinct and we do not need that they are linearly independent. 27/35

42 Theorem for dimensionality-reduced datasets Theorem 2 V R F N : generated under the same conditions given in Theorem 1; The separability assumption being modified to δ 1 := (K 1) σ2 λ min + σ 2 < ζ(w min). Ṽ R (K 1) N : the post-(k 1)-PCA dataset of V. 28/35

43 Theorem for dimensionality-reduced datasets Theorem 2 V R F N : generated under the same conditions given in Theorem 1; The separability assumption being modified to δ 1 := (K 1) σ2 λ min + σ 2 < ζ(w min). Ṽ R (K 1) N : the post-(k 1)-PCA dataset of V. For sufficiently large N, w.h.p., d(correct, optimal) τ (δ 1 ) w max. 28/35

44 Upper bound for ME distance between optimal clusterings Combining the results of Theorem 1 and Theorem 2, by the triangle inequality: Corollary 1 V R F N : generated under the same conditions given in Theorem 1; Ṽ: the post-(k 1)-PCA dataset of V. For sufficiently large N, w.h.p. d(optimal, optimal) (τ (δ 0 ) + τ (δ 1 )) w max. 29/35

45 Parameter settings K = 2, for all k = 1, 2, we set or where ε = σk 2 = λ minζ(w min ε), corr. to 4(K 1) σk 2 = λ minζ(w min ε), corr. to K 1 δ 0 ζ(w min ) 1 4, δ 0 ζ(w min ) 1, 30/35

46 Parameter settings K = 2, for all k = 1, 2, we set or where ε = σk 2 = λ minζ(w min ε), corr. to 4(K 1) σk 2 = λ minζ(w min ε), corr. to K 1 δ 0 ζ(w min ) 1 4, δ 0 ζ(w min ) 1, The former corresponds to well-separated clusters; the latter corresponds to moderately well-separated clusters 30/35

47 Visualization of post-2-svd datasets δ 0 /ζ (w min) 1/4 6 δ 0 /ζ (w min) 1 Cluster 1 Cluster 2 Component Mean 4 8 Cluster 1 Cluster 2 Component Mean /35

48 Original datasets d org := d(i, I opt ), d org := τ(δ 0 )w max. δ emp 0 := D(V,I ) D (V) λ K 1 (S) λ K (S) is an approximation of δ 0, d emp org := τ(δ emp 0 )p max is an approximation of d org original dataset original dataset ME distance d emp org d org d org d emp org d org d org N N Figure: True distances and upper bounds for original datasets. 32/35

49 Dimensionality-reduced datasets 0.14 post-pca dataset post-pca dataset ME distance d emp pca d pca d pca d emp pca d pca d pca N N Figure: True distances and upper bounds for post-pca datasets. 33/35

50 Comparisons of running time original dataset post-pca dataset 0.3 original dataset post-pca dataset Running time N N 34/35

51 Further extensions Randomized SVD instead of exact SVD; Random projection; Non-spherical Gaussian or even more general distributions, e.g., log-concave distributions; 35/35

EM for Spherical Gaussians

EM for Spherical Gaussians EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning for Data Science (CS4786) Lecture 12 Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

On Learning Mixtures of Well-Separated Gaussians

On Learning Mixtures of Well-Separated Gaussians On Learning Mixtures of Well-Separated Gaussians Oded Regev Aravindan Viayaraghavan Abstract We consider the problem of efficiently learning mixtures of a large number of spherical Gaussians, when the

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Robust Statistics, Revisited

Robust Statistics, Revisited Robust Statistics, Revisited Ankur Moitra (MIT) joint work with Ilias Diakonikolas, Jerry Li, Gautam Kamath, Daniel Kane and Alistair Stewart CLASSIC PARAMETER ESTIMATION Given samples from an unknown

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Isotropic PCA and Affine-Invariant Clustering

Isotropic PCA and Affine-Invariant Clustering Isotropic PCA and Affine-Invariant Clustering (Extended Abstract) S. Charles Brubaker Santosh S. Vempala Georgia Institute of Technology Atlanta, GA 30332 {brubaker,vempala}@cc.gatech.edu Abstract We present

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Review and Motivation

Review and Motivation Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

HST.582J/6.555J/16.456J

HST.582J/6.555J/16.456J Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

Sketching for Large-Scale Learning of Mixture Models

Sketching for Large-Scale Learning of Mixture Models Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora Scribe: Today we continue the

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Lecture 7 Spectral methods

Lecture 7 Spectral methods CSE 291: Unsupervised learning Spring 2008 Lecture 7 Spectral methods 7.1 Linear algebra review 7.1.1 Eigenvalues and eigenvectors Definition 1. A d d matrix M has eigenvalue λ if there is a d-dimensional

More information

DS-GA 1002 Lecture notes 12 Fall Linear regression

DS-GA 1002 Lecture notes 12 Fall Linear regression DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable,

More information

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Dimensionality Reduction CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Visualize high dimensional data (and understand its Geometry) } Project the data into lower dimensional spaces }

More information

CSC 411 Lecture 12: Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Self-Tuning Spectral Clustering

Self-Tuning Spectral Clustering Self-Tuning Spectral Clustering Lihi Zelnik-Manor Pietro Perona Department of Electrical Engineering Department of Electrical Engineering California Institute of Technology California Institute of Technology

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

Lecture 8: Linear Algebra Background

Lecture 8: Linear Algebra Background CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected

More information

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and Athens Journal of Sciences December 2014 Discriminant Analysis with High Dimensional von Mises - Fisher Distributions By Mario Romanazzi This paper extends previous work in discriminant analysis with von

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING

DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING ZHE WANG, TONG ZHANG AND YUHAO ZHANG Abstract. Graph properties suitable for the classification of instance hardness for the NP-hard

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. Broadly, these techniques can be used in data analysis and visualization

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

1 Singular Value Decomposition and Principal Component

1 Singular Value Decomposition and Principal Component Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

Partitioning Metric Spaces

Partitioning Metric Spaces Partitioning Metric Spaces Computational and Metric Geometry Instructor: Yury Makarychev 1 Multiway Cut Problem 1.1 Preliminaries Definition 1.1. We are given a graph G = (V, E) and a set of terminals

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CSC411: Final Review. James Lucas & David Madras. December 3, 2018 CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Maximum variance formulation

Maximum variance formulation 12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal

More information

is the desired collar neighbourhood. Corollary Suppose M1 n, M 2 and f : N1 n 1 N2 n 1 is a diffeomorphism between some connected components

is the desired collar neighbourhood. Corollary Suppose M1 n, M 2 and f : N1 n 1 N2 n 1 is a diffeomorphism between some connected components 1. Collar neighbourhood theorem Definition 1.0.1. Let M n be a manifold with boundary. Let p M. A vector v T p M is called inward if for some local chart x: U V where U subsetm is open, V H n is open and

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING Luis Rademacher, Ohio State University, Computer Science and Engineering. Joint work with Mikhail Belkin and James Voss This talk A new approach to multi-way

More information

Lecture 6: Gaussian Mixture Models (GMM)

Lecture 6: Gaussian Mixture Models (GMM) Helsinki Institute for Information Technology Lecture 6: Gaussian Mixture Models (GMM) Pedram Daee 3.11.2015 Outline Gaussian Mixture Models (GMM) Models Model families and parameters Parameter learning

More information

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm by Korbinian Schwinger Overview Exponential Family Maximum Likelihood The EM Algorithm Gaussian Mixture Models Exponential

More information

Mixtures of Gaussians continued

Mixtures of Gaussians continued Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Dissertation Defense

Dissertation Defense Clustering Algorithms for Random and Pseudo-random Structures Dissertation Defense Pradipta Mitra 1 1 Department of Computer Science Yale University April 23, 2008 Mitra (Yale University) Dissertation

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

25 Minimum bandwidth: Approximation via volume respecting embeddings

25 Minimum bandwidth: Approximation via volume respecting embeddings 25 Minimum bandwidth: Approximation via volume respecting embeddings We continue the study of Volume respecting embeddings. In the last lecture, we motivated the use of volume respecting embeddings by

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

4 Gaussian Mixture Models

4 Gaussian Mixture Models 4 Gaussian Mixture Models Once you have a collection of feature vectors you will need to describe their distribution. You will do this using a Gaussian Mixture Model. The GMM comprises a collection of

More information

Isotropic PCA and Affine-Invariant Clustering

Isotropic PCA and Affine-Invariant Clustering Isotropic PCA and Affine-Invariant Clustering S. Charles Brubaker Santosh S. Vempala Abstract We present an extension of Principal Component Analysis (PCA) and a new algorithm for clustering points in

More information

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis Sham M Kakade c 2019 University of Washington cse446-staff@cs.washington.edu 0 / 10 Announcements Please do Q1

More information

Dimensionality Reduction and Principal Components

Dimensionality Reduction and Principal Components Dimensionality Reduction and Principal Components Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Motivation Recall, in Bayesian decision theory we have: World: States Y in {1,..., M} and observations of X

More information

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models, Expectation Maximization Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak

More information

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Orthonormal Systems. Fourier Series

Orthonormal Systems. Fourier Series Yuliya Gorb Orthonormal Systems. Fourier Series October 31 November 3, 2017 Yuliya Gorb Orthonormal Systems (cont.) Let {e α} α A be an orthonormal set of points in an inner product space X. Then {e α}

More information

Joint Factor Analysis for Speaker Verification

Joint Factor Analysis for Speaker Verification Joint Factor Analysis for Speaker Verification Mengke HU ASPITRG Group, ECE Department Drexel University mengke.hu@gmail.com October 12, 2012 1/37 Outline 1 Speaker Verification Baseline System Session

More information

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007 MIT OpenCourseWare http://ocw.mit.edu HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing Spring 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information