The Informativeness of k-means for Learning Mixture Models
|
|
- Ariel Crawford
- 5 years ago
- Views:
Transcription
1 The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, /35
2 Gaussian distribution For F dimensions, the Gaussian distribution of a vector x R F is defined by: ( 1 N (x u, Σ) = (2π) F /2 Σ exp 1 ) 2 (x u)t Σ 1 (x u), where u is the mean vector, Σ is the covariance matrix of the Gaussian. [ ] [ ] Example: Mean u = and Covariance matrix Σ = /35
3 Gaussian mixture model (GMM) w k : mixing weight P(x) = K w k N (x u k, Σ k ). k=1 u k : component mean vector Σ k : component covariance matrix; if Σ k = σ 2 k I, the GMM is said to be spherical 3/35
4 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from 4/35
5 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. 4/35
6 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. The correct target clustering I := {I 1, I 2,..., I K } satisfies n I k iff v n comes from the k-th component. 4/35
7 Learning GMM Data samples independently generated from a GMM Correct target clustering of the samples according to which Gaussian distribution they are generated from Definition 1 (correct target clustering) Suppose V := [v 1, v 2,..., v N ] are samples independently generated from a K-component GMM. The correct target clustering I := {I 1, I 2,..., I K } satisfies n I k iff v n comes from the k-th component. Thereby inferring the important parameters of the GMM. 4/35
8 Algorithms for learning GMM i) Expectation Maximization (EM) A local-search heuristic approach for maximum likelihood estimation in the presence of incomplete data; Cannot guarantee the convergence to global optima. 5/35
9 Algorithms for learning GMM ii) Algorithms based on spectral decomposition and method of moments; Definition 2 (non-degeneracy condition) The mixture model is said to satisfy a non-degeneracy condition if the component mean vectors u 1,..., u K span a K-dimensional subspace, and the mixing weight w k > 0, for k {1, 2,..., K}. 6/35
10 Algorithms for learning GMM iii) Algorithms proposed in theoretical computer science with guarantees; Need to assume separability assumptions. Vempala and Wang [2002]: for any i, j [K], i j, u i u j 2 > C max{σ i, σ j }K 1 4 log 1 4 ( F w min ). 7/35
11 Algorithms for learning GMM iii) Algorithms proposed in theoretical computer science with guarantees; Need to assume separability assumptions. Vempala and Wang [2002]: for any i, j [K], i j, u i u j 2 > C max{σ i, σ j }K 1 4 log 1 4 ( F w min ). A simple spectral algorithm with running time polynomial in both F and K works well for correctly clustering samples. 7/35
12 The k-means algorithm Large number of algorithms for finding the (approximately) correct clustering of GMM; 8/35
13 The k-means algorithm Large number of algorithms for finding the (approximately) correct clustering of GMM; Many practitioners stick with k-means algorithm because of its simplicity and successful applications in various fields. 8/35
14 The objective function of k-means Objective function: the so-called sum-of-squares distortion. where D(V, I ) := K k=1 I k : the index set of k-th cluster; c k := 1 I k n I k v n c k 2 2, n I k v n is the centroid of the k-th cluster. 9/35
15 The objective function of k-means Objective function: the so-called sum-of-squares distortion. where D(V, I ) := K k=1 I k : the index set of k-th cluster; c k := 1 I k n I k v n c k 2 2, n I k v n is the centroid of the k-th cluster. Finding an optimal clustering I opt that satisfies D(V, I opt ) = min D(V, I ) =: I D (V). 9/35
16 k-means algorithm 10/35
17 k-means algorithm 11/35
18 k-means algorithm 12/35
19 k-means algorithm 13/35
20 k-means algorithm 14/35
21 k-means algorithm 15/35
22 Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? 16/35
23 Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? Yes! Kumar and Kannan [2010] showed that if: Data points satisfy a proximity condition, i.e., when they independently generated from a GMM with a certain separability assumption 16/35
24 Using k-means to learn GMM? Can we simply use k-means to learn the correct clustering of GMM? Yes! Kumar and Kannan [2010] showed that if: Data points satisfy a proximity condition, i.e., when they independently generated from a GMM with a certain separability assumption k-means algorithm with a proper initialization can correctly cluster nearly all data points with high probability 16/35
25 Using k-means to learn GMM? The key condition to be satisfied for performing k-means to learn the parameters of a GMM? 17/35
26 Using k-means to learn GMM? The key condition to be satisfied for performing k-means to learn the parameters of a GMM? The correct clustering Any optimal clustering 17/35
27 Main contributions We prove if data points generated from a K-component spherical GMM; non-degeneracy condition and an separability assumption; The correct clustering Any optimal clustering 18/35
28 Main contributions We also prove if data points generated from a K-component spherical GMM; projected onto the low-dimensional space; non-degeneracy condition and an even weaker separability assumption; The correct clustering Any optimal clustering for the dimensionality-reduced dataset 19/35
29 Advantages of dimensionality reduction Significantly faster running time Reduced memory usage Weaker separability assumption Other advantages 20/35
30 Lower bound of distortion Let Z be the centralized data matrix of V and denote S = Z T Z. According to Ding and He [2004], for any K-clustering I, where K 1 D(V, I ) D (V) := tr(s) λ k (S), are the sorted eigenvalues of S. k=1 λ 1 (S) λ 2 (S) /35
31 Misclassification error (ME) distance Definition 3 (ME distance) The misclassification error distance of any two K-clusterings I 1 := {I 1 1, I 1 2,..., I 1 K }, I 2 := {I 2 1, I 2 2,..., I 2 K } and is defined as d(i 1, I 2 ) := 1 1 N max π P K K k=1 I 1 k I 2 π(k), where π P K represents that the distance is minimized over all permutations of the labels {1, 2,..., K}. Meilă [2005]: ME distance defined above is indeed a metric. 22/35
32 Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; 23/35
33 Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; Let and p max := max k 1 N I 1 k, and p min := min k N I k δ := D(V, I ) D (V) λ K 1 (S) λ K (S), where D (V) := min D(V, I ). I 23/35
34 Important lemma Lemma 1 (Meilă, 2006) Given a partition I := {I 1, I 2,..., I K } and a dataset V; Let and p max := max k 1 N I 1 k, and p min := min k N I k δ := D(V, I ) D (V) λ K 1 (S) λ K (S), where D (V) := min D(V, I ). I If δ K 1 2 and ( τ(δ) := 2δ 1 δ ) p min, K 1 then d(i, optimal) p max τ(δ). 23/35
35 Definitions Define the increasing function the average variances ζ(p) := p p/(K 1), σ 2 := and the minimum eigenvalue K w k σk 2 k=1 ( K ) λ min := λ K 1 w k (u k ū)(u k ū) T. k=1 24/35
36 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); 25/35
37 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; 25/35
38 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; Let and assume w min := min w k, and w max := max k k δ 0 := (K 1) σ2 λ min < ζ(w min ). w k 25/35
39 Theorem for original datasets Theorem 1 Dataset V R F N consisting of samples generated from a K-component spherical GMM (N > F > K); The non-degeneracy condition; Let w min := min w k, and w max := max k k and assume (K 1) σ2 δ 0 := < ζ(w min ). λ min For sufficiently large N, w.h.p., d(correct, optimal) τ (δ 0 ) w max. w k 25/35
40 Remark for separability assumption Remark 1 The condition δ 0 < ζ(w min ) can be considered as a separability assumption. For example, K = 2 implies that and we have λ min = w 1 w 2 u 1 u u 1 u 2 2 > σ w1 w 2 ζ(w min ). 26/35
41 Remark for non-degeneracy condition Remark 2 The non-degeneracy condition is used to ensure that λ min > 0. For K = 2, we have λ min = w 1 w 2 u 1 u and we only need the two component mean vectors are distinct and we do not need that they are linearly independent. 27/35
42 Theorem for dimensionality-reduced datasets Theorem 2 V R F N : generated under the same conditions given in Theorem 1; The separability assumption being modified to δ 1 := (K 1) σ2 λ min + σ 2 < ζ(w min). Ṽ R (K 1) N : the post-(k 1)-PCA dataset of V. 28/35
43 Theorem for dimensionality-reduced datasets Theorem 2 V R F N : generated under the same conditions given in Theorem 1; The separability assumption being modified to δ 1 := (K 1) σ2 λ min + σ 2 < ζ(w min). Ṽ R (K 1) N : the post-(k 1)-PCA dataset of V. For sufficiently large N, w.h.p., d(correct, optimal) τ (δ 1 ) w max. 28/35
44 Upper bound for ME distance between optimal clusterings Combining the results of Theorem 1 and Theorem 2, by the triangle inequality: Corollary 1 V R F N : generated under the same conditions given in Theorem 1; Ṽ: the post-(k 1)-PCA dataset of V. For sufficiently large N, w.h.p. d(optimal, optimal) (τ (δ 0 ) + τ (δ 1 )) w max. 29/35
45 Parameter settings K = 2, for all k = 1, 2, we set or where ε = σk 2 = λ minζ(w min ε), corr. to 4(K 1) σk 2 = λ minζ(w min ε), corr. to K 1 δ 0 ζ(w min ) 1 4, δ 0 ζ(w min ) 1, 30/35
46 Parameter settings K = 2, for all k = 1, 2, we set or where ε = σk 2 = λ minζ(w min ε), corr. to 4(K 1) σk 2 = λ minζ(w min ε), corr. to K 1 δ 0 ζ(w min ) 1 4, δ 0 ζ(w min ) 1, The former corresponds to well-separated clusters; the latter corresponds to moderately well-separated clusters 30/35
47 Visualization of post-2-svd datasets δ 0 /ζ (w min) 1/4 6 δ 0 /ζ (w min) 1 Cluster 1 Cluster 2 Component Mean 4 8 Cluster 1 Cluster 2 Component Mean /35
48 Original datasets d org := d(i, I opt ), d org := τ(δ 0 )w max. δ emp 0 := D(V,I ) D (V) λ K 1 (S) λ K (S) is an approximation of δ 0, d emp org := τ(δ emp 0 )p max is an approximation of d org original dataset original dataset ME distance d emp org d org d org d emp org d org d org N N Figure: True distances and upper bounds for original datasets. 32/35
49 Dimensionality-reduced datasets 0.14 post-pca dataset post-pca dataset ME distance d emp pca d pca d pca d emp pca d pca d pca N N Figure: True distances and upper bounds for post-pca datasets. 33/35
50 Comparisons of running time original dataset post-pca dataset 0.3 original dataset post-pca dataset Running time N N 34/35
51 Further extensions Randomized SVD instead of exact SVD; Random projection; Non-spherical Gaussian or even more general distributions, e.g., log-concave distributions; 35/35
EM for Spherical Gaussians
EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationMachine Learning for Data Science (CS4786) Lecture 12
Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationOn Learning Mixtures of Well-Separated Gaussians
On Learning Mixtures of Well-Separated Gaussians Oded Regev Aravindan Viayaraghavan Abstract We consider the problem of efficiently learning mixtures of a large number of spherical Gaussians, when the
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationRobust Statistics, Revisited
Robust Statistics, Revisited Ankur Moitra (MIT) joint work with Ilias Diakonikolas, Jerry Li, Gautam Kamath, Daniel Kane and Alistair Stewart CLASSIC PARAMETER ESTIMATION Given samples from an unknown
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationIsotropic PCA and Affine-Invariant Clustering
Isotropic PCA and Affine-Invariant Clustering (Extended Abstract) S. Charles Brubaker Santosh S. Vempala Georgia Institute of Technology Atlanta, GA 30332 {brubaker,vempala}@cc.gatech.edu Abstract We present
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization
More informationDISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania
Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the
More informationLecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26
Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationReview and Motivation
Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationHST.582J/6.555J/16.456J
Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear
More informationA minimalist s exposition of EM
A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability
More informationSketching for Large-Scale Learning of Mixture Models
Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationLecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora
princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora Scribe: Today we continue the
More informationPrincipal Component Analysis
Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand
More informationLecture 7 Spectral methods
CSE 291: Unsupervised learning Spring 2008 Lecture 7 Spectral methods 7.1 Linear algebra review 7.1.1 Eigenvalues and eigenvectors Definition 1. A d d matrix M has eigenvalue λ if there is a d-dimensional
More informationDS-GA 1002 Lecture notes 12 Fall Linear regression
DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable,
More informationDimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Dimensionality Reduction CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Visualize high dimensional data (and understand its Geometry) } Project the data into lower dimensional spaces }
More informationCSC 411 Lecture 12: Principal Component Analysis
CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised
More informationThe Expectation-Maximization Algorithm
The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationDimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas
Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationManifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationSelf-Tuning Spectral Clustering
Self-Tuning Spectral Clustering Lihi Zelnik-Manor Pietro Perona Department of Electrical Engineering Department of Electrical Engineering California Institute of Technology California Institute of Technology
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationLecture 8: Linear Algebra Background
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected
More informationSVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)
Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationDiscriminant Analysis with High Dimensional. von Mises-Fisher distribution and
Athens Journal of Sciences December 2014 Discriminant Analysis with High Dimensional von Mises - Fisher Distributions By Mario Romanazzi This paper extends previous work in discriminant analysis with von
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationDISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING
DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING ZHE WANG, TONG ZHANG AND YUHAO ZHANG Abstract. Graph properties suitable for the classification of instance hardness for the NP-hard
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationLeast Squares Optimization
Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. Broadly, these techniques can be used in data analysis and visualization
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More information1 Singular Value Decomposition and Principal Component
Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationPartitioning Metric Spaces
Partitioning Metric Spaces Computational and Metric Geometry Instructor: Yury Makarychev 1 Multiway Cut Problem 1.1 Preliminaries Definition 1.1. We are given a graph G = (V, E) and a set of terminals
More informationPrincipal Component Analysis
Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental
More informationCSC411: Final Review. James Lucas & David Madras. December 3, 2018
CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be
More informationMathematical foundations - linear algebra
Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationUnsupervised machine learning
Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels
More informationMaximum variance formulation
12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal
More informationis the desired collar neighbourhood. Corollary Suppose M1 n, M 2 and f : N1 n 1 N2 n 1 is a diffeomorphism between some connected components
1. Collar neighbourhood theorem Definition 1.0.1. Let M n be a manifold with boundary. Let p M. A vector v T p M is called inward if for some local chart x: U V where U subsetm is open, V H n is open and
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationTHE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING
THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING Luis Rademacher, Ohio State University, Computer Science and Engineering. Joint work with Mikhail Belkin and James Voss This talk A new approach to multi-way
More informationLecture 6: Gaussian Mixture Models (GMM)
Helsinki Institute for Information Technology Lecture 6: Gaussian Mixture Models (GMM) Pedram Daee 3.11.2015 Outline Gaussian Mixture Models (GMM) Models Model families and parameters Parameter learning
More informationExponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger
Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm by Korbinian Schwinger Overview Exponential Family Maximum Likelihood The EM Algorithm Gaussian Mixture Models Exponential
More informationMixtures of Gaussians continued
Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationDissertation Defense
Clustering Algorithms for Random and Pseudo-random Structures Dissertation Defense Pradipta Mitra 1 1 Department of Computer Science Yale University April 23, 2008 Mitra (Yale University) Dissertation
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationMulti-View Clustering via Canonical Correlation Analysis
Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationarxiv: v5 [math.na] 16 Nov 2017
RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem
More information25 Minimum bandwidth: Approximation via volume respecting embeddings
25 Minimum bandwidth: Approximation via volume respecting embeddings We continue the study of Volume respecting embeddings. In the last lecture, we motivated the use of volume respecting embeddings by
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More information4 Gaussian Mixture Models
4 Gaussian Mixture Models Once you have a collection of feature vectors you will need to describe their distribution. You will do this using a Gaussian Mixture Model. The GMM comprises a collection of
More informationIsotropic PCA and Affine-Invariant Clustering
Isotropic PCA and Affine-Invariant Clustering S. Charles Brubaker Santosh S. Vempala Abstract We present an extension of Principal Component Analysis (PCA) and a new algorithm for clustering points in
More informationMachine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis
Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis Sham M Kakade c 2019 University of Washington cse446-staff@cs.washington.edu 0 / 10 Announcements Please do Q1
More informationDimensionality Reduction and Principal Components
Dimensionality Reduction and Principal Components Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Motivation Recall, in Bayesian decision theory we have: World: States Y in {1,..., M} and observations of X
More informationGaussian Mixture Models, Expectation Maximization
Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak
More informationDegenerate Expectation-Maximization Algorithm for Local Dimension Reduction
Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationPrincipal Component Analysis -- PCA (also called Karhunen-Loeve transformation)
Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations
More informationDS-GA 1002 Lecture notes 10 November 23, Linear models
DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationOrthonormal Systems. Fourier Series
Yuliya Gorb Orthonormal Systems. Fourier Series October 31 November 3, 2017 Yuliya Gorb Orthonormal Systems (cont.) Let {e α} α A be an orthonormal set of points in an inner product space X. Then {e α}
More informationJoint Factor Analysis for Speaker Verification
Joint Factor Analysis for Speaker Verification Mengke HU ASPITRG Group, ECE Department Drexel University mengke.hu@gmail.com October 12, 2012 1/37 Outline 1 Speaker Verification Baseline System Session
More informationHST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007
MIT OpenCourseWare http://ocw.mit.edu HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing Spring 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More information