Clustering, K-Means, EM Tutorial

Similar documents
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Data Mining Techniques

Expectation Maximization

Expectation Maximization

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Mixtures of Gaussians. Sargur Srihari

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Clustering and Gaussian Mixture Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Introduction to Machine Learning

K-Means and Gaussian Mixture Models

Latent Variable Models

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Gaussian Mixture Models

Gaussian Mixture Models, Expectation Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

CSE446: Clustering and EM Spring 2017

Latent Variable Models and Expectation Maximization

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Latent Variable Models and Expectation Maximization

ECE 5984: Introduction to Machine Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Technical Details about the Expectation Maximization (EM) Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CPSC 540: Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Expectation Maximization Algorithm

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Series 7, May 22, 2018 (EM Convergence)

But if z is conditioned on, we need to model it:

Introduction to Machine Learning

Clustering with k-means and Gaussian mixture distributions

13: Variational inference II

Lecture 6: Gaussian Mixture Models (GMM)

Machine Learning for Signal Processing Bayes Classification and Regression

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

STA 4273H: Statistical Machine Learning

A minimalist s exposition of EM

Mixture Models and EM

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning for Data Science (CS4786) Lecture 12

The Expectation-Maximization Algorithm

Unsupervised Learning

CSC 411 Lecture 12: Principal Component Analysis

Unsupervised Learning with Permuted Data

L11: Pattern recognition principles

Variational Autoencoders

Probabilistic Graphical Models

Lecture 14. Clustering, K-means, and EM

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Gaussian Mixture Models

The Expectation Maximization or EM algorithm

A Note on the Expectation-Maximization (EM) Algorithm

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Brief Introduction of Machine Learning Techniques for Content Analysis

Mixtures of Gaussians continued

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Linear Dynamical Systems

Statistical Pattern Recognition

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Latent Variable Models and EM algorithm

Latent Variable View of EM. Sargur Srihari

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

CSCI-567: Machine Learning (Spring 2019)

Lecture 7: Con3nuous Latent Variable Models

13 : Variational Inference: Loopy Belief Propagation and Mean Field

CSC321 Lecture 18: Learning Probabilistic Models

Clustering with k-means and Gaussian mixture distributions

COM336: Neural Computing

CS534 Machine Learning - Spring Final Exam

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

CSC321 Lecture 20: Autoencoders

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Statistical Pattern Recognition

Probabilistic clustering

Generative Clustering, Topic Modeling, & Bayesian Inference

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Based on slides by Richard Zemel

Probabilistic Graphical Models for Image Analysis - Lecture 4

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Variational inference

Lecture 8: Clustering & Mixture Models

Non-Parametric Bayes

Expectation Maximization

Latent Variable Models and EM Algorithm

Bayesian Analysis for Natural Language Processing Lecture 2

An Introduction to Expectation-Maximization

Statistical learning. Chapter 20, Sections 1 4 1

CS181 Midterm 2 Practice Solutions

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Lecture 4: Probabilistic Learning

MIXTURE MODELS AND EM

4th IMPRS Astronomy Summer School Drawing Astrophysical Inferences from Data Sets

Transcription:

Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes

Organization: Clustering Motivation K-Means Review & Demo Gaussian Mixture Models Review EM Algorithm (time permitting) Free Energy Justification

Clustering

Clustering: Motivation Important assumption we make when doing any form of learning: Similar data-points have similar behaviour Eg. In the context of supervised learning Similar inputs should lead to similar predictions * *sometimes our trained models don t follow these assumption (cf. literature on adversarial examples)

Clustering: Examples Discretizing colours for compression using a codebook

Clustering: Examples Doing a very basic form of boundary detection Discretize colours Draw boundaries between colour groups

Clustering: Examples Like all unsupervised learning algorithms, clustering can be incorporated into the pipeline for training a supervised model We will go over an example of this very soon

Clustering: Challenges What is a good notion of similarity? Euclidean distance bad for Images SSD: ~728 SSD: ~713

Clustering: Challenges The notion of similarity used can make the same algorithm behave in very different ways and can in some cases be a motivation for developing new algorithms (not necessarily just for clustering algorithms) Another question is how to compare different clustering algorithms May have specific methods for making these decisions based on the clustering algorithms used Can also use performance on down-the-line tasks as a proxy when choosing between different setups

Clustering: Some Specific Algorithms Today we shall review: K-Means Gaussian Mixture Models Hopefully there will be some time to go over EM as well

K-Means

K-Means: The Algorithm 1. Initialize K centroids 2. Iterate until convergence a. Assign each data-point to it s closest centroid b. Move each centroid to the center of data-points assigned to it

K-Means: A look at how it can be used << Slides from TA s past >>

Tomato sauce A major tomato sauce company wants to tailor their brands to sauces to suit their customers They run a market survey where the test subject rates different sauces After some processing they get the following data Each point represents the preferred sauce characteristics of a specific person Shikhar Sharma (UofT) Unsupervised Learning October {27,29,30}, 2015 4 / 29

Tomato sauce data More Garlic More Sweet This tells us how much different customers like different flavors Shikhar Sharma (UofT) Unsupervised Learning October {27,29,30}, 2015 5 / 29

Some natural questions How many different sauces should the company make? How sweet/garlicy should these sauces be? Idea: We will segment the consumers into groups (in this case 3), we will then find the best sauce for each group Shikhar Sharma (UofT) Unsupervised Learning October {27,29,30}, 2015 6 / 29

Approaching k-means Say I give you 3 sauces whose garlicy-ness and sweetness are marked by X More Garlic More Sweet Shikhar Sharma (UofT) Unsupervised Learning October {27,29,30}, 2015 7 / 29

Approaching k-means We will group each customer by the sauce that most closely matches their taste More Garlic More Sweet Shikhar Sharma (UofT) Unsupervised Learning October {27,29,30}, 2015 8 / 29

Approaching k-means Given this grouping, can we choose sauces that would make each group happier on average? More Garlic More Sweet Shikhar Sharma (UofT) Unsupervised Learning October {27,29,30}, 2015 9 / 29

Approaching k-means Given this grouping, can we choose sauces that would make each group happier on average? More Garlic Yes! More Sweet Shikhar Sharma (UofT) Unsupervised Learning October {27,29,30}, 2015 10 / 29

Approaching k-means Given these new sauces, we can regroup the customers More Garlic More Sweet Shikhar Sharma (UofT) Unsupervised Learning October {27,29,30}, 2015 11 / 29

Approaching k-means Given these new sauces, we can regroup the customers More Garlic More Sweet Shikhar Sharma (UofT) Unsupervised Learning October {27,29,30}, 2015 12 / 29

K-Means: Challenges How to initialize? You saw k-means++ in lecture slides Can come up with other heuristics How do you choose K? You may come up with criteria for the value of K based on: Restrictions on the magnitude of K Everyone can t have their own tomato sauce Performance on some down-the-line task If used for doing supervised learning later, must choose K such that you do not under/over fit

K-Means: Challenges K-Means algorithm converges to a local minimum: Can try multiple random restarts Other heuristics such as splitting discussed in lecture Questions about K-Means?

Gaussian Mixture Models

Generative Models One important class of methods in machine learning The goal is to define some parametric family of probability distributions and then maximize the likelihood of your data under this distribution by finding the best parameters

Back to GMM A Gaussian mixture distribution: p(x) = K π k N (x µ k, Σ k ) k=1 We had: z Categorical(π) (where π k 0, Joint distribution: Log-likelihood: p(x, z) = p(z)p(x z) l(π, µ, Σ) = ln p(x π, µ, Σ) = = N ln p(x (n) π, µ, Σ) n=1 N ln n=1 k π k = 1) K p(x (n) z (n) ; µ, Σ)p(z (n) π) z (n) =1 Note: We have a hidden variable z (n) for every observation General problem: sum inside the log How can we optimize this? CSC411 Lec15-16 10 / 33

Expectation Maximization for GMM Overview Elegant and powerful method for finding maximum likelihood solutions for models with latent variables 1. E-step: In order to adjust the parameters, we must first solve the inference problem: Which Gaussian generated each datapoint? We cannot be sure, so it s a distribution over all possibilities. γ (n) k = p(z (n) = k x (n) ; π, µ, Σ) 2. M-step: Each Gaussian gets a certain amount of posterior probability for each datapoint. We fit each Gaussian to the weighted datapoints We can derive closed form updates for all parameters CSC411 Lec15-16 14 / 33

Gaussian Mixture Models: Connection to K-Means You saw soft K-means in lecture If you look at the update equations (and maybe some back of the envelope calculations) you will see that the update rule for soft k-means is the same as the GMMs where each Gaussian is spherical (0 mean, Identity covariance matrix)

Gaussian Mixture Models: Miscellany Can try initializing the centers with the k-means algorithm Your models will train a lot fast if you use diagonal covariance matrices (but it might not necessarily be a good idea)

EM

EM: Review The update rule for GMMs is a special case of the EM algorithm

EM: Free Energy Justification Let s try doing this on the board

General EM Algorithm 1. Initialize Θ old 2. E-step: Evaluate p(z X, Θ old ) and compute Q(Θ, Θ old ) = z p(z X, Θ old ) ln p(x, Z Θ) 3. M-step: Maximize Θ new = arg max Θ Q(Θ, Θold ) 4. Evaluate log likelihood and check for convergence (or the parameters). If not converged, Θ old = Θ new, Go to step 2 CSC411 Lec15-16 20 / 33

EM alternative approach * Our goal is to maximize p(x Θ) = z p(x, z Θ) Typically optimizing p(x Θ) is difficult, but p(x, Z Θ) is easy Let q(z) be a distribution over the latent variables. For any distribution q(z) we have ln p(x Θ) = L(q, Θ) + KL(q p(z X, Θ)) where L(q, Θ) = { } p(x, Z Θ) q(z) ln q(z) Z KL(q p) = { } p(z X, Θ) q(z) ln q(z) Z CSC411 Lec15-16 27 / 33

E-step and M-step * ln p(x Θ) = L(q, Θ) + KL(q p(z X, Θ)) In the E-step we maximize q(z) w.r.t the lower bound L(q, Θ old ) This is achieved when q(z) = p(z X, Θ old ) The lower bound L is then L(q, Θ) = Z p(z X, Θ old ) ln p(x, Z Θ) Z p(z X, Θ old ) ln p(z X, Θ old ) = Q(Θ, Θ old ) + const with the content the entropy of the q distribution, which is independent of Θ In the M-step the quantity to be maximized is the expectation of the complete data log-likelihood Note that Θ is only inside the logarithm and optimizing the complete data likelihood is easier CSC411 Lec15-16 31 / 33

Visualization of E-step The q distribution equal to the posterior distribution for the current parameter values Θ old, causing the lower bound to move up to the same value as the log likelihood function, with the KL divergence vanishing. CSC411 Lec15-16 29 / 33

Visualization of M-step The distribution q(z) is held fixed and the lower bound L(q, Θ) is maximized with respect to the parameter vector Θ to give a revised value Θ new. Because the KL divergence is nonnegative, this causes the log likelihood ln p(x Θ) to increase by at least as much as the lower bound does. CSC411 Lec15-16 30 / 33