COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Similar documents
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation maximization

Machine Learning for Data Science (CS4786) Lecture 12

Latent Variable Models and EM Algorithm

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Maximization

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Gaussian Mixture Models

Clustering and Gaussian Mixture Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning Lecture 5

Naïve Bayes classification

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical Pattern Recognition

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Latent Variable View of EM. Sargur Srihari

Generative Clustering, Topic Modeling, & Bayesian Inference

K-Means and Gaussian Mixture Models

Mixtures of Gaussians. Sargur Srihari

CSCI-567: Machine Learning (Spring 2019)

Lecture 4: Probabilistic Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Notes on Machine Learning for and

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Technical Details about the Expectation Maximization (EM) Algorithm

10-701/15-781, Machine Learning: Homework 4

Statistical Pattern Recognition

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Bayesian Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

CPSC 540: Machine Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Statistical learning. Chapter 20, Sections 1 4 1

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CSE446: Clustering and EM Spring 2017

Parametric Techniques Lecture 3

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

PMR Learning as Inference

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Scribe to lecture Tuesday March

Probabilistic Time Series Classification

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Lecture 6: Gaussian Mixture Models (GMM)

Clustering with k-means and Gaussian mixture distributions

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Clustering, K-Means, EM Tutorial

Lecture 14. Clustering, K-means, and EM

Lecture 3: Machine learning, classification, and generative models

Machine Learning, Fall 2012 Homework 2

Introduction to Machine Learning

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Linear Models for Classification

Lecture 8: Clustering & Mixture Models

COMP90051 Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Machine Learning for Signal Processing Bayes Classification and Regression

Parametric Techniques

Logistic Regression. Machine Learning Fall 2018

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Review and Motivation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS534 Machine Learning - Spring Final Exam

Probabilistic & Unsupervised Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

COM336: Neural Computing

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

G8325: Variational Bayes

Algorithms for Variational Learning of Mixture of Gaussians

Ch 4. Linear Models for Classification

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

STA 414/2104: Machine Learning

Non-Parametric Bayes

Probabilistic Graphical Models

Clustering problems, mixture models and Bayesian nonparametrics

Expectation Maximization and Mixtures of Gaussians

Unsupervised Learning

Gaussian Mixture Models, Expectation Maximization

CSC411 Fall 2018 Homework 5

ECE 5984: Introduction to Machine Learning

Clustering with k-means and Gaussian mixture distributions

Data Mining Techniques

Latent Variable Models and Expectation Maximization

Expectation Maximization Algorithm

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Introduction to Probabilistic Machine Learning

Probabilistic Graphical Models

Mixture Models and Expectation-Maximization

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Clustering with k-means and Gaussian mixture distributions

Latent Variable Models and Expectation Maximization

Transcription:

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

SOFT CLUSTERING VS HARD CLUSTERING MODELS

HARD CLUSTERING MODELS Review: K-means clustering algorithm Given: Data x 1,..., x n, where x R d Goal: Minimize L = n i=1 K k=1 1{c i = k} x i µ k 2. Iterate until values no longer changing 1. Update c: For each i, set c i = arg min k x i µ k 2 2. Update µ: For each k, set µ k = ( i xi1{ci = k}) / ( i 1{ci = k}) K-means is an example of a hard clustering algorithm because it assigns each observation to only one cluster. In other words, c i = k for some k {1,..., K}. There is no accounting for the boundary cases by hedging on the corresponding c i.

SOFT CLUSTERING MODELS A soft clustering algorithm breaks the data across clusters intelligently. 1 (a) 1 (b) 1 (c) 0.5 0.5 0.5 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 (left) True cluster assignments of data from three Gaussians. (middle) The data as we see it. (right) A soft-clustering of the data accounting for borderline cases.

WEIGHTED K-MEANS (SOFT CLUSTERING EXAMPLE) Weighted K-means clustering algorithm Given: Data x 1,..., x n, where x R d Goal: Minimize L = n i=1 K k=1 φ i(k) xi µk 2 β i H(φ i) over φ i and µ k Conditions: φ i (k) > 0, K k=1 φ i(k) = 1, H(φ i ) = entropy. Set β > 0. Iterate the following 1. Update φ: For each i, update the cluster allocation weights φ i(k) = exp{ 1 β xi µk 2 } j exp{ 1 β xi, for k = 1,..., K µj 2 } 2. Update µ: For each k, update µ k with the weighted average µ k = i xiφi(k) i φi(k)

SOFT CLUSTERING WITH WEIGHTED K-MEANS ϕ i = 0.75 on green cluster & 0.25 blue cluster ϕ i = 1 on blue cluster μ 1 x β-defined region When ϕ i is binary, we get back the hard clustering model

MIXTURE MODELS

PROBABILISTIC SOFT CLUSTERING MODELS Probabilistic vs non-probabilistic soft clustering The weight vector φ i is like a probability of x i being assigned to each cluster. A mixture model is a probabilistic model where φ i actually is a probability distribution according to the model. Mixture models work by defining: A prior distribution on the cluster assignment indicator c i A likelihood distribution on observation x i given the assignment c i Intuitively we can connect a mixture model to the Bayes classifier: Class prior cluster prior. This time, we don t know the label Class-conditional likelihood cluster-conditional likelihood

MIXTURE MODELS (a) A probability distribution on R 2. (b) Data sampled from this distribution. Before introducing math, some key features of a mixture model are: 1. It is a generative model (defines a probability distribution on the data) 2. It is a weighted combination of simpler distributions. Each simple distribution is in the same distribution family (i.e., a Gaussian). The weighting is defined by a discrete probability distribution.

MIXTURE MODELS Generating data from a mixture model Data: x 1,..., x n, where each x i X (can be complicated, but think X = R d ) Model parameters: A K-dim distribution π and parameters θ 1,..., θ K. Generative process: For observation number i = 1,..., n, 1. Generate cluster assignment: c i iid Discrete(π) Prob(c i = k π) = π k. 2. Generate observation: x i p(x θ ci ). Some observations about this procedure: First, each x i is randomly assigned to a cluster using distribution π. c i indexes the cluster assignment for x i This picks out the index of the parameter θ used to generate xi. If two x s share a parameter, they are clustered together.

MIXTURE MODELS (a) Uniform mixing weights (b) Data sampled from this distribution. (c) Uneven mixing weights (d) Data sampled from this distribution.

GAUSSIAN MIXTURE MODELS

ILLUSTRATION Gaussian mixture models are mixture models where p(x θ) is Gaussian. Mixture of two Gaussians The red line is the density function. π = [0.5, 0.5] (µ 1, σ1) 2 = (0, 1) (µ 2, σ2) 2 = (2, 0.5) Influence of mixing weights The red line is the density function. π = [0.8, 0.2] (µ 1, σ1) 2 = (0, 1) (µ 2, σ2) 2 = (2, 0.5)

GAUSSIAN MIXTURE MODELS (GMM) The model Parameters: Let π be a K-dimensional probability distribution and (µ k, Σ k ) be the mean and covariance of the kth Gaussian in R d. Generate data: For the ith observation, 1. Assign the ith observation to a cluster, c i Discrete(π) 2. Generate the value of the observation, x i N(µ ci, Σ ci ) Definitions: µ = {µ 1,..., µ K } and Σ = {Σ 1,..., Σ k }. Goal: We want to learn π, µ and Σ.

GAUSSIAN MIXTURE MODELS (GMM) Maximum likelihood Objective: Maximize the likelihood over model parameters π, µ and Σ by treating the c i as auxiliary data using the EM algorithm. p(x 1,..., x n π, µ, Σ) = n p(x i π, µ, Σ) = i=1 n i=1 k=1 K p(x i, c i = k π, µ, Σ) The summation over values of each c i integrates out this variable. We can t simply take derivatives with respect to π, µ k and Σ k and set to zero to maximize this because there s no closed form solution. We could use gradient methods, but EM is cleaner.

EM ALGORITHM Q: Why not instead just include each c i and maximize n i=1 p(x i, c i π, µ, Σ) since (we can show) this is easy to do using coordinate ascent? A: We would end up with a hard-clustering model where c i {1,..., K}. Our goal here is to have soft clustering, which EM does. EM and the GMM We will not derive everything from scratch. However, we can treat c 1,..., c n as the auxiliary data that we integrate out. Therefore, we use EM to maximize n ln p(x i π, µ, Σ) i=1 by using n ln p(x i, c i π, µ, Σ) i=1 Let s look at the outlines of how to derive this.

THE EM ALGORITHM AND THE GMM From the last lecture, the generic EM objective is ln p(x θ 1 ) = q(θ 2 ) ln p(x, θ 2 θ 1 ) q(θ 2 ) dθ 2 + The EM objective for the Gaussian mixture model is q(θ 2 ) ln q(θ 2 ) p(θ 2 x, θ 1 ) dθ 2 n ln p(x i π, µ, Σ) = i=1 n K i=1 k=1 n i=1 k=1 q(c i = k) ln p(x i, c i = k π, µ, Σ) q(c i = k) K q(c i = k) ln q(c i = k) p(c i x i, π, µ, Σ) + Because c i is discrete, the integral becomes a sum.

EM SETUP (ONE ITERATION) First: Set q(c i = k) p(c i = k x i, π, µ, Σ) using Bayes rule: p(c i = k x i, π, µ, Σ) p(c i = k π)p(x i c i = k, µ, Σ) We can solve the posterior of c i given π, µ and Σ: q(c i = k) = π kn(x i µ k, Σ k ) j π jn(x i µ j, Σ j ) = φ i (k) E-step: Take the expectation using the updated q s L = n K φ i (k) ln p(x i, c i = k π, µ k, Σ k ) + constant w.r.t. π, µ, Σ i=1 k=1 M-step: Maximize L with respect to π and each µ k, Σ k.

M-STEP CLOSE UP Aside: How has EM made this easier? Original objective function: L = n K ln p(x i, c i = k π, µ k, Σ k ) = i=1 k=1 n K ln π k N(x i µ k, Σ k ). i=1 k=1 The log-sum form makes optimizing π, and each µ k and Σ k difficult. Using EM here, we have the M-Step: L = n K i=1 k=1 φ i (k) {ln π k + ln N(x i µ k, Σ k )} + constant w.r.t. π, µ, Σ }{{} ln p(x i,c i=k π,µ k,σ k) The sum-log form is easier to optimize. We can take derivatives and solve.

EM FOR THE GMM Algorithm: Maximum likelihood EM for the GMM Given: x 1,..., x n where x R d Goal: Maximize L = n i=1 ln p(x i π, µ, Σ). Iterate until incremental improvement to L is small 1. E-step: For i = 1,..., n, set φ i(k) = πkn(xi µk, Σk), for k = 1,..., K πjn(xi µj, Σj) j 2. M-step: For k = 1,..., K, define n k = n i=1 φi(k) and update the values π k = nk n, µk = 1 n k n φ i(k)x i Σ k = 1 n φ i(k)(x i µ k)(x i µ k) T n k i=1 Comment: The updated value for µ k is used when updating Σ k. i=1

GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 0 A random initialization 2 2 0 (a) 2

GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 Iteration 1 (E-step) 0 Assign data to clusters 2 2 0 (b) 2

GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 L = 1 Iteration 1 (M-step) 0 Update the Gaussians 2 2 0 (c) 2

GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 L = 2 Iteration 2 0 Assign data to clusters and update the Gaussians 2 2 0 (d) 2

GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 L = 5 Iteration 5 (skipping ahead) 0 Assign data to clusters and update the Gaussians 2 2 0 (e) 2

GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 L = 20 Iteration 20 (convergence) 0 Assign data to clusters and update the Gaussians 2 2 0 (f) 2

GMM AND THE BAYES CLASSIFIER The GMM feels a lot like a K-class Bayes classifier, where the label of x i is label(x i ) = arg max k π k N(x i µ k, Σ k ). π k = class prior, and N(µ k, Σ k ) = class-conditional density function. We learned π, µ and Σ using maximum likelihood there too. For the Bayes classifier, we could find π, µ and Σ with a single equation because the class label was known. Compare with the GMM update: π k = n k n, µ k = 1 n k n φ i (k)x i Σ k = 1 n φ i (k)(x i µ k )(x i µ k ) T n k i=1 i=1 They re almost identical. But since φ i (k) is changing we have to update these values. With the Bayes classifier, φ i encodes the label, so it was known.

CHOOSING THE NUMBER OF CLUSTERS Maximum likelihood for the Gaussian mixture model can overfit the data. It will learn as many Gaussians as it s given. There are a set of techniques for this based on the Dirichlet distribution. A Dirichlet prior is used on π which encourages many Gaussians to disappear (i.e., not have any data assigned to them).

EM FOR A GENERIC MIXTURE MODEL Algorithm: Maximum likelihood EM for mixture models Given: Data x 1,..., x n where x X Goal: Maximize L = n i=1 ln p(x i π, θ), where p(x θ k ) is problem-specific. Iterate until incremental improvement to L is small 1. E-step: For i = 1,..., n, set φ i(k) = πk p(xi θk), for k = 1,..., K πj p(xi θj) j 2. M-step: For k = 1,..., K, define n k = n i=1 φi(k) and set π k = nk n, θk = arg max θ n φ i(k) ln p(x i θ) Comment: Similar to generalization of the Bayes classifier for any p(x θ k ). i=1