COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University
SOFT CLUSTERING VS HARD CLUSTERING MODELS
HARD CLUSTERING MODELS Review: K-means clustering algorithm Given: Data x 1,..., x n, where x R d Goal: Minimize L = n i=1 K k=1 1{c i = k} x i µ k 2. Iterate until values no longer changing 1. Update c: For each i, set c i = arg min k x i µ k 2 2. Update µ: For each k, set µ k = ( i xi1{ci = k}) / ( i 1{ci = k}) K-means is an example of a hard clustering algorithm because it assigns each observation to only one cluster. In other words, c i = k for some k {1,..., K}. There is no accounting for the boundary cases by hedging on the corresponding c i.
SOFT CLUSTERING MODELS A soft clustering algorithm breaks the data across clusters intelligently. 1 (a) 1 (b) 1 (c) 0.5 0.5 0.5 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 (left) True cluster assignments of data from three Gaussians. (middle) The data as we see it. (right) A soft-clustering of the data accounting for borderline cases.
WEIGHTED K-MEANS (SOFT CLUSTERING EXAMPLE) Weighted K-means clustering algorithm Given: Data x 1,..., x n, where x R d Goal: Minimize L = n i=1 K k=1 φ i(k) xi µk 2 β i H(φ i) over φ i and µ k Conditions: φ i (k) > 0, K k=1 φ i(k) = 1, H(φ i ) = entropy. Set β > 0. Iterate the following 1. Update φ: For each i, update the cluster allocation weights φ i(k) = exp{ 1 β xi µk 2 } j exp{ 1 β xi, for k = 1,..., K µj 2 } 2. Update µ: For each k, update µ k with the weighted average µ k = i xiφi(k) i φi(k)
SOFT CLUSTERING WITH WEIGHTED K-MEANS ϕ i = 0.75 on green cluster & 0.25 blue cluster ϕ i = 1 on blue cluster μ 1 x β-defined region When ϕ i is binary, we get back the hard clustering model
MIXTURE MODELS
PROBABILISTIC SOFT CLUSTERING MODELS Probabilistic vs non-probabilistic soft clustering The weight vector φ i is like a probability of x i being assigned to each cluster. A mixture model is a probabilistic model where φ i actually is a probability distribution according to the model. Mixture models work by defining: A prior distribution on the cluster assignment indicator c i A likelihood distribution on observation x i given the assignment c i Intuitively we can connect a mixture model to the Bayes classifier: Class prior cluster prior. This time, we don t know the label Class-conditional likelihood cluster-conditional likelihood
MIXTURE MODELS (a) A probability distribution on R 2. (b) Data sampled from this distribution. Before introducing math, some key features of a mixture model are: 1. It is a generative model (defines a probability distribution on the data) 2. It is a weighted combination of simpler distributions. Each simple distribution is in the same distribution family (i.e., a Gaussian). The weighting is defined by a discrete probability distribution.
MIXTURE MODELS Generating data from a mixture model Data: x 1,..., x n, where each x i X (can be complicated, but think X = R d ) Model parameters: A K-dim distribution π and parameters θ 1,..., θ K. Generative process: For observation number i = 1,..., n, 1. Generate cluster assignment: c i iid Discrete(π) Prob(c i = k π) = π k. 2. Generate observation: x i p(x θ ci ). Some observations about this procedure: First, each x i is randomly assigned to a cluster using distribution π. c i indexes the cluster assignment for x i This picks out the index of the parameter θ used to generate xi. If two x s share a parameter, they are clustered together.
MIXTURE MODELS (a) Uniform mixing weights (b) Data sampled from this distribution. (c) Uneven mixing weights (d) Data sampled from this distribution.
GAUSSIAN MIXTURE MODELS
ILLUSTRATION Gaussian mixture models are mixture models where p(x θ) is Gaussian. Mixture of two Gaussians The red line is the density function. π = [0.5, 0.5] (µ 1, σ1) 2 = (0, 1) (µ 2, σ2) 2 = (2, 0.5) Influence of mixing weights The red line is the density function. π = [0.8, 0.2] (µ 1, σ1) 2 = (0, 1) (µ 2, σ2) 2 = (2, 0.5)
GAUSSIAN MIXTURE MODELS (GMM) The model Parameters: Let π be a K-dimensional probability distribution and (µ k, Σ k ) be the mean and covariance of the kth Gaussian in R d. Generate data: For the ith observation, 1. Assign the ith observation to a cluster, c i Discrete(π) 2. Generate the value of the observation, x i N(µ ci, Σ ci ) Definitions: µ = {µ 1,..., µ K } and Σ = {Σ 1,..., Σ k }. Goal: We want to learn π, µ and Σ.
GAUSSIAN MIXTURE MODELS (GMM) Maximum likelihood Objective: Maximize the likelihood over model parameters π, µ and Σ by treating the c i as auxiliary data using the EM algorithm. p(x 1,..., x n π, µ, Σ) = n p(x i π, µ, Σ) = i=1 n i=1 k=1 K p(x i, c i = k π, µ, Σ) The summation over values of each c i integrates out this variable. We can t simply take derivatives with respect to π, µ k and Σ k and set to zero to maximize this because there s no closed form solution. We could use gradient methods, but EM is cleaner.
EM ALGORITHM Q: Why not instead just include each c i and maximize n i=1 p(x i, c i π, µ, Σ) since (we can show) this is easy to do using coordinate ascent? A: We would end up with a hard-clustering model where c i {1,..., K}. Our goal here is to have soft clustering, which EM does. EM and the GMM We will not derive everything from scratch. However, we can treat c 1,..., c n as the auxiliary data that we integrate out. Therefore, we use EM to maximize n ln p(x i π, µ, Σ) i=1 by using n ln p(x i, c i π, µ, Σ) i=1 Let s look at the outlines of how to derive this.
THE EM ALGORITHM AND THE GMM From the last lecture, the generic EM objective is ln p(x θ 1 ) = q(θ 2 ) ln p(x, θ 2 θ 1 ) q(θ 2 ) dθ 2 + The EM objective for the Gaussian mixture model is q(θ 2 ) ln q(θ 2 ) p(θ 2 x, θ 1 ) dθ 2 n ln p(x i π, µ, Σ) = i=1 n K i=1 k=1 n i=1 k=1 q(c i = k) ln p(x i, c i = k π, µ, Σ) q(c i = k) K q(c i = k) ln q(c i = k) p(c i x i, π, µ, Σ) + Because c i is discrete, the integral becomes a sum.
EM SETUP (ONE ITERATION) First: Set q(c i = k) p(c i = k x i, π, µ, Σ) using Bayes rule: p(c i = k x i, π, µ, Σ) p(c i = k π)p(x i c i = k, µ, Σ) We can solve the posterior of c i given π, µ and Σ: q(c i = k) = π kn(x i µ k, Σ k ) j π jn(x i µ j, Σ j ) = φ i (k) E-step: Take the expectation using the updated q s L = n K φ i (k) ln p(x i, c i = k π, µ k, Σ k ) + constant w.r.t. π, µ, Σ i=1 k=1 M-step: Maximize L with respect to π and each µ k, Σ k.
M-STEP CLOSE UP Aside: How has EM made this easier? Original objective function: L = n K ln p(x i, c i = k π, µ k, Σ k ) = i=1 k=1 n K ln π k N(x i µ k, Σ k ). i=1 k=1 The log-sum form makes optimizing π, and each µ k and Σ k difficult. Using EM here, we have the M-Step: L = n K i=1 k=1 φ i (k) {ln π k + ln N(x i µ k, Σ k )} + constant w.r.t. π, µ, Σ }{{} ln p(x i,c i=k π,µ k,σ k) The sum-log form is easier to optimize. We can take derivatives and solve.
EM FOR THE GMM Algorithm: Maximum likelihood EM for the GMM Given: x 1,..., x n where x R d Goal: Maximize L = n i=1 ln p(x i π, µ, Σ). Iterate until incremental improvement to L is small 1. E-step: For i = 1,..., n, set φ i(k) = πkn(xi µk, Σk), for k = 1,..., K πjn(xi µj, Σj) j 2. M-step: For k = 1,..., K, define n k = n i=1 φi(k) and update the values π k = nk n, µk = 1 n k n φ i(k)x i Σ k = 1 n φ i(k)(x i µ k)(x i µ k) T n k i=1 Comment: The updated value for µ k is used when updating Σ k. i=1
GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 0 A random initialization 2 2 0 (a) 2
GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 Iteration 1 (E-step) 0 Assign data to clusters 2 2 0 (b) 2
GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 L = 1 Iteration 1 (M-step) 0 Update the Gaussians 2 2 0 (c) 2
GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 L = 2 Iteration 2 0 Assign data to clusters and update the Gaussians 2 2 0 (d) 2
GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 L = 5 Iteration 5 (skipping ahead) 0 Assign data to clusters and update the Gaussians 2 2 0 (e) 2
GAUSSIAN MIXTURE MODEL: EXAMPLE RUN 2 L = 20 Iteration 20 (convergence) 0 Assign data to clusters and update the Gaussians 2 2 0 (f) 2
GMM AND THE BAYES CLASSIFIER The GMM feels a lot like a K-class Bayes classifier, where the label of x i is label(x i ) = arg max k π k N(x i µ k, Σ k ). π k = class prior, and N(µ k, Σ k ) = class-conditional density function. We learned π, µ and Σ using maximum likelihood there too. For the Bayes classifier, we could find π, µ and Σ with a single equation because the class label was known. Compare with the GMM update: π k = n k n, µ k = 1 n k n φ i (k)x i Σ k = 1 n φ i (k)(x i µ k )(x i µ k ) T n k i=1 i=1 They re almost identical. But since φ i (k) is changing we have to update these values. With the Bayes classifier, φ i encodes the label, so it was known.
CHOOSING THE NUMBER OF CLUSTERS Maximum likelihood for the Gaussian mixture model can overfit the data. It will learn as many Gaussians as it s given. There are a set of techniques for this based on the Dirichlet distribution. A Dirichlet prior is used on π which encourages many Gaussians to disappear (i.e., not have any data assigned to them).
EM FOR A GENERIC MIXTURE MODEL Algorithm: Maximum likelihood EM for mixture models Given: Data x 1,..., x n where x X Goal: Maximize L = n i=1 ln p(x i π, θ), where p(x θ k ) is problem-specific. Iterate until incremental improvement to L is small 1. E-step: For i = 1,..., n, set φ i(k) = πk p(xi θk), for k = 1,..., K πj p(xi θj) j 2. M-step: For k = 1,..., K, define n k = n i=1 φi(k) and set π k = nk n, θk = arg max θ n φ i(k) ln p(x i θ) Comment: Similar to generalization of the Bayes classifier for any p(x θ k ). i=1