Clustering: Mixture Models

Size: px

Start display at page:

Download "Clustering: Mixture Models"

Rolf Williamson
5 years ago
Views:

1 Clusterig: Mixture Models Machie Learig B Seyoug Kim May of these slides are derived from Tom Mitchell, Ziv- Bar Joseph, ad Eric Xig. Thaks!

2 Problem with K- meas

3 Hard Assigmet of Samples ito Three Clusters Idividual 1 Idividual 2 Idividual 3 Idividual 4 Idividual 5 Idividual 6 Idividual 7 Idividual 8 Idividual 9 Idividual 10 Cluster 1 Cluster 2 Cluster

4 ProbabilisBc SoD- Clusterig of Samples ito Three Clusters Probability of Idividual 1 Idividual 2 Idividual 3 Idividual 4 Idividual 5 Idividual 6 Idividual 7 Idividual 8 Idividual 9 Cluster 1 Cluster 2 Cluster Idividual 10 1 Each sample ca be assiged to more tha oe clusters with a certai probability. For each sample, the probabilises for all clusters should sum to 1. (i.e., each row should sum to 1.) Each cluster is explaied by a cluster ceter variable (i.e., cluster mea) Sum

5 Probability Model for Data P(X)?

6 Mixture Model A desity model p(x) may be muls- modal. MulS- model: how do we model this? Uimodal - Gaussia

7 Mixture Model We may be able to model it as a mixture of ui- modal distribusos (e.g., Gaussias). Each mode may correspod to a differet sub- populaso (e.g., male ad female).

8 Learig Mixture Models from Data Give data geerated from muls- modal distribuso, ca we fid a represetaso of the muls- model distribuso as a mixture of ui- modal distribusos?

9 Gaussia Mixture Models (GMMs) Cosider a mixture of K Gaussia compoets: p(x ) = p(x z = k) p(z k = k) = N(x µ k,σ k ) k π k mixture proporso mixture compoet

10 Gaussia Mixture Models (GMMs) Cosider a mixture of K Gaussia compoets: p(x ) = p(x,z = k) p(z k = k) = N(x µ k,σ k ) k π k mixture proporso mixture compoet This probability model describes how each data poit x ca be geerated Step 1: Flip a K- sided die (with probability for the k- th side) to select a cluster c π k Step 2: Geerate the values of the data poit from N(µ c,σ c )

11 Gaussia Mixture Models (GMMs) Cosider a mixture of K Gaussia compoets: p(x ) = p(x,z = k) p(z k = k) = N(x µ k,σ k ) k π k mixture proporso mixture compoet Parameters for K clusters: θ = {µ k,σ k,π k,k =1,...,K}

12 Learig mixture models Latet variable model: data are oly parsally observed! x i : observed sample data z i ={z i 1. z ik } : Uobserved cluster labels (each elemet 0 or 1, oly oe of them is 1) MLE essmate What if all data (x i, z i ) are observed? Maximize the data log likelihood for (x i, z i ) based o p(x i, z i ) Easy to opsmize! I pracsce, oly x i, s are observed Maximize the data log likelihood for (x i ) based o p(x i ) Difficult to opsmize! Maximize the expected data log likelihood for (x i, z i ) based o p(x i, z i ) ExpectaSo- MaximizaSo (EM) algorithm

13 Learig mixture models: fully observed data I fully observed iid seggs, assumig the cluster labels z i s were observed, the log likelihood decomposes ito a sum of local terms. l c (θ;d) = log p(x,z θ) = log p(z θ ) + log p(x z,θ) Depeds o π k Depeds o µ k,σ k µ k,σ k π k The opsmizaso problems for ad for are decoupled, ad a closed- form soluso for MLE exists.

14 MLE for GMM with fully observed data! If we are doig MLE for completely observed data! Data log- likelihood l(θ;d) = log p(z, x ) = log p(z π) p(x z,µ,σ)! MLE = k z log π k + log N(x ;µ k,σ) z k k k k k = z k k 1 logπ k - z (x 2σ 2 - µ k ) 2 + C! What if we do ot kow z?

15 Learig mixture models I fully observed iid seggs, assumig the cluster labels z i s were observed, the log likelihood decomposes ito a sum of local terms. l c (θ;d) = With latet variables for cluster labels l c (θ;d) = all the parameters become coupled together via margializa2o Are they equally difficult? log p(x,z θ) log p(x θ) = log p(x,z θ) = log p(z θ) p(x z,θ) z z Depeds o π k Depeds o µ k,σ k

16 Theory uderlyig EM Recall that accordig to MLE, we ited to lear the model parameter that would have maximized the likelihood of the data. But we do ot observe z, so compusg l c (θ;d) = log p(x,z θ) = log p(z θ)p(x z,θ) is difficult! z z OpSmizig the log- likelihood for MLE is difficult! What shall we do?

17 Complete vs. Expected Complete Log Likelihoods The complete log likelihood: The expected complete log likelihood Depeds o π k Depeds o µ k,σ k

18 Complete vs. Expected Complete Log Likelihoods The complete log likelihood: The expected complete log likelihood EM opsmizes the expected complete log likelihood

19 EM Algorithm MaximizaSo (M)- step: - Fid mixture parameters ExpectaSo (E)- step: - Re- assig samples x i s to clusters - Impute the uobserved values z i Iterate usl covergece

20 K- Meas Clusterig Algorithm Fid the cluster meas Re- assig samples x i s to clusters argmax k x i µ k 2 2 Iterate usl covergece

21 The ExpectaBo- MaximizaBo (EM) Algorithm Start: "Guess" the cetroid µ k ad covariace Σ k of each of the K clusters Loop

22 The ExpectaBo- MaximizaBo (EM) Algorithm A som k- meas E step: M step: π k (t +1) = k(t ) τ N = k N (t µ +1) k = (t Σ +1) k = τ k(t ) x τ k(t ) τ k(t ) (x µ k (t +1) )(x µ k (t +1) ) T τ k(t )

23 Compare: K- meas The EM algorithm for mixtures of Gaussias is like a "som versio" of the K- meas algorithm. I the K- meas E- step we do hard assigmet: I the K- meas M- step we update the meas as the weighted sum of the data, but ow the weights are 0 or 1:

24 Expected Complete Log Likelihood Lower- bouds Complete Log Likelihood For ay distribuso q(z), defie expected complete log likelihood: Does maximizig this surrogate yield a maximizer of the likelihood? Jese s iequality

25 Closig otes Covergece Seed choice Quality of cluster How may clusters

26 Covergece Why should the K- meas algorithm ever reach a fixed poit? - - A state i which clusters do t chage. K- meas is a special case of a geeral procedure the ExpectaSo MaximizaSo (EM) algorithm. Both are kow to coverge. Number of iterasos could be large.

27 Seed Choice Results ca vary based o radom seed selecso. Some seeds ca result i covergece to sub- opsmal clusterigs. Select good seeds usig a heurissc (e.g., doc least similar to ay exissg mea) Try out mulsple starsg poits (very importat!!!) IiSalize with the results of aother method.

28 What Is A Good Clusterig? Iteral criterio: A good clusterig will produce high quality clusters i which: the itra- class (that is, itra- cluster) similarity is high the iter- class similarity is low The measured quality of a clusterig depeds o both the obj represetaso ad the similarity measure used Exteral criteria for clusterig quality Quality measured by its ability to discover some or all of the hidde pasers or latet classes i gold stadard data Assesses a clusterig with respect to groud truth

29 How May Clusters? Number of clusters K is give ParSSo docs ito predetermied umber of clusters Fidig the right umber of clusters is part of the problem Give objs, parsso ito a appropriate umber of subsets. E.g., for query results - ideal value of K ot kow up frot - though UI may impose limits. Tradeoff betwee havig more clusters (beser focus withi each cluster) ad havig too may clusters Noparametric Bayesia Iferece

30 Cross validabo We ca also use cross validaso to determie the correct umber of classes Recall that GMMs is a geerasve model. We ca compute the likelihood of the held- out data to determie which model (umber of clusters) is more accurate

31 Cross validabo

32 Gaussia mixture clusterig

33 Clusterig methods: Compariso Hierarchical K-meas GMM Ruig time aively, O(N 3 ) Assumptios requires a similarity / distace measure Iput parameters oe fastest (each iteratio is liear) strog assumptios K (umber of clusters) fast (each iteratio is liear) strogest assumptios K (umber of clusters) Clusters subjective (oly a tree is retured) exactly K clusters exactly K clusters

34 What you should kow about Mixture Models Gaussia mixture models ProbabilisSc extesio of K- meas for som- clusterig EM algorithm for learig by assumig data are oly parsally observed Cluster labels are treated as the uobserved part of data EM algorithm for learig from partly uobserved data MLE of θ = EM essmate: θ = Where X is observed part of data, Z is uobserved

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019 Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 /