Machine Learning for Data Science (CS4786) Lecture 12

Size: px

Start display at page:

Download "Machine Learning for Data Science (CS4786) Lecture 12"

Maurice Farmer
6 years ago
Views:

1 Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage :

2 Back to K-means Single link is sensitive to outliners We need a good clustering algorithm after spectral embedding: K-means?

3 Back to K-means

4 Back to K-means

5 Back to K-means

6 Back to K-means

7 Back to K-means

8 Back to K-means

9 Back to K-means

10 Back to K-means

11 Back to K-means

12 Back to K-means

13 Back to K-means

14 Back to K-means

15 Back to K-means

16 Back to K-means Looks for spherical clusters Of same size And with roughly equal number of points

17 No Free Lunch

18 No Free Lunch When averaged across all possible situations, all algorithms perform equally well/badly

19 No Free Lunch When averaged across all possible situations, all algorithms perform equally well/badly No Assumptions => No method

20 No Free Lunch When averaged across all possible situations, all algorithms perform equally well/badly No Assumptions => No method Lets model our assumptions in a more principled way

21 How do we model the following?

22 Multivariate Gaussian Two parameters: Mean µ 2 R d Covariance matrix of size dxd

23 Multivariate Gaussian Two parameters: Mean µ 2 R d Covariance matrix of size dxd p(x; µ, ) =(2 ) d/2 det( ) 1/2 1 exp 2 (x µ)> (x µ)

24 Multivariate Gaussian Two parameters: Mean µ 2 R d Covariance matrix of size dxd p(x; µ, ) =(2 ) d/2 det( ) 1/2 1 exp exp(-(10 x 2 +y 2 )/2) 2 (x µ)> (x µ)

25 Gaussian Mixture Models EXAMPLES Each 2 is a model. Gaussian Mixture Model Each consists of mixture distribution = ( 1,..., K ), means µ 1,...,µ K R d and covariance matrices 1,..., K For each t, independently: At time t we generate a new tree as follows: c t, x t N(µ ct, ct ) 1 =0.5 µ µ 1 3 =0.25 µ 2 2 =0.25 2

26 PROBABILISTIC MODELS More generally: consists of set of possible parameters We have a distribution P over the data induced by each Data is generated by one of the Learning: Estimate value or distribution for given data

27 MAXIMUM LIKELIHOOD PRINCIPAL Pick that maximizes probability of observation MLE = argmax log P (x 1,...,x n ) {z } Likelihood Likelihood

28 EXAMPLE: GAUSSIAN MIXTURE MODEL MLE: = (µ 1,...,µ K ),, P (x 1,...,x n ) = n t=1 K i=1 i 1 ( ) 2 i exp (x t µ i ) i (x t µ i ) Find that maximizes log P (x 1,...,x n )

29 MLE FOR GMM Let us consider the one dimensional case, assume variances are 1 and is uniform log P (x 1,...,n ) = n log 1 t=1 K K i= exp (x t µ i ) 2 2 Now consider the partial derivative w.r.t. µ 1, we log P (x 1,...,n 1 = n t=1 (x t µ 1 ) exp (x t µ 1 ) 2 2 K i=1 exp (x t µ i ) 2 2 Given all other parameters, optimizing w.r.t. even just µ 1 is hard!

30 MLE FOR GMM Say by some magic you knew cluster assignments, then log P ((x t, c t ) 1,...,n ) = = n log ct exp (x t µ ct ) 2 t= ct 2 2 c t n log( ct ) log( c t ) (x t µ ct ) 2 t=1 2 2 c t Now consider the partial derivative w.r.t. µ i, we log P ((x t, c t ) 1,...,n i = i n 1 (x t=1 2 2 t µ ct ) 2 c (x t µ i ) i t c t =i = 1 2 i t c t =i (x t µ i ) How would you compute parameters?

31 MLE FOR GMM Say by some magic you knew cluster assignments, then log P ((x t, c t ) 1,...,n ) = = n log ct exp (x t µ ct ) 2 t= ct 2 2 c t n log( ct ) log( c t ) (x t µ ct ) 2 t=1 2 2 c t Now consider the partial derivative w.r.t. µ i, we log P ((x t, c t ) 1,...,n i = i n 1 (x t=1 2 2 t µ ct ) 2 c (x t µ i ) i t c t =i = 1 2 i t c t =i (x t µ i ) How would you compute parameters?

32 LATENT VARIABLES We only observe x 1,...,x n, cluster assignments c 1,...,c n are not observed Finding (even for 1-d GMM) that directly maximizes Likelihood or A Posteriori given x 1,...,x n is hard! Given latent variables c 1,...,c n, the problem of maximizing likelihood (or a posteriori) became easy Can we use latent variables to device an algorithm?

33 TOWARDS EM ALGORITHM Latent variables can help, but we have a chicken and egg problem Given all variables including latent variables, finding optimal parameters is easy Given model parameter, optimizing/finding distribution over the latent variables is easy

34 GMM: POWER OF WISHFUL THINKING 1 Initialize model parameters (0), µ (0) 1,...,µ (0) K 2 For i = 1 until convergence or bored and (0) 1,..., (0) K 1 Under current model parameters (i 1), compute probability Q (i) t (k) of each point x t belonging to cluster k 2 Given probabilities of each point belonging to the various clusters, compute optimal parameters (i) 3 End For

35 EM ALGORITHM FOR GMM 1 Initialize model parameters (0), µ (0) 1,...,µ (0) K 2 For i = 1 until convergence or bored 1 Q (i) t (k) p(x t ; µ (i 1) k, (i 1) ) (i 1) k k and (0) 1,..., (0) K 2 For every k [K], µ (i) k = n t=1 Q (i) t (k)x t n t=1 Q t (k), (i) k = n t=1 Q (i) t (k) x t µ (i) k n t=1 Q t (k) x t µ (i) k (i) k = n t=1 Q (i) t (k) n 3 End For

36 Demo

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a