Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Size: px

Start display at page:

Download "Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar."

Anthony Griffin
6 years ago
Views:

1 Clusterig CM226: Machie Learig for Bioiformatics. Fall 216 Sriram Sakararama Ackowledgmets: Fei Sha, Ameet Talwalkar Clusterig 1 / 42

2 Admiistratio HW 1 due o Moday. /post o CCLE if you have questios. No class o October 19. OH moved to Moday 1pm Clusterig 2 / 42

3 Supervised versus Usupervised Learig Supervised Learig from labeled observatios Labels teach algorithm to lear mappig from observatios to labels Classificatio, Regressio Topic of last three lectures Usupervised Learig from ulabeled observatios Learig algorithm must fid latet structure from features aloe Ca be goal i itself (discover hidde patters, exploratory aalysis) Ca be meas to a ed (preprocessig for supervised task) Clusterig Dimesioality Reductio: Trasform a iitial feature represetatio ito a more cocise represetatio Modelig complex distributio from simple distributios Clusterig 3 / 42

4 Supervised versus Usupervised Learig Supervised Learig from labeled observatios Labels teach algorithm to lear mappig from observatios to labels Classificatio, Regressio Topic of last three lectures Usupervised Learig from ulabeled observatios Learig algorithm must fid latet structure from features aloe Ca be goal i itself (discover hidde patters, exploratory aalysis) Ca be meas to a ed (preprocessig for supervised task) Clusterig Dimesioality Reductio: Trasform a iitial feature represetatio ito a more cocise represetatio Modelig complex distributio from simple distributios Clusterig 3 / 42

5 Outlie Clusterig K-meas Gaussia mixture models EM Algorithm Model selectio Clusterig Clusterig 4 / 42

6 Clusterig Setup Give D = {x } N =1 ad K, we wat to output {µ k } K k=1 : prototypes of clusters A(x ) {1, 2,..., K}: the cluster membership, i.e., the cluster ID assiged to x Toy Example Cluster data ito two clusters. 2 (a) 2 (i) 2 2 Defiitio Group data poits so that poits withi group are more similar tha poits across groups. Clusterig Clusterig 5 / 42

7 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42

8 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42

9 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42

10 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42

11 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42

12 K-meas clusterig Ituitio Data poits assiged to cluster k should be close to µ k, the prototype. Distortio measure (clusterig objective fuctio, cost fuctio) N K J = r k x µ k 2 2 =1 k=1 where r k {, 1} is a idicator variable r k = 1 if ad oly if A(x ) = k Clusterig Clusterig 7 / 42

13 K-meas clusterig Ituitio Data poits assiged to cluster k should be close to µ k, the prototype. Distortio measure (clusterig objective fuctio, cost fuctio) N K J = r k x µ k 2 2 =1 k=1 where r k {, 1} is a idicator variable r k = 1 if ad oly if A(x ) = k Clusterig Clusterig 7 / 42

14 Algorithm Miimize distortio measure alterative optimizatio betwee {r k } ad {µ k } Step Iitialize {µ k } to some values Step 1 Assume the curret value of {µ k } fixed, miimize J over {r k }, which leads to the followig cluster assigmet rule r k = { 1 if k = arg mij x µ j 2 2 otherwise Step 2 Assume the curret value of {r k } fixed, miimize J over {µ k }, which leads to the followig rule to update the prototypes of the clusters µ k = r kx r k Step 3 Determie whether to stop or retur to Step 1 Clusterig Clusterig 8 / 42

15 Algorithm Miimize distortio measure alterative optimizatio betwee {r k } ad {µ k } Step Iitialize {µ k } to some values Step 1 Assume the curret value of {µ k } fixed, miimize J over {r k }, which leads to the followig cluster assigmet rule r k = { 1 if k = arg mij x µ j 2 2 otherwise Step 2 Assume the curret value of {r k } fixed, miimize J over {µ k }, which leads to the followig rule to update the prototypes of the clusters µ k = r kx r k Step 3 Determie whether to stop or retur to Step 1 Clusterig Clusterig 8 / 42

16 Algorithm Miimize distortio measure alterative optimizatio betwee {r k } ad {µ k } Step Iitialize {µ k } to some values Step 1 Assume the curret value of {µ k } fixed, miimize J over {r k }, which leads to the followig cluster assigmet rule r k = { 1 if k = arg mij x µ j 2 2 otherwise Step 2 Assume the curret value of {r k } fixed, miimize J over {µ k }, which leads to the followig rule to update the prototypes of the clusters µ k = r kx r k Step 3 Determie whether to stop or retur to Step 1 Clusterig Clusterig 8 / 42

17 Remarks Prototype µ k is the mea of data poits assiged to the cluster k, hece K-meas The procedure reduces J i both Step 1 ad Step 2 ad thus makes improvemets o each iteratio No guaratee we fid the global solutio; quality of local optimum depeds o iitial values at Step (k-meas++ is a eat approximatio algorithm) Clusterig Clusterig 9 / 42

18 Probabilistic iterpretatio of clusterig? We ca impose a probabilistic iterpretatio of our ituitio that poits stay close to their cluster ceters How ca we model p(x) to reflect this? 1.5 (b).5 1 Data poits seem to form 3 clusters We caot model p(x) with simple ad kow distributios E.g., the data is ot a Guassia b/c we have 3 distict cocetrated regios Clusterig Clusterig 1 / 42

19 Probabilistic iterpretatio of clusterig? We ca impose a probabilistic iterpretatio of our ituitio that poits stay close to their cluster ceters How ca we model p(x) to reflect this? 1.5 (b).5 1 Data poits seem to form 3 clusters We caot model p(x) with simple ad kow distributios E.g., the data is ot a Guassia b/c we have 3 distict cocetrated regios Clusterig Clusterig 1 / 42

20 Gaussia mixture models: ituitio 1.5 (a).5 1 We ca model each regio with a distict distributio Commo to use Gaussias, i.e., Gaussia mixture models (GMMs) or mixture of Gaussias (MoGs). We do t kow cluster assigmets (label) or parameters of Gaussias or mixture compoets! We eed to lear them all from our ulabeled data D = {x } N =1 Clusterig Clusterig 11 / 42

21 Gaussia mixture models: ituitio 1.5 (a).5 1 We ca model each regio with a distict distributio Commo to use Gaussias, i.e., Gaussia mixture models (GMMs) or mixture of Gaussias (MoGs). We do t kow cluster assigmets (label) or parameters of Gaussias or mixture compoets! We eed to lear them all from our ulabeled data D = {x } N =1 Clusterig Clusterig 11 / 42

22 Gaussia mixture models: formal defiitio A Gaussia mixture model has the followig desity fuctio for x K p(x) = π k N (x µ k, Σ k ) k=1 K: the umber of Gaussias they are called (mixture) compoets µ k ad Σ k : mea ad covariace matrix of the k-th compoet π k : mixture weights they represet how much each compoet cotributes to the fial distributio (priors). It satisfies two properties: k, π k >, ad π k = 1 The properties esure p(x) is a properly ormalized probability desity fuctio. k Clusterig Clusterig 12 / 42

23 Gaussia mixture models: formal defiitio A Gaussia mixture model has the followig desity fuctio for x K p(x) = π k N (x µ k, Σ k ) k=1 K: the umber of Gaussias they are called (mixture) compoets µ k ad Σ k : mea ad covariace matrix of the k-th compoet π k : mixture weights they represet how much each compoet cotributes to the fial distributio (priors). It satisfies two properties: k, π k >, ad π k = 1 The properties esure p(x) is a properly ormalized probability desity fuctio. k Clusterig Clusterig 12 / 42

24 Gaussia mixture models: formal defiitio A Gaussia mixture model has the followig desity fuctio for x K p(x) = π k N (x µ k, Σ k ) k=1 K: the umber of Gaussias they are called (mixture) compoets µ k ad Σ k : mea ad covariace matrix of the k-th compoet π k : mixture weights they represet how much each compoet cotributes to the fial distributio (priors). It satisfies two properties: k, π k >, ad π k = 1 The properties esure p(x) is a properly ormalized probability desity fuctio. k Clusterig Clusterig 12 / 42

25 GMM as the margial distributio of a joit distributio Cosider the followig joit distributio p(x, z) = p(z)p(x z) where z is a discrete radom variable takig values betwee 1 ad K, i.e., a multiomial radom variable. Observed variables: x Hidde or latet variables: z Clusterig Clusterig 13 / 42

26 GMM as the margial distributio of a joit distributio Cosider the followig joit distributio p(x, z) = p(z)p(x z) where z is a discrete radom variable takig values betwee 1 ad K, i.e., a multiomial radom variable. Deote π k = p(z = k) Now, assume the coditioal distributios are Gaussia distributios p(x z = k) = N (x µ k, Σ k ) The, the margial distributio of x is p(x) = K p(z = k)p(x z = k) = k=1 Namely, the Gaussia mixture model K π k N (x µ k, Σ k ) k=1 Clusterig Clusterig 13 / 42

27 GMMs: example 1.5 (a).5 1 The coditioal distributio betwee x ad z (represetig color) are p(x z = red) = N (x µ 1, Σ 1 ) p(x z = blue) = N (x µ 2, Σ 2 ) p(x z = gree) = N (x µ 3, Σ 3 ) 1 The margial distributio is thus (b).5 p(x) = p(red)n (x µ 1, Σ 1 ) + p(blue)n (x µ 2, Σ 2 ) + p(gree)n (x µ 3, Σ 3 ).5 1 Clusterig Clusterig 14 / 42

28 GMMs: example 1.5 (a).5 1 The coditioal distributio betwee x ad z (represetig color) are p(x z = red) = N (x µ 1, Σ 1 ) p(x z = blue) = N (x µ 2, Σ 2 ) p(x z = gree) = N (x µ 3, Σ 3 ) 1 The margial distributio is thus (b).5 p(x) = p(red)n (x µ 1, Σ 1 ) + p(blue)n (x µ 2, Σ 2 ) + p(gree)n (x µ 3, Σ 3 ).5 1 Clusterig Clusterig 14 / 42

29 Parameter estimatio for Gaussia mixture models The parameters i GMMs are θ = {π k, µ k, Σ k } K k=1. To estimate, cosider the simple (ad urealistic) case first. We have labels z If we assume z is observed for every x, the our estimatio problem is easier to solve. I fact, this is a supervised learig problem. Our traiig data is augmeted: D = {x, z } N =1, D = {x } N =1 z deotes the regio where x comes from. D is the complete data ad D the icomplete data. How ca we lear our parameters? The maximum likelihood estimate is obtaied by maximizig the complete log likelihood θ = arg max θ LL c (θ) log P (x, z ) Clusterig Clusterig 15 / 42

30 Parameter estimatio for Gaussia mixture models The parameters i GMMs are θ = {π k, µ k, Σ k } K k=1. To estimate, cosider the simple (ad urealistic) case first. We have labels z If we assume z is observed for every x, the our estimatio problem is easier to solve. I fact, this is a supervised learig problem. Our traiig data is augmeted: D = {x, z } N =1, D = {x } N =1 z deotes the regio where x comes from. D is the complete data ad D the icomplete data. How ca we lear our parameters? The maximum likelihood estimate is obtaied by maximizig the complete log likelihood θ = arg max θ LL c (θ) log P (x, z ) Clusterig Clusterig 15 / 42

31 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) Itroduce a biary variable z k {, 1} to idicate whether z = k. K K P (z ) = P (z = k) 1{z=k} = P (z = k) z k We the have k=1 k=1 K P (x z ) = P (x z = k) 1{z=k} = k=1 k=1 K P (x z = k) z k LL c (θ) = log k (P (z = k)p(x z = k)) z k Clusterig Clusterig 16 / 42

32 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) Itroduce a biary variable z k {, 1} to idicate whether z = k. K K P (z ) = P (z = k) 1{z=k} = P (z = k) z k We the have k=1 k=1 K P (x z ) = P (x z = k) 1{z=k} = k=1 k=1 K P (x z = k) z k LL c (θ) = log k (P (z = k)p(x z = k)) z k Clusterig Clusterig 16 / 42

33 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) Itroduce a biary variable z k {, 1} to idicate whether z = k. K K P (z ) = P (z = k) 1{z=k} = P (z = k) z k We the have k=1 k=1 K P (x z ) = P (x z = k) 1{z=k} = k=1 k=1 K P (x z = k) z k LL c (θ) = log k (P (z = k)p(x z = k)) z k Clusterig Clusterig 16 / 42

34 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) LL c (θ) = = = = k ( ) log (P (z = k)p(x z = k)) z k k z k log (P (z = k)p(x z = k)) k z k [log P (z = k) + log p(x z = k)] k z k [log P (z = k) + log p(x z = k)] We use a dummy variable z to deote all the possible values cluster assigmet values for x D specifies this value i the complete data settig Clusterig Clusterig 17 / 42

35 Parameter estimatio for GMMs: complete data From our previous discussio, we have LL c (θ) = z k [log π k + log N (x µ k, Σ k )] k Regroupig, we have LL c (θ) = k z k log π k + k { } z k log N (x µ k, Σ k ) The term iside the braces depeds o k-th compoet s parameters. It is ow easy to show that (left as a exercise) the MLE is: π k = z k 1 k z, µ k = k z z k x k 1 Σ k = z z k (x µ k )(x µ k ) T k What s the ituitio? Clusterig Clusterig 18 / 42

36 Parameter estimatio for GMMs: complete data From our previous discussio, we have LL c (θ) = z k [log π k + log N (x µ k, Σ k )] k Regroupig, we have LL c (θ) = k z k log π k + k { } z k log N (x µ k, Σ k ) The term iside the braces depeds o k-th compoet s parameters. It is ow easy to show that (left as a exercise) the MLE is: π k = z k 1 k z, µ k = k z z k x k 1 Σ k = z z k (x µ k )(x µ k ) T k What s the ituitio? Clusterig Clusterig 18 / 42

37 Ituitio Sice z k is biary, the previous solutio is othig but For π k : cout the umber of data poits whose z is k ad divide by the total umber of data poits (ote that k z k = N) For µ k : get all the data poits whose z is k, compute their mea For Σ k : get all the data poits whose z is k, compute their covariace matrix This ituitio is goig to help us to develop a algorithm for estimatig θ whe we do ot kow z (icomplete data). Clusterig Clusterig 19 / 42

38 Parameter estimatio for GMMs: icomplete data Whe z is ot give, we ca guess it via the posterior probability p(z = k x ) = p(x z = k)p(z = k) p(x ) = p(x z = k)p(z = k) K k =1 p(x z = k )p(z = k ) To compute the posterior probability, we eed to kow the parameters θ! Let s preted we kow the value of the parameters so we ca compute the posterior probability. How is that goig to help us? Clusterig Clusterig 2 / 42

39 Parameter estimatio for GMMs: icomplete data Whe z is ot give, we ca guess it via the posterior probability p(z = k x ) = p(x z = k)p(z = k) p(x ) = p(x z = k)p(z = k) K k =1 p(x z = k )p(z = k ) To compute the posterior probability, we eed to kow the parameters θ! Let s preted we kow the value of the parameters so we ca compute the posterior probability. How is that goig to help us? Clusterig Clusterig 2 / 42

40 Estimatio with soft r k We defie r k = p(z = k x ) Recall that z k should be biary r k is a soft assigmet of x to k-th compoet Each x is assiged to a compoet fractioally accordig to r k Use the soft assigmets i the complete log-likelihood. l(θ) = r k log π k + { } r k log N (x µ k, Σ k ) k k Usig the soft r k i LL c, we get the same expressio for the MLE! π k = r k 1 k r, µ k = k r r k x k 1 Σ k = r r k (x µ k )(x µ k ) T k But remember, we re cheatig by usig θ to compute r k! Clusterig Clusterig 21 / 42

41 Estimatio with soft r k We defie r k = p(z = k x ) Recall that z k should be biary r k is a soft assigmet of x to k-th compoet Each x is assiged to a compoet fractioally accordig to r k Use the soft assigmets i the complete log-likelihood. l(θ) = r k log π k + { } r k log N (x µ k, Σ k ) k k Usig the soft r k i LL c, we get the same expressio for the MLE! π k = r k 1 k r, µ k = k r r k x k 1 Σ k = r r k (x µ k )(x µ k ) T k But remember, we re cheatig by usig θ to compute r k! Clusterig Clusterig 21 / 42

42 Estimatio with soft r k We defie r k = p(z = k x ) Recall that z k should be biary r k is a soft assigmet of x to k-th compoet Each x is assiged to a compoet fractioally accordig to r k Use the soft assigmets i the complete log-likelihood. l(θ) = r k log π k + { } r k log N (x µ k, Σ k ) k k Usig the soft r k i LL c, we get the same expressio for the MLE! π k = r k 1 k r, µ k = k r r k x k 1 Σ k = r r k (x µ k )(x µ k ) T k But remember, we re cheatig by usig θ to compute r k! Clusterig Clusterig 21 / 42

43 Iterative procedure We ca alterate betwee estimatig r k ad usig the estimated r k to compute the parameters (same idea as with K-meas!) Step : iitialize θ with some values (radom or otherwise) Step 1: compute r k usig the curret θ Step 2: update θ usig the just computed r k Step 3: go back to Step 1 Questios: Is this procedure reasoable, i.e., are we optimizig a sesible criteria? Will this procedure coverge? The aswers lie i the EM algorithm a powerful procedure for model estimatio with ukow data. Clusterig Clusterig 22 / 42

44 GMMs ad K-meas GMMs provide probabilistic iterpretatio for K-meas GMMs reduce to K-meas uder the followig assumptios (i which case EM for GMM parameter estimatio simplifies to K-meas): Assume all Gaussias have σ 2 I covariace matrices Further assume σ, so we oly eed to estimate µ k, i.e., meas K-meas is ofte called hard GMM or GMMs is called soft K-meas The posterior γ k provides a probabilistic assigmet for x to cluster k Clusterig Clusterig 23 / 42

45 GMMs ad K-meas GMMs provide probabilistic iterpretatio for K-meas GMMs reduce to K-meas uder the followig assumptios (i which case EM for GMM parameter estimatio simplifies to K-meas): Assume all Gaussias have σ 2 I covariace matrices Further assume σ, so we oly eed to estimate µ k, i.e., meas K-meas is ofte called hard GMM or GMMs is called soft K-meas The posterior γ k provides a probabilistic assigmet for x to cluster k Clusterig Clusterig 23 / 42

46 Outlie Clusterig K-meas Gaussia mixture models EM Algorithm Model selectio Clusterig EM Algorithm 24 / 42

47 EM algorithm: motivatio ad setup As a geeral procedure, EM is used to estimate parameters for probabilistic models with hidde/latet variables. Suppose the model is give by a joit distributio p(x θ) = z p(x, z θ) where x is the observed radom variable ad z is hidde. We are give data cotaiig oly the observed variable D = {x } where the correspodig hidde variable values z is ot icluded. Our goal is to obtai the maximum likelihood estimate of θ. Namely, we choose θ = arg max θ LL(θ) = arg max log P (x θ) = arg max θ log p(x, z θ) z The objective fuctio LL(θ) is called icomplete log-likelihood. Clusterig EM Algorithm 25 / 42

48 Expected (complete) log-likelihood θ = arg max θ log p(x, z θ) z The difficulty with icomplete log-likelihood is that it eeds to sum over all possible values that z ca take, the take a logarithm. This log-sum format makes computatio itractable. LL(θ) = z p(x, z θ) z p(x, z θ) Clusterig EM Algorithm 26 / 42

49 Expected (complete) log-likelihood If we kew the z (complete data settig), optimizig the likelihood is easy. Istead, the EM algorithm uses a clever trick to chage this ito sum-log form. Q q (θ) = = E z q(z ) log P (x, z θ) q(z ) log P (x, z θ) z which is called expected (complete) log-likelihood (with respect to q(z). q(z) is a distributio over z. Note that Q q (θ) takes the form of sum-log, which turs out to be tractable. Clusterig EM Algorithm 27 / 42

50 Examples Cosider the previous model where x could be from 3 regios. We ca choose q(z) ay valid distributio. This will lead to differet Q q (θ). Note that z here represets differet colors. q(z = k) = 1/3 for ay of 3 colors. This gives rise to Q q (θ) = 1 3 [log P (x, red θ) + log P (x, blue θ) + log P (x, gree θ) ] q(z = k) = 1/2 for red ad blue, for gree. This gives rise to Q q (θ) = 1 2 [log P (x, red θ) + log P (x, blue θ)] Clusterig EM Algorithm 28 / 42

51 Which q(z) to choose? We will choose a special q(z) = p(z x; θ), i.e., the posterior probability of z. We defie Q(θ; θ ) = Q z p(z x;θ )(θ) = E θ [log P (x, z θ)] Clusterig EM Algorithm 29 / 42

52 EM algorithm We ca alterate betwee estimatig r k ad usig the estimated r k to compute the parameters (same idea as with K-meas!) Step : Iitialize t, θ () with some values (radom or otherwise) Repeat Step 1 (E-step): Compute r (t) k = p(z = k x, θ (t) ) usig the curret θ (t) Step 2 (M-step): θ (t+1) arg max θ Q(θ; θ (t) ). Update θ usig the just computed r k Step 3: t t + 1. Go back to step 1 while ot coverged. Clusterig EM Algorithm 3 / 42

53 EM for GMM Recall the complete log-likelihood LL c (θ) = z k [log π k + log N (x µ k, Σ k )] k The Q fuctio for GMM { ] Q(θ; θ (t) ) = E θ (t) z k [log π k + log N (x µ k, Σ k )} = k E θ (t) [z k {log π k + log N (x µ k, Σ k )}] k [ Eθ (t)[z k ] ] {log π k + log N (x µ k, Σ k )} = = k k r (t) k {log π k + log N (x µ k, Σ k )} Clusterig EM Algorithm 31 / 42

54 EM for GMM Regroupig Q(θ; θ (t) ) = k r (t) k log π k + k { r (t) k log N (x µ k, Σ k ) } We have recovered the parameter estimatio algorithm for GMMs discussed previously! This gives M-step updates to the parameters: π (t) k = r(t) k r(t) k k Σ (t) k = 1 r(t) k, µ (t) k = 1 r(t) k r (t) k x r (t) k (x µ (t) k )(x µ (t) k )T Clusterig EM Algorithm 32 / 42

55 EM for GMM Regroupig Q(θ; θ (t) ) = k r (t) k log π k + k { r (t) k log N (x µ k, Σ k ) } We have recovered the parameter estimatio algorithm for GMMs discussed previously! This gives M-step updates to the parameters: π (t) k = r(t) k r(t) k k Σ (t) k = 1 r(t) k, µ (t) k = 1 r(t) k r (t) k x r (t) k (x µ (t) k )(x µ (t) k )T Clusterig EM Algorithm 32 / 42

56 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42

57 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42

58 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42

59 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42

60 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42

61 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42

62 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42

63 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42

64 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42

65 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42

66 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42

67 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42

68 Iterative ad mootoic improvemet We ca show that LL(θ t+1 ) LL(θ t ) Recall that we chose q( ) i the E-step such that: LL(θ t ) = q(z ) log p(x, z θ q(z z ) However, i the M-step, θ t+1 is chose to maximize the right had side of the equatio, thus provig our desired result Note: the EM procedure coverges but oly to a local optimum Ru algorithm with radom iitializatios ad pick the best solutio t ) Clusterig EM Algorithm 36 / 42

69 Bayesia viewpoit Ca impose a prior o θ. Reduces over-fittig. Ca compute the maximum a posterior (MAP) estimate. Equivalet to addig a regularizer. ˆθ = arg max θ log P (D θ) + log P (θ) EM ca be exteded to this settig. Clusterig EM Algorithm 37 / 42

70 Outlie Clusterig K-meas Gaussia mixture models EM Algorithm Model selectio Clusterig Model selectio 38 / 42

71 Model selectio: choosig K How well does the model fit the data for differet K? Problem: Will fit better with larger K. Ca evaluate fit o idepedet data ot used i traiig. Clusterig Model selectio 39 / 42

72 Model selectio: choosig K Byesia solutio: Prior cotrols model complexity K = arg max K P (D K) = P (D θ)p (θ K)d(θ) Challegig to evaluate itegral Other approximate methods to cotrol model complexity Akaike Iformatio Criterio Bayesia Iformatio Criterio K = arg max K 2LL(ˆθ) 2k K = arg max K 2LL(ˆθ) k log() ˆθ : MLE, k: Number of parameters Clusterig Model selectio 4 / 42

73 Model selectio: choosig K Byesia oparametrics: Prior o K! More details i future lectures. Clusterig Model selectio 41 / 42

74 Summary Usupervised learig: fidig structure i data without labels Clusterig K-meas GMM: Probabilistic model for K-meas Iferece o hidde or latet variables. More challegig tha supervised learig. EM algorithm Pricipled approach to iferece i these models. Clusterig Model selectio 42 / 42

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019 Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 /