16 EXPECTATION MAXIMIZATION

Size: px

Start display at page:

Download "16 EXPECTATION MAXIMIZATION"

Morris Warner
6 years ago
Views:

1 16 EXPECTATION MAXIMIZATION A he is oly a egg s way of akig aother egg. Sauel Butler Suppose you were buildig a aive Bayes odel for a text categorizatio proble. After you were doe, your boss told you that it becae prohibitively expesive to obtai labeled data. You ow have a probabilistic odel that assues access to labels, but you do t have ay labels! Ca you still do soethig? Aazigly, you ca. You ca treat the labels as hidde variables, ad attept to lear the at the sae tie as you lear the paraeters of your odel. A very broad faily of algoriths for solvig probles just like this is the expectatio axiizatio faily. I this chapter, you will derive expectatio axiizatio (EM) algoriths for clusterig ad diesioality reductio, ad the see why EM works. Learig Objectives: Explai the relatioship betwee paraeters ad hidde variables. Costruct geerative stories for clusterig ad diesioality reductio. Draw a graph explaiig how EM works by costructig covex lower bouds. Ipleet EM for clusterig with ixtures of Gaussias, ad cotrastig it with k-eas. Evaluate the differeces betwee EM ad gradiet descet for hidde variable odels. Depedecies: 16.1 Gradig a Exa without a Aswer Key Alice s achie learig professor Carlos gives out a exa that cosists of 50 true/false questios. Alice s class of 100 studets takes the exa ad Carlos goes to grade their solutios. If Carlos ade a aswer key, this would be easy: he would just cout the fractio of correctly aswered questios each studet got, ad that would be their score. But, like ay professors, Carlos was really busy ad did t have tie to ake a aswer key. Ca he still grade the exa? There are two isights that suggest that he ight be able to. Suppose he kow ahead of tie that Alice was a awesoe studet, ad is basically guarateed to get 100% o the exa. I that case, Carlos ca siply use Alice s aswers as the groud truth. More geerally, if Carlos assues that o average studets are better tha rado guessig, he ca hope that the ajority aswer for each questio is likely to be correct. Cobiig this with the previous isight, whe doig the votig, he ight wat to pay ore attetio to the aswers of the better studets. To be a bit ore pedatic, suppose there are N = 100 studets ad M = 50 questios. Each studet has a score s, betwee 0 ad

2 expectatio axiizatio that deotes how well they do o the exa. The score is what we really wat to copute. For each questio ad each studet, the studet has provided a aswer a,, which is either zero or oe. There is also a ukow groud truth aswer for each questio, which we ll call t, which is also either zero or oe. As a startig poit, let s cosider a siple heuristic ad the coplexify it. The heuristic is the ajority vote heuristic ad works as follows. First, we estiate t as the ost coo aswer for questio : t = argax t 1a, = t. Oce we have a guess for each true aswer, we estiate each studets score as how ay aswers they produced that atch this guessed key: s = 1 M 1a, = t. Oce we have these scores, however, we ight wat to trust soe of the studets ore tha others. I particular, aswers fro studets with high scores are perhaps ore likely to be correct, so we ca recopute the groud truth, accordig to weighted votes. The weight of the votes will be precisely the score the correspodig each studet: t = argax t s 1a, = t (16.1) You ca recogize this as a chicke ad egg proble. If you kew the studet s scores, you could estiate a aswer key. If you had a aswer key, you could copute studet scores. A very coo strategy i coputer sciece for dealig with such chicke ad egg probles is to iterate. Take a guess at the first, copute the secod, recopute the first, ad so o. I order to develop this idea forally, we have to case the proble i ters of a probabilistic odel with a geerative story. The geerative story we ll use is: 1. For each questio, choose a true aswer t Ber(0.5) 2. For each studet, choose a score s Ui(0, 1) 3. For each questio ad each studet, choose a aswer a, Ber(s ) t Ber(1 s ) 1 t I the first step, we geerate the true aswers idepedetly by flippig a fair coi. I the secod step, each studets overall score is deteried to be a uifor rado uber betwee zero ad oe. The tricky step is step three, where each studets aswer is geerated for each questio. Cosider studet aswerig questio, ad suppose that s = 0.9. If t = 1, the a, should be 1 (i.e., correct) 90% of the tie; this ca be accoplished by drawig the aswer fro Ber(0.9). O the other had, if t = 0, the a, should 1 (i.e., icorrect) 10% of the tie; this ca be accoplished by drawig

3 188 a course i achie learig the aswer fro Ber(0.1). The expoet i step 3 selects which of two Beroulli distributios to draw fro, ad the ipleets this rule. This ca be traslated ito the followig likelihood: p(a, t, s) = 0.5 t t = 0.5 M 1 s a,t (1 s ) (1 a,)t s (1 a,)(1 t ) (1 s ) a,(1 t ) (16.2) s a,t (1 s ) (1 a,)t s (1 a,)(1 t ) (1 s ) a,(1 t ) (16.3) Suppose we kew the true lables t. We ca take the log of this likelihood ad differetiate it with respect to the score s of soe studet (ote: we ca drop the 0.5 M ter because it is just a costat): log p(a, t, s) = log p(a, t, s) s a, t log s + (1 a, )(1 t ) log(s ) + (1 a, )t log(1 s ) + a, (1 t ) log(1 s ) (16.4) a, t + (1 a, )(1 t ) = (1 a,)t + a, (1 t ) s 1 s (16.5) The derivative has the for s A 1 s. If we set this equal to zero ad solve for s, we get a optiu of s =. I this case: B A A+B A = a, t + (1 a, )(1 t ) (16.6) B = (1 a, )t + a, (1 t ) (16.7) A + B = 1 = M (16.8) Puttig this together, we get: s = 1 M a, t + (1 a, )(1 t ) (16.9) I the case of kow ts, this atches exactly what we had i the heuristic. However, we do ot kow t, so istead of usig the true values of t, we re goig to use their expectatios. I particular, we will copute s by axiizig its likelihood uder the expected values

4 expectatio axiizatio 189 of t, hece the ae expectatio axiizatio. If we are goig to copute expectatios of t, we have to say: expectatios accordig to which probability distributio? We will use the distributio p(t a, s). Let t deote E t p(t a,s)t. Because t is a biary variable, its expectatio is equal to it s probability; aely: t = p(t a, s). How ca we copute this? We will copute C = p(t = 1, a, s) ad D = p(t = 0, a, s) ad the copute t = C/(C + D). The coputatio is straightforward: C = 0.5 s a, (1 s ) 1 a, = 0.5 s (1 s ) (16.10) : : a, =1 a, =0 D = 0.5 s 1 a, (1 s ) a, = 0.5 (1 s ) s (16.11) : : a, =1 a, =0 If you ispect the value of C, it is basically votig (i a product for, ot a su for) the scores of those studets who agree that the aswer is 1 with oe-ius-the-score of those studets who do ot. The value of D is doig the reverse. This is a for of ultiplicative votig, which has the effect that if a give studet has a perfect score of 1.0, their results will carry the vote copletely. We ow have a way to: 1. Copute expected groud truth values t, give scores. 2. Optiize scores s give expected groud truth values. The full solutio is the to alterate betwee these two. You ca start by iitializig the groud truth values at the ajority vote (this sees like a safe iitializatio). Give those, copute ew scores. Give those ew scores, copute ew groud truth values. Ad repeat util tired. I the ext two sectios, we will cosider a ore coplex usupervised learig odel for clusterig, ad the a geeric atheatical fraework for expectatio axiizatio, which will aswer questios like: will this process coverge, ad, if so, to what? 16.2 Clusterig with a Mixture of Gaussias I Chapter 9, you leared about probabilitic odels for classificatio based o desity estiatio. Let s start with a fairly siple classificatio odel that assues we have labeled data. We will shortly reove this assuptio. Our odel will state that we have K classes, ad data fro class k is draw fro a Gaussia with ea µ k ad variace σ 2 k. The choice of classes is paraeterized by θ. The geerative story for this odel is:

5 190 a course i achie learig 1. For each exaple = 1... N: (a) Choose a label y Disc(θ) (b) Choose exaple x Nor(µ y, σ 2 y ) This geerative story ca be directly traslated ito a likelihood as before: p(d) = Mult(y θ)nor(x µ y, σy 2 ) (16.12) for each exaple {}} { = θ y 2πσy 2 D 2 }{{} exp 1 x 2 2σy 2 µ y choose label } {{ } choose feature values (16.13) If you had access to labels, this would be all well ad good, ad you could obtai closed for solutios for the axiu likelihood estiates of all paraeters by takig a log ad the takig gradiets of the log likelihood: θ k = fractio of traiig exaples i class k (16.14) = 1 N y = k µ k = ea of traiig exaples i class k (16.15) σ 2 k = y = kx y = k = variace of traiig exaples i class k (16.16) = y = k x µ k y = k Suppose that you do t have labels. Aalogously to the K-eas algorith, oe potetial solutio is to iterate. You ca start off with guesses for the values of the ukow variables, ad the iteratively iprove the over tie. I K-eas, the approach was the assig exaples to labels (or clusters). This tie, istead of akig hard assigets ( exaple 10 belogs to cluster 4 ), we ll ake soft assigets ( exaple 10 belogs half to cluster 4, a quarter to cluster 2 ad a quarter to cluster 5 ). So as ot to cofuse ourselves too uch, we ll itroduce a ew variable, z = z,1,..., z,k (that sus to oe), to deote a fractioal assiget of exaples to clusters. This otio of soft-assigets is visualized i Figure Here, we ve depicted each exaple as a pie chart, ad it s colorig deotes the degree to which it s bee assiged to each (of three) clusters. The size of the pie pieces correspod to the z values.? You should be able to derive the axiu likelihood solutio results forally by ow. Figure 16.1: e:piecharts: A figure

6 expectatio axiizatio 191 Forally, z,k deotes the probability that exaple is assiged to cluster k: z,k = p(y = k x ) (16.17) = p(y = k, x ) p(x ) (16.18) = 1 Z Mult(k θ)nor(x µ k, σ 2 k ) (16.19) Here, the oralizer Z is to esure that z sus to oe. Give a set of paraeters (the θs, µs ad σ 2 s), the fractioal assigets z,k are easy to copute. Now, aki to K-eas, give fractioal assigets, you eed to recopute estiates of the odel paraeters. I aalogy to the axiu likelihood solutio (Eqs (??)-(??)), you ca do this by coutig fractioal poits rather tha full poits. This gives the followig re-estiatio updates: θ k = fractio of traiig exaples i class k (16.20) = 1 N z,k µ k = ea of fractioal exaples i class k (16.21) = z,k x z,k σ 2 k = variace of fractioal exaples i class k (16.22) = z,k x µ k z,k All that has happeed here is that the hard assigets y = k have bee replaced with soft assigets z,k. As a bit of foreshadowig of what is to coe, what we ve doe is essetially replace kow labels with expected labels, hece the ae expectatio axiizatio. Puttig this together yields Algorith This is the GMM ( Gaussia Mixture Models ) algorith, because the probabilitic odel beig leared describes a dataset as beig draw fro a ixture distributio, where each copoet of this distributio is a Gaussia. Just as i the K-eas algorith, this approach is succeptible to local optia ad quality of iitializatio. The heuristics for coputig better iitializers for K-eas are also useful here.? Aside fro the fact that GMMs use soft assigets ad K-eas uses hard assigets, there are other differeces betwee the two approaches. What are they? 16.3 The Expectatio Maxiizatio Fraework At this poit, you ve see a ethod for learig i a particular probabilistic odel with hidde variables. Two questios reai: (1) ca

7 192 a course i achie learig Algorith 38 GMM(X, K) 1: for k = 1 to K do 2: µ k soe rado locatio // radoly iitialize ea for kth cluster 3: σk 2 1 // iitialize variaces 4: θ k 1/K // each cluster equally likely a priori 5: ed for 6: repeat 7: for = 1 to N do 8: for k = 1 to K do 9: z,k θ k 2πσ 2 D 2 k exp 1 x µ k 2 // copute 2σ 2 k (uoralized) fractioal assigets 10: ed for 11: z 1 k z,k z 12: ed for 13: for k = 1 to K do // oralize fractioal assigets 14: θ k 1 N z,k // re-estiate prior probability of cluster k 15: µ k z,k x z,k 16: σk 2 z,k x µ k z,k 17: ed for 18: util coverged // re-estiate ea of cluster k // re-estiate variace of cluster k 19: retur z // retur cluster assigets you apply this idea ore geerally ad (2) why is it eve a reasoable thig to do? Expectatio axiizatio is a faily of algoriths for perforig axiu likelihood estiatio i probabilistic odels with hidde variables. The geeral flavor of how we will proceed is as follows. We wat to axiize the log likelihood L, but this will tur out to be difficult to do directly. Istead, we ll pick a surrogate fuctio L that s a lower boud o L (i.e., L L everywhere) that s (hopefully) easier to axiize. We ll costruct the surrogate i such a way that icreasig it will force the true likelihood to also go up. After axiizig L, we ll costruct a ew lower boud ad optiize that. This process is show pictorially i Figure To proceed, cosider a arbitrary probabilistic odel p(x, y θ), where x deotes the observed data, y deotes the hidde data ad θ deotes the paraeters. I the case of Gaussia Mixture Models, x was the data poits, y was the (ukow) labels ad θ icluded the cluster prior probabilities, the cluster eas ad the cluster variaces. Now, give access oly to a uber of exaples x 1,..., x N, you would like to estiate the paraeters (θ) of the odel. Probabilistically, this eas that soe of the variables are ukow ad therefore you eed to argialize (or su) over their possible values. Now, your data cosists oly of X = x 1, x 2,..., x N, Figure 16.2: e:lowerboud: A figure showig successive lower bouds

8 expectatio axiizatio 193 ot the (x, y) pairs i D. You ca the write the likelihood as: p(x θ) = y 1 y 2 y N p(x, y 1, y 2,... y N θ) = y 1 y 2 y N = p(x, y θ) y p(x, y θ) argializatio (16.23) exaples are idepedet (16.24) algebra (16.25) At this poit, the atural thig to do is to take logs ad the start takig gradiets. However, oce you start takig logs, you ru ito a proble: the log caot eat the su! L(X θ) = log p(x, y θ) (16.26) y Naely, the log gets stuck outside the su ad caot ove i to decopose the rest of the likelihood ter! The ext step is to apply the soewhat strage, but stragely useful, trick of ultiplyig by 1. I particular, let q( ) be a arbitrary probability distributio. We will ultiply the p(... ) ter above by q(y )/q(y ), a valid step so log as q is ever zero. This leads to: L(X θ) = log q(y ) p(x, y θ) y q(y ) (16.27) We will ow costruct a lower boud usig Jese s iequality. This is a very useful (ad easy to prove!) result that states that f ( i λ i x i ) i λ i f (x i ), so log as (a) λ i 0 for all i, (b) i λ i = 1, ad (c) f is cocave. If this looks failiar, that s just because it s a direct result of the defiitio of cocavity. Recall that f is cocave if f (ax + by) a f (x) + b f (x) wheever a + b = 1. You ca ow apply Jese s iequality to the log likelihood by idetifyig the list of q(y )s as the λs, log as f (which is, ideed, cocave) ad each x as the p/q ter. This yields:? Prove Jese s iequality usig the defiitio of cocavity ad iductio. L(X θ) = q(y ) log p(x, y θ) y q(y ) q(y ) log p(x, y θ) q(y ) log q(y ) y (16.28) (16.29) L(X θ) (16.30) Note that this iequality holds for ay choice of fuctio q, so log as its o-egative ad sus to oe. I particular, it eed t eve by the

9 194 a course i achie learig sae fuctio q for each. We will eed to take advatage of both of these properties. We have succeeded i our first goal: costructig a lower boud o L. Whe you go to optiize this lower boud for θ, the oly part that atters is the first ter. The secod ter, q log q, drops out as a fuctio of θ. This eas that the the axiizatio you eed to be able to copute, for fixed q s, is: θ (ew) arg ax θ q (y ) log p(x, y θ) (16.31) y This is exactly the sort of axiizatio doe for Gaussia ixture odels whe we recoputed ew eas, variaces ad cluster prior probabilities. The secod questio is: what should q ( ) actually be? Ay reasoable q will lead to a lower boud, so i order to choose oe q over aother, we eed aother criterio. Recall that we are hopig to axiize L by istead axiizig a lower boud. I order to esure that a icrease i the lower boud iplies a icrease i L, we eed to esure that L(X θ) = L(X θ). I words: L should be a lower boud o L that akes cotact at the curret poit, θ Further Readig TODO further readig

Expectation maximization

Expectation maximization Motivatio Expectatio maximizatio Subhrasu Maji CMSCI 689: Machie Learig 14 April 015 Suppose you are builig a aive Bayes spam classifier. After your are oe your boss tells you that there is o moey to label