Sem-Supervsed Learnng Consder the problem of Prepostonal Phrase Attachment. Buy car wth money ; buy car wth wheel There are several ways to generate features. Gven the lmted representaton, we can assume that all possble conjunctons of the 4 attrbutes are used. (5 feature n each example). Assume we wll use naïve Bayes for learnng to decde between [n,v] Examples are: (x,x, x n,[n,v]) EM CS446 Sprng 7
Usng naïve Bayes To use naïve Bayes, we need to use the data to estmate: P(n) P(v) P(x n) P(x v) P(x n) P(x v) P(x n n) P(x n v) Then, gven an example (x,x, x n,?), compare: P(n x)=p(n) P(x n) P(x n) P(x n n) and P(v x)=p(v) P(x v) P(x v) P(x n v) EM CS446 Sprng 7
Usng naïve Bayes After seeng 0 examples, we have: P(n) =0.5; P(v)=0.5 P(x n)=0.75;p(x n) =0.5; P(x 3 n) =0.5; P(x 4 n) =0.5 P(x v)=0.5; P(x v) =0.5;P(x 3 v) =0.75;P(x 4 v) =0.5 Then, gven an example x=(000), we have: P n (x)=0.5 0.75 0.5 0.5 0.5 = 3/64 P v (x)=0.5 0.5 0.75 0.5 0.5=3/56 Now, assume that n addton to the 0 labeled examples, we also have 00 unlabeled examples. Wll that help? EM CS446 Sprng 7 3
Usng naïve Bayes For example, what can be done wth the example (000)? We have an estmate for ts label But, can we use t to mprove the classfer (that s, the estmaton of the probabltes that we wll use n the future)? Opton : We can make predctons, and beleve them Or some of them (based on what?) Opton : We can assume the example x=(000) s a An n-labeled example wth probablty P n (x)/(p n (x) + P v (x)) A v-labeled example wth probablty P v (x)/(p n (x) + P v (x)) Estmaton of probabltes does not requre workng wth ntegers! EM CS446 Sprng 7 4
Usng Unlabeled Data The dscusson suggests several algorthms:. Use a threshold. Chose examples labeled wth hgh confdence. Label them [n,v]. Retran.. Use fractonal examples. Label the examples wth fractonal labels [p of n, (-p) of v]. Retran. EM CS446 Sprng 7 5
Comments on Unlabeled Data Both algorthms suggested can be used teratvely. Both algorthms can be used wth other classfers, not only naïve Bayes. The only requrement a robust confdence measure n the classfcaton. There are other approaches to Sem-Supervsed learnng: See ncluded papers (co-tranng; Yarowksy s Decson Lst/Bootstrappng algorthm; graph-based algorthms that assume smlar examples have smlar labels, etc.) What happens f nstead of 0 labeled examples we start wth 0 labeled examples? Make a Guess; contnue as above; a verson of EM EM CS446 Sprng 7 6
EM EM s a class of algorthms that s used to estmate a probablty dstrbuton n the presence of mssng attrbutes. Usng t, requres an assumpton on the underlyng probablty dstrbuton. The algorthm can be very senstve to ths assumpton and to the startng pont (that s, the ntal guess of parameters. In general, known to converge to a local maxmum of the maxmum lkelhood functon. EM CS446 Sprng 7 7
Three Con Example We observe a seres of con tosses generated n the followng way: A person has three cons. Con 0: probablty of Head s a Con : probablty of Head p Con : probablty of Head q Consder the followng con-tossng scenaros: EM CS446 Sprng 7 8
Estmaton Problems Scenaro I: Toss one of the cons four tmes. Observng HHTH Queston: Whch con s more lkely to produce ths sequence? Scenaro II: Toss con 0. If Head toss con ; o/w toss con Observng the sequence HHHHT, THTHT, HHHHT, HHTTH produced by Con 0, Con and Con Queston: Estmate most lkely values for p, q (the probablty of H n each con) and the probablty to use each of the cons (a) Scenaro III: Toss con 0. If Head toss con ; o/w toss con Observng the sequence HHHT, HTHT, HHHT, HTTH produced by Con and/or Con Con 0 Queston: Estmate most lkely values for p, q and a There s no known analytcal soluton to ths problem (general settng). That s, t s not known how to compute the values of the parameters so as to maxmze the lkelhood of the data. EM CS446 Sprng 7 st toss nd toss nth toss 9
Key Intuton () If we knew whch of the data ponts (HHHT), (HTHT), (HTTH) came from Con and whch from Con, there was no problem. Recall that the smple estmaton s the ML estmaton: Assume that you toss a (p,-p) con m tmes and get k Heads m-k Tals. log[p(d p)] = log [ p k (-p) m-k ]= k log p + (m-k) log (-p) To maxmze, set the dervatve w.r.t. p equal to 0: d log P(D p)/dp = k/p (m-k)/(-p) = 0 Solvng ths for p, gves: p=k/m EM CS446 Sprng 7 0
Key Intuton () If we knew whch of the data ponts (HHHT), (HTHT), (HTTH) came from Con and whch from Con, there was no problem. Instead, use an teratve approach for estmatng the parameters: Guess the probablty that a gven data pont came from Con or ; Generate fctonal labels, weghted accordng to ths probablty. Now, compute the most lkely value of the parameters. [recall NB example] Compute the lkelhood of the data gven ths model. Re-estmate the ntal parameter settng: set them to maxmze the lkelhood of the data. (Labels Model Parameters) Lkelhood of the data Ths process can be terated and can be shown to converge to a local maxmum of the lkelhood functon EM CS446 Sprng 7
EM Algorthm (Cons) -I We wll assume (for a mnute) that we know the parameters and use t to estmate whch Con t s (Problem ) Then, we wll use ths label estmaton of the observed tosses, to estmate the most lkely parameters and so on... p,q, a Notaton: n data ponts; n each one: m tosses, h heads. What s the probablty that the th data pont came from Con? STEP (Expectaton Step): (Here h=h ) P P(Con D ) P(D Con) P(Con) P(D ) a p h h a p (p) mh (p) (a)q mh h ( q) mh EM CS446 Sprng 7
EM Algorthm (Cons) - II Now, we would lke to compute the lkelhood of the data, and fnd the parameters that maxmze t. We wll maxmze the log lkelhood of the data (n data ponts) LL =,n logp(d p,q,a) But, one of the varables the con s name - s hdden. We can margnalze: LL= =,n log y=0, P(D, y p,q, a) LL= =,n log y=0, P(D, y p,q, a) = = =,n log y=0, P(D p,q, a )P(y D,p,q,a) = = =,n log E_y P(D p,q, a) =,n E_y log P(D p,q, a) Where the nequalty s due to Jensen s Inequalty. We maxmze a lower bound on the Lkelhood. However, the sum s nsde the log, makng ML soluton dffcult. Snce the latent varable y s not observed, we cannot use the completedata log lkelhood. Instead, we use the expectaton of the complete-data log lkelhood under the posteror dstrbuton of the latent varable to approxmate log p(d p,q, ) We thnk of the lkelhood logp(d p,q,a ) as a random varable that depends on the value y of the con n the th toss. Therefore, nstead of maxmzng the LL we wll maxmze the expectaton of ths random varable (over the con s name). [Justfed usng Jensen s Inequalty; later & above] EM CS446 Sprng 7 3
EM Algorthm (Cons) - III We maxmze the expectaton of ths random varable (over the con name). E[LL] = E[ =,n log P(D p,q, a)] = =,n E[log P(D p,q, a)] = = =,n P log P(D, p,q, a)] + (-P ) log P(D, 0 p,q, a)] - P log P - (-P ) log (- P ) (Does not matter when we maxmze) Ths s due to the lnearty of the expectaton and the random varable defnton: log P(D, y p,q, a) = log P(D, p,q, a) wth Probablty P log P(D, 0 p,q, a) wth Probablty (-P ) EM CS446 Sprng 7 4
EM Algorthm (Cons) - IV Explctly, we get: E( log P(D p,q, a) P log P(,D p,q, a) (P )log P(0,D p,q, a) h mh h mh P log( a p (p) ) (P )log((- a) q ( q) ) P (log a hlogp (m-h )log(p)) (P )(log(- a) hlogq (m-h )log( q)) EM CS446 Sprng 7 5
EM CS446 Sprng 7 EM Algorthm (Cons) - V Fnally, to fnd the most lkely parameters, we maxmze the dervatves wth respect to : STEP : Maxmzaton Step (Santy check: Thnk of the weghted fctonal ponts) a, p,q n P 0 P - P d de n a a a a P m h P p 0 ) p h m - p h P ( dp de n (-P ) m h P ) ( q 0 ) q h m - q h (-P )( dq de n When computng the dervatves, notce P here s a constant; t was computed usng the current parameters n the E step 6
Models wth Hdden Varables EM CS446 Sprng 7 7
EM: General Settng The EM algorthm s a general purpose algorthm for fndng the maxmum lkelhood estmate n latent varable models. In the E-Step, we fll n the latent varables usng the posteror, and n the M-Step, we maxmze the expected complete log lkelhood wth respect to the complete posteror dstrbuton. Let D = (x,, x N ) be the observed data, and Let Z denote hdden random varables. (We are not commttng to any partcular model.) Let θ be the model parameters. Then µ * = argmax µ p(x µ) = argmax µ z p(x,z µ) = = argmax µ z [p(z µ)p(x z, µ)] Ths expresson s called the complete log lkelhood. EM CS446 Sprng 7 8
EM: General Settng () To derve the EM objectve functon, we re-wrte the complete log lkelhood functon by multplyng t by q(z)/q(z), where q(z) s an arbtrary dstrbuton for the random varable z. log p(x µ) = log z p(x,z µ) = log z p(z µ) p(x z,µ) q(z)/q(z) = = log E q [p(z µ) p(x z,µ) /q(z)] E q log [p(z µ) p(x z,µ) /q(z)], Where the nequalty s due to Jensen s nequalty appled to the concave functon, log. We get the objectve: Jensen s Inequalty for convex functons: E(f(x)) f(e(x)) But log s concave, so E(log(x)) log (E(x)) L(µ, q) = E q [log p(z µ)] + E q [log p(x z,µ)] - E q [log q(z)] The last component s an Entropy component; t s also possble to wrte the objectve so that t ncludes a KL dvergence (a dstance functon between dstrbutons) of q(z) and p(z x,µ). EM CS446 Sprng 7 9
Other q s can be chosen [Samdan & Roth0] to gve other EM algorthms. Specfcally, you can choose a q that chooses the most lkely z n the E-step, and then contnues to estmate the parameters (called Truncated EM, or Hard EM). (Thnk back to the sem-supervsed case) EM: General Settng (3) EM now contnues teratvely, as a gradent accent algorthm, where we choose q = p(z x, µ). At the t-th step, we have q (t) and µ (t). E-Step: update the posteror q, whle holdng µ (t) fxed: q (t+) = argmax q L(q, µ (t) ) = p(z x, µ (t) ). M-Step: update the model parameters to maxmze the expected complete log-lkelhood functon: µ (t+) = argmax µ L(q (t+), µ) To wrap t up, wth the rght q: L(µ, q) = E q log [p(z µ) p(x z,µ) /q(z)] = z p(z x, µ) log [p(x, z µ)/p(z x, µ)] = = z p(z x, µ) log [p(x, z µ) p(x µ)/p(z, x µ)] = = z p(z x, µ) log [p(x µ)] = log [p(x µ)] z p(z x, µ) = log [p(x µ)] So, by maxmzng the objectve functon, we are also maxmzng the log lkelhood functon. EM CS446 Sprng 7 0
The General EM Procedure E M EM CS446 Sprng 7
EM Summary (so far) EM s a general procedure for learnng n the presence of unobserved varables. We have shown how to use t n order to estmate the most lkely densty functon for a mxture of (Bernoull) dstrbutons. EM s an teratve algorthm that can be shown to converge to a local maxmum of the lkelhood functon. It depends on assumng a famly of probablty dstrbutons. In ths sense, t s a famly of algorthms. The update rules you wll derve depend on the model assumed. It has been shown to be qute useful n practce, when the assumptons made on the probablty dstrbuton are correct, but can fal otherwse. EM CS446 Sprng 7
EM Summary (so far) EM s a general procedure for learnng n the presence of unobserved varables. The (famly of ) probablty dstrbuton s known; the problem s to estmate ts parameters In the presence of hdden varables, we can often thnk about t as a problem of a mxture of dstrbutons the partcpatng dstrbutons are known, we need to estmate: Parameters of the dstrbutons The mxture polcy Our prevous example: Mxture of Bernoull dstrbutons EM CS446 Sprng 7 3
Example: K-Means Algorthm K- means s a clusterng algorthm. We are gven data ponts, known to be sampled ndependently from a mxture of k Normal dstrbutons, wth means, =, k and the same standard varaton p(x) x EM CS446 Sprng 7 4
Example: K-Means Algorthm Frst, notce that f we knew that all the data ponts are taken from a normal dstrbuton wth mean, fndng ts most lkely value s easy. p(x ) exp[ (x ) ] We get many data ponts, D = {x,,x m } ln(l(d )) ln(p(d )) - (x ) Maxmzng the log-lkelhood s equvalent to mnmzng: ML argmn (x ) Calculate the dervatve wth respect to, we get that the mnmal pont, that s, the most lkely mean s EM CS446 Sprng 7 m x 5
A mxture of Dstrbutons As n the con example, the problem s that data s sampled from a mxture of k dfferent normal dstrbutons, and we do not know, for a gven data pont x, where s t sampled from. Assume that we observe data pont x ;what s the probablty that t was sampled from the dstrbuton j? P(x j )P( j) P(x x j) Pj P( j x) k k P(x ) P(x x n n) k exp[ (x j ) ] k exp[ (x n n) ] EM CS446 Sprng 7 6
A Mxture of Dstrbutons As n the con example, the problem s that data s sampled from a mxture of k dfferent normal dstrbutons, and we do not know, for a gven each data pont x, where s t sampled from. For a data pont x, defne k bnary hdden varables, z,z,,z k, s.t z j = ff x s sampled from the j-th dstrbuton. E[z j ] P(x 0 P(x was sampled from ) was not sampled from ) P EM CS446 Sprng 7 7 j j E[Y] y P(Y y j y ) E[X Y] E[X] E[Y]
Example: K-Means Algorthms,,,..., Expectaton: (here: h = k ) p(y h) p(x,z,..., zk h) exp[ j z j (x ) j ] Computng the lkelhood gven the observed data D = {x,,x m } and the hypothess h (w/o the constant coeffcent) m ln(p(y h)) - z (x ) j j j [ m E[ln(P(Y h))] E - z ] j j(x j) m - E[z ](x ) j j j EM CS446 Sprng 7 8
Example: K-Means Algorthms Maxmzaton: Maxmzng m Q(h h') - E[z ](x ) j j j wth respect to we get that: Whch yelds: j dq d j j m C E[z ](x ) m m E[z j j E[z ]x j ] j 0 EM CS446 Sprng 7 9
Summary: K-Means Algorthms Gven a set D = {x,,x m } of data ponts, guess ntal parameters,,,..., k Compute (for all,j) exp[ (x p j ) ] j E[zj] k exp[ (x n n ) and a new set of means: m E[z ]x j j m E[z ] j repeat to convergence Notce that ths algorthm wll fnd the best k means n the sense of mnmzng the sum of square dstance. ] EM CS446 Sprng 7 30
Summary: EM EM s a general procedure for learnng n the presence of unobserved varables. We have shown how to use t n order to estmate the most lkely densty functon for a mxture of probablty dstrbutons. EM s an teratve algorthm that can be shown to converge to a local maxmum of the lkelhood functon. Thus, mght requres many restarts. It depends on assumng a famly of probablty dstrbutons. It has been shown to be qute useful n practce, when the assumptons made on the probablty dstrbuton are correct, but can fal otherwse. As examples, we have derved an mportant clusterng algorthm, the k-means algorthm and have shown how to use t n order to estmate the most lkely densty functon for a mxture of probablty dstrbutons. EM CS446 Sprng 7 3
More Thoughts about EM Tranng: a sample of data ponts, (x 0, x,, x n ) {0,} n+ Task: predct the value of x 0, gven assgnments to all n varables. EM CS446 Sprng 7 3
z z P z More Thoughts about EM Assume that a set x {0,} n+ of data ponts s generated as follows: Postulate a hdden varable Z, wth k values, z k wth probablty z,,k z = Havng randomly chosen a value z for the hdden varable, we choose the value x for each observable X to be wth probablty p z and 0 otherwse, [ = 0,,,.n] Tranng: a sample of data ponts, (x 0, x,, x n ) {0,} n+ Task: predct the value of x 0, gven assgnments to all n varables. EM CS446 Sprng 7 33
z z P z More Thoughts about EM Two optons: Parametrc: estmate the model usng EM. Once a model s known, use t to make predctons. Problem: Cannot use EM drectly wthout an addtonal assumpton on the way data s generated. Non-Parametrc: Learn x 0 drectly as a functon of the other varables. Problem: whch functon to try and learn? x 0 turns out to be a lnear functon of the other varables, when k= (what does t mean)? When Another k mportant s known, dstncton the EM to attend approach to s the fact performs that, once you well; f an estmated ncorrect all the value parameters s assumed wth EM, you the can answer estmaton many predcton fals; the problems e.g., p(x lnear methods 0, x performs 7,,x 8 x, x,, x better n ) whle wth Perceptron (say) [Grove & Roth 00] you need to learn separate models for each predcton problem. EM CS446 Sprng 7 34
EM EM CS446 Sprng 7 35
The EM Algorthm Algorthm: Guess ntal values for the hypothess h=,,..., Expectaton: Calculate Q(h,h) = E(Log P(Y h ) h, X) usng the current hypothess h and the observed data X., Maxmzaton: Replace the current hypothess h by h, that maxmzes the Q functon (the lkelhood functon) set h = h, such that Q(h,h) s maxmal Repeat: Estmate the Expectaton agan. k EM CS446 Sprng 7 36