Note on EM-training of IBM-model 1

Note on EM-tranng of IBM-model INF58 Language Technologcal Applcatons, Fall The sldes on ths subject (nf58 6.pdf) ncludng the example seem nsuffcent to gve a good grasp of what s gong on. Hence here are some supplementary notes wth more detals. Hopefully they make thngs clearer. The man dea There are two man tems nvolved: - Translaton probabltes - Word algnments The translaton probabltes are assgned to the blngual lexcon: For a par of words (e,f) n the lexcon, how probable s t that e gets translated as f, expressed by f e). Beware, ths s calculated from the whole corpus; we do not consder these probabltes for a sngle sentence. A word algnment s assgned to a par of sentences (e, f). (We are usng bold face to ndcate that e s a strng (array) of words e, e, e k, etc.) When we have a parallel corpus where the sentences are sentence algned whch may be expressed by (e, f ), (e, f ),,(e m, f m ) we are consderng the algnment of each sentence par ndvdually. Ideally, we are lookng for the best algnment of each sentence. But as we do not know t, we wll nstead consder the probablty of the varous algnments of the sentence. For each sentence, the probablty of the varous algnments must add to. The EM-tranng then goes as follows. Intalzng a. We start wth ntalzng t. When we don t have other nformaton, we ntalze t unformly. That s, f e) /s, where s s the number of F-words n the lexcon. b. For each sentence n the corpus, we estmate the probablty dstrbuton for the varous algnments of the sentence. Ths s done on the bass of t, and should reflect t: For example, f f e k ) f j e k ), then algnments whch algns f to e k should be tmes more probable than they whch algn f j to e k. (Well, actually, n round ths s more trval, snce all algnments are equally lkely when we start wth a unform t.). Next round a. We count how many tmes a word e s translated as f on the bass of the probablty dstrbutons for the sentences. Ths s a fractonal count. Gven a sentence par (e j, f j ). If e occurs n e j. and f occurs n f j, we consder algnments whch algns f to e. Gven such an algnment, a, we consder ts probablty P(a), and from ths algnment we count that e s translated as f P(a) many tmes. For example, f P(a) s., we wll add. to the count of how many tmes e s translated as f. After we have done ths for all algnments of all the sentences, we can recalculate t. The notaton n Koehn s book for the dfferent counts and measures s not stellar, but as we adopted the same notaton n the sldes, we wll stck to t to make the smlartes transparent. Koehn used the notaton f e) for the fractonal count of the

par (e,f) n a partcular sentence. To make t clear that t s the count n the specfc sentence par (e, f), he also uses the notaton f e; f, e ). To ndcate the fractonal count of the word (type) par (e,f) over the whole corpus, he uses (f,e) f e; f, e ) (.e. we add the fractonal counts for all the sentences.) An alternatve notaton for the same would have been m f e; f, e ) gven there are m sentences n the corpus. We ntroduced the notaton tc for total count for ths on the sldes. tf e) f e; f, e) (f,e) The reestmated translaton probablty can then be calculated from ths f e) ( f, e ) f e; f, e) f e; f, e) Here f vares over all the F-words n the lexcon. b. Wth these new translaton probabltes, we may return to the algnments, and for each sentence estmate the best probablty dstrbuton over the possble algnments. Ths tme there s no smple way as there was n round. For each algnment, we calculate a probablty on the bass of t, and normalze to make sure that the sum of the probabltes for each sentence add up to. Next round: a. We go about exactly as n step (a). On the bass of the algnment probabltes estmated n step (b), we may now calculate new translatons probabltes t, b. And on the bass of the translaton probabltes estmate new algnment probabltes. And so we may repeat the two steps as long as we lke Propertes What s nce wth ths algorthm s: - We can prove that the result gets better (or stay the same) after each round. It never deterorates. - The result converges towards a local optmum. - For IBM model (but not n general) ths local optmum s also a global optmum. The fast way We have descrbed here the underlyng dea of the algorthm. The descrpton above s probably the best for understandng what s gong on. There s a problem when applyng t. There are so many (too many) dfferent algnments. We therefore derved a modfed algorthm where we do not calculate the probabltes of the actual algnments. Instead we calculate the translaton probablty n step (a) drectly from the translaton probabltes from step (a) and the translaton probabltes n step (a) drectly from the translaton probabltes n (a) wthout actually calculatng the ntermedate algnment probabltes (step b). f ( f, e) f ' t f e) t f ' e)

Examples There s a very smple example n Jurafsky and Martn whch llustrates the calculaton wth the orgnal algorthm. You should consult ths frst. In the example n the lecture, we followed the modfed algorthm where we sdestep the actual algnments. Let us now see how the example from the lecture would go wth the full algorthm frst (smlarly to the Jurafsky-Martn example), before we compare t to the example from the lecture wth some more detals flled n. We wll number the examples Sentence : - e : dog barked - f : hund bjeffet Sentence : - e : dog bt dog - f : hund bet hund to have the smplest example frst. The theoretcal sound, but computatonally ntractable way: Step a - Intalzaton. Snce there are Norwegan words all f e) s set to /. hund dog) / bet dog) / bjeffet dog) / hund bt) / bet bt) / bjeffet bt) / hund barked) / bet barked) / bjeffet barked) / hund ) / bet ) / bjeffet ) / Step b Algnments We must also nclude n the E-sentence to ndcate that a word n the F-sentence may be algned to nothng. Each of the words n the sentence f may come from one of dfferent words n sentence e. Hence there are 9 dfferent algnments: <,>, <,>, <,>, <,>, <,>, <,>, <,>, <,>, <,>. Snce all translaton probabltes are equally lkely, each algnment wll have the same probablty. Snce there are 9 dfferent algnments, each of them wll have the probablty /9. Wrtng a for the algnment probablty of the frst sentence, we have a (<,>) a (<,>) a (<,>)/9. For sentence, there are words n f.. Each of them may be algned to any of 4 dfferent words n e (ncludng ). Hence there are 4*4*464 dfferent algnments, rangng from <,,> to <,,>. We could take the easy way out and say that each of them s equally lkely, hence a (<,,>) a (<,,>) a (<,,>)/64. But to prepare our understandng for later rounds, let us see what happens f we follow the recpe. To calculate the probablty of one partcular algnment, we multply together the nvolved translaton probabltes, eg. P (<,,>) hund dog)*bet bt)*hund )/7. In ths round, we get exactly the same result for all the algnments, /7. But that sn t the same as /64. Has anythng gone wrong here? No. The score /7 s not the probablty of the algnment. To get at the probablty we must normalze. Frst we sum together the scores for all the algnments whch yelds 64/7. Then to get the probablty for each algnment, we must dvde the second wth ths sum. Hence the probablty for each algnment s (/7)/(64/7) a complcated way to wrte /64.

Step a Maxmze the translaton probabltes Then the show may start. We frst calculate the fractonal counts for the word pars n the lexcon, and we do ths sentence by sentence, startng wth sentence. To take one example, what s the fractonal count of (dog, hund) n sentence? We must see whch algnments whch algn the two words. There are : <,>, <,>, <,>. (A good advce at ths pont s to draw the algnments whle you read.) To get the fractonal count we must add the probabltes of these algnments,.e., hund dog; f, e ) a (<,>)+ a (<,>)+ a (<,>)*(/9) /. We can repeat for the par (hund, barked) and get hund barked; f, e ) a (<,>)+ a (<,>)+ a (<,>)*(/9) /, and so on. We see we get the same for all word pars n ths sentence hund dog) / bjeffet dog) / hund barked) / bjeffet barked) / hund ) / bjeffet ) / (There s a typo n the lecture sldes and n the frst verson of these notes, wrtng t nstead of c n the rght column. The same for sentence.) Sentence s more extng. Consder frst the par (bet, bt). They get algned by all algnments of the form <x,, y> where x and y are any of,,,. There are 6 such algnments. (We don t bother to wrte them out). Each algnment has probablty /64. Hence bet bt; f, e ) 6/64 ¼ Smlarly we get bet ; f, e ) 6/64 ¼. To count the par (dog, bet), they are algned by all algnments of the form <x,,y> and all algnments of the form <x,,y>, hence bet dog; f, e ) *6/64 ½ To count the par (bt, hund), we must consder both algnments of the form <,x,y> and of the form <x,y,>. (Observe that <,x,> should be counted twce snce two occurrences of hund are algned to bt.) And to count the par (hund, dog), we must consder all algnments <,x,y>, <,x,y>, <x,y,> and <x,y,>. We get the followng counts for sentence : hund dog) bet dog) / hund bt) ½ bet bt) /4 hund ) / bet ) /4 4

We get the total counts (tc) by addng the fractonal counts for all the sentences n the corpus resultng n thund dog) +/ tbet dog) / tbjeffet dog) / t* dog)4/+/+/ /6 thund bt) ½ tbet bt) ¼ tbjeffet bt) t* bt)/4 thund barked) / tbet barked) tbjeffet barked) / t* barked) / thund ) ½+/ tbet ) /4 tbjeffet ) / t* )7/ In the last column we have added all the total counts for one E word, e.g. t dog) tf e; f, e ) f We can then fnally calculate the new translaton probabltes: e f f e) exact decmal hund (5/6)/(7/)/7.5885 bet (/4)/(7/)/7.7647 bjeffet (/)/(7/)4/7.594 dog hund (4/)/(/6) 8/.6585 dog bet (/)/(/6) /.769 dog bjeffet (/)/(/6) /.5846 bt hund (/)/(/4) /.666667 bt bet (/4)/(/4) /. barked hund (/)/(/ /.5 barked bjeffet (/)/(/) /.5 5

Step b_ Estmate algnment probabltes It s tme to estmate the algnment probabltes agan. Remember ths s done sentence by sentence, startng wth sentence. There are 9 dfferent algnments to consder. For each of them we may calculate an ntal unnormalzed probablty, call t P, on the bass of the last translaton probabltes. P PP /,44546 P (<,>) hund )*bjeffet ) (/7)*(/7),86,7848 P (<,>) hund )*bjeffet dog) (/7)*(/),94977,69766 P (<,>) hund )*bjeffet barked) (/7)*(/),948,794 P (<,>) hund dog)*bjeffet ) (8/)*(/7),8597,76778 P (<,>) hund dog)*bjeffet dog) (8/)*(/),946746,66994 P (<,>) hund dog)*bjeffet barked) (8/)*(/),769,75 P (<,>) hund barked)*bjeffet ) (/)*(/7),885,6775 P (<,>) hund barked)*bjeffet dog) (/)*(/),769,5485 P (<,>) hund barked)*bjeffet barked) (/)*(/),5,7675 Sum of P s,44546 We sum the P scores (last lne) and normalze them n the last column to get the probablty dstrbuton over the algnments. We may do the same for sentence. But because there are 64 dfferent algnments we refran from carryng out the detals. Step a Maxmze the translaton probabltes We proceed exactly as n step a. We frst collect the fractonal counts sentence by sentence, startng wth sentence. For example, we get hund barked; f, e ) a (<,>)+ a (<,>)+ a (<,>),6775+,5485+,7675,949 And smlarly for the other fractonal counts n sentence. Snce we have not calculated the algnments for sentence, we stop here. Hopefully the dea s clear by now. The fast lane Manually we refran from calculatng 64 algnments, but t wouldn t have been a problem for a machne. However, a short sentence of words has roughly algnments and soon also the machnes must gve n. Let us repeat the calculatons from the sldes from the lecture. The pont s that we skp the algnments and pass drectly from step a to step a and then to step a etc. The key s the formula m k f e) f e; e, f ) δ ( f, f ) δ ( e, e ) whch lets us calculate fractonal counts drectly from (last round of) translaton probabltes. k j f e j ) 6

Step a Maxmze the translaton probabltes To understand the formula, f j refers to the word at poston j n sentence f. Thus n sentence, f f s hund, δ f, f j for j, whle, δ f, f j for j. Smlarly, e refers to the word n poston n the Englsh strng. Hence, hund barked) hund barked; e, f) (, ) (, ) δ hund f j δ barked e f e ) j ( δ ( hund, hund) + δ ( hund, bjeffet)) ( ) ( δ ( barked,) + δ ( barked, dog) + δ ( barked, barked)) / and smlarly for the other word pars. We get the same fractonal counts for sentence as when we used explct algnments. Then sentence. To take two examples bet bt) bet bt; e, f ) (, ) (, ) / 4 δ bet f j δ bt e bet e ) j ( ) hund dog) c hund dog; e, f ) (, ) (, ) δ hund f j δ dog e hund e ) j ( ) ( Hurray we get the same fractonal counts as wth the explct use of algnments. And we may proceed as we dd there, calculatng frst the total fractonal counts, tc, and then the translaton probabltes, t. Step a Maxmze the translaton probabltes We can harvest the award when we come to the next round and want to calculate the fractonal counts. Take an example from sentence : hund barked; e, f hund barked) ) (, ) (, ) δ hund f j δ barked e f e ) j.5.5.9497.5885 +.6585 +.5.76 Whch s close enough to the result we got by takng the long route (gven that we use calculator and round off for each round). The mracle s that ths works equally well on sentence, for example: hund dog; e, f hund dog) ) (, ) (, ) δ hund f j δ dog e hund e ) j.6585?.5885 +.6585 +.666667 +.6585 7

Summng up Ths concludes the examples. Hopefully t s now possble to better see: The motvaton between the orgnal approach where we explctly calculate algnments That the faster algorthm yelds the same results as the orgnal algorthm, at least on the example we explctly calculated. And that even though t may be hard to see step by step that the two algorthms produce the same results n general, we may open up to the dea. That the fast algorthm s computatonally tractable. 8