Example of Parallel Corpus. Machine Translation: Word Alignment Problem. Outline. Word Alignments

Size: px

Start display at page:

Download "Example of Parallel Corpus. Machine Translation: Word Alignment Problem. Outline. Word Alignments"

Albert Wright
5 years ago
Views:

1 Example of Parallel Corpus 2 Machine Translation: Word Alignment Problem Marcello Federico FBK, Trento - Italy 206 Darum liegt die Verantwortung für das Erreichen des Effizienzzieles und der damit einhergehenden CO2 -Reduzierung bei der Gemeinschaft, die nämlich dann tätig wird, wenn das Ziel besser durch gemeinschaftliche Massnahmen erreicht werden kann. Und genaugenommen steht hier die Glaubwürdigkeit der EU auf dem Spiel. Notice di erent positions of corresponding verb groups. That is why the responsibility for achieving the efficiency and at the same time reducing CO2 lies with the Community, which in fact takes action when an objective can be achieved more effectively by Community measures. Strictly speaking, it is the credibility of the EU that is at stake here. MT has to take into account word re-ordering! Outline Word Alignments Word alignments Word alignment models Alignment search Alignment estimation EM algorithm Model 2 Fertility alignment models HMM alignment models Let us considers possible alignments a between words in f and e. 2 serata di domani soffierà un freddo vento orientale This part contains advanced material (marked with *) suited to students interested in the mathematical details of the presented models. since tomorrow evening an eastern chilly wind will blow

2 Word Alignments Word Alignments Let us considers possible alignments a between words in f and e. Typically, alignments are restricted to maps between positions of f and of e. Ley us considers possible alignments a between words in f and e. Typically, alignments are restricted to maps between positions of f and of e. Some words might be not aligned (virtually aligned with NULL) 2 serata di domani soffierà un freddo vento orientale 2 serata di domani soffierà un freddo vento orientale since tomorrow evening an eastern chilly wind will blow NULL since tomorrow evening an eastern chilly wind will blow Word Alignments Word Alignments Let us considers possible alignments a between words in f and e. Typically, alignments are restricted to maps between positions of f and of e. Some words might be not aligned Let us considers possible alignments a between words in f and e. Typically, alignments are restricted to maps between positions of f and of e. Some words might be not aligned (virtually aligned with NULL) These and even more general alignments are machine learnable. 2 serata di domani soffierà un freddo vento orientale 2 serata di domani soffierà un freddo vento orientale since tomorrow evening an eastern chilly wind will blow NULL since tomorrow evening an eastern chilly wind will blow

3 Word Alignments Word Alignment: Matrix Representation Let us considers possible alignments a between words in f and e. Typically, alignments are restricted to maps between positions of f and of e. Some words might be not aligned (virtually aligned with NULL) These and even more general alignments are machine learnable. Notice also that alignments induce word re-ordering blow 9 will 8 wind 7 chilly 6 eastern 5 an evening tomorrow 2 since NULL 0 2 serata di domani soffierà un freddo vento orientale serata di domani soffierà un fblackdo vento orientale NULL since tomorrow evening an eastern chilly wind will blow serata di domani soffierà un freddo vento orientale NULL since tomorrow evening an eastern chilly wind will blow Word Alignment: Matrix Representation Word Alignment: Direct Alignment 5 blow 9 will 8 wind 7 chilly 6 eastern 5 an evening tomorrow 2 since serata di domani soffierà un fblackdo vento orientale A : {,...,m}! {,...,l} implemented 6 been 5 has program the 2 and position il programma è stato messo in pratica serata di domani soffierà un freddo vento orientale We allow only one link (point) in each column. Some columns may be empty. since tomorrow evening an eastern chilly wind will blow

4 Word Alignment: Inverted Alignment A : {,...,l}! {,...,m} people 6 aborigenal 5 the of territory 2 the position 2 il territorio degli autoctoni You can get a direct alignment by swapping and sentence. 6 Word Alignment Model In SMT we will model the translation probability Pr(f e) by summing the probabilities of all possible (l + ) m hidden alignments a between the and the strings: Pr(f e) X a Hence we will consider statistical word alignment models: defined by specific sets of parameters. Pr(f, a e) p (f, a e) Pr(f, a e) () The art of statistical modelling consists in designing statistical models which capture the relevant properties of the considered phenomenon, in our case the relationship between a language string and a language string. There are 5 models of increasing complexity (number of parameters) 8 Alignment Variable 7 Word Alignment Models 9 Modelling the alignment as an arbitrary relation between and language is very general but computationally unfeasible: 2 l m possible alignments! A generally applied restriction is to let each word be assigned to exactly one word(see Example 2). Hence, alignment is a map from to positions: A : {,...,m}! {0,...,l} Alignment variable: a a,...,a m consists of associations j! i a j, from position j to position i a j. We may include null word alignments, thatisa j 0to account for words not aligned to any word. Hence, only (l + ) m possible alignments. In order to find automatic methods to learn word alignments from data we use mathematical models that explain how translations are generated. The way models explain translations may appear very naïve if not silly! Indeed they are very simplistic... However, simple explanations often do work better than complex ones! We need to be a little bit formal here, just to give names to ingredients we will use in our recipes to learn word alignments: English sentence e is a sequence of l words French sentence f is a sentence of m words Word alignment a is a map from m positions to l +positions We will have to relax a bit our conception of sentence: it is just a sequence of words, which might have or not sense at all...

5 Word Alignment Models 0 On Probability Factorization 2 There are five models, of increasing complexity, that explain how a translation and an alignment can be generated from a foreign sentence. Chain Rule The prob. of a sequence of events e e,e 2,e,...e l can be factorized as: e Alignment Model Pr(a,f e) a,f Pr(e e,e 2,e,...e l ) Pr(e ) Pr(e 2 e ) Pr(e e,e 2 )... Pr(e l e,...,e l ) Complexity refers to the amount of parameters that define the model! We start from the simplest model, called Model! The joint probability is factorized over single event probabilities Factors however introduce dependencies of increasing complexity the last factor has the same complexity of the complete joint probability! There are two basic approximations for sequential models which eliminate dependencies in the conditional part of the chain factors Notice that for non-sequential events, we might change the order of factors, e.g: Pr(f, a e) Pr(a, f e) Pr(a e) Pr(f e, a) Model Basic Sequential Models e Alignment Model Pr(a,f e) a,f Bag-of-word model: We assume that each event is independent from the others: Pr(e e,e 2,e,...e l ) Pr(e ) Pr(e 2 ) Pr(e )... Pr(e l ) Model generates the translation and the alignment as follows:. guess the length m of f on the basis of the length l of e 2. for each position j in f repeat the following two steps: (a) randomly pick a corresponding position i in e (b) generate word j of f by picking a translation of word i in e Step is executed by using a translation length predictor Step 2.(a) is performed by throwing a dice with l +faces Step 2.(b) is carried out by using a word translation table Markov chain model: We assume that each factor event only depends from the previous one: Pr(e e,e 2,e,...e l ) Pr(e ) Pr(e 2 e ) Pr(e e 2 )... Pr(e l e l ) We reduce complexity by removing dependencies Event space becomes smaller and probabilities easier to estimate This simplification might reduce accuracy of the model We want to include the null word.

6 Word Alignment Model Factorization Model 6 One of the many ways to exactly decompose Pr(f m,a m e l ) is: Given the alignment factorization Pr(f m,am el ) Pr(m el ) m Y j Pr(m e l ) m Y j Pr(f j,a j f j,a j,m,e l ) Pr(a j f j,a j,m,e l ) Pr(f j f j,a j,m,el ) Looks dense but it s just the plain application of the chain rule! Pr(f m,a m e l )Pr(m e l ) Pr(a j...) Pr(f j a j,...) j We simplify all interactions by means of pairwise dependencies: Pr(m e l ) p(m l) length probability Pr(a j...) (l + ) alignment probability Pr(f j a j,...) p(f j e aj ) translation probability Hence, we get the following translation model: Pr(f m e l ) X a m Pr(f m,a m e l ) p(m l) (l + ) m j i0 p(m l) (l + ) m X a m p(f j e aj ) j p(f j e i ) nice complexity reduction! Word Alignment Model Factorization 5 Model 7 One of the many ways to exactly decompose Pr(f m,a m Let s make it look simpler Pr(f m,a m e l )Pr(m e l ) Generative stochastic process: e l ) is: Pr(a j...) Pr(f j a j,...). choose length m of the French string, given knowledge of the English string e l 2. cover one English position for each French position j, given.... choose French word for each position j, given the covered English position... Remark: the process works in the wrong direction: it generates f from e. In fact, it is used to calculate Pr(f, a e) Pr(e) in the search problem. Though, it can work in both directions by exchanging f and e. j Model has a very simple stochastic generative process:. Choose a length m for f according to p(m l) 2. For each j,...,m, choose a j in {0,,...,l} at random. For each j,...,m, choose French word f j according to p(f j e aj ) Properties: Model is very naive but is a good starting point for better models Parameters are the probabilities p(f j e aj ) Computation of Pr(f e) can be very e cient Search of the most probable alignment is straightforward Estimation is trivial given a parallel corpus with alignments Estimation is e cient given a parallel corpus without alignments

7 Model : Generative Process 8 Model : Translation Table 20 Assume very simple German and English languages: just words each. 2 length l5 the program2 has been implemented5... m positions picked randomly alignment house book a0. 0, has been implemented5 implemented5 implemented5 the program2 the 0.85 das 0.2 ein 0.0 Haus 0.02 Buch e' stato2 messo in pratica5 il6 programma7 words chosen through a probability table translation MODEL ONLY RELIES ON WORD-TO-WORD TRANSLATION PROBs! Model needs a table x : each raw shows German translations of each English word each raw contains probabilities summing up to one Of course, the majority of cells should ideally be equal to zero. Learning Model, basically means filling the table with some good values... 2 Let s forget about the null word here. Model 9 Model : Learning 2 e Alignment Model Pr(a,f e) a,f Let us assume that we have a parallel corpus with alignments: serata di domani soffierà un freddo vento orientale un vento freddo da est interessa le Alpi since tomorrow evening an eastern chilly wind will blow an eastern cool breeze affects the Alps Let us see how we can can implement Model and at its complexity:. length predictor of the translation this is not di cult to build, we look for instance at many English-French translations and study how sentence lengths are related (few parameters) 2. dice of l +faces: very simple to simulate by a computer (no parameters). translation table of words: this is the tricky part. We need a big table that tells us for each French word f and English word e if e is either a good or bad translation of f (fair amount of parameters) We can estimate translation probabilities by counting aligned word-pairs. The maximum likelihood estimation for a discrete distribution is: p(e f) count(f,e) count(e, f) count(e, f) count(f) P e for a word-pair chilly-freddo we count how often they are aligned together p(chilly freddo) count(chilly, freddo) count(freddo) we end up with reliable probabilities by using a very large parallel corpus!

8 Model : Aligning 22 MLE of a Discrete Distribution 2 Let us assume that we have probabilities p(f e) for all word pairs. Given a parallel corpus without alignments serata di domani soffierà un freddo vento orientale un vento freddo da est interessa le Alpi since tomorrow evening an eastern chilly wind will blow an eastern cool breeze affects the Alps the most probable (or Viterbi) alignment of each sentence pair a arg max Pr(a f, e) / a p(f j e aj ) can be computed by finding the most probable translation for each position: j a j arg max p(f j e i ) i0,,...,l The time complexity of the Viterbi search for Model is just O(M L) Let x x,...,x S be a random sample of outcomes of a dice X p (X). Parameters { (w)!2, (!) 0, P! (!) } where {, 2,...,6} Assume outcomes in x are independent and identically distributed (iid) Maximum likelihood estimation looks for that maximize the sample likelihood L( ) SY (!) c(!) p (X x i ) Y p (X!) c(!) Y i!2!2 We apply a monotonic map to get something equivalent but easier to maximize: L( ) log Y (!) c(!) X c(!) log (!)!2!2 then we can apply Lagrange multipliers to get the closed form solution: ˆ (!) c(!) S that is the well known relative frequency! [c( ) sample count] Model : Best Alignment Search 2 Training of Word Alignment Models 25 Let us assume that we have translation probabilities p(f e). Given a parallel corpus without alignments serata di domani soffierà un freddo vento orientale un vento freddo da est interessa le Alpi since tomorrow evening an eastern chilly wind will blow an eastern cool breeze affects the Alps We can compute the most probable alignment of each sentence pair as follows: Let p (f e) be a translation model with unknown parameters, thatwewantto estimate from a sample of iid translations {(f s, e s ):s,...,s} by maximizing: L( ) SX log p (f s e s ) X c(f, e) log p (f e) [c( ) sample count] (2) f,e s p (f s e s ) is the marginal probability of an alignment model: for each word e in the text e we pick the most probable word f in the text e according to the available probabilities. p (f e) X a p (f, a e) () [Exercise 2. Given the following translation probabilities of the word freddo, what alignments will be generate for this word? where the hidden variable a is not observed in the training sample. cold chilly cool wind freddo ] Unfortunately, there is no closed-form solution for maximizing L( ). There is an iterative algorithm which is proven to converge at least to a maximum of L( ).

9 Estimation of Word Alignment Models 26 Model : Estimation 28 How to train alignment models from parallel data? data + word alignments ) model parameters data + model parameters ) word alignments Idea to solve this chicken & egg problem INITIAL PARAM IMPROVE ESTIMATE BILINGUAL CORPUS PARAM loop until convergencence We start from the first sentence pair of the bilingual corpus: the house - das Haus and apply the following two steps:. We weight each word co-occurrences with its probability in the table: co(the,das) x 0.25 co(the,haus) x 0.25 co(house,das) x 0.25 co(house,haus) x We transform weighted co-occurrences in conditional probabilities: Pr(das/the) 0.25/( ) Pr(Haus/the) 0.25/( ) Pr(das/house) 0.25/( ) Pr(Haus/house) 0.25/( ) the das ein Haus Buch co pr house das ein Haus Buch co pr Notice: in this sentence the can be only linked either to das or to Haus We apply the same steps to all the sentence pairs of the bilingual corpus Model : Estimation 27 Model : Estimation 29 Let s go back to our simplified English and German languages. Ingredients of Expectation Maximization algorithm: Initial parameters: translation table with uniform probabilities Bilingual corpus: collection of human translations the house - das Haus the das ein Haus Buch co pr house das ein Haus Buch co pr Blingual corpus the house - das Haus the book - das Buch a book - ein Buch Probability table: das ein Haus Buch the a house book the book - das Buch the das ein Haus Buch co 0, pr 0, a book - ein Buch book das ein Haus Buch co pr Let us now see how to improve our probabilities... a das ein Haus Buch co pr book das ein Haus Buch co pr

10 Model : Estimation 0 Model : Estimation 2 We sum up all sentence level probs in a co-occurrence table... Again, we sum all probabilities in a co-occurrence table Bilingual corpus Co-occurrence table: Bilingual corpus Co-occurrence table: the house - das Haus the book - das Buch a book - ein Buch das ein Haus Buch total the a house book the house - das Haus the book - das Buch a book - ein Buch das ein Haus Buch total the a house book Finally, we compute updated word translation probabilities from the counts. Probability table: das ein Haus Buch the a house book and compute updated word translation probabilities from the counts. Probability table: das ein Haus Buch the a house book Let is start a second iteration... We iterate this procedure several times, until prob get stable values Model : Estimation Model : Estimation the house - das Haus the das ein Haus Buch co pr 0, the book - das Buch house das ein Haus Buch co pr Iteration... das ein Haus Buch the a house book the das ein Haus Buch co 0, pr 0, a book - ein Buch book das ein Haus Buch co pr Iteration 2 das ein Haus Buch the a house book a das ein Haus Buch co pr book das ein Haus Buch co pr This procedure is called Expectation Maximization algorithm Here, EM could only learn translations of the and book! We need more data to learn more translations and... better models, too!

11 Expectation Maximization Algorithm Expectation Maximization Algorithm 6 We introduce an auxiliary function Q(, ) which has two properties: Q(, ) 0 () Q(, ) > 0 ) L( ) >L( ) (5) Hence, with this iterative procedure we can find a maximum point of L( ): 0. ˆ initialization. do 2. ˆ. ˆ arg max Q(, ). while L(ˆ ) >L( ) 0. ˆ initialize. do 2. ˆ.. generate all possible alignments for the training data. 2. accumulate observed counts weighted by alignment model.. compute relative frequencies ˆ from counts c (!). while L(ˆ ) >L( ) Property: Q(, ) 0 ) {max Q(, )} 0. Definition is in the appendix. Expectation Maximization Algorithm 5 EM Algorithm of Model 7 For our alignment models, the solution of max Q(, ) is: Parameters of Model are probs p(f e), i.e.word f is aligned with word e. where: ˆ (!) c (!) P c (!)!2 c (!) µ SX X p (a f s, e s )c(!; a, f s, e s ) s (!) is one element of, i.e. the probability of! are the old parameter values, ˆ are the new values a! is any elementary event of the model to be normalized over a sub-space µ (!) is a relative frequency estimator based on the expected count c (!), which is taken by generating all possible alignments a with the old probabilities. ˆp(f e) c (f e) c (f e) P e c (f e) SX X p (a f s, e s )c(f e; a, f s, e s ) s a c(f e; a, f s, e s ): count how many times f is aligned with e in the triple a, f s, e s With some manipulation (see appendix) we get: c (f e) SX mx s k i0 p(f k e i ) P l a0 p(f k e a ) (e, e i) (f,f k )

12 EM Algorithm of Model 8 Model 2 0 EM-M(F,E,S) Init-Params(P); // P[f,e]uniform 2 do Reset-Expected-Counts; //p[]0 ptot[]0; for s:to S; // loop over training data 5 do Expected-Counts(F[s] Length(F[s]),E[s],Length(E[s])); 6 for f 2 F; 7 do for e 2 E; // new parameters 8 do P[f,e] : p[f,e]/ptot[e]; 9 until convergence Replaces the uniform alignment probability of M with: Pr(a j...) p(a j j, l, m) Properties: Model does not care where words appear in the two strings! Model 2 introduces alignment probs, i.e. a table of size (L M) 2 Training of Models -2 is easy from a bilingual corpus with given alignments: both models are products of discrete distributions their likelihood function is a product of multinomial distributions MLE just needs relative counts of events (su cient statistics) E cient computation of Pr(f e) like Model Problems and limitations of Model and Model 2: do not model the # of foreign words to be connected to each English word the alignment probability of Model 2 is complex and shallow EM Algorithm of Model 9 Example of Alignment with Model Expected-Counts(F,m,E,l) // Update counters p[], ptot[],using current parameters P[] 2 for j:to m; do t : 0; for i:0to l; 5 do ff[j]; ee[i]; 6 t : t + P[f,e] ; 7 for i:0to l; 8 do f:f[j]; e:e[i]; 9 p[f,e] : p[f,e] + P[f,e] / t; 0 ptot[e]:ptot[e] + P[f,e] / t;. mehr nicht wohl das geht dann, ja ah NULL oh well, then, I guess, that will not work anymore. Problem: three words (,) are mapped to the same word! in fact, words are aligned independently from each other

13 Example: alignment with fertility models 2 Model. mehr nicht wohl das geht dann, ja ah NULL oh well, then, I guess, that will not work anymore. e Alignment Model Pr(a,f e) Model generates the translation and the alignment as follows:. for each word i of e it generates a fertility value i 2. fore each word i of e it applies the following steps: (a) generate translations of word i (b) pick one positions for each of the words a,f Fertility models explicitly consider the number of words covered by each English word e.g. if comma has fertility, then only one word can be aligned to it Step implicitly defines the length m of the translation Steps -2 all rely of specific probability tables This model is significantly more complex than Model! Estimation of M follows the principle used for M, it s just more tricky! Fertility Models Model : Generative Process 5 The number of French words covered by e is a r.v. Models -2 do not explicitly model fertilities Models,, and 5 parameterize fertilities directly e: namely,thefertility of e Fertility models imply a di erent generative process of f and a given e:. For i,...,l,0, choose a fertility value i 0 for word e i 2. For i,...,l,0, choose a tablet i of i French words to translate e i. Choose a permutation over the tableau (,..., l, 0 ) to generate f. IF any position was chosen more than once THEN return FAILURE 5. ELSE return (a,f) corresponding to (, ) Notice: for correct pairs (, ) there is a many-to-one mapping to (f, a) the notion of fertility is embedded into and 0 null0 the program2 has been implemented5 il programma e` stato pratica in messo fertility tablet permutation e' stato2 messo in pratica5 il6 programma7

Model : Generative Process 5 Incremental Training Procedure 7 null0 the program2 has been implemented5 BILINGUAL CORPUS 0 fertility EM ALGORITHM EM ALGORITHM EM ALGORITHM il programma e` stato

14 Model : Generative Process 5 Incremental Training Procedure 7 null0 the program2 has been implemented5 BILINGUAL CORPUS 0 fertility EM ALGORITHM EM ALGORITHM EM ALGORITHM il programma e` stato pratica in messo PERMUTATION MUST BE CHECKED BEFORE GENERATING (a,f)!!! tablet permutation MODEL INITIAL PARAM MODEL 2 INITIAL PARAM MODEL INITIAL PARAM... e' stato2 messo in pratica5 il6 programma7 use previous model to initialize some of the parameters of next model! Model : Training 6 HMM Alignment Model 8 Problem: no way to e ciently compute summation over alignments Trick: limit summation to a neighborhood of best alignment from Problem: no e c (!) SX X s a2n (a ) p (a f s, e s )c(!; a, f s, e s ) cient way to compute Viterbi alignments with M Trick:do hill-climbing in alignment space: start from best alignment by Model 2: a V (f e; M2) hill-climbing operator: b(a ) arg max p(a e, f; M) (6) a2n (a ) neighborhood N (a): alignments di ering from a by one move or one swap move operator: m [j,i] (a): seta j : i swap operator: s [j,j 2 ](a): exchanging a j with a j2 Another alignment model which follows from the general alignment model: Pr(f m,a m e l )Pr(m e l ) j Let us define the following parameters: Pr(a j f j,a j, m, e l ) Pr(f j f j,a j, m, el ) Pr(m e l ) p(m l) string length probabilities Pr(a j...) p(a j a j,l) alignment probabilities Pr(f j a j,...) p(f j e aj ) translation probabilities Hence, we get the following translation model: Pr(f m e l ) X a m Pr(f m,a m EM training can be carried out e e l )p(m l) X a m p(a j a j,l) p(f j e aj ) j ciently through dynamic programming

15 Combinations of Word Alignments 9 Grow Diagonal Word Alignment 5 Given parallel sentences we can train an alignment model and then align them. We have di erent options: direct alignment: we learn alignments from to inverted alignment: we learn alignment from so We can get better alignments by combining direct and inverted alignments. union: greedy collection of alignment points, higher coverage intersection: selective collection, higher precision grow-diagonal: take the best of two Properties: direct/inverted alignments are maps betwen two sets of positions union alignment is a many-to-many partial alignment intersection is is a - partial alignment direct inverted intersection grow diagonal grow diagonal grow diagonal Union and Intersection Alignments 50 How to measure quality of word alignments 52 direct inverted intersection Automatic Manual Matches direct inverted union automatic alignments sure alignments possible alignments AER - #( ) + #( ) #( ) + #( ) Alignment Error Rate

Use of word alignments 5 55 Bilingual concordance Source: Search string: Select corpus: EN-English EQUAL TO Alice in Wonderland She felt very sleepy, when suddenly a White rabbit with pink eyes ran

16 Use of word alignments 5 55 Bilingual concordance Source: Search string: Select corpus: EN-English EQUAL TO Alice in Wonderland She felt very sleepy, when suddenly a White rabbit with pink eyes ran close by her. nor did Alice think it so unusual to hear the rabbit say to itself "Oh dear! Oh dear! I shall be too late!" rabbit But when the rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for she remembered that she had never before seen a rabbit with either a waistcoat-pocket or a watch to take out of it, and she ran across the field after it, and was just in time to see it pop down a large rabbit-hole under the hedge. The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had no time to think about stopping herself before she found herself falling down what seemed to be a very deep well. Target: ZH-Chinese A Done Appendix Last words on word alignments 5 Often used symbols 56 Given a parallel corpus we can automatically learn alignments to: discover interesting lexical relationships generate a probabilistic translation lexicon extract phrase-pairs Alignments have limitations in terms of allowed word mappings Better alignments can be obtained by: estimating alignments from to and viceversa computing a suitable combination of the two alignments l, m length of and sentences f f m f...,f m sentence e e l e...,e l sentence i, j and positions e i,f j and words e 0 empty word (of the sentence) F language dictionary E language dictionary i 2 {0,,...,l} positions L,M maximum length of and sentences

17 Auxiliary Function and Theorem 57 EM Theorem (cont d) 59 Given parameters, we search for better values through the auxiliary function: where: and Q(, ) X f,e c(f, e) X a p (a f, e) p (f, a e) p (f e) p (a f, e) log p (f, a e) p (f, a e) p (f, a e) P a 0 p (f, a 0 e) (7) (8) Q(, ) 0 (9) Q is only similar to an entropy formula and is explained by the following property EM Theorem If Q(, ) > 0 then L( ) >L( ) Hence, for any e and f, wehavethat X p (a f, e) log p (f, a e) p (f, a e) a X p (a p (f, a e)/p (f e) f, e) log p (f, a e)/ p a (f e) p (f e) p (f e) X p (a f, e) log p (a f, e) p (a f, e) + log p (f e) X p (f e) p (a f, e) a a {z } apple X a p (a p (a f, e) f, e) p (a f, e) {z } 0 + log p (f e) p (f e) (2) () () (5) In the last step, we applied the inequality log x apple x with x p (a f,e) p (a f,e). EM Theorem 58 EM Theorem (cont d) 60 EM Theorem. Givenparametervalues and of an alignment model, it holds: if Q(, ) > 0 then L( ) >L( ) (0) Proof We can show that Q is related to the likelihood function L by: L( ) L( )+Q(, ) () which is is equivalent to the theorem s statement. The proof of the inequality is based on this simple geometric property: log x apple (x ) (equality for x ) By summing up over all (f, e) we get the desired inequality: X (f,e) c(f, e) log p (f e) p (f e) End of the Proof. X (f,e) c(f, e) X a L( ) L( ) Q(, ) L( ) L( )+Q(, ) p (a f, e) log p (f, a e) p (f, a e) The role of Q(, ) is now clear: are the current parameters, are the new unknown parameters if we find such that Q(, ) > 0 then we have better parameters... but we need some parameters to start with... the good news is that we can start with any settings (uniform, random,...) Proof.

18 EM with Alignment Models 6 Training Model 2 6 All our word alignment models have the general exponential form: For Model 2, {(i, j, l, m)} [ {(e, f)}, whichispartitioned as follows: p (f, a e) Y (!) c(!;a,f,e) (6)!2 where parameters satisfy multiple normalization constraints X (!), µ, 2,... (7)!2 µ where the subsets µ, µ, 2,...,formapartition of. The partition corresponds to all the conditional probabilities in the model. Each conditional probability to be estimated has indeed to sum-up to. j,l,m {(i j, l, m) :0appleiapplel}, 0 apple j, m apple M,0 apple l apple L e {(f e) :f 2 F}, e 2 E (2) c(i j, l, m; a, f, e) (i, a j ) (22) mx c(f e; a, f, e) (e, e aj ) (f,f j ) (2) j We can now directly derive the iterative re-estimation formulae from: c (!; f, e) X p (a f, e)c(!; a, f, e) (2) a Problem: the above formula requires summing over (l + ) m alignments! M-5 are products of discrete distributions defined over di erent events within (a, f, e). EM with Alignment Models 62 Training Model 2: Useful Formulas 6 Constraints leads to the system of equations: (!) Q(, )+ P µ µ! 2 µ,µ, 2,... P!2 (!) Q(, ) µ µ Q(, )+ µ ( P!2 µ (!) 0 µ, 2,... (8) For! 2 µ, by applying Lagrange multipliers we get the re-estimation formula ˆ (!) c (!) X f,e P!2 mu (!) µ c (!) µ X c(f, e) X a!2 µ c (!) (9) p (a f, e)c(!; a, f, e) (20) Parameter update based on expected counts, which are collected by averaging all possible alignments a with the posterior p (a f, e) of the current value! Model 2 permits to e p (f e) X a ciently calculate the sum over alignments: p (f, a e) p(m l)... a 0 p(m l) j i0 am0 j p(f j e aj )p(a j j, l, m) p(f j e i )p(i j, l, m) (25) Proof. Let m and l,andletx jaj p(f j e aj )p(a j j, l, m). It is routine to verify that: x 0 x 20 x x x 2 x 0 + x x 2 x (x 0 + x )(x 20 + x 2 )(x 0 + x ) Hence we can write: p (a f, e) p (f, a e) Pa p (f, a e) Q m j p(f j e aj )p(a j j, l, m) m P l i0 p(f j e i )p(i j, l, m) Y p (a j j, f, e) (26) Q m j Important: we need only 2 m (l + ) operations! j

19 Training Model 2 65 Model 2: Training Algorithm 67 c (f e; f, e) X a p (a f, e)c(f e; a, f, e)... a 0 mx k a 0 mx 0 p (a ` j j, f, e) A (e, eak ) (f, f k ) am0... k a k 0 mx k i0 j am0 j k p (a j j, f, e) (e, e ak ) (f, f k ) p (a k k, f, e) (e, e ak ) (f, f k ) (27) p(f k e i ) p(i k, l, m) P l a0 p(f k e a ) p(a k, l, m) (e, e i) (f, f k ) (28) EM-Model2(F,m,E,l) Init-Params(P,Q); // P[f,e]uniform; Q[i,j,l,m]uniform; 2 do Reset-Expected-Counts; //p[]0 ptot[]0; q[]0; qtot[]0; for s:to S; // loop over training data 5 do Expected-Counts(F[s] Length(F[s]),E[s],Length(E[s])); 6 for m: to M; //max length 7 do for l:to L; // max length 8 do for j:to m; 9 do for i: 0 to l; //new current parameters 0 do Q[i,j,l,m] : q[i,j,l,m]/qtot[j,l,m]; for f 2 F; 2 do for e 2 E; // new current parameters do P[f,e] : p[f,e]/ptot[e]; until convergence Training Model 2 66 Model 2: Training Algorithm 68 c (i j, l, m; f, e) X p (a f, e)c(i j, l, m; a, f, e) a!... p (a k k, f, e) (i, a j ) a 0 a m0 k p (a j j, f, e) (i, a j ) (29) a j 0 p (i j, f, e) p(f j e i ) p(i j, l, m) P l i 0 0 p(f j e i 0) p(i 0 j, l, m) (0) Expected-Counts(F,m,E,l) // Update counters p[], q[], ptot[], qtot[] using current parameters P[],Q[] 2 for j:to m; do t : 0; for i:0to l; 5 do ff[j]; ee[i]; 6 t : t + P[f,e] * Q[i,j,l,m]; 7 for i:0to l; 8 do f:f[j]; e:e[i]; 9 c: P[f,e] * Q[i,j,l,m] / t; 0 q[i,j,l,m] : q[i,j,l,m] + c; qtot[j,l,m]qtot[j,l,m] + c; 2 p[f,e] : p[f,e] + c; ptot[e]:ptot[e] + c;

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013 Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48 Roadmap