Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, PDF Free Download

Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48

Roadmap Introduction to MT Components of MT system Word-based models Beyond word-based models CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 2 / 48

Outline 1 Introduction 2 Word Based Translation Systems 3 Learning the Models 4 Everything Else CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 3 / 48

What unlocks translations? Humans need parallel text to understand new languages when no speakers are round Rosetta stone: allowed us understand to Egyptian Computers need the same information CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 4 / 48

What unlocks translations? Humans need parallel text to understand new languages when no speakers are round Rosetta stone: allowed us understand to Egyptian Computers need the same information Where do we get them? I Some governments require translations (Canada, EU, Hong Kong) I Newspapers I Internet CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 4 / 48

Pieces of Machine Translation System CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 5 / 48

Terminology Source language: f (foreign) Target language: e (english) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 6 / 48

Outline 1 Introduction 2 Word Based Translation Systems 3 Learning the Models 4 Everything Else CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 7 / 48

Collect Statistics Look at a parallel corpus (German text along with English translation) Translation of Haus Count house 8,000 building 1,600 home 200 household 150 shell 50 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 8 / 48

Estimate Translation Probabilities Maximum likelihood estimation 8 0.8 if e = house, >< 0.16 if e = building, p f (e) = 0.02 if e = home, 0.015 if e = household, >: 0.005 if e = shell. CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 9 / 48

Alignment In a parallel text (or when we translate), we align words in one language with the words in the other 1 2 3 4 das Haus ist klein Word positions are numbered 1 4 the house is small 1 2 3 4 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 10 / 48

Alignment Function Formalizing alignment with an alignment function Mapping an English target word at position i to a German source word at position j with a function a : i! j Example a : {1! 1, 2! 2, 3! 3, 4! 4} CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 11 / 48

Reordering Words may be reordered during translation 1 2 3 4 klein ist das Haus the house is small 1 2 3 4 a : {1! 3, 2! 4, 3! 2, 4! 1} CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 12 / 48

One-to-Many Translation A source word may translate into multiple target words 1 2 3 4 das Haus ist klitzeklein the house is very small 1 2 3 4 5 a : {1! 1, 2! 2, 3! 3, 4! 4, 5! 4} CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 13 / 48

Dropping Words Words may be dropped when translated (German article das is dropped) 1 2 3 4 das Haus ist klein house is small 1 2 3 a : {1! 2, 2! 3, 3! 4} CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 14 / 48

Inserting Words Words may be added during translation I The English just does not have an equivalent in German I We still need to map it to something: special null token 0 NULL 1 2 3 4 das Haus ist klein the house is just small 1 2 3 4 5 a : {1! 1, 2! 2, 3! 3, 4! 0, 5! 4} CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 15 / 48

Afamilyoflexicaltranslationmodels A family translation models Uncreatively named: Model 1, Model 2,... Foundation of all modern translation algorithms First up: Model 1 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 16 / 48

IBM Model 1 Generative model: break up translation process into smaller steps I IBM Model 1 only uses lexical translation Translation probability I for a foreign sentence f =(f1,...,f lf ) of length l f I to an English sentence e =(e1,...,e le ) of length l e I with an alignment of each English word ej to a foreign word f i according to the alignment function a : j! i p(e, a f) = (l f + 1) le l e Y j=1 t(e j f a(j) ) I parameter is a normalization constant CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 17 / 48

Example das Haus ist klein e t(e f ) e t(e f ) house 0.8 is 0.8 building 0.16 s 0.16 home 0.02 exists 0.02 family 0.015 has 0.015 shell 0.005 are 0.005 e t(e f ) the 0.7 that 0.15 which 0.075 who 0.05 this 0.025 e t(e f ) small 0.4 little 0.4 short 0.1 minor 0.06 petty 0.04 p(e, a f )= t(the das) t(house Haus) t(is ist) t(small klein) 43 = 0.7 0.8 0.8 0.4 43 =0.0028 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 18 / 48

Outline 1 Introduction 2 Word Based Translation Systems 3 Learning the Models 4 Everything Else CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 19 / 48

Learning Lexical Translation Models We would like to estimate the lexical translation probabilities t(e f ) from a parallel corpus... but we do not have the alignments Chicken and egg problem I if we had the alignments,! we could estimate the parameters of our generative model I if we had the parameters,! we could estimate the alignments CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 20 / 48

EM Algorithm Incomplete data I if we had complete data, would could estimate model I if we had model, we could fill in the gaps in the data Expectation Maximization (EM) in a nutshell 1 initialize model parameters (e.g. uniform) 2 assign probabilities to the missing data 3 estimate model parameters from completed data 4 iterate steps 2 3 until convergence CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 21 / 48

EM Algorithm... la maison... la maison blue... la fleur...... the house... the blue house... the flower... Initial step: all alignments equally likely Model learns that, e.g., la is often aligned with the CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 22 / 48

EM Algorithm... la maison... la maison blue... la fleur...... the house... the blue house... the flower... After one iteration Alignments, e.g., between la and the are more likely CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 23 / 48

EM Algorithm... la maison... la maison bleu... la fleur...... the house... the blue house... the flower... After another iteration It becomes apparent that alignments, e.g., between fleur and flower are more likely (pigeon hole principle) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 24 / 48

EM Algorithm... la maison... la maison bleu... la fleur...... the house... the blue house... the flower... Convergence Inherent hidden structure revealed by EM CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 25 / 48

EM Algorithm... la maison... la maison bleu... la fleur...... the house... the blue house... the flower... p(la the) = 0.453 p(le the) = 0.334 p(maison house) = 0.876 p(bleu blue) = 0.563... Parameter estimation from the aligned corpus CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 26 / 48

IBM Model 1 and EM EM Algorithm consists of two steps Expectation-Step: Apply model to the data I parts of the model are hidden (here: alignments) I using the model, assign probabilities to possible values Maximization-Step: Estimate model from data I take assign values as fact I collect counts (weighted by probabilities) I estimate model from counts Iterate these steps until convergence CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 27 / 48

IBM Model 1 and EM We need to be able to compute: I Expectation-Step: probability of alignments I Maximization-Step: count collection CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 28 / 48

IBM Model 1 and EM Probabilities p(the la) =0.7 p(house la) =0.05 p(the maison) =0.1 p(house maison) =0.8 Alignments la maison the house la maison the house la maison the house la maison the house p(e,a f) =0.56 p(e,a f) =0.035 p(e,a f) =0.08 p(e,a f) =0.005 p(a e, f) =0.824 p(a e, f) =0.052 p(a e, f) =0.118 p(a e, f) =0.007 Counts c(the la) = 0.824 + 0.052 c(house la) = 0.052 + 0.007 c(the maison) = 0.118 + 0.007 c(house maison) = 0.824 + 0.118 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 29 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(a e, f) Applying the chain rule: p(a e, f) = p(e, a f) p(e f) We already have the formula for p(e, a f) (definition of Model 1) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 30 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 31 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = X a p(e, a f) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 31 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = X a p(e, a f) = l f X l f X a(1)=0 a(l e)=0 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 31 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = X a p(e, a f) = l f X l f X a(1)=0 a(l e)=0 p(e, a f) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 31 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = X a p(e, a f) = = l f X a(1)=0 l f X l f X a(l e)=0 Xl f a(1)=0 a(l e)=0 p(e, a f) (l f + 1) le l e Y j=1 t(e j f a(j) ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 31 / 48

IBM Model 1 and EM: Expectation Step p(e f) = l f X... l f X a(1)=0 a(l e)=0 (l f + 1) le l e Y j=1 t(e j f a(j) ) Note the algebra trick in the last line I removes the need for an exponential number of products I this makes IBM Model 1 estimation tractable CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 32 / 48

IBM Model 1 and EM: Expectation Step l f X p(e f) =... a(1)=0 = (l f + 1) le l f X a(l e)=0 l f X a(1)=0 (l f + 1) le l f X l e Y j=1 l e Y a(l e)=0 j=1 t(e j f a(j) ) t(e j f a(j) ) Note the algebra trick in the last line I removes the need for an exponential number of products I this makes IBM Model 1 estimation tractable CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 32 / 48

IBM Model 1 and EM: Expectation Step l f X p(e f) =... a(1)=0 = (l f + 1) le = (l f + 1) le l f X a(l e)=0 l f X a(1)=0 l e Y (l f + 1) le l f X j=1 i=0 l f X l e Y j=1 l e Y a(l e)=0 j=1 t(e j f i ) t(e j f a(j) ) t(e j f a(j) ) Note the algebra trick in the last line I removes the need for an exponential number of products I this makes IBM Model 1 estimation tractable CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 32 / 48

The Trick 2X 2X a(1)=0 a(2)=0 = 3 2 2Y t(e j f a(j) )= j=1 (case l e = l f =2) = t(e 1 f 0 ) t(e 2 f 0 )+t(e 1 f 0 ) t(e 2 f 1 )+t(e 1 f 0 ) t(e 2 f 2 )+ + t(e 1 f 1 ) t(e 2 f 0 )+t(e 1 f 1 ) t(e 2 f 1 )+t(e 1 f 1 ) t(e 2 f 2 )+ + t(e 1 f 2 ) t(e 2 f 0 )+t(e 1 f 2 ) t(e 2 f 1 )+t(e 1 f 2 ) t(e 2 f 2 )= = t(e 1 f 0 )(t(e 2 f 0 )+t(e 2 f 1 )+t(e 2 f 2 )) + + t(e 1 f 1 )(t(e 2 f 1 )+t(e 2 f 1 )+t(e 2 f 2 )) + + t(e 1 f 2 )(t(e 2 f 2 )+t(e 2 f 1 )+t(e 2 f 2 )) = =(t(e 1 f 0 )+t(e 1 f 1 )+t(e 1 f 2 )) (t(e 2 f 2 )+t(e 2 f 1 )+t(e 2 f 2 )) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 33 / 48

IBM Model 1 and EM: Expectation Step Combine what we have: p(a e, f) =p(e, a f)/p(e f) = = Q le (l f +1) le j=1 t(e j f a(j) ) Q le P lf (l f +1) le j=1 i=0 t(e j f i ) l e Y j=1 t(e j f a(j) ) P lf i=0 t(e j f i ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 34 / 48

IBM Model 1 and EM: Maximization Step Now we have to collect counts Evidence from a sentence pair e,f that word e is a translation of word f : c(e f ; e, f) = X a p(a e, f) l e X j=1 (e, e j ) (f, f a(j) ) With the same simplication as before: t(e f ) c(e f ; e, f) = P lf i=0 t(e f i) l e X j=1 (e, e j ) l f X i=0 (f, f i ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 35 / 48

IBM Model 1 and EM: Maximization Step After collecting these counts over a corpus, we can estimate the model: t(e f ; e, f) = P P f (e,f) P (e,f) c(e f ; e, f)) c(e f ; e, f)) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 36 / 48

IBM Model 1 and EM: Pseudocode 1: initialize t(e f )uniformly 2: while not converged do 3: {initialize} 4: count(e f )=0for all e, f 5: total(f )=0for all f 6: for all sentence pairs (e,f) do 7: {compute normalization} 8: for all words e in edo 9: s-total(e) =0 10: for all words f in fdo 11: s-total(e) +=t(e f ) 12: {collect counts} 13: for all words e in edo 14: for all words f in fdo 15: count(e f )+= t(e f ) s-total(e) 16: total(f )+= t(e f ) 1: while not converged (cont.) do 2: {estimate probabilities} 3: for all foreign words f do 4: for all English words e do 5: t(e f )= count(e f ) total(f ) s-total(e) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 37 / 48

Convergence das Haus das Buch ein Buch the house the book a book e f initial 1st it. 2nd it.... final the das 0.25 0.5 0.6364... 1 book das 0.25 0.25 0.1818... 0 house das 0.25 0.25 0.1818... 0 the buch 0.25 0.25 0.1818... 0 book buch 0.25 0.5 0.6364... 1 a buch 0.25 0.25 0.1818... 0 book ein 0.25 0.5 0.4286... 0 a ein 0.25 0.5 0.5714... 1 the haus 0.25 0.5 0.4286... 0 house haus 0.25 0.5 0.5714... 1 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 38 / 48

Ensuring Fluent Output Our translation model cannot decide between small and little Sometime one is preferred over the other: I small step: 2,070,000 occurrences in the Google index I little step: 257,000 occurrences in the Google index Language model I estimate how likely a string is English I based on n-gram statistics p(e) =p(e 1, e 2,...,e n ) = p(e 1 )p(e 2 e 1 )...p(e n e 1, e 2,...,e n 1 ) ' p(e 1 )p(e 2 e 1 )...p(e n e n 2, e n 1 ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 39 / 48

Noisy Channel Model We would like to integrate a language model Bayes rule argmax e p(e f) = argmax e p(f e) p(e) p(f) = argmax e p(f e) p(e) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 40 / 48

Noisy Channel Model Applying Bayes rule also called noisy channel model I we observe a distorted message R (here: a foreign string f) I we have a model on how the message is distorted (here: translation model) I we have a model on what messages are probably (here: language model) I we want to recover the original message S (here: an English string e) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 41 / 48

Outline 1 Introduction 2 Word Based Translation Systems 3 Learning the Models 4 Everything Else CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 42 / 48

Higher IBM Models IBM Model 1 IBM Model 2 IBM Model 3 IBM Model 4 IBM Model 5 lexical translation adds absolute reordering model adds fertility model relative reordering model fixes deficiency Only IBM Model 1 has global maximum I training of a higher IBM model builds on previous model Compuationally biggest change in Model 3 I trick to simplify estimation does not work anymore! exhaustive count collection becomes computationally too expensive I sampling over high probability alignments is used instead CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 43 / 48

Legacy IBM Models were the pioneering models in statistical machine translation Introduced important concepts I generative model I EM training I reordering models Only used for niche applications as translation model... but still in common use for word alignment (e.g., GIZA++ toolkit) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 44 / 48

Word Alignment Given a sentence pair, which words correspond to each other? michael geht davon aus, dass er im haus bleibt michael assumes that he will stay in the house CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 45 / 48

Word Alignment? john wohnt hier nicht john does not live here Is the English word does aligned to the German wohnt (verb) or nicht (negation) or neither??? CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 46 / 48

Word Alignment? john biss ins grass john kicked the bucket How do the idioms kicked the bucket and biss ins grass match up? Outside this exceptional context, bucket is never a good translation for grass CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 47 / 48

Summary Lexical translation Alignment Expectation Maximization (EM) Algorithm Noisy Channel Model IBM Models Word Alignment CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 48 / 48

Summary Lexical translation Alignment Expectation Maximization (EM) Algorithm Noisy Channel Model IBM Models Word Alignment Much more in CL2! CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 48 / 48

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013