Statistical Machine Translation

Size: px

Start display at page:

Download "Statistical Machine Translation"

Gwen Stone
6 years ago
Views:

1 Statistical Machie Traslatio LECTURE 5 HIGHER IBM MODELS APRIL 6 200

2 Brief Outlie - IBM Model 2 - IBM Model 3 - IBM Model 4 - IBM Model 5 Ref: The Mathematics of Statistical Machie Traslatio: Parameter Estimatio - Peter F Brow et.al. Computatioal Liguistics Vol 9 No

3 IBM Model 2 3

4 IBM Model 2 Model takes o otice of where the words appear i the traslatio: E.g. questa casa è bella aturalmete Of course this house is beautiful This house beautiful is of course Are equally probable uder Model Model 2 takes care of this. 4

5 Aligmet Model: IBM Model 2 The assumptio is that the traslatio of f to e i depeds upo aligmet probability: P a ( i m ) Q. What does it mea?? Q2. How to compute?? 5

6 IBM Model 2 Thus traslatio is a two-step process: Lexical Traslatio step: modeled by t(e p f q ) Aligmet Step: modeled by P a ( i m ) e.g. Questa casa è bella aturalmete this house is beautiful aturally Naturally this house is beautiful Traslatio Step Aligmet Step 6

7 IBM Model 2 Uder this model we have: p( e a f) = c m i= t( e Hece : p( e f) = p ( e a f ) = c = c a... t ( e a( ) = 0 a( m) = 0 = m = i= 0 t ( e i m f f i a ) ( i) ) pa ( a( i) i m ) p a f a ( ) ) pa ( a ( ) m ) ( a( ) m ) 7

8 IBM Model 2 We eed to maximize this p Subect to the followig costraits: ( e f ). i t ( e f ) = = i 2. p ( i m ) a = 0 = i = m Thus we have a larger set of Lagragia costats 8

9 IBM Model 2 The auxiliary fuctio becomes: h( t p λ µ ) a = c m.... t ( e f a ( ) a( ) = 0 a( m) = 0 = ) p a ( a( ) m ) λ ( t ( e f ) ) µ ( im i i i p a ( i m ) ) To fid the extremum we eed to differetiate 9

10 IBM Model 2 We ow eed a ew cout: cout( i m ; f e) the expected umber of times the word i positio i of the TL strig e is coected to the word i the positio of the SL strig f give that their legths are m respectively. cout( i m ; f e) = a p ( a e f) * δ( a( i)) 0

11 IBM Model 2 So here we shall look at groups of setece pairs Who satisfy the m ad criterio. The we look at the aligmet probabilities. Example: m = 4 = 3 aami bari achchhi I am goig home > 2>0 3>3 4>2 tumi ki khachchho What are you eatig >2 2>0 3> 4>3 kaal tomar aam ki What is your ame >3 2>0 3> 4>2 kothay chhile Where were you yesterday >2 2>3 3>0 4>

12 IBM Model 2 The by aalogy with the Model we get: ) ; ( ) ( f e m i cout m i p im a µ = for a sigle traslatio ad S ) ; ( ) ( ) ( ) ( s s S s im a m i cout m i p f e = = µ for a set of traslatios 2

13 IBM Model 2 Although apparetly the expressio for cout is complicated we ca make it simple as i the case of Model : cout( i m ; f e) = t( e i f ) p a ( i m ) t( e i 0 f ) p ( i m ) t( e f ) p ( i m ) 0 a i a 3

14 IBM Model 2 Oe ca ow desig a algorithm for Expectatio Maximizatio as i case of Model 4

15 IBM Model 3-5 5

16 Itermodel Iterlude Models ad 2 have bee created o the basis of the followig geeralized priciple: p( e a f) = p( m f) * m i = p( a( i) a()..( i ) e Note that each a() takes value betwee 0 to It works o the followig: * p( e i.. i m f) a()...( i) e... i m f) 6

17 Itermodel Iterlude It represets the oit likelihood of e ad a as A product of coditioal probabilities. Each product correspods to a geerative process for developig e ad a from f. - Choose the legth of the traslatio e - Decide which positio i f correspods to e ad the idetity of e is. - Do the same for positios 2 to m 7

18 IBM Model 3-5 Here we cosider the fertility of a word i couctio with the Word model. How may e-words a sigle f-word will produce is NOT determiistic. E.g. Noostate (It) >> despite eve though i spite of (E) Cosequetly correspodig to each word of f we get a radom variable φ f which gives the fertility of f icludig 0. I all models 3-5 fertility is explicitly modeled 8

19 IBM Model 3-5 Example: Dovete adare il gioro dopo (It) May have may possible Eglish traslatios: - You must go ext day - You have to go the followig day - You ought to be there o the followig day - You will have to go there o the followig day - I ask you to go there o the followig day Ca we ow get the aligmet? Look at the fertility of the source words. 9

20 IBM Model 3-5 Fertility: ) we ca assume that the fertility of each Word is govered by a probability distributio p( f) 2) Deals explicitly with droppig of iput words by puttig = 0. 3) Similarly we ca cotrol words i TL setece that have NO equivalet i f callig them NULL words. Thus models 3-5 are Geerative Process give a f-strig we first decide the fertility of each word ad a list of e-words to coect to it. This list is called a Tablet. 20

21 IBM Model 3-5 Defiitios: Tablet: Give a f word the list of words that may coect to it. Tableau: A collectio of Tablets. Notatio: T tableau for f. T tablet for the th f-word. T k k th e-word i th tablet T. 2

22 IBM Model 3-5 Example: Come ti chiami (It ) Tableau Come ti chiami Tablet (T ) Tablet 2 (T 2 ) Tablet 3 (T 3 ) T = Like T 2 = you T 3 = call T 2 = What T 22 = yourself T 32 = Address T 3 = As T 23 = thyself T 4 = How 22

23 IBM Model 3-5 = ) ( f p π τ * ) ( ) ( 0 = f p f p φ φ φ φ * ) p( f φ τ τ τ φ The geeratio is as per the followig formula 23 * ) ( f p k k k φ τ τ τ = = * ) ( 0 0. f p i k k k φ τ π π π φ = = = ) ( φ φ τ π π π k l l l k k f p

24 IBM Model 3-5 After choosig the Tableau the words are Permuted to geerate e. This permutatio is a radom variable п The positio i e of the k th word of the th Tablet is called п k. I these models the geerative process is expressed as a oit likelihood for a tableau τ ad a permutatio π i The followig way: 24

25 First Step: Compute φ = IBM Model 3-5 p( φ φ f f) p( φ 0 φ ) - Determie the umber of tokes that f will produce. - This will deped upo the o. of words produced by f f -. - Determie φ 0 the umber of words geerated out of NULL. 25

26 IBM Model 3-5 Secod Step: Compute φ = 0 k= τ k p( τ τ τ φ k. k 0 0 f) - Determie the k th word produced by f - This depeds o all the words produced by f f -. ad all the words produced so far by f 26

27 Third Step: Compute IBM Model 3-5 φ = k= p( π π π. τ0 φ 0 k f) - Determie π the positio i e of the k th k word produced by f. - This depeds o the positios of all the words produced so far. k i 27

28 IBM Model 3-5 Fourth Step: Compute φ 0 k= p( π π π τ φ f) 0 k 0 k Determie π the positio i e of the k th 0k word produced by NULL.. - This depeds o the positios of all the words produced so far. Thus the fial expressio is a product of 4 expressios: 28

29 IBM Model 3-5 p( τ π f)= = p( φ φ φ = 0 k = f) p( φ φ 0 f) p( τ τ τ k. k 0 φ 0 φ = k= * f) * p( π π π. τ k k i 0 φ 0 φ 0 k= p( π f) π π τ φ f) 0 k 0 k l 0 l 0 l * 29

30 IBM Model 3-5 This obviously is very difficult to maipulate. Hece cocessios are made for differet models. The cocessios come i the form of assumptios. Let us first look at the IBM Model 3. 30

31 IBM Model 3 3

32 IBM Model 3 Assumptios. For betwee ad p( φ φ f ) depeds oly o ad f. φ 2. For betwee ad k depeds oly o ad f. τ k τ τ τ φ p( f). k For betwee ad p( πk π. k π i τ0 φ0 f) depeds oly o m. π k This reduces the umber of variables 32

33 IBM Model 3 Thus parameters for Model 3 are:. A set of Fertility probabilities: which is equal to p φ φ ) ( f η ( φ f ) 2. A set of Trasitio probabilities t( e f ) which is equal to p( τ k = e τ. k τ0 φ 0 f) 3. A set of Distortio probabilities d( i m ) which is equal to p π = i π π τ φ ) ( k. k i 0 0 f 33

34 IBM Model 3 The distortio ad fertility probabilities for f 0 (NULL) are treated i a differet way: These are meat for hadlig the words i TL setece Which caot be accouted for. Obviously they are plugged-i oce all the words Are place for =... φ So = m ϕ ϕ ϕ ϕ 2 0 We have to estimate these probabilities. 34

35 IBM Model 3 It is assumed that each of the Tableau word ca produce at most oe NULL word. Assume each Tableau word produces a NULL word with Prob. p ad does ot produce oe with Prob. p 0 Hece p(φ 0 ) = = ϕ + ϕ2 + + ϕ ϕ 0 pϕ0 ϕ ϕ ϕ 2 ϕ 0 p0 + m 2ϕ 0 ϕ0 2ϕ 0 0 ϕ p p m

36 IBM Model 3 As with Models ad 2 a aligmet of (e f ) is Determied by specifyig a(i) for each positio of the TL strig. The fertilities φ = 0.. are fuctios of the a( ) s.: φ is equal to the umber of i s such that a(i) =. Hece P (e f ) ca be obtaied as summig over All the aligmets: P (e f ) =.... p ( e a f a ( ) = 0 a ( m ) = 0 ) 36

37 = m m 2ϕ 0... p ϕ ( ) 0 a m) 0 ϕ 0 a IBM Model 3 m i= t( e i f 2ϕ = ( = = = a( i) ) d( i a( i) m ) With the followig costraits p m φ! η ( φ f ) *. t ( e f ) = 2. d ( i m ) = e η( f ) = 3. φ 4. p 0 + p = φ i 37

38 IBM Model 3 Remarks:. Here also we have expoetial umber of aligmets. 2. Cout collectio is too high eve for moderate legth setece. 3. Samplig is used from the space of possible aligmets 4. Samplig should be such that most probable oes are icluded. 38

39 IBM Model 3 Remarks: 5. Still it is much harder for Model Hece Hill-climbig type heuristics are used. 7. Typically they start from Model solutio. 8. From there go to eighborig aligmetswhere distace betwee two aligmets is measured o the o. of poits they differ. 39

40 IBM Model 4 40

41 IBM Model 4 Model 3 has bee foud to be a very powerful oe. It takes care of all the maor aspects: - word traslatio - reorderig - isertio of words - droppig of words - oe to may traslatio But it has oe maor shortcomig: formulatio of distortio probabilities d(i m ) 4

42 IBM Model 4 Model 3 does ot take ito accout the followig fact: ofte a group of words are traslated together ad therefore whe they move they move together. E.g Ridig a bicycle >> i sella a ua bicicletta Comig here ridig a bicycle is dagerous >> veire qui i sella a ua bicicletta è pericoloso Try may seteces with the phrase ridig a bicycle oe ca otice that the phrase i sella a ua bicicletta will remai together. But Model 3 cosiders the distortio probabilities i isolatio. 42

43 IBM Model 4 Model 4 itroduces the cocept of Relative Distortio It assumes that the placemet of the traslatio of a Iput word is based o the placemet of the precedig iput word. It is however difficult to coceptualize: as words are beig added dropped coverted from oe-to-may. Model 4 is based aroud the cocept of cept. 43

44 IBM Model 4 Defiitio: each iput word that is aliged to at least oe output word is called a cept. Typically represeted by [ ] or π. Defiitio: the ceilig of the average of the positios is called the ceter of a cept. We shall deote as C. For each output word the Relative Distortio is defied With the help of cepts. Let us first see a example: 44

45 IBM Model 4 Cosider the followig Begali-Eglish pair: φ lambaa moto chhele cycle-e chore aaschhe A tall boy is comig ridig a bicycle Note: The does ot alig with aythig! moto a orametatio is ot a cept. 45

46 IBM Model 4 cept л л2 л3 л4 л5 Foreig Word Positio Foreig Word lambaa chheleta cycle-e chore aaschhe Eglish Word Tall boy a bicycle ridig is comig Eglish word positio Ceter of cept

47 Relative Distortio: IBM Model 4 -Words geerated by φ are Uiformly distributed. - The positio of the first word of a cept is defied w.r.t. the cetre of the previous cept. d ( C - ) Cosider for example : the word ridig it is geerated by cept 4 (л4) its Eglish positio is: 6 Cetere of the precedig cept is 8. Thus there is a distorio of -2. This shows a forward movemet of the word. Normally the distortio will be + 47

48 IBM Model 4 Relative Distortio: - For subsequet words of a cept the positio is defied w.r.t. the positio of the previous word of the same cept. d > ( л k- ) Where л k- refers to the k th word of the th cept. For example i a bicycle is comig the distortio Probability of the secod word is calculated i relatio with the previous word. 48

49 IBM Model 5 49

50 IBM Model 5 The key term for Model 5 is Deficiecy. Models 3 & 4 do ot take care of whether two words Are beig put i the same place. Thus it puts positive probabilities o some impossible Traslatios. I Model 5 the distortio probabilities are calculated By cosiderig cepts (as before) plus vacacies. Also it takes care of the problem of multiple tableaus. This makes it a better word-based model. 50

51 IBM Model 5 Model 5 keeps track of the vacacies i the m-word log e setece. Let - v max be the maximum o. of vacacies possible. - v be the o. of vacacies available i the setece e i the positios [ ] Hece the distortio probabilities are fuctios of 3 quatities: d ( v C - v max ) Similarly the relative distortio of the subsequet words i the cept are: d > ( v vл k- v max ) 5

52 Coclusio Still we go by word based traslatio. Ca we do better? Because lookig at traslatios As word-by-word is ot the best thig. E.G The trai is i. The trai is i motio. The trai is i statio. The trai is dager. Proper traslatio demads that we eed to see the Word alog with the cotext. This gives us the cocept of Phrase-based Traslatio 52

53 Thak You 53

( ) = p and P( i = b) = q.

( ) = p and P( i = b) = q. MATH 540 Radom Walks Part 1 A radom walk X is special stochastic process that measures the height (or value) of a particle that radomly moves upward or dowward certai fixed amouts o each uit icremet of