Bitext Alignment for Statistical Machine Translation

Size: px

Start display at page:

Download "Bitext Alignment for Statistical Machine Translation"

Katrina Dennis
5 years ago
Views:

1 Bitext Alignment for Statistical Machine Translation Yonggang Deng Advisor: Prof. William Byrne Thesis Committee: Prof. William Byrne, Prof. Trac Tran Prof. Jerry Prince and Prof. Gerard Meyer Center for Language and Speech Processing The Johns Hopkins University Baltimore, MD th December 2005 Y. Deng (Johns Hopkins) Bitext Alignment for SMT 1 / 42

2 Bitext and Bitext Alignment Bitext: a collection of text in two languages Bitext Alignment: finding translation equivalence within bitext,,.,.,.,. Z_ It is necessary to resolutely remove obstacles in rivers and lakes. 4. It is necessary to strengthen monitoring and forecast work and scientifically dispatch people and materials. It is necessary to take effective measures and try by every possible means to provide precision forecast. Before the flood season comes, it is necessary to seize the time to formulate plans for forecasting floods and to carry out work with clear _ ]Z Y. Deng (Johns Hopkins) Bitext Alignment for SMT 2 / 42

3 Bitext and Bitext Alignment Bitext: a collection of text in two languages Bitext Alignment: finding translation equivalence within bitext,,.,.,.,. Z_ It is necessary to resolutely remove obstacles in rivers and lakes. 4. It is necessary to strengthen monitoring and forecast work and scientifically dispatch people and materials. It is necessary to take effective measures and try by every possible means to provide precision forecast. Before the flood season comes, it is necessary to seize the time to formulate plans for forecasting floods and to carry out work with clear _ ]Z Y. Deng (Johns Hopkins) Bitext Alignment for SMT 2 / 42

4 Bitext and Bitext Alignment Bitext: a collection of text in two languages Bitext Alignment: finding translation equivalence within bitext,,.,.,.,. Z_ It is necessary to resolutely remove obstacles in rivers and lakes. 4. It is necessary to strengthen monitoring and forecast work and scientifically dispatch people and materials. It is necessary to take effective measures and try by every possible means to provide precision forecast. Before the flood season comes, it is necessary to seize the time to formulate plans for forecasting floods and to carry out work with clear _ ]Z Y. Deng (Johns Hopkins) Bitext Alignment for SMT 2 / 42

5 Why automatic bitext alignment? Critical and beneficial in many multilingual NLP tasks provides basic ingredients in building a Machine Translation system Hand alignment is expensive for large corpora Desired properties language independent: Chinese, Arabic, Spanish, French... no linguistic knowledge: from scratch, unsupervised, statistical huge amount of data: effectiveness and efficiency Y. Deng (Johns Hopkins) Bitext Alignment for SMT 3 / 42

6 Statistical Machine Translation (SMT) Source Channel Target Source Decoding E P(F E) F Ê = argmax E P(E)P(F E) Translation Model P(F E) needs BITEXTs ` ^ _ M c `c^x_ EZ / _\Z_ O J / X] _ _ 8 _\ `c Z _ _ ]Z J / cxz_z_ `c 8M cx J/ ` c IJ Y. Deng (Johns Hopkins) Bitext Alignment for SMT 4 / 42

7 Outline 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 5 / 42

8 Outline Bitext Chunk Alignment Model 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 6 / 42

9 Chunk Alignment Bitext Chunk Alignment Model Problem: sentences are not translated 1-to-1 in sequence 1-to-n, n-to-1, m-to-n, order changes, real data challenge A Statistical Generative Chunk Alignment Model (Deng et al, 04) introduce a hidden chunk alignment variable document generating: fill in the blank two alignment algorithms are derived in a straightforward manor 5 e = e 1 w w # 1! 8 w w # 9! 20 w w # 21! 30 w w # 31! 38 w w Boundary marks 39! 50 e 1 e 2 e 3 e 4 e 5 m e = e1 Y. Deng (Johns Hopkins) Bitext Alignment for SMT 7 / 42

10 Chunk Alignment Bitext Chunk Alignment Model Problem: sentences are not translated 1-to-1 in sequence 1-to-n, n-to-1, m-to-n, order changes, real data challenge A Statistical Generative Chunk Alignment Model (Deng et al, 04) introduce a hidden chunk alignment variable document generating: fill in the blank two alignment algorithms are derived in a straightforward manor 5 e = e 1 w w # 1! 8 w w # 9! 20 w w # 21! 30 w w # 31! 38 w w Boundary marks 39! 50 e 1 e 2 e 3 e 4 e 5 f 1 f 2 f 3 f 4 m e = e1 n α( n m) Y. Deng (Johns Hopkins) Bitext Alignment for SMT 7 / 42

11 Chunk Alignment Bitext Chunk Alignment Model Problem: sentences are not translated 1-to-1 in sequence 1-to-n, n-to-1, m-to-n, order changes, real data challenge A Statistical Generative Chunk Alignment Model (Deng et al, 04) introduce a hidden chunk alignment variable document generating: fill in the blank two alignment algorithms are derived in a straightforward manor 5 e = e 1 w w # 1! 8 w w # 9! 20 w w # 21! 30 w w # 31! 38 w w Boundary marks 39! 50 e 1 e 2 e 3 e 4 e 5 K=3 f 1 f 2 f 3 f 4 m e = e1 n K α ( n m) β ( K m, n) Y. Deng (Johns Hopkins) Bitext Alignment for SMT 7 / 42

12 Chunk Alignment Bitext Chunk Alignment Model Problem: sentences are not translated 1-to-1 in sequence 1-to-n, n-to-1, m-to-n, order changes, real data challenge A Statistical Generative Chunk Alignment Model (Deng et al, 04) introduce a hidden chunk alignment variable document generating: fill in the blank two alignment algorithms are derived in a straightforward manor 5 e = e 1 w w # 1! 8 w w # 9! 20 w w # 21! 30 w w # 31! 38 w w Boundary marks 39! 50 e 1 e 2 e 3 e 4 e 5 a 3 (5,4) a = (1,2,1,1 1 ) a = (3,4,2,3) a = (5,5,4,4) K=3 f 1 f 2 f 3 f 4 m e = e1 n K a K ( m, ) 1 n α ( n m) β ( K m, n) P( a K 1 m, n, K) Y. Deng (Johns Hopkins) Bitext Alignment for SMT 7 / 42

13 Chunk Alignment Bitext Chunk Alignment Model Problem: sentences are not translated 1-to-1 in sequence 1-to-n, n-to-1, m-to-n, order changes, real data challenge A Statistical Generative Chunk Alignment Model (Deng et al, 04) introduce a hidden chunk alignment variable document generating: fill in the blank two alignment algorithms are derived in a straightforward manor 5 e = e 1 w w # 1! 8 w w # 9! 20 w w # 21! 30 w w # 31! 38 w w Boundary marks 39! 50 e 1 e 2 e 3 e 4 e 5 4 f = f 1 a 3 (5,4) a = (1,2,1,1 1 ) a = (3,4,2,3) a = (5,5,4,4) f 1 f 2 f 3 f 4 K=3 m e = e1 t 1! t 6 t t 7! 13 t t 14! 20 t t 21! 35 # # # Boundary marks a K 1 ( m, n f n 1 = f n K ) α ( n m) β ( K m, n) P( a K w) n K 1 m, n, K) P ( f ( a k ) e( a k )) = P( f1, K, a1 e1 k ( m ) Y. Deng (Johns Hopkins) Bitext Alignment for SMT 7 / 42

14 Outline Bitext Chunk Alignment Algorithms 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 8 / 42

15 Bitext Chunk Alignment Algorithms Dynamic Programming (DP) f4 p(1,1)p(f5 e5) f3 f2 f1 p(2,2)p(f2,f3 e3,e4) e1 e2 e3 e4 e5 p(1,2)p(f1 e1,e2) Monotone chunk alignment Global optimum Y. Deng (Johns Hopkins) Bitext Alignment for SMT 9 / 42

16 Bitext Chunk Alignment Divisive Clustering (DC) Algorithms divide and conquer, iterative binary parallel splitting, reorder!" #$%& ' () * +, -. / : /;! < = >? AB D E FGHI J K 6 - L; MNOP 1 / QR #$ S T UV = WXYZ[\ K ] ^_` b c de 1 fg hi?j 1 6- k l m H n J >o 2- p/ qr s tu vw = x y 1 z { } ~ #$ ÄÅ ÇÉ S 7 Ñ Ö Ü 2- áà âä = ãå 1 p çé %& èê u ë í9 ìî 1 # 2 Äï ñó ñj ò ë ô 1 ö õú ë ùûü S 2- ß à = M 0 1 Æ!Ø= vw xy qr ± S 2- µ [\ ë µ u ë Äï òª 1 º/ æ k Æ ë!ø= vw ± 1 ø æ 1 ƒ ò Äï «ò» = èñ S Since the Korean Peninsula was split into two countries, the Republic of Korea has, while leaning its back on the " big tree " of the United States for security, carefully and consistently sought advanced weapons from the United States in a bid to confront the Democratic People 's Republic of Korea. An informed source in Seoul revealed to the Washington Post that the United States had secretly agreed to the request of South Korea earlier this year to " extend its existing missile range " to strike Pyongyang direct. This should have elated South Korea. But since the situation surrounding the peninsula has changed dramatically and the two heads of state of the two Koreas have met with each other and signed a joint statement, what should South Korea do now? It has no choice but spit back the " greasy meat " from its mouth and put the " missile expansion plan " on the back burner. A knowledgeable South Korean speaks the truth : " Because of the summit meeting, we have shelved our own missile plan. If we go ahead with it, it will spoil the excellent situation opened up by the summit meeting. " Y. Deng (Johns Hopkins) Bitext Alignment for SMT 10 / 42

17 Bitext Chunk Alignment Divisive Clustering (DC) Algorithms divide and conquer, iterative binary parallel splitting, reorder 1 Since the Korean Peninsula was split into two countries, the Republic of Korea has, while leaning its back on the " big tree " of the United States for security, carefully and consistently sought advanced weapons from the United States in a bid to confront the Democratic People 's Republic of Korea. An informed source in Seoul revealed to the Washington Post that the United States had secretly agreed to the request of South Korea earlier this year to " extend its existing missile range " to strike Pyongyang direct. This should have elated South Korea. But since the situation surrounding the peninsula has changed dramatically and the two heads of state of the two Koreas have met with each other and signed a joint statement, what should South Korea do now? It has no choice but spit back the " greasy meat " from its mouth and put the " missile expansion plan " on the back burner. A knowledgeable South Korean speaks the truth : " Because of the summit meeting, we have shelved our own missile plan. If we go ahead with it, it will spoil the excellent situation opened up by the summit meeting. " Y. Deng (Johns Hopkins) Bitext Alignment for SMT 10 / 42

18 Bitext Chunk Alignment Divisive Clustering (DC) Algorithms divide and conquer, iterative binary parallel splitting, reorder =U VWXY Z [\ ] ^_ $` 7a 5 #$ b c d e$ f g h i 7j = k? lm 5 n opqq r s tuvw x y e $ zj { }~ 5 7 Ä VW T Å ÇÉ? ÑÖÜá'( y à âäãåç é èê 5 ëí ìîmï 5 e$ : ñ ó v ò x lô #$ ö7 õú ù ü 5 VW 1ß T f #$ Æ Ø? ± 5 ö XY Rµ 0 + h 5 V # 12 S Sx ª º + æø T #$ H ƒ Q Æ «?» { * a 5 ;< ü õú BC T! " #$ %& '( ) * +,& -./ : ;< + BC 5 DE 89 F GH I 5 J 3 KL 12 M 3 NO? PQ RS T 2 1 Since the Korean Peninsula was split into two countries, the Republic of Korea has, while leaning its back on the " big tree " of the United States for security, carefully and consistently sought advanced weapons from the United States in a bid to confront the Democratic People 's Republic of Korea. An informed source in Seoul revealed to the Washington Post that the United States had secretly agreed to the request of South Korea earlier this year to " extend its existing missile range " to strike Pyongyang direct. This should have elated South Korea. But since the situation surrounding the peninsula has changed dramatically and the two heads of state of the two Koreas have met with each other and signed a joint statement, what should South Korea do now? It has no choice but spit back the " greasy meat " from its mouth and put the " missile expansion plan " on the back burner. A knowledgeable South Korean speaks the truth : " Because of the summit meeting, we have shelved our own missile plan. If we go ahead with it, it will spoil the excellent situation opened up by the summit meeting. " Y. Deng (Johns Hopkins) Bitext Alignment for SMT 10 / 42

19 Bitext Chunk Alignment Divisive Clustering (DC) Algorithms divide and conquer, iterative binary parallel splitting, reorder = u no Æ Ø ± 7] 5 µ ö b r 7 = ª? üò 5 º æøø ùƒ w ä ö [ 5 7 u T É ÑÖ? Üáàâ ä ã åçéèê ë íì 5 îï ñóòô 5 ö : õ ú ù û w ü k7 `a ^ _ 5 ß u 1 T b c d e fw gh? ij 5 k lm no Rp 0 + qr st 5 u 12 Sv Sw 3 + x 5 y z{ + }~ T ÄHÅ Ç 5 UQ V W X? YZ [ \ ] 5 ;< = ^_ `a BC T 2 +, -./ : ;< + BC 5 DE 89 F GH I 5 J 3 KL 12 M 3 NO? PQ RS T Since the Korean Peninsula was split into two countries, the Republic of Korea has, while leaning its back on the " big tree " of the United States for security, carefully and consistently sought advanced weapons from the United States in a bid to confront the Democratic People 's Republic of Korea. An informed source in Seoul revealed to the Washington Post that the United States had secretly agreed to the request of South Korea earlier this year to " extend its existing missile range " to strike Pyongyang direct. This should have elated South Korea. But since the situation surrounding the peninsula has changed dramatically and the two heads of state of the two Koreas have met with each other and signed a joint statement, what should South Korea do now? It has no choice but spit back the " greasy meat " from its mouth and put the " missile expansion plan " on the back burner. A knowledgeable South Korean speaks the truth : " Because of the summit meeting, we have shelved our own missile plan. If we go ahead with it, it will spoil the excellent situation opened up by the summit meeting. " Y. Deng (Johns Hopkins) Bitext Alignment for SMT 10 / 42

20 Bitext Chunk Alignment Algorithms A hierarchical chunking scheme DP+DC DP at sentence level followed by DC at sub-sentence level from coarse to fine, deriving short chunk pairs Advantage significantly reduce machine training time 21 hrs vs. 8 hrs make most of bitext usable for machine training 78% vs. 98% "There is no data like more data" (Robert Mercer, 1988) improve system performance by higher coverage Y. Deng (Johns Hopkins) Bitext Alignment for SMT 11 / 42

21 Outline Bitext Chunk Alignment Results 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 12 / 42

22 Bitext Chunk Alignment Results Unsupervised Sentence Alignment 122 Chinese/English document pairs selected from FBIS corpus sentence aligned by humans, 2,200 sentence pairs unsupervised from scratch, measured by Pre/Rec 1 Precision at each Iteration 1 Recall at each Iteration Precision A: DP w/ uniform P(x,y) B: DP w/ biased P(x,y) C: DC D: DP+DC w/ uniform P(x,y) E: DP+DC w/ biased P(x,y) F: DP+DC w/ uniform P(x,y) seeded from E Ite. 0 G: DC seeded from E Ite Iteration Recall A: DP w/ uniform P(x,y) B: DP w/ biased P(x,y) C: DC D: DP+DC w/ uniform P(x,y) E: DP+DC w/ biased P(x,y) F: DP+DC w/ uniform P(x,y) seeded from E Ite. 0 G: DC seeded from E Ite Iteration Y. Deng (Johns Hopkins) Bitext Alignment for SMT 13 / 42

23 Outline Bitext Word Alignment Introduction/Motivation 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 14 / 42

24 Word Alignment Bitext Word Alignment Introduction/Motivation Fundamental problem in Machine Translation Basis for phrase/syntax models Model relations from source s = s1 I to target t = tj 1 Word alignment a = a1 J: s a j t j, j = 1, 2,, J = hidden r.v. Conditional likelihood P(t, a s) = complete data Sentence translation P(t s) = a P(t, a s) = incomplete data s 0 s 1 s 2 s 3 s 4 NULL!" #$ %& '()* china s accession to the world trade organization at an early date t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t Y. Deng (Johns Hopkins) Bitext Alignment for SMT 15 / 42

25 State of the Art Bitext Word Alignment Introduction/Motivation IBM Model-4 generated by GIZA++ Toolkit (Och & Ney, 03) The state of the art word alignments especially on large bitexts But Exact-EM is problematic, sub-optimal estimation algorithms used Difficult to compute statistics under the model Applications limited by word alignments only GOAL: improve word alignments of bitexts for better translation Comparable performance to Model-4 Fast efficient training, with controlled memory usage Use the model, not just the alignments Y. Deng (Johns Hopkins) Bitext Alignment for SMT 16 / 42

26 Bitext Word Alignment Introduction/Motivation IBM Model-4 Word Alignments (Brown et al, 93) s 0 NULL s 1 s 2 s 3 s 4 What makes the model powerful also makes computation complex Typical training procedure: Model-1, HMM, Model-4 Can we do something to HMM? Y. Deng (Johns Hopkins) Bitext Alignment for SMT 17 / 42

27 Bitext Word Alignment Introduction/Motivation IBM Model-4 Word Alignments (Brown et al, 93) s 0 NULL s 1 s 2 s 3 s 4!" #$ %& '()* Create a tablet for each source word What makes the model powerful also makes computation complex Typical training procedure: Model-1, HMM, Model-4 Can we do something to HMM? Y. Deng (Johns Hopkins) Bitext Alignment for SMT 17 / 42

28 Bitext Word Alignment Introduction/Motivation IBM Model-4 Word Alignments (Brown et al, 93) s 0 NULL s 1 s 2 s 3 s 4!" #$ %& '()* Table lookup to decide fertility: # of target words connected What makes the model powerful also makes computation complex Typical training procedure: Model-1, HMM, Model-4 Can we do something to HMM? Y. Deng (Johns Hopkins) Bitext Alignment for SMT 17 / 42

29 Bitext Word Alignment Introduction/Motivation IBM Model-4 Word Alignments (Brown et al, 93) s 0 NULL s 1 s 2 s 3 s 4!" #$ %& '()* the china s at an early date accession to world trade organization Sample target words from translation table i.i.d. What makes the model powerful also makes computation complex Typical training procedure: Model-1, HMM, Model-4 Can we do something to HMM? Y. Deng (Johns Hopkins) Bitext Alignment for SMT 17 / 42

30 Bitext Word Alignment Introduction/Motivation IBM Model-4 Word Alignments (Brown et al, 93) s 0 NULL s 1 s 2 s 3 s 4!" #$ %& '()* the china s at an early date accession to world trade organization Distortion Model china s accession to the world trade organization at an early date t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 What makes the model powerful also makes computation complex Typical training procedure: Model-1, HMM, Model-4 Can we do something to HMM? Y. Deng (Johns Hopkins) Bitext Alignment for SMT 17 / 42

31 Bitext Word Alignment Introduction/Motivation HMM WtoW Model (Vogel et al, 96; Och & Ney, 03) s 1 s 2 s 3 s 4!" #$ %& '()* china s accession to the world trade organization at an early date t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 State sequences word to word alignments Words are generated one by one, one transition emits one target word Y. Deng (Johns Hopkins) Bitext Alignment for SMT 18 / 42

32 Bitext Word Alignment Introduction/Motivation HMM WtoW Model (Vogel et al, 96; Och & Ney, 03) s 1 s 2 s 3 s 4!" #$ %& '()*!" #$ %& '()* china s accession to the world trade organization at an early date t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 State sequences word to word alignments Words are generated one by one, one transition emits one target word Y. Deng (Johns Hopkins) Bitext Alignment for SMT 18 / 42

33 Bitext Word Alignment Introduction/Motivation HMM WtoW Model (Vogel et al, 96; Och & Ney, 03) s 1 s 2 s 3 s 4!" #$ %& '()*!" #$ %& '()* china s accession to the world trade organization at an early date t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 State sequences word to word alignments Words are generated one by one, one transition emits one target word Y. Deng (Johns Hopkins) Bitext Alignment for SMT 18 / 42

34 Bitext Word Alignment Introduction/Motivation Make HMM More Powerful in Generating Observations s 1 s 2 s 3 s 4!" #$ %& '()*!" #$ %& '()* china s accession to the world trade organization at an early date t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 Target phrases rather than words are emitted after jumping into a state State sequences word to phrase alignments Word-to-Phrase (WtoP) HMM (Deng & Byrne, 05) Y. Deng (Johns Hopkins) Bitext Alignment for SMT 19 / 42

35 Bitext Word Alignment Introduction/Motivation Make HMM More Powerful in Generating Observations s 1 s 2 s 3 s 4!" #$ %& '()*!" #$ %& '()* china s accession to the world trade organization at an early date t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 Target phrases rather than words are emitted after jumping into a state State sequences word to phrase alignments Word-to-Phrase (WtoP) HMM (Deng & Byrne, 05) Y. Deng (Johns Hopkins) Bitext Alignment for SMT 19 / 42

36 Outline Bitext Word Alignment WtoP HMM Model 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 20 / 42

37 Bitext Word Alignment WtoP HMM Model Word-to-Phrase HMM Alignment Models Target sentence segmented into K phrases Phrase length sequence φ K 1, t = v K 1 Phrase alignment sequence a K 1 NULL: h1 K is a Bernoulli process, d(h k = 1) = 1 p 0, d(h k = 0) = p 0 h k = 1 s ak v k h k = 0 NULL v k Hidden random variable: Word-to-phrase alignment a = (K, a1 K, φ K 1, h1 K ) P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) = P(K J, s) P(a K 1, φ K 1, h K 1 K, J, s) P(v K 1 a K 1, h K 1, φ K 1, K, J, s) = P(K J, I) = Phrase Count η K K k=1 K k=1 p(a k a k 1, h k ; I) d(h k ) n(φ k ; h k s ak ) = Markov, Phrase Length P(v k h k s ak ) = Word-to-Phrase Translation Y. Deng (Johns Hopkins) Bitext Alignment for SMT 21 / 42

38 Bitext Word Alignment WtoP HMM Model Word-to-Phrase HMM Alignment Models Target sentence segmented into K phrases Phrase length sequence φ K 1, t = v K 1 Phrase alignment sequence a K 1 NULL: h1 K is a Bernoulli process, d(h k = 1) = 1 p 0, d(h k = 0) = p 0 h k = 1 s ak v k h k = 0 NULL v k Hidden random variable: Word-to-phrase alignment a = (K, a1 K, φ K 1, h1 K ) P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) = P(K J, s) P(a K 1, φ K 1, h K 1 K, J, s) P(v K 1 a K 1, h K 1, φ K 1, K, J, s) = P(K J, I) = Phrase Count η K K k=1 K k=1 p(a k a k 1, h k ; I) d(h k ) n(φ k ; h k s ak ) = Markov, Phrase Length P(v k h k s ak ) = Word-to-Phrase Translation Y. Deng (Johns Hopkins) Bitext Alignment for SMT 21 / 42

39 Bitext Word Alignment WtoP HMM Model Word-to-Phrase HMM Alignment Models Target sentence segmented into K phrases Phrase length sequence φ K 1, t = v K 1 Phrase alignment sequence a K 1 NULL: h1 K is a Bernoulli process, d(h k = 1) = 1 p 0, d(h k = 0) = p 0 h k = 1 s ak v k h k = 0 NULL v k Hidden random variable: Word-to-phrase alignment a = (K, a1 K, φ K 1, h1 K ) P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) = P(K J, s) P(a K 1, φ K 1, h K 1 K, J, s) P(v K 1 a K 1, h K 1, φ K 1, K, J, s) = P(K J, I) = Phrase Count η K K k=1 K k=1 p(a k a k 1, h k ; I) d(h k ) n(φ k ; h k s ak ) = Markov, Phrase Length P(v k h k s ak ) = Word-to-Phrase Translation Y. Deng (Johns Hopkins) Bitext Alignment for SMT 21 / 42

40 Bitext Word Alignment WtoP HMM Model Word-to-Phrase Translation Probabilities Replace weak i.i.d. word-for-word translation P(world trade organization f = dddd; 3) =? = t(world f) t(trade f) t(organization f) = i.i.d. = t(world f) t 2 (trade world, f) t 2 (organization trade, f) = bigram Model i.i.d. bigram P(world dddd) P(trade world, dddd) P(organization trade, dddd) P(world trade organization dddd, 3) Assigns higher probability to correct translation than i.i.d Incorporates context without losing algorithmic efficiency: DP Use same estimation techniques as used for bigram LMs Data sparseness, Witten-Bell smoothing Y. Deng (Johns Hopkins) Bitext Alignment for SMT 22 / 42

41 Bitext Word Alignment WtoP HMM Model Comparing Word-to-Phrase HMM to... Segmental Hidden Markov Models (Ostendorf et al, 96) states emit observation sequences WtoW HMM (Vogel et al, 96; Och & Ney, 03) N = 1 Extensions to WtoW HMM (Toutanova et al, 02) P(stay s) vs. P(stay = φ s) in modeling state durations IBM Model-4 fertility vs. phrase length Y. Deng (Johns Hopkins) Bitext Alignment for SMT 23 / 42

42 Outline Bitext Word Alignment Parameter Estimation 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 24 / 42

43 Bitext Word Alignment Parameter Estimation Forward-backward Algorithm State space S = {(i, φ, h) : 1 i I, 1 φ N, h = 0 or 1} Grid 2NI J s 1.. s i.. s I t 1 t 2.. t j- t j-.. t j t j+1.. t J α j (i, φ, h) = β j (i, φ, h) = i,φ,h α j φ (i, φ, h )p(i i, h; I) η n(φ; h s i ) P(t j j φ+1 h s i, φ) β j+φ (i, φ, h )p(i i, h ; I) η n(φ ; h s i ) P(t j+φ j+1 h s i, φ ) i,φ,h γ j (i, φ, h) = P(h s i v = t j j φ+1 θ, s, t) = α j(i, φ, h)β j (i, φ, h) i,h,φ α J (i, φ, h ) Y. Deng (Johns Hopkins) Bitext Alignment for SMT 25 / 42

44 Bitext Word Alignment Parameter Estimation Embedded Estimation of Word-to-Phrase HMM Unsupervised training from scratch Model-1, 10 its (initial t-table) Model-2, 5 its (better t-table) WtoW HMM, 5 its (initial Markov model) WtoP HMM N=2, 3,.., each 5 its (Markov model, phrase length) (experience from ASR) WtoP HMM with bigram t-table, 5 its (bigram t-table) Parallel Implementation Partitioning training bitext E-step: Collect counts from each partition parallel M-step: Merge counts to update model parameters Memory efficient, virtually no limitation on training bitext size Y. Deng (Johns Hopkins) Bitext Alignment for SMT 26 / 42

45 Bitext Word Alignment Parameter Estimation Embedded Estimation of Word-to-Phrase HMM Unsupervised training from scratch Model-1, 10 its (initial t-table) Model-2, 5 its (better t-table) WtoW HMM, 5 its (initial Markov model) WtoP HMM N=2, 3,.., each 5 its (Markov model, phrase length) (experience from ASR) WtoP HMM with bigram t-table, 5 its (bigram t-table) Parallel Implementation Partitioning training bitext E-step: Collect counts from each partition parallel M-step: Merge counts to update model parameters Memory efficient, virtually no limitation on training bitext size Y. Deng (Johns Hopkins) Bitext Alignment for SMT 26 / 42

46 Outline Bitext Word Alignment Word Alignment Results 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 27 / 42

47 Bitext Word Alignment Bitext Alignment Results Word Alignment Results Test: NIST 2001 MT-eval set, 124 sentence pairs w/ manual word alignments Comparable performance to Model-4 on FBIS training bitext Increasing max phrase length N improves quality in C E direction Bigram translation probability improves word-to-phrase links A good balance between 1-1 and 1-N distribution can be achieved # of hypothesized links Links 1 N Links Overall AER Overall AER η Comparable performance when extending to large scale bitexts Y. Deng (Johns Hopkins) Bitext Alignment for SMT 28 / 42

48 Outline Bitext Phrase Alignment Inducing from Word Alignments 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 29 / 42

49 Bitext Phrase Alignment Inducing from Word Alignments Statistical Phrase Translation Models Phrase-based SMT performs better than word-based SMT Phrases Pair Inventory (PPI) extracted from word aligned bitext (Och et al, 99) Y. Deng (Johns Hopkins) Bitext Alignment for SMT 30 / 42

50 Bitext Phrase Alignment Inducing from Word Alignments But word alignments are imperfect... There is no gang and money linked politics in hong kong and there will not be such politics in future either? Relying on the one-best word alignment may exclude some valid phrase pairs Goal is to define a probability distribution over phrase pairs Allows more control over generation of phrase pairs Y. Deng (Johns Hopkins) Bitext Alignment for SMT 31 / 42

51 Bitext Phrase Alignment Inducing from Word Alignments But word alignments are imperfect... There is no gang and money linked politics in hong kong and there will not be such politics in future either Relying on the one-best word alignment may exclude some valid phrase pairs Goal is to define a probability distribution over phrase pairs Allows more control over generation of phrase pairs Y. Deng (Johns Hopkins) Bitext Alignment for SMT 31 / 42

52 Bitext Phrase Alignment Inducing from Word Alignments But word alignments are imperfect... There is no gang and money linked politics in hong kong and there will not be such politics in future either Relying on the one-best word alignment may exclude some valid phrase pairs Goal is to define a probability distribution over phrase pairs Allows more control over generation of phrase pairs Y. Deng (Johns Hopkins) Bitext Alignment for SMT 31 / 42

53 Outline Bitext Phrase Alignment Model-based Phrase Pair Posterior 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 32 / 42

54 Bitext Phrase Alignment Model-based Phrase Pair Posterior Model-based Phrase Pair Posterior Doesn t rely on a single alignment i 1 i 2 There is no gang and money linked politics in hong kong and there will not be such politics in future either j 1 j 2 Define a set of alignments that align words to words in phrases A(i 1, i 2 ; j 1, j 2 ) = {a m 1 : a j [i 1, i 2 ] iff j [j 1, j 2 ]} Calculate the likelihood of the source phrase producing the target phrase P(t, A(i 1, i 2 ; j 1, j 2 ) s) = P(t, a s) a : a m 1 A(i 1,i 2 ;j 1,j 2 ) Obtain phrase pair posterior P(A(i 1, i 2 ; j 1, j 2 ) t, s) = P(t, A(i 1, i 2 ; j 1, j 2 ) s)/p(t s) Efficient DP-based implementation for WtoP HMM, Difficult for Model-4 Y. Deng (Johns Hopkins) Bitext Alignment for SMT 33 / 42

55 Bitext Phrase Alignment Model-based Phrase Pair Posterior Model-based Phrase Pair Posterior Doesn t rely on a single alignment i 1 i 2 There is no gang and money linked politics in hong kong and there will not be such politics in future either j 1 j 2 Define a set of alignments that align words to words in phrases A(i 1, i 2 ; j 1, j 2 ) = {a m 1 : a j [i 1, i 2 ] iff j [j 1, j 2 ]} Calculate the likelihood of the source phrase producing the target phrase P(t, A(i 1, i 2 ; j 1, j 2 ) s) = P(t, a s) a : a m 1 A(i 1,i 2 ;j 1,j 2 ) Obtain phrase pair posterior P(A(i 1, i 2 ; j 1, j 2 ) t, s) = P(t, A(i 1, i 2 ; j 1, j 2 ) s)/p(t s) Efficient DP-based implementation for WtoP HMM, Difficult for Model-4 Y. Deng (Johns Hopkins) Bitext Alignment for SMT 33 / 42

56 Bitext Phrase Alignment Model-based Phrase Pair Posterior Model-based Phrase Pair Posterior Doesn t rely on a single alignment i 1 i 2 There is no gang and money linked politics in hong kong and there will not be such politics in future either j 1 j 2 Define a set of alignments that align words to words in phrases A(i 1, i 2 ; j 1, j 2 ) = {a m 1 : a j [i 1, i 2 ] iff j [j 1, j 2 ]} Calculate the likelihood of the source phrase producing the target phrase P(t, A(i 1, i 2 ; j 1, j 2 ) s) = P(t, a s) a : a m 1 A(i 1,i 2 ;j 1,j 2 ) Obtain phrase pair posterior P(A(i 1, i 2 ; j 1, j 2 ) t, s) = P(t, A(i 1, i 2 ; j 1, j 2 ) s)/p(t s) Efficient DP-based implementation for WtoP HMM, Difficult for Model-4 Y. Deng (Johns Hopkins) Bitext Alignment for SMT 33 / 42

57 Bitext Phrase Alignment Model-based Phrase Pair Posterior Model-based Phrase Pair Posterior Doesn t rely on a single alignment i 1 i 2 There is no gang and money linked politics in hong kong and there will not be such politics in future either j 1 j 2 Define a set of alignments that align words to words in phrases A(i 1, i 2 ; j 1, j 2 ) = {a m 1 : a j [i 1, i 2 ] iff j [j 1, j 2 ]} Calculate the likelihood of the source phrase producing the target phrase P(t, A(i 1, i 2 ; j 1, j 2 ) s) = P(t, a s) a : a m 1 A(i 1,i 2 ;j 1,j 2 ) Obtain phrase pair posterior P(A(i 1, i 2 ; j 1, j 2 ) t, s) = P(t, A(i 1, i 2 ; j 1, j 2 ) s)/p(t s) Efficient DP-based implementation for WtoP HMM, Difficult for Model-4 Y. Deng (Johns Hopkins) Bitext Alignment for SMT 33 / 42

58 Bitext Phrase Alignment Model-based Phrase Pair Posterior Model-based Phrase Pair Posterior Doesn t rely on a single alignment i 1 i 2 There is no gang and money linked politics in hong kong and there will not be such politics in future either j 1 j 2 Define a set of alignments that align words to words in phrases A(i 1, i 2 ; j 1, j 2 ) = {a m 1 : a j [i 1, i 2 ] iff j [j 1, j 2 ]} Calculate the likelihood of the source phrase producing the target phrase P(t, A(i 1, i 2 ; j 1, j 2 ) s) = P(t, a s) a : a m 1 A(i 1,i 2 ;j 1,j 2 ) Obtain phrase pair posterior P(A(i 1, i 2 ; j 1, j 2 ) t, s) = P(t, A(i 1, i 2 ; j 1, j 2 ) s)/p(t s) Efficient DP-based implementation for WtoP HMM, Difficult for Model-4 Y. Deng (Johns Hopkins) Bitext Alignment for SMT 33 / 42

59 Bitext Phrase Alignment Model-based Phrase Pair Posterior Augmented PPI for a Better Coverage Baseline PPI extracted from 1-best alignments using establishing techniques (Och et al., 99) GOAL: add phrase pairs to improve test set coverage For each foreign phrase v in test set not covered by the baseline for each sentence pair containing v find the English phrase u that maximizes the phrase pair posterior f (i 1, i 2 ) = P F E ( A(i 1, i 2 ; j 1, j 2 ) e l 1, f m 1 ) b(i 1, i 2 ) = P E F ( A(i 1, i 2 ; j 1, j 2 ) e l 1, f m 1 ) g(i 1, i 2 ) = f (1 1, i 2 ) b(i 1, i 2 ) (î1, î2) = argmax g(i 1, i 2 ), and set u = eî2 î1 1 i 1,i 2 l add (u, v) to the baseline PPI if posterior exceeds a threshold value Y. Deng (Johns Hopkins) Bitext Alignment for SMT 34 / 42

60 Outline Bitext Phrase Alignment Translation Results 1 Bitext Chunk Alignment Chunk Alignment Chunking Algorithms Sentence Alignment Results 2 Bitext Word Alignment Introduction/Motivation Word-to-Phrase HMM Model Parameter Estimation Word Alignment Results 3 Bitext Phrase Alignment Inducing from Word Alignments Model-based Phrase Pair Posterior Translation Results 4 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 35 / 42

61 Bitext Phrase Alignment Translation Results Transduce Translation Model (Kumar et al, 05) TTM Decoder - WFST implementation with monotone order Source Phrase Segmentation Phrase Insertion Phrase Translation Target Phrase Segmentation GRAIN EXPORTS ARE PROJECTED TO FALL BY 25 % GRAIN EXPORTS ARE_PROJECTED_TO FALL BY_25_% 1 GRAINS 1 EXPORTS ARE_PROJECTED_TO FALL BY_25_% DE GRAINS LES EXPORTATIONS DOIVENT FLECHIR DE_25_% DE GRAINS LES EXPORTATIONS DOIVENT FLECHIR DE 25 % Source Language Sentence Source Phrases Placement of Target Phrase Insertion Markers Target Phrases Target Language Sentence Y. Deng (Johns Hopkins) Bitext Alignment for SMT 36 / 42

62 Bitext Phrase Alignment Translation Results Automatic Machine Translation Evaluation hard problem! BLEU (Papeneni et al, 01) an automatic MT metric correlated well with human judgements geomantic mean of n-gram precisions weighted by brevity penalty Reference : mr. speaker, in absolutely no way. Hypothesis : in absolutely no way, mr. chairman. BLEU computation Sub-string-Matches(Truth,Hyp) BLEU ( 1-word 2-word 3-word 4-word = )1 7/8 3/7 2/6 1/5 Y. Deng (Johns Hopkins) Bitext Alignment for SMT 37 / 42

63 Bitext Phrase Alignment Translation Results Translation Results: Small Systems Coverage BLEU Coverage BLEU eval02 eval03 eval Chinese-English 23.4 eval02 eval03 eval eval02 eval03 eval Arabic-English 36.5 eval02 eval03 eval Relaxing threshold in PPI augmenting improves coverage and BLEU score Balance coverage against phrase translation quality WtoP model can even be applied to augment Model-4 PPI Y. Deng (Johns Hopkins) Bitext Alignment for SMT 38 / 42

64 Bitext Phrase Alignment Translation Results Translation Results: Large Systems Model-4 Baseline WtoP HMM Baseline WtoP HMM Augmented Model-4 Baseline WtoP HMM Baseline WtoP HMM Augmented eva l02 eva l03 eva l04(news) 36 eva l02 eva l03 eva l04(news) Chinese-English Arabic-English Used all parallel corpora available from LDC C-E: 200M En. words (FBIS, Xinhua, HK News,..., all UN bitexts) A-E: 130M En. words (news, all UN bitexts) Y. Deng (Johns Hopkins) Bitext Alignment for SMT 39 / 42

65 Conclusions Conclusions A hierarchial bitext chunking approach language independent, no linguistic knowledge required derived short chunk pairs, retain more of the available bitext The word-to-phrase HMM alignment model produces good quality word alignments over very large bitexts has efficient training algorithm with parallel implementation a powerful framework Model-based phrase pair distribution enables an improved phrase pair extraction strategy controlled balance coverage vs. quality WtoP HMM performs better than IBM Model-4 on large systems Y. Deng (Johns Hopkins) Bitext Alignment for SMT 40 / 42

66 Conclusions Machine Translation Toolkit (MTTK) Solutions for MT training, Used for JHU-CU 2005 NIST MT Eval Systems Document Alignments Length Statistics Bitext Chunking { t(f e) } Model 1 Training Chunk Alignments Filter High Quality Pairs Model 1 Training PPEM { t(f e) } AlignM1 Model 2 Training PPEM { t(f e), a(i j;l,m) } AlignM2 Phrase Alignments PPEHmm WtW HMM Training { t(f e), P(i i ;l) } WtP HMM Training N=2,3... AlignHmm Word Alignments PPEHmm { t(f e), P(i i ;l), n(phi;e) } AlignSHmm WtP HMM Training w/ Bigram t table PPEHmm { t(f e), P(i i ;l), n(phi;e), t2(f f,e) } AlignSHmm Y. Deng (Johns Hopkins) Bitext Alignment for SMT 41 / 42

67 Conclusions Y. Deng (Johns Hopkins) Bitext Alignment for SMT 42 / 42

Out of GIZA Efficient Word Alignment Models for SMT

Out of GIZA Efficient Word Alignment Models for SMT Yanjun Ma National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Series March 4, 2009 Y. Ma (DCU) Out of Giza