Out of GIZA Efficient Word Alignment Models for SMT Yanjun Ma National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Series March 4, 2009 Y. Ma (DCU) Out of Giza 1 / 28
Outline 1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction Y. Ma (DCU) Out of Giza 2 / 28
Outline 1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction Y. Ma (DCU) Out of Giza 3 / 28
Word Alignment and SMT All SMT systems rely on word alignment Word-Based SMT Phrase-Based SMT Hiero, hierarchical SMT Syntax-Based SMT, i.e, tree-to-string, string-to-tree, tree-to-tree Y. Ma (DCU) Out of Giza 4 / 28
Word Alignment and SMT All SMT systems rely on word alignment Word-Based SMT Phrase-Based SMT Hiero, hierarchical SMT Syntax-Based SMT, i.e, tree-to-string, string-to-tree, tree-to-tree Giza implementation of IBM model 4 is dominant Y. Ma (DCU) Out of Giza 4 / 28
Word Alignment and SMT All SMT systems rely on word alignment Word-Based SMT Phrase-Based SMT Hiero, hierarchical SMT Syntax-Based SMT, i.e, tree-to-string, string-to-tree, tree-to-tree Giza implementation of IBM model 4 is dominant Viterbi alignment from IBM model 4 is used Y. Ma (DCU) Out of Giza 4 / 28
Outline 1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction Y. Ma (DCU) Out of Giza 5 / 28
Efficient Model: HMM Model [Vogel et al., 1996] HMM emission (translation) model p(t j s aj ) Y. Ma (DCU) Out of Giza 6 / 28
HMM transition (alignment) model p(a j a j a j 1 ) Y. Ma (DCU) Out of Giza 7 / 28
HMM transition (alignment) model p(a j a j a j 1 ) p(t, a s) = j p(a j a j a j 1 ) p(t j s aj ) (1) Y. Ma (DCU) Out of Giza 7 / 28
Deficient Model: IBM Model 3 and 4 Model 3: zero-order distortion model Y. Ma (DCU) Out of Giza 8 / 28
Model 4: first-order distortion model Y. Ma (DCU) Out of Giza 9 / 28
Derivation P(t J 1,a J 1 s I 1) = P(t J 1,B I 0 s I 1) (2) Y. Ma (DCU) Out of Giza 10 / 28
Derivation P(t J 1,a J 1 s I 1) = P(t J 1,B0 s I I 1) (2) I = P(B 0 B1) I P(B i B1 i 1, e I 1) P(f1 J B0, I e I 1) (3) i=1 Y. Ma (DCU) Out of Giza 10 / 28
Derivation P(t J 1,a J 1 s I 1) = P(t J 1,B0 s I I 1) (2) I = P(B 0 B1) I P(B i B1 i 1, e I 1) P(f1 J B0, I e I 1) (3) = P(B 0 B I 1) i=1 I p(b i B i 1, e i ) }{{} i=1 fertility-distortion (4) Y. Ma (DCU) Out of Giza 10 / 28
Derivation P(t J 1,a J 1 s I 1) = P(t J 1,B0 s I I 1) (2) I = P(B 0 B1) I P(B i B1 i 1, e I 1) P(f1 J B0, I e I 1) (3) = P(B 0 B I 1) i=1 I I p(f j e i ) }{{} j B i translation i=0 p(b i B i 1, e i ) }{{} i=1 fertility-distortion (4) (5) Y. Ma (DCU) Out of Giza 10 / 28
Model 3 fertility and distortion p(b i B i 1, e i ) = p(φ i e i ) φ i! p(j i, J) }{{}}{{} j B fertility i distortion (6) Y. Ma (DCU) Out of Giza 11 / 28
Model 3 fertility and distortion p(b i B i 1, e i ) = p(φ i e i ) φ i! p(j i, J) }{{}}{{} j B fertility i distortion (6) Model 4 fertility and distortion φ i p(b i B i 1, e i ) = p(φ i e i ) p =1 (B i1 B }{{} ρ(i) ) p >1 (B ik B i,k 1 ) }{{} k=2 fertility first word }{{} remaining words (7) Y. Ma (DCU) Out of Giza 11 / 28
Decoding HMM Viterbi decoding: â = argmaxp(a s, t) a Posterior decoding: Align point a j i iff. p(a j i s, t) δ Y. Ma (DCU) Out of Giza 12 / 28
Decoding HMM Viterbi decoding: â = argmaxp(a s, t) a Posterior decoding: Align point a j i iff. p(a j i s, t) δ IBM model 3 and 4 No efficient algorithm available Y. Ma (DCU) Out of Giza 12 / 28
Advantages of HMM models Efficient parameter estimation algorithm: forward-backward algorithm (Baum-Welch algorithm) Y. Ma (DCU) Out of Giza 13 / 28
Advantages of HMM models Efficient parameter estimation algorithm: forward-backward algorithm (Baum-Welch algorithm) Figure: Eric B. Baum (son of Leonard E. Baum, who was the inventor of the algorithm) and Lloyd R. Welch Y. Ma (DCU) Out of Giza 13 / 28
Advantages of HMM models Efficient parameter estimation algorithm: forward-backward algorithm (Baum-Welch algorithm) Figure: Eric B. Baum (son of Leonard E. Baum, who was the inventor of the algorithm) and Lloyd R. Welch The resulting posterior probabilities are useful Y. Ma (DCU) Out of Giza 13 / 28
Disadvantages of standard HMM models Objective is maximising the likelihood Y. Ma (DCU) Out of Giza 14 / 28
Disadvantages of standard HMM models Objective is maximising the likelihood There is no garantee that the optimised parameters correspond to more accurate alignments Y. Ma (DCU) Out of Giza 14 / 28
Disadvantages of standard HMM models Objective is maximising the likelihood There is no garantee that the optimised parameters correspond to more accurate alignments To complicate things (sometimes!) does help, e.g. IBM model 4 Y. Ma (DCU) Out of Giza 14 / 28
Outline 1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction Y. Ma (DCU) Out of Giza 15 / 28
Improved HMM models Two more sophisticated HMM models Segmental HMM model, word-to-phrase alignment model Constrained HMM model, agreement-guided alignment model Y. Ma (DCU) Out of Giza 16 / 28
HMM Word-to-Phrase Alignment [Deng and Byrne, 2008] Introducing a segmentation model: segmental HMM Y. Ma (DCU) Out of Giza 17 / 28
P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) (8) Y. Ma (DCU) Out of Giza 18 / 28
P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) (8) = P(K J, s) }{{} segmentation (9) Y. Ma (DCU) Out of Giza 18 / 28
P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) (8) = P(K J, s) }{{} segmentation P(a K 1, φk 1, hk 1 K, J, s) }{{} alignment-fertility (9) (10) Y. Ma (DCU) Out of Giza 18 / 28
P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) (8) = P(K J, s) }{{} segmentation P(a K 1, φk 1, hk 1 K, J, s) }{{} alignment-fertility P(v1 K ak 1, φk 1, hk 1, K, J, s) }{{} translation (9) (10) (11) Y. Ma (DCU) Out of Giza 18 / 28
HMM Word-to-Phrase Alignment P(a K 1, φ K 1, h K 1 K, J, s) = K P(a k, h k, φ k a k 1, φ k 1, h k 1, K, J, s) (12) k=1 Y. Ma (DCU) Out of Giza 19 / 28
HMM Word-to-Phrase Alignment P(a K 1, φ K 1, h K 1 K, J, s) = = K P(a k, h k, φ k a k 1, φ k 1, h k 1, K, J, s) (12) k=1 K p(a k, a k 1, h k ; I) d(h k ) n(φ k ; s ak (13) ) }{{}}{{}}{{} k=1 alignment null alignment fertility Y. Ma (DCU) Out of Giza 19 / 28
Performance of HMM Word-to-Phrase Alignment MTTK implementation Y. Ma (DCU) Out of Giza 20 / 28
Performance of HMM Word-to-Phrase Alignment MTTK implementation Used by Cambridge University Engineering Department Arabic English NIST 2008 (6th out of 16, third best university participant, behind LIUM and ISI) Consistent performance for Chinese English for differently sized collections of corpus Parallelised to handle large amount of data (e.g. 10M sentence pairs) Y. Ma (DCU) Out of Giza 20 / 28
Agreement Constrained HMM Alignment [Ganchev et al., 2008] Objective argmin{kl(q(a) p θ (a s, t))} s.t. E q [f(s, t, a)] b (14) q(a) (Q) Y. Ma (DCU) Out of Giza 21 / 28
Agreement Constrained HMM Alignment [Ganchev et al., 2008] Objective argmin{kl(q(a) p θ (a s, t))} s.t. E q [f(s, t, a)] b (14) q(a) (Q) Figure: p θ (a s, t), p θ (a s, t) and q (a), q ( a) Y. Ma (DCU) Out of Giza 21 / 28
Agreement Constrained HMM Alignment [Ganchev et al., 2008] Constrained E(M) Y. Ma (DCU) Out of Giza 22 / 28
Performance of Agreement Constrained HMM PostCAT implementation Evaluation Six language pairs, from 100,000 to 1M sentence pairs Outperform IBM Model 4 (16 out 18 times) However... getting slightly worse when the training data is over 1M Y. Ma (DCU) Out of Giza 23 / 28
Algorithm 1 Agreement Constrained HMM Alignment 1: λ ij i, j 2: for T interations do 3: θ t (t j s i ) θ t (t j s i )e λij i, j 4: θ t (s i t j ) θ t (s i t j )e λij i, j 5: q forwardbackward( θ t, θ a ) 6: q forwardbackward( θ t, θ a ) 7: end for 8: return ( q, q ) Y. Ma (DCU) Out of Giza 23 / 28
Outline 1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction Y. Ma (DCU) Out of Giza 24 / 28
Phrase Pair Extraction State-of-the-art: using viterbi alignment only Y. Ma (DCU) Out of Giza 25 / 28
Phrase Pair Extraction State-of-the-art: using viterbi alignment only Using all possible alignments A(i 1, i 2 ; j 1, j 2 ) = {a = a J 1 : a j [i 1, i 2 ] iff. j [j 1, j 2 ]} (15) Y. Ma (DCU) Out of Giza 25 / 28
Phrase Pair Extraction via Model-Based Posteriors Derivation P(t, A(i 1, i 2 ; j 1, j 2 ) s; θ) = P(t, a s; θ) (16) a A(i 1,i 2;j 1,j 2) Y. Ma (DCU) Out of Giza 26 / 28
Phrase Pair Extraction via Model-Based Posteriors Derivation P(t, A(i 1, i 2 ; j 1, j 2 ) s; θ) = P(t, a s; θ) (16) a A(i 1,i 2;j 1,j 2) P(A(i 1, i 2 ; j 1, j 2 ) s, t; θ) = P(t, A(i 1, i 2 ; j 1, j 2 ) s; θ) P(t, a s; θ) (17) Y. Ma (DCU) Out of Giza 26 / 28
Evaluation Significant gains when used as an augmentation to the original phrase extraction strategy Y. Ma (DCU) Out of Giza 27 / 28
References Deng, Y. and Byrne, W. (2008). HMM word and phrase alignment for statistical machine translation. IEEE Transactions on Audio, Speech, and Language Processing, 16(3):494 507. Ganchev, K., Graca, J., and Taskar, B. (2008). Better alignments = better translations? In Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics, Columbus, OH. Vogel, S., Ney, H., and Tillmann, C. (1996). HMM-based word alignment in statistical translation. In Proceedings of the 16th International Conference on Computational Linguistics, pages 836 841, Copenhagen, Denmark. Y. Ma (DCU) Out of Giza 28 / 28