Out of GIZA Efficient Word Alignment Models for SMT

Similar documents
Word Alignment for Statistical Machine Translation Using Hidden Markov Models

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

Sequences and Information

Lexical Translation Models 1I. January 27, 2015

Bitext Alignment for Statistical Machine Translation

Hidden Markov Models and Gaussian Mixture Models

Lexical Translation Models 1I

Phrase-Based Statistical Machine Translation with Pivot Languages

Lecture 11: Hidden Markov Models

STA 414/2104: Machine Learning

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Multiple System Combination. Jinhua Du CNGL July 23, 2008

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Probabilistic Word Alignment under the L 0 -norm

Hidden Markov Models Part 2: Algorithms

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Theory of Alignment Generators and Applications to Statistical Machine Translation

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Variational Decoding for Statistical Machine Translation

Hidden Markov Modelling

PAPER Bayesian Word Alignment and Phrase Table Training for Statistical Machine Translation

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Model and Speech Recognition

1. Markov models. 1.1 Markov-chain

STA 4273H: Statistical Machine Learning

IBM Model 1 for Machine Translation

Introduction to Machine Learning CMU-10701

COMS 4705 (Fall 2011) Machine Translation Part II

CSCE 471/871 Lecture 3: Markov Chains and

Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Gappy Phrasal Alignment by Agreement

Hidden Markov Models. x 1 x 2 x 3 x N

Triplet Lexicon Models for Statistical Machine Translation

Word Alignment via Submodular Maximization over Matroids

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Computa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Word Alignment III: Fertility Models & CRFs. February 3, 2015

Computing Optimal Alignments for the IBM-3 Translation Model

Fast Collocation-Based Bayesian HMM Word Alignment

Hidden Markov models

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

Hidden Markov Models

Statistical Phrase-Based Speech Translation

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Data Mining in Bioinformatics HMM

Expectation Maximization (EM)

Pair Hidden Markov Models

Brief Introduction of Machine Learning Techniques for Content Analysis

Text Mining. March 3, March 3, / 49

Linear Dynamical Systems (Kalman filter)

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Expectation Maximization (EM)

Stephen Scott.

The Noisy Channel Model and Markov Models

statistical machine translation

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

LECTURER: BURCU CAN Spring

Math 350: An exploration of HMMs through doodles.

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Advanced Natural Language Processing Syntactic Parsing

Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation

A Recursive Statistical Translation Model

Machine Learning for OR & FE

Hidden Markov Models

National Centre for Language Technology School of Computing Dublin City University

6.864 (Fall 2007) Machine Translation Part II

Dept. of Linguistics, Indiana University Fall 2009

Hidden Markov Models Hamid R. Rabiee

An Evolutionary Programming Based Algorithm for HMM training

Natural Language Processing (CSEP 517): Machine Translation

Lecture 9. Intro to Hidden Markov Models (finish up)

Hidden Markov Models

CRF for human beings

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

COMP90051 Statistical Machine Learning

Statistical Machine Learning from Data

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Word Alignment by Thresholded Two-Dimensional Normalization

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Machine Learning for natural language processing

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Graphical Models Seminar

Machine Learning for Structured Prediction

Transcription:

Out of GIZA Efficient Word Alignment Models for SMT Yanjun Ma National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Series March 4, 2009 Y. Ma (DCU) Out of Giza 1 / 28

Outline 1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction Y. Ma (DCU) Out of Giza 2 / 28

Outline 1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction Y. Ma (DCU) Out of Giza 3 / 28

Word Alignment and SMT All SMT systems rely on word alignment Word-Based SMT Phrase-Based SMT Hiero, hierarchical SMT Syntax-Based SMT, i.e, tree-to-string, string-to-tree, tree-to-tree Y. Ma (DCU) Out of Giza 4 / 28

Word Alignment and SMT All SMT systems rely on word alignment Word-Based SMT Phrase-Based SMT Hiero, hierarchical SMT Syntax-Based SMT, i.e, tree-to-string, string-to-tree, tree-to-tree Giza implementation of IBM model 4 is dominant Y. Ma (DCU) Out of Giza 4 / 28

Word Alignment and SMT All SMT systems rely on word alignment Word-Based SMT Phrase-Based SMT Hiero, hierarchical SMT Syntax-Based SMT, i.e, tree-to-string, string-to-tree, tree-to-tree Giza implementation of IBM model 4 is dominant Viterbi alignment from IBM model 4 is used Y. Ma (DCU) Out of Giza 4 / 28

Outline 1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction Y. Ma (DCU) Out of Giza 5 / 28

Efficient Model: HMM Model [Vogel et al., 1996] HMM emission (translation) model p(t j s aj ) Y. Ma (DCU) Out of Giza 6 / 28

HMM transition (alignment) model p(a j a j a j 1 ) Y. Ma (DCU) Out of Giza 7 / 28

HMM transition (alignment) model p(a j a j a j 1 ) p(t, a s) = j p(a j a j a j 1 ) p(t j s aj ) (1) Y. Ma (DCU) Out of Giza 7 / 28

Deficient Model: IBM Model 3 and 4 Model 3: zero-order distortion model Y. Ma (DCU) Out of Giza 8 / 28

Model 4: first-order distortion model Y. Ma (DCU) Out of Giza 9 / 28

Derivation P(t J 1,a J 1 s I 1) = P(t J 1,B I 0 s I 1) (2) Y. Ma (DCU) Out of Giza 10 / 28

Derivation P(t J 1,a J 1 s I 1) = P(t J 1,B0 s I I 1) (2) I = P(B 0 B1) I P(B i B1 i 1, e I 1) P(f1 J B0, I e I 1) (3) i=1 Y. Ma (DCU) Out of Giza 10 / 28

Derivation P(t J 1,a J 1 s I 1) = P(t J 1,B0 s I I 1) (2) I = P(B 0 B1) I P(B i B1 i 1, e I 1) P(f1 J B0, I e I 1) (3) = P(B 0 B I 1) i=1 I p(b i B i 1, e i ) }{{} i=1 fertility-distortion (4) Y. Ma (DCU) Out of Giza 10 / 28

Derivation P(t J 1,a J 1 s I 1) = P(t J 1,B0 s I I 1) (2) I = P(B 0 B1) I P(B i B1 i 1, e I 1) P(f1 J B0, I e I 1) (3) = P(B 0 B I 1) i=1 I I p(f j e i ) }{{} j B i translation i=0 p(b i B i 1, e i ) }{{} i=1 fertility-distortion (4) (5) Y. Ma (DCU) Out of Giza 10 / 28

Model 3 fertility and distortion p(b i B i 1, e i ) = p(φ i e i ) φ i! p(j i, J) }{{}}{{} j B fertility i distortion (6) Y. Ma (DCU) Out of Giza 11 / 28

Model 3 fertility and distortion p(b i B i 1, e i ) = p(φ i e i ) φ i! p(j i, J) }{{}}{{} j B fertility i distortion (6) Model 4 fertility and distortion φ i p(b i B i 1, e i ) = p(φ i e i ) p =1 (B i1 B }{{} ρ(i) ) p >1 (B ik B i,k 1 ) }{{} k=2 fertility first word }{{} remaining words (7) Y. Ma (DCU) Out of Giza 11 / 28

Decoding HMM Viterbi decoding: â = argmaxp(a s, t) a Posterior decoding: Align point a j i iff. p(a j i s, t) δ Y. Ma (DCU) Out of Giza 12 / 28

Decoding HMM Viterbi decoding: â = argmaxp(a s, t) a Posterior decoding: Align point a j i iff. p(a j i s, t) δ IBM model 3 and 4 No efficient algorithm available Y. Ma (DCU) Out of Giza 12 / 28

Advantages of HMM models Efficient parameter estimation algorithm: forward-backward algorithm (Baum-Welch algorithm) Y. Ma (DCU) Out of Giza 13 / 28

Advantages of HMM models Efficient parameter estimation algorithm: forward-backward algorithm (Baum-Welch algorithm) Figure: Eric B. Baum (son of Leonard E. Baum, who was the inventor of the algorithm) and Lloyd R. Welch Y. Ma (DCU) Out of Giza 13 / 28

Advantages of HMM models Efficient parameter estimation algorithm: forward-backward algorithm (Baum-Welch algorithm) Figure: Eric B. Baum (son of Leonard E. Baum, who was the inventor of the algorithm) and Lloyd R. Welch The resulting posterior probabilities are useful Y. Ma (DCU) Out of Giza 13 / 28

Disadvantages of standard HMM models Objective is maximising the likelihood Y. Ma (DCU) Out of Giza 14 / 28

Disadvantages of standard HMM models Objective is maximising the likelihood There is no garantee that the optimised parameters correspond to more accurate alignments Y. Ma (DCU) Out of Giza 14 / 28

Disadvantages of standard HMM models Objective is maximising the likelihood There is no garantee that the optimised parameters correspond to more accurate alignments To complicate things (sometimes!) does help, e.g. IBM model 4 Y. Ma (DCU) Out of Giza 14 / 28

Outline 1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction Y. Ma (DCU) Out of Giza 15 / 28

Improved HMM models Two more sophisticated HMM models Segmental HMM model, word-to-phrase alignment model Constrained HMM model, agreement-guided alignment model Y. Ma (DCU) Out of Giza 16 / 28

HMM Word-to-Phrase Alignment [Deng and Byrne, 2008] Introducing a segmentation model: segmental HMM Y. Ma (DCU) Out of Giza 17 / 28

P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) (8) Y. Ma (DCU) Out of Giza 18 / 28

P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) (8) = P(K J, s) }{{} segmentation (9) Y. Ma (DCU) Out of Giza 18 / 28

P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) (8) = P(K J, s) }{{} segmentation P(a K 1, φk 1, hk 1 K, J, s) }{{} alignment-fertility (9) (10) Y. Ma (DCU) Out of Giza 18 / 28

P(t, a s) = P(v K 1, K, a K 1, h K 1, φ K 1 s) (8) = P(K J, s) }{{} segmentation P(a K 1, φk 1, hk 1 K, J, s) }{{} alignment-fertility P(v1 K ak 1, φk 1, hk 1, K, J, s) }{{} translation (9) (10) (11) Y. Ma (DCU) Out of Giza 18 / 28

HMM Word-to-Phrase Alignment P(a K 1, φ K 1, h K 1 K, J, s) = K P(a k, h k, φ k a k 1, φ k 1, h k 1, K, J, s) (12) k=1 Y. Ma (DCU) Out of Giza 19 / 28

HMM Word-to-Phrase Alignment P(a K 1, φ K 1, h K 1 K, J, s) = = K P(a k, h k, φ k a k 1, φ k 1, h k 1, K, J, s) (12) k=1 K p(a k, a k 1, h k ; I) d(h k ) n(φ k ; s ak (13) ) }{{}}{{}}{{} k=1 alignment null alignment fertility Y. Ma (DCU) Out of Giza 19 / 28

Performance of HMM Word-to-Phrase Alignment MTTK implementation Y. Ma (DCU) Out of Giza 20 / 28

Performance of HMM Word-to-Phrase Alignment MTTK implementation Used by Cambridge University Engineering Department Arabic English NIST 2008 (6th out of 16, third best university participant, behind LIUM and ISI) Consistent performance for Chinese English for differently sized collections of corpus Parallelised to handle large amount of data (e.g. 10M sentence pairs) Y. Ma (DCU) Out of Giza 20 / 28

Agreement Constrained HMM Alignment [Ganchev et al., 2008] Objective argmin{kl(q(a) p θ (a s, t))} s.t. E q [f(s, t, a)] b (14) q(a) (Q) Y. Ma (DCU) Out of Giza 21 / 28

Agreement Constrained HMM Alignment [Ganchev et al., 2008] Objective argmin{kl(q(a) p θ (a s, t))} s.t. E q [f(s, t, a)] b (14) q(a) (Q) Figure: p θ (a s, t), p θ (a s, t) and q (a), q ( a) Y. Ma (DCU) Out of Giza 21 / 28

Agreement Constrained HMM Alignment [Ganchev et al., 2008] Constrained E(M) Y. Ma (DCU) Out of Giza 22 / 28

Performance of Agreement Constrained HMM PostCAT implementation Evaluation Six language pairs, from 100,000 to 1M sentence pairs Outperform IBM Model 4 (16 out 18 times) However... getting slightly worse when the training data is over 1M Y. Ma (DCU) Out of Giza 23 / 28

Algorithm 1 Agreement Constrained HMM Alignment 1: λ ij i, j 2: for T interations do 3: θ t (t j s i ) θ t (t j s i )e λij i, j 4: θ t (s i t j ) θ t (s i t j )e λij i, j 5: q forwardbackward( θ t, θ a ) 6: q forwardbackward( θ t, θ a ) 7: end for 8: return ( q, q ) Y. Ma (DCU) Out of Giza 23 / 28

Outline 1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction Y. Ma (DCU) Out of Giza 24 / 28

Phrase Pair Extraction State-of-the-art: using viterbi alignment only Y. Ma (DCU) Out of Giza 25 / 28

Phrase Pair Extraction State-of-the-art: using viterbi alignment only Using all possible alignments A(i 1, i 2 ; j 1, j 2 ) = {a = a J 1 : a j [i 1, i 2 ] iff. j [j 1, j 2 ]} (15) Y. Ma (DCU) Out of Giza 25 / 28

Phrase Pair Extraction via Model-Based Posteriors Derivation P(t, A(i 1, i 2 ; j 1, j 2 ) s; θ) = P(t, a s; θ) (16) a A(i 1,i 2;j 1,j 2) Y. Ma (DCU) Out of Giza 26 / 28

Phrase Pair Extraction via Model-Based Posteriors Derivation P(t, A(i 1, i 2 ; j 1, j 2 ) s; θ) = P(t, a s; θ) (16) a A(i 1,i 2;j 1,j 2) P(A(i 1, i 2 ; j 1, j 2 ) s, t; θ) = P(t, A(i 1, i 2 ; j 1, j 2 ) s; θ) P(t, a s; θ) (17) Y. Ma (DCU) Out of Giza 26 / 28

Evaluation Significant gains when used as an augmentation to the original phrase extraction strategy Y. Ma (DCU) Out of Giza 27 / 28

References Deng, Y. and Byrne, W. (2008). HMM word and phrase alignment for statistical machine translation. IEEE Transactions on Audio, Speech, and Language Processing, 16(3):494 507. Ganchev, K., Graca, J., and Taskar, B. (2008). Better alignments = better translations? In Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics, Columbus, OH. Vogel, S., Ney, H., and Tillmann, C. (1996). HMM-based word alignment in statistical translation. In Proceedings of the 16th International Conference on Computational Linguistics, pages 836 841, Copenhagen, Denmark. Y. Ma (DCU) Out of Giza 28 / 28