Natural Language Processing (CSEP 517): Machine Translation

Similar documents
Natural Language Processing (CSE 517): Machine Translation

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

COMS 4705, Fall Machine Translation Part III

Natural Language Processing (CSE 490U): Language Models

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

IBM Model 1 for Machine Translation

statistical machine translation

CRF Word Alignment & Noisy Channel Translation

Tuning as Linear Regression

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Natural Language Processing (CSE 517): Cotext Models (II)

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms

A phrase-based hidden Markov model approach to machine translation

Sampling Alignment Structure under a Bayesian Translation Model

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Natural Language Processing (CSE 517): Cotext Models

Machine Translation 1 CS 287

Phrase Table Pruning via Submodular Function Maximization

Theory of Alignment Generators and Applications to Statistical Machine Translation

Lexical Translation Models 1I

Lattice-based System Combination for Statistical Machine Translation

6.864: Lecture 17 (November 10th, 2005) Machine Translation Part III

IBM Model 1 and the EM Algorithm

Fast Collocation-Based Bayesian HMM Word Alignment

A Discriminative Model for Semantics-to-String Translation

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation

Word Alignment by Thresholded Two-Dimensional Normalization

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Phrasetable Smoothing for Statistical Machine Translation

Consensus Training for Consensus Decoding in Machine Translation

Minimum Bayes-risk System Combination

Learning to translate with neural networks. Michael Auli

Computing Optimal Alignments for the IBM-3 Translation Model

Improving Relative-Entropy Pruning using Statistical Significance

Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation

Phrase-Based Statistical Machine Translation with Pivot Languages

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

Shift-Reduce Word Reordering for Machine Translation

Discriminative Training

Shift-Reduce Word Reordering for Machine Translation

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Expectation Maximization (EM)

Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation

Translation as Weighted Deduction

Machine Learning for natural language processing

Word Alignment III: Fertility Models & CRFs. February 3, 2015

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Lexical Translation Models 1I. January 27, 2015

Neural Hidden Markov Model for Machine Translation

Translation as Weighted Deduction

Multi-Task Word Alignment Triangulation for Low-Resource Languages

Coactive Learning for Interactive Machine Translation

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

A Systematic Comparison of Training Criteria for Statistical Machine Translation

Natural Language Processing

Gappy Phrasal Alignment by Agreement

Latent Variable Models in NLP

Variational Decoding for Statistical Machine Translation

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Out of GIZA Efficient Word Alignment Models for SMT

Structure and Complexity of Grammar-Based Machine Translation

CKY-based Convolutional Attention for Neural Machine Translation

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation

Enhanced Bilingual Evaluation Understudy

Integrating Morphology in Probabilistic Translation Models

arxiv: v1 [cs.cl] 22 Jun 2015

A Recursive Statistical Translation Model

The Geometry of Statistical Machine Translation

Machine Learning for natural language processing

Midterm sample questions

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Theory of Alignment Generators and Applications to Statistical Machine Translation

The Noisy Channel Model and Markov Models

Massachusetts Institute of Technology

Improved Decipherment of Homophonic Ciphers

Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach. Abhishek Arun

Fast Consensus Decoding over Translation Forests

Graphical models for part of speech tagging

Statistical Machine Translation of Natural Languages

Triplet Lexicon Models for Statistical Machine Translation

Machine Learning for natural language processing

Natural Language Processing (CSEP 517): Text Classification

Feature-Rich Translation by Quasi-Synchronous Lattice Parsing

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Generalized Stack Decoding Algorithms for Statistical Machine Translation

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Multi-Source Neural Translation

logo The Prague Bulletin of Mathematical Linguistics NUMBER??? AUGUST Efficient Word Alignment with Markov Chain Monte Carlo

Discriminative Training. March 4, 2014

Minimum Error Rate Training Semiring

Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation

Multi-Source Neural Translation

Transcription:

Natural Language Processing (CSEP 57): Machine Translation Noah Smith c 207 University of Washington nasmith@cs.washington.edu May 5, 207 / 59

To-Do List Online quiz: due Sunday (Jurafsky and Martin, 2008, ch. 25), Collins (20, 203) A5 due May 28 (Sunday) 2 / 59

Evaluation Intuition: good translations are fluent in the target language and faithful to the original meaning. Bleu score (Papineni et al., 2002): Compare to a human-generated reference translation Or, better: multiple references Weighted average of n-gram precision (across different n) There are some alternatives; most papers that use them report Bleu, too. 3 / 59

Warren Weaver to Norbert Wiener, 947 One naturally wonders if the problem of translation could be conceivably treated as a problem in cryptography. When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. 4 / 59

Noisy Channel Models Review A pattern for modeling a pair of random variables, X and Y : source Y channel X 5 / 59

Noisy Channel Models Review A pattern for modeling a pair of random variables, X and Y : source Y channel X Y is the plaintext, the true message, the missing information, the output 6 / 59

Noisy Channel Models Review A pattern for modeling a pair of random variables, X and Y : source Y channel X Y is the plaintext, the true message, the missing information, the output X is the ciphertext, the garbled message, the observable evidence, the input 7 / 59

Noisy Channel Models Review A pattern for modeling a pair of random variables, X and Y : source Y channel X Y is the plaintext, the true message, the missing information, the output X is the ciphertext, the garbled message, the observable evidence, the input Decoding: select y given X = x. y = argmax p(y x) y = argmax y = argmax y p(x y) p(y) p(x) p(x y) }{{} channel model p(y) }{{} source model 8 / 59

Bitext/Parallel Text Let f and e be two sequences in V (French) and V (English), respectively. We re going to define p(f e), the probability over French translations of English sentence e. In a noisy channel machine translation system, we could use this together with source/language model p(e) to decode f into an English translation. Where does the data to estimate this come from? 9 / 59

IBM Model (Brown et al., 993) Let l and m be the (known) lengths of e and f. Latent variable a = a,..., a m, each a i ranging over {0,..., l} (positions in e). a 4 = 3 means that f 4 is aligned to e 3. a 6 = 0 means that f 6 is aligned to a special null symbol, e 0. p(f e, m) = l l a =0 a 2 =0 l a m=0 p(f, a e, m) = p(f, a e, m) a {0,...,l} m m p(f, a e, m) = p(a i i, l, m) p(f i e ai ) = i= m i= ( ) m l + θ m f i e ai = θ l + fi e ai i= 0 / 59

Example: f is German Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 4,... p(f, a e, m) = 7 + θ Noahs Noah s / 59

Example: f is German Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 4, 5,... p(f, a e, m) = 7 + θ Noahs Noah s 7 + θ Arche ark 2 / 59

Example: f is German Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 4, 5, 6,... p(f, a e, m) = 7 + θ Noahs Noah s 7 + θ war was 7 + θ Arche ark 3 / 59

Example: f is German Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 4, 5, 6, 8,... p(f, a e, m) = 7 + θ Noahs Noah s 7 + θ war was 7 + θ Arche ark 7 + θ nicht not 4 / 59

Example: f is German Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 4, 5, 6, 8, 7,... p(f, a e, m) = 7 + θ Noahs Noah s 7 + θ war was 7 + θ voller filled 7 + θ Arche ark 7 + θ nicht not 5 / 59

Example: f is German Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 4, 5, 6, 8, 7,?,... p(f, a e, m) = 7 + θ Noahs Noah s 7 + θ Arche ark 7 + θ war was 7 + θ nicht not 7 + θ voller filled 7 + θ Productionsfactoren? 6 / 59

Example: f is German Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 4, 5, 6, 8, 7,?,... p(f, a e, m) = 7 + θ Noahs Noah s 7 + θ Arche ark 7 + θ war was 7 + θ nicht not 7 + θ voller filled 7 + θ Productionsfactoren? Problem: This alignment isn t possible with IBM Model! Each f i is aligned to at most one e ai! 7 / 59

Example: f is English Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 0,... p(f, a e, m) = 0 + θ Mr null 8 / 59

Example: f is English Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 0, 0, 0,... p(f, a e, m) = 0 + θ Mr null 0 + θ, null 0 + θ President null 9 / 59

Example: f is English Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 0, 0, 0,,... p(f, a e, m) = 0 + θ Mr null 0 + θ, null 0 + θ President null 0 + θ Noah s Noahs 20 / 59

Example: f is English Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 0, 0, 0,, 2,... p(f, a e, m) = 0 + θ Mr null 0 + θ, null 0 + θ ark Arche 0 + θ President null 0 + θ Noah s Noahs 2 / 59

Example: f is English Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 0, 0, 0,, 2, 3,... p(f, a e, m) = 0 + θ Mr null 0 + θ President null 0 + θ, null 0 + θ Noah s Noahs 0 + θ ark Arche 0 + θ was war 22 / 59

Example: f is English Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 0, 0, 0,, 2, 3, 5,... p(f, a e, m) = 0 + θ Mr null 0 + θ President null 0 + θ, null 0 + θ Noah s Noahs 0 + θ ark Arche 0 + θ was war 0 + θ filled voller 23 / 59

Example: f is English Mr President, Noah's ark was filled not with production factors, but with living creatures. Noahs Arche war nicht voller Produktionsfaktoren, sondern Geschöpfe. a = 0, 0, 0,, 2, 3, 5, 4,... p(f, a e, m) = 0 + θ Mr null 0 + θ President null 0 + θ, null 0 + θ Noah s Noahs 0 + θ ark Arche 0 + θ was war 0 + θ filled voller 0 + θ not nicht 24 / 59

How to Estimate Translation Distributions? This is a problem of incomplete data: at training time, we see e and f, but not a. 25 / 59

How to Estimate Translation Distributions? This is a problem of incomplete data: at training time, we see e and f, but not a. Classical solution is to alternate: Given a parameter estimate for θ, align the words. Given aligned words, re-estimate θ. Traditional approach uses soft alignment. 26 / 59

Complete Data IBM Model Let the training data consist of N word-aligned sentence pairs: e (), f (), a (),..., e (N), f (N), a (N). Define: ι(k, i, j) = Maximum likelihood estimate for θ f e : { if a (k) i = j 0 otherwise c(e, f) c(e) = N k= N k= i:f (k) i =f m (k) i= j:e (k) j =e j:e (k) j =e ι(k, i, j) ι(k, i, j) 27 / 59

MLE with Soft Counts for IBM Model Let the training data consist of N softly aligned sentence pairs, e (), f (),,..., e (N), f (N). Now, let ι(k, i, j) be soft, interpreted as: Maximum likelihood estimate for θ f e : N k= ι(k, i, j) = p(a (k) i = j) N k= i:f (k) i =f m (k) i= j:e (k) j =e j:e (k) j =e ι(k, i, j) ι(k, i, j) 28 / 59

Expectation Maximization Algorithm for IBM Model. Initialize θ to some arbitrary values. 2. E step: use current θ to estimate expected ( soft ) counts. 3. M step: carry out soft MLE. θ f e ι(k, i, j) N k= N k= i:f (k) i =f m (k) i= l (k) j =0 θ (k) f i e (k) j θ (k) f i e (k) j j:e (k) j =e j:e (k) j =e ι(k, i, j) ι(k, i, j) 29 / 59

Expectation Maximization Originally introduced in the 960s for estimating HMMs when the states really are hidden. Can be applied to any generative model with hidden variables. Greedily attempts to maximize probability of the observable data, marginalizing over latent variables. For IBM Model, that means: max θ N k= p θ (f (k) e (k) ) = max θ N p θ (f (k), a e (k) ) Usually converges only to a local optimum of the above, which is in general not convex. Strangely, for IBM Model (and very few other models), it is convex! k= a 30 / 59

IBM Model 2 (Brown et al., 993) Let l and m be the (known) lengths of e and f. Latent variable a = a,..., a m, each a i ranging over {0,..., l} (positions in e). E.g., a 4 = 3 means that f 4 is aligned to e 3. p(f e, m) = p(f, a e, m) a {0,...,n} m m p(f, a e, m) = p(a i i, l, m) p(f i e ai ) i= = δ ai i,l,m θ fi e ai 3 / 59

IBM Models and 2, Depicted hidden Markov model y y 2 y 3 y 4 IBM and 2 a a 2 a 3 a 4 e e e e x x 2 x 3 x 4 f f 2 f 3 f 4 32 / 59

Variations Dyer et al. (203) introduced a new parameterization: δ j i,l,m exp λ i m j l (This is called fast align.) IBM Models 3 5 (Brown et al., 993) introduced increasingly more powerful ideas, such as fertility and distortion. 33 / 59

From Alignment to (Phrase-Based) Translation Obtaining word alignments in a parallel corpus is a common first step in building a machine translation system.. Align the words. 2. Extract and score phrase pairs. 3. Estimate a global scoring function to optimize (a proxy for) translation quality. 4. Decode French sentences into English ones. (We ll discuss 2 4.) The noisy channel pattern isn t taken quite so seriously when we build real systems, but language models are really, really important nonetheless. 34 / 59

Phrases? Phrase-based translation uses automatically-induced phrases... not the ones given by a phrase-structure parser. 35 / 59

Examples of Phrases Courtesy of Chris Dyer. German English p( f ē) the issue 0.4 das Thema the point 0.72 the subject 0.47 the thema 0.99 es gibt there is 0.96 there are 0.72 morgen tomorrow 0.90 will I fly 0.63 fliege ich will fly 0.7 I will fly 0.3 36 / 59

Phrase-Based Translation Model Originated by Koehn et al. (2003). R.v. A captures segmentation of sentences into phrases, alignment between them, and reordering. Morgen fliege ich nach Pittsburgh zur Konferenz a f Tomorrow I will fly to the conference in Pittsburgh e a p(f, a e) = p(a e) p( f i ē i ) i= 37 / 59

Extracting Phrases After inferring word alignments, apply heuristics. 38 / 59

Extracting Phrases After inferring word alignments, apply heuristics. 39 / 59

Extracting Phrases After inferring word alignments, apply heuristics. 40 / 59

Extracting Phrases After inferring word alignments, apply heuristics. 4 / 59

Extracting Phrases After inferring word alignments, apply heuristics. 42 / 59

Extracting Phrases After inferring word alignments, apply heuristics. 43 / 59

Extracting Phrases After inferring word alignments, apply heuristics. 44 / 59

Extracting Phrases After inferring word alignments, apply heuristics. 45 / 59

Scoring Whole Translations s(e, a; f) = log p(e) }{{} + log p(f, a e) }{{} language model translation model Remarks: Segmentation, alignment, reordering are all predicted as well (not marginalized). This does not factor nicely. 46 / 59

Scoring Whole Translations Remarks: s(e, a; f) = log p(e) }{{} + log p(f, a e) }{{} language model translation model + log p(e, a f) }{{} reverse t.m. Segmentation, alignment, reordering are all predicted as well (not marginalized). This does not factor nicely. I am simplifying! Reverse translation model typically included. 47 / 59

Scoring Whole Translations Remarks: s(e, a; f) = β l.m. log p(e) }{{} +β t.m. log p(f, a e) }{{} language model translation model + β r.t.m. log p(e, a f) }{{} reverse t.m. Segmentation, alignment, reordering are all predicted as well (not marginalized). This does not factor nicely. I am simplifying! Reverse translation model typically included. Each log-probability is treated as a feature and weights are optimized for Bleu performance. 48 / 59

Decoding: Example Maria no dio una bofetada a la bruja verda Mary not give a slap to the witch green did not slap by hag bawdy no slap to the green witch did not give the the witch 49 / 59

Decoding: Example Maria no dio una bofetada a la bruja verda Mary not give a slap to the witch green did not slap by hag bawdy no slap to the green witch did not give the the witch 50 / 59

Decoding: Example Maria no dio una bofetada a la bruja verda Mary not give a slap to the witch green did not slap by hag bawdy no slap to the green witch did not give the the witch 5 / 59

Decoding Adapted from Koehn et al. (2006). Typically accomplished with beam search. Initial state: } {{... }, with score 0 f Goal state: } {{... }, e with (approximately) the highest score f Reaching a new state: Find an uncovered span of f for which a phrasal translation exists in the input ( f, ē) New state appends ē to the output and covers f. Score of new state includes additional language model, translation model components for the global score. 52 / 59

Decoding Example Maria no dio una bofetada a la bruja verda Mary not give a slap to the witch green did not slap by hag bawdy no slap to the green witch did not give the the witch,, 0 53 / 59

Decoding Example Maria no dio una bofetada a la bruja verda Mary not give a slap to the witch green did not slap by hag bawdy no slap to the green witch did not give the the witch, Mary, log p l.m. (Mary) + log p t.m. (Maria Mary) 54 / 59

Decoding Example Maria no dio una bofetada a la bruja verda Mary give a slap to the witch green did not slap by hag bawdy slap to the the green witch the witch, Mary did not, log p l.m. (Mary did not) + log p t.m. (Maria Mary) + log p t.m. (no did not) 55 / 59

Decoding Example Maria no dio una bofetada a la bruja verda Mary to the witch green did not by hag bawdy slap to the the green witch the witch, Mary did not slap, log p l.m. (Mary did not slap) + log p t.m. (Maria Mary) + log p t.m. (no did not) + log p t.m. (dio una bofetada slap) 56 / 59

Machine Translation: Remarks Sometimes phrases are organized hierarchically (Chiang, 2007). Extensive research on syntax-based machine translation (Galley et al., 2004), but requires considerable engineering to match phrase-based systems. Recent work on semantics-based machine translation (Jones et al., 202); remains to be seen! Some good pre-neural overviews: Lopez (2008); Koehn (2009) 57 / 59

References I Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 9(2):263 3, 993. David Chiang. Hierarchical phrase-based translation. computational Linguistics, 33(2):20 228, 2007. Michael Collins. Statistical machine translation: IBM models and 2, 20. URL http://www.cs.columbia.edu/~mcollins/courses/nlp20/notes/ibm2.pdf. Michael Collins. Phrase-based translation models, 203. URL http://www.cs.columbia.edu/~mcollins/pb.pdf. Chris Dyer, Victor Chahuneau, and Noah A Smith. A simple, fast, and effective reparameterization of IBM Model 2. In Proc. of NAACL, 203. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. What s in a translation rule? In Proc. of NAACL, 2004. Bevan Jones, Jacob Andreas, Daniel Bauer, Karl Moritz Hermann, and Kevin Knight. Semantics-based machine translation with hyperedge replacement grammars. In Proc. of COLING, 202. Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, second edition, 2008. Philipp Koehn. Statistical Machine Translation. Cambridge University Press, 2009. Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation. In Proc. of NAACL, 2003. 58 / 59

References II Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi, Ondrej Bojar, Chris Callison-Burch, Brooke Cowan, Chris Dyer, Hieu Hoang, and Richard Zens. Open source toolkit for statistical machine translation: Factored translation models and confusion network decoding, 2006. Final report of the 2006 JHU summer workshop. Adam Lopez. Statistical machine translation. ACM Computing Surveys, 40(3):8, 2008. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL, 2002. 59 / 59