Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Similar documents
IBM Model 1 and the EM Algorithm

Word Alignment. Chris Dyer, Carnegie Mellon University

Lexical Translation Models 1I. January 27, 2015

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Lexical Translation Models 1I

EM with Features. Nov. 19, Sunday, November 24, 13

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

IBM Model 1 for Machine Translation

Word Alignment III: Fertility Models & CRFs. February 3, 2015

COMS 4705 (Fall 2011) Machine Translation Part II

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Discriminative Training

Natural Language Processing (CSEP 517): Machine Translation

Midterm sample questions

COMS 4705, Fall Machine Translation Part III

Phrase-Based Statistical Machine Translation with Pivot Languages

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

statistical machine translation

Hidden Markov Models in Language Processing

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

CRF Word Alignment & Noisy Channel Translation

Discriminative Training. March 4, 2014

Cross-Lingual Language Modeling for Automatic Speech Recogntion

6.864 (Fall 2007) Machine Translation Part II

Out of GIZA Efficient Word Alignment Models for SMT

Example of Parallel Corpus. Machine Translation: Word Alignment Problem. Outline. Word Alignments

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Massachusetts Institute of Technology

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

Syntax-Based Decoding

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Expectation Maximization (EM)

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

The Noisy Channel Model and Markov Models

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Statistical Machine Translation

Hidden Markov Models

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Language Models. Philipp Koehn. 11 September 2018

Multiple System Combination. Jinhua Du CNGL July 23, 2008

Exercise 1: Basics of probability calculus

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Machine Translation. Uwe Dick

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Probabilistic Language Modeling

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Like our alien system

Machine Translation Evaluation

Theory of Alignment Generators and Applications to Statistical Machine Translation

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Modelling

Bayesian Networks. Motivation

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Lecture 12: Algorithms for HMMs

TnT Part of Speech Tagger

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

A Discriminative Model for Semantics-to-String Translation

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Introduction to Machine Learning

Lecture 12: Algorithms for HMMs

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Machine Learning for natural language processing

Stephen Scott.

Statistical Ranking Problem

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo

Natural Language Processing SoSe Words and Language Model

Design and Implementation of Speech Recognition Systems

Other types of Movement

Learning Bayesian Networks (part 1) Goals for the lecture

Learning to translate with neural networks. Michael Auli

Sequences and Information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Text Mining. March 3, March 3, / 49

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Natural Language Processing

Latent Variable Models in NLP

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

DT2118 Speech and Speaker Recognition

Hidden Markov Models

Gaussian Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Spectral Unsupervised Parsing with Additive Tree Metrics

Bayesian Reasoning. Adapted from slides by Tim Finin and Marie desjardins.

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Machine Learning Techniques for Computer Vision

Multiword Expression Identification with Tree Substitution Grammars

Transcription:

Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48

Roadmap Introduction to MT Components of MT system Word-based models Beyond word-based models CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 2 / 48

Outline 1 Introduction 2 Word Based Translation Systems 3 Learning the Models 4 Everything Else CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 3 / 48

What unlocks translations? Humans need parallel text to understand new languages when no speakers are round Rosetta stone: allowed us understand to Egyptian Computers need the same information CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 4 / 48

What unlocks translations? Humans need parallel text to understand new languages when no speakers are round Rosetta stone: allowed us understand to Egyptian Computers need the same information Where do we get them? I Some governments require translations (Canada, EU, Hong Kong) I Newspapers I Internet CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 4 / 48

Pieces of Machine Translation System CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 5 / 48

Terminology Source language: f (foreign) Target language: e (english) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 6 / 48

Outline 1 Introduction 2 Word Based Translation Systems 3 Learning the Models 4 Everything Else CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 7 / 48

Collect Statistics Look at a parallel corpus (German text along with English translation) Translation of Haus Count house 8,000 building 1,600 home 200 household 150 shell 50 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 8 / 48

Estimate Translation Probabilities Maximum likelihood estimation 8 0.8 if e = house, >< 0.16 if e = building, p f (e) = 0.02 if e = home, 0.015 if e = household, >: 0.005 if e = shell. CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 9 / 48

Alignment In a parallel text (or when we translate), we align words in one language with the words in the other 1 2 3 4 das Haus ist klein Word positions are numbered 1 4 the house is small 1 2 3 4 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 10 / 48

Alignment Function Formalizing alignment with an alignment function Mapping an English target word at position i to a German source word at position j with a function a : i! j Example a : {1! 1, 2! 2, 3! 3, 4! 4} CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 11 / 48

Reordering Words may be reordered during translation 1 2 3 4 klein ist das Haus the house is small 1 2 3 4 a : {1! 3, 2! 4, 3! 2, 4! 1} CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 12 / 48

One-to-Many Translation A source word may translate into multiple target words 1 2 3 4 das Haus ist klitzeklein the house is very small 1 2 3 4 5 a : {1! 1, 2! 2, 3! 3, 4! 4, 5! 4} CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 13 / 48

Dropping Words Words may be dropped when translated (German article das is dropped) 1 2 3 4 das Haus ist klein house is small 1 2 3 a : {1! 2, 2! 3, 3! 4} CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 14 / 48

Inserting Words Words may be added during translation I The English just does not have an equivalent in German I We still need to map it to something: special null token 0 NULL 1 2 3 4 das Haus ist klein the house is just small 1 2 3 4 5 a : {1! 1, 2! 2, 3! 3, 4! 0, 5! 4} CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 15 / 48

Afamilyoflexicaltranslationmodels A family translation models Uncreatively named: Model 1, Model 2,... Foundation of all modern translation algorithms First up: Model 1 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 16 / 48

IBM Model 1 Generative model: break up translation process into smaller steps I IBM Model 1 only uses lexical translation Translation probability I for a foreign sentence f =(f1,...,f lf ) of length l f I to an English sentence e =(e1,...,e le ) of length l e I with an alignment of each English word ej to a foreign word f i according to the alignment function a : j! i p(e, a f) = (l f + 1) le l e Y j=1 t(e j f a(j) ) I parameter is a normalization constant CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 17 / 48

IBM Model 1 Generative model: break up translation process into smaller steps I IBM Model 1 only uses lexical translation Translation probability I for a foreign sentence f =(f1,...,f lf ) of length l f I to an English sentence e =(e1,...,e le ) of length l e I with an alignment of each English word ej to a foreign word f i according to the alignment function a : j! i p(e, a f) = (l f + 1) le l e Y j=1 t(e j f a(j) ) I parameter is a normalization constant CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 17 / 48

IBM Model 1 Generative model: break up translation process into smaller steps I IBM Model 1 only uses lexical translation Translation probability I for a foreign sentence f =(f1,...,f lf ) of length l f I to an English sentence e =(e1,...,e le ) of length l e I with an alignment of each English word ej to a foreign word f i according to the alignment function a : j! i p(e, a f) = (l f + 1) le l e Y j=1 t(e j f a(j) ) I parameter is a normalization constant CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 17 / 48

IBM Model 1 Generative model: break up translation process into smaller steps I IBM Model 1 only uses lexical translation Translation probability I for a foreign sentence f =(f1,...,f lf ) of length l f I to an English sentence e =(e1,...,e le ) of length l e I with an alignment of each English word ej to a foreign word f i according to the alignment function a : j! i p(e, a f) = (l f + 1) le l e Y j=1 t(e j f a(j) ) I parameter is a normalization constant CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 17 / 48

IBM Model 1 Generative model: break up translation process into smaller steps I IBM Model 1 only uses lexical translation Translation probability I for a foreign sentence f =(f1,...,f lf ) of length l f I to an English sentence e =(e1,...,e le ) of length l e I with an alignment of each English word ej to a foreign word f i according to the alignment function a : j! i p(e, a f) = (l f + 1) le l e Y j=1 t(e j f a(j) ) I parameter is a normalization constant CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 17 / 48

IBM Model 1 Generative model: break up translation process into smaller steps I IBM Model 1 only uses lexical translation Translation probability I for a foreign sentence f =(f1,...,f lf ) of length l f I to an English sentence e =(e1,...,e le ) of length l e I with an alignment of each English word ej to a foreign word f i according to the alignment function a : j! i p(e, a f) = (l f + 1) le l e Y j=1 t(e j f a(j) ) I parameter is a normalization constant CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 17 / 48

IBM Model 1 Generative model: break up translation process into smaller steps I IBM Model 1 only uses lexical translation Translation probability I for a foreign sentence f =(f1,...,f lf ) of length l f I to an English sentence e =(e1,...,e le ) of length l e I with an alignment of each English word ej to a foreign word f i according to the alignment function a : j! i p(e, a f) = (l f + 1) le l e Y j=1 t(e j f a(j) ) I parameter is a normalization constant CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 17 / 48

IBM Model 1 Generative model: break up translation process into smaller steps I IBM Model 1 only uses lexical translation Translation probability I for a foreign sentence f =(f1,...,f lf ) of length l f I to an English sentence e =(e1,...,e le ) of length l e I with an alignment of each English word ej to a foreign word f i according to the alignment function a : j! i p(e, a f) = (l f + 1) le l e Y j=1 t(e j f a(j) ) I parameter is a normalization constant CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 17 / 48

Example das Haus ist klein e t(e f ) e t(e f ) house 0.8 is 0.8 building 0.16 s 0.16 home 0.02 exists 0.02 family 0.015 has 0.015 shell 0.005 are 0.005 e t(e f ) the 0.7 that 0.15 which 0.075 who 0.05 this 0.025 e t(e f ) small 0.4 little 0.4 short 0.1 minor 0.06 petty 0.04 p(e, a f )= t(the das) t(house Haus) t(is ist) t(small klein) 43 = 0.7 0.8 0.8 0.4 43 =0.0028 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 18 / 48

Outline 1 Introduction 2 Word Based Translation Systems 3 Learning the Models 4 Everything Else CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 19 / 48

Learning Lexical Translation Models We would like to estimate the lexical translation probabilities t(e f ) from a parallel corpus... but we do not have the alignments Chicken and egg problem I if we had the alignments,! we could estimate the parameters of our generative model I if we had the parameters,! we could estimate the alignments CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 20 / 48

EM Algorithm Incomplete data I if we had complete data, would could estimate model I if we had model, we could fill in the gaps in the data Expectation Maximization (EM) in a nutshell 1 initialize model parameters (e.g. uniform) 2 assign probabilities to the missing data 3 estimate model parameters from completed data 4 iterate steps 2 3 until convergence CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 21 / 48

EM Algorithm... la maison... la maison blue... la fleur...... the house... the blue house... the flower... Initial step: all alignments equally likely Model learns that, e.g., la is often aligned with the CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 22 / 48

EM Algorithm... la maison... la maison blue... la fleur...... the house... the blue house... the flower... After one iteration Alignments, e.g., between la and the are more likely CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 23 / 48

EM Algorithm... la maison... la maison bleu... la fleur...... the house... the blue house... the flower... After another iteration It becomes apparent that alignments, e.g., between fleur and flower are more likely (pigeon hole principle) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 24 / 48

EM Algorithm... la maison... la maison bleu... la fleur...... the house... the blue house... the flower... Convergence Inherent hidden structure revealed by EM CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 25 / 48

EM Algorithm... la maison... la maison bleu... la fleur...... the house... the blue house... the flower... p(la the) = 0.453 p(le the) = 0.334 p(maison house) = 0.876 p(bleu blue) = 0.563... Parameter estimation from the aligned corpus CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 26 / 48

IBM Model 1 and EM EM Algorithm consists of two steps Expectation-Step: Apply model to the data I parts of the model are hidden (here: alignments) I using the model, assign probabilities to possible values Maximization-Step: Estimate model from data I take assign values as fact I collect counts (weighted by probabilities) I estimate model from counts Iterate these steps until convergence CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 27 / 48

IBM Model 1 and EM We need to be able to compute: I Expectation-Step: probability of alignments I Maximization-Step: count collection CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 28 / 48

IBM Model 1 and EM Probabilities p(the la) =0.7 p(house la) =0.05 p(the maison) =0.1 p(house maison) =0.8 Alignments la maison the house la maison the house la maison the house la maison the house p(e,a f) =0.56 p(e,a f) =0.035 p(e,a f) =0.08 p(e,a f) =0.005 p(a e, f) =0.824 p(a e, f) =0.052 p(a e, f) =0.118 p(a e, f) =0.007 Counts c(the la) = 0.824 + 0.052 c(house la) = 0.052 + 0.007 c(the maison) = 0.118 + 0.007 c(house maison) = 0.824 + 0.118 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 29 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(a e, f) Applying the chain rule: p(a e, f) = p(e, a f) p(e f) We already have the formula for p(e, a f) (definition of Model 1) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 30 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 31 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = X a p(e, a f) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 31 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = X a p(e, a f) = l f X l f X a(1)=0 a(l e)=0 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 31 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = X a p(e, a f) = l f X l f X a(1)=0 a(l e)=0 p(e, a f) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 31 / 48

IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = X a p(e, a f) = = l f X a(1)=0 l f X l f X a(l e)=0 Xl f a(1)=0 a(l e)=0 p(e, a f) (l f + 1) le l e Y j=1 t(e j f a(j) ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 31 / 48

IBM Model 1 and EM: Expectation Step p(e f) = l f X... l f X a(1)=0 a(l e)=0 (l f + 1) le l e Y j=1 t(e j f a(j) ) Note the algebra trick in the last line I removes the need for an exponential number of products I this makes IBM Model 1 estimation tractable CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 32 / 48

IBM Model 1 and EM: Expectation Step l f X p(e f) =... a(1)=0 = (l f + 1) le l f X a(l e)=0 l f X a(1)=0 (l f + 1) le l f X l e Y j=1 l e Y a(l e)=0 j=1 t(e j f a(j) ) t(e j f a(j) ) Note the algebra trick in the last line I removes the need for an exponential number of products I this makes IBM Model 1 estimation tractable CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 32 / 48

IBM Model 1 and EM: Expectation Step l f X p(e f) =... a(1)=0 = (l f + 1) le = (l f + 1) le l f X a(l e)=0 l f X a(1)=0 l e Y (l f + 1) le l f X j=1 i=0 l f X l e Y j=1 l e Y a(l e)=0 j=1 t(e j f i ) t(e j f a(j) ) t(e j f a(j) ) Note the algebra trick in the last line I removes the need for an exponential number of products I this makes IBM Model 1 estimation tractable CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 32 / 48

The Trick 2X 2X a(1)=0 a(2)=0 = 3 2 2Y t(e j f a(j) )= j=1 (case l e = l f =2) = t(e 1 f 0 ) t(e 2 f 0 )+t(e 1 f 0 ) t(e 2 f 1 )+t(e 1 f 0 ) t(e 2 f 2 )+ + t(e 1 f 1 ) t(e 2 f 0 )+t(e 1 f 1 ) t(e 2 f 1 )+t(e 1 f 1 ) t(e 2 f 2 )+ + t(e 1 f 2 ) t(e 2 f 0 )+t(e 1 f 2 ) t(e 2 f 1 )+t(e 1 f 2 ) t(e 2 f 2 )= = t(e 1 f 0 )(t(e 2 f 0 )+t(e 2 f 1 )+t(e 2 f 2 )) + + t(e 1 f 1 )(t(e 2 f 1 )+t(e 2 f 1 )+t(e 2 f 2 )) + + t(e 1 f 2 )(t(e 2 f 2 )+t(e 2 f 1 )+t(e 2 f 2 )) = =(t(e 1 f 0 )+t(e 1 f 1 )+t(e 1 f 2 )) (t(e 2 f 2 )+t(e 2 f 1 )+t(e 2 f 2 )) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 33 / 48

IBM Model 1 and EM: Expectation Step Combine what we have: p(a e, f) =p(e, a f)/p(e f) = = Q le (l f +1) le j=1 t(e j f a(j) ) Q le P lf (l f +1) le j=1 i=0 t(e j f i ) l e Y j=1 t(e j f a(j) ) P lf i=0 t(e j f i ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 34 / 48

IBM Model 1 and EM: Maximization Step Now we have to collect counts Evidence from a sentence pair e,f that word e is a translation of word f : c(e f ; e, f) = X a p(a e, f) l e X j=1 (e, e j ) (f, f a(j) ) With the same simplication as before: t(e f ) c(e f ; e, f) = P lf i=0 t(e f i) l e X j=1 (e, e j ) l f X i=0 (f, f i ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 35 / 48

IBM Model 1 and EM: Maximization Step Now we have to collect counts Evidence from a sentence pair e,f that word e is a translation of word f : c(e f ; e, f) = X a p(a e, f) l e X j=1 (e, e j ) (f, f a(j) ) With the same simplication as before: t(e f ) c(e f ; e, f) = P lf i=0 t(e f i) l e X j=1 (e, e j ) l f X i=0 (f, f i ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 35 / 48

IBM Model 1 and EM: Maximization Step Now we have to collect counts Evidence from a sentence pair e,f that word e is a translation of word f : c(e f ; e, f) = X a p(a e, f) l e X j=1 (e, e j ) (f, f a(j) ) With the same simplication as before: t(e f ) c(e f ; e, f) = P lf i=0 t(e f i) l e X j=1 (e, e j ) l f X i=0 (f, f i ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 35 / 48

IBM Model 1 and EM: Maximization Step Now we have to collect counts Evidence from a sentence pair e,f that word e is a translation of word f : c(e f ; e, f) = X a p(a e, f) l e X j=1 (e, e j ) (f, f a(j) ) With the same simplication as before: t(e f ) c(e f ; e, f) = P lf i=0 t(e f i) l e X j=1 (e, e j ) l f X i=0 (f, f i ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 35 / 48

IBM Model 1 and EM: Maximization Step Now we have to collect counts Evidence from a sentence pair e,f that word e is a translation of word f : c(e f ; e, f) = X a p(a e, f) l e X j=1 (e, e j ) (f, f a(j) ) With the same simplication as before: t(e f ) c(e f ; e, f) = P lf i=0 t(e f i) l e X j=1 (e, e j ) l f X i=0 (f, f i ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 35 / 48

IBM Model 1 and EM: Maximization Step After collecting these counts over a corpus, we can estimate the model: t(e f ; e, f) = P P f (e,f) P (e,f) c(e f ; e, f)) c(e f ; e, f)) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 36 / 48

IBM Model 1 and EM: Pseudocode 1: initialize t(e f )uniformly 2: while not converged do 3: {initialize} 4: count(e f )=0for all e, f 5: total(f )=0for all f 6: for all sentence pairs (e,f) do 7: {compute normalization} 8: for all words e in edo 9: s-total(e) =0 10: for all words f in fdo 11: s-total(e) +=t(e f ) 12: {collect counts} 13: for all words e in edo 14: for all words f in fdo 15: count(e f )+= t(e f ) s-total(e) 16: total(f )+= t(e f ) 1: while not converged (cont.) do 2: {estimate probabilities} 3: for all foreign words f do 4: for all English words e do 5: t(e f )= count(e f ) total(f ) s-total(e) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 37 / 48

IBM Model 1 and EM: Pseudocode 1: initialize t(e f )uniformly 2: while not converged do 3: {initialize} 4: count(e f )=0for all e, f 5: total(f )=0for all f 6: for all sentence pairs (e,f) do 7: {compute normalization} 8: for all words e in edo 9: s-total(e) =0 10: for all words f in fdo 11: s-total(e) +=t(e f ) 12: {collect counts} 13: for all words e in edo 14: for all words f in fdo 15: count(e f )+= t(e f ) s-total(e) 16: total(f )+= t(e f ) 1: while not converged (cont.) do 2: {estimate probabilities} 3: for all foreign words f do 4: for all English words e do 5: t(e f )= count(e f ) total(f ) s-total(e) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 37 / 48

IBM Model 1 and EM: Pseudocode 1: initialize t(e f )uniformly 2: while not converged do 3: {initialize} 4: count(e f )=0for all e, f 5: total(f )=0for all f 6: for all sentence pairs (e,f) do 7: {compute normalization} 8: for all words e in edo 9: s-total(e) =0 10: for all words f in fdo 11: s-total(e) +=t(e f ) 12: {collect counts} 13: for all words e in edo 14: for all words f in fdo 15: count(e f )+= t(e f ) s-total(e) 16: total(f )+= t(e f ) 1: while not converged (cont.) do 2: {estimate probabilities} 3: for all foreign words f do 4: for all English words e do 5: t(e f )= count(e f ) total(f ) s-total(e) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 37 / 48

IBM Model 1 and EM: Pseudocode 1: initialize t(e f )uniformly 2: while not converged do 3: {initialize} 4: count(e f )=0for all e, f 5: total(f )=0for all f 6: for all sentence pairs (e,f) do 7: {compute normalization} 8: for all words e in edo 9: s-total(e) =0 10: for all words f in fdo 11: s-total(e) +=t(e f ) 12: {collect counts} 13: for all words e in edo 14: for all words f in fdo 15: count(e f )+= t(e f ) s-total(e) 16: total(f )+= t(e f ) 1: while not converged (cont.) do 2: {estimate probabilities} 3: for all foreign words f do 4: for all English words e do 5: t(e f )= count(e f ) total(f ) s-total(e) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 37 / 48

Convergence das Haus das Buch ein Buch the house the book a book e f initial 1st it. 2nd it.... final the das 0.25 0.5 0.6364... 1 book das 0.25 0.25 0.1818... 0 house das 0.25 0.25 0.1818... 0 the buch 0.25 0.25 0.1818... 0 book buch 0.25 0.5 0.6364... 1 a buch 0.25 0.25 0.1818... 0 book ein 0.25 0.5 0.4286... 0 a ein 0.25 0.5 0.5714... 1 the haus 0.25 0.5 0.4286... 0 house haus 0.25 0.5 0.5714... 1 CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 38 / 48

Ensuring Fluent Output Our translation model cannot decide between small and little Sometime one is preferred over the other: I small step: 2,070,000 occurrences in the Google index I little step: 257,000 occurrences in the Google index Language model I estimate how likely a string is English I based on n-gram statistics p(e) =p(e 1, e 2,...,e n ) = p(e 1 )p(e 2 e 1 )...p(e n e 1, e 2,...,e n 1 ) ' p(e 1 )p(e 2 e 1 )...p(e n e n 2, e n 1 ) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 39 / 48

Noisy Channel Model We would like to integrate a language model Bayes rule argmax e p(e f) = argmax e p(f e) p(e) p(f) = argmax e p(f e) p(e) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 40 / 48

Noisy Channel Model Applying Bayes rule also called noisy channel model I we observe a distorted message R (here: a foreign string f) I we have a model on how the message is distorted (here: translation model) I we have a model on what messages are probably (here: language model) I we want to recover the original message S (here: an English string e) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 41 / 48

Outline 1 Introduction 2 Word Based Translation Systems 3 Learning the Models 4 Everything Else CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 42 / 48

Higher IBM Models IBM Model 1 IBM Model 2 IBM Model 3 IBM Model 4 IBM Model 5 lexical translation adds absolute reordering model adds fertility model relative reordering model fixes deficiency Only IBM Model 1 has global maximum I training of a higher IBM model builds on previous model Compuationally biggest change in Model 3 I trick to simplify estimation does not work anymore! exhaustive count collection becomes computationally too expensive I sampling over high probability alignments is used instead CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 43 / 48

Legacy IBM Models were the pioneering models in statistical machine translation Introduced important concepts I generative model I EM training I reordering models Only used for niche applications as translation model... but still in common use for word alignment (e.g., GIZA++ toolkit) CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 44 / 48

Word Alignment Given a sentence pair, which words correspond to each other? michael geht davon aus, dass er im haus bleibt michael assumes that he will stay in the house CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 45 / 48

Word Alignment? john wohnt hier nicht john does not live here Is the English word does aligned to the German wohnt (verb) or nicht (negation) or neither??? CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 46 / 48

Word Alignment? john biss ins grass john kicked the bucket How do the idioms kicked the bucket and biss ins grass match up? Outside this exceptional context, bucket is never a good translation for grass CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 47 / 48

Summary Lexical translation Alignment Expectation Maximization (EM) Algorithm Noisy Channel Model IBM Models Word Alignment CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 48 / 48

Summary Lexical translation Alignment Expectation Maximization (EM) Algorithm Noisy Channel Model IBM Models Word Alignment Much more in CL2! CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 48 / 48