Lexical Translation Models 1I

Similar documents
Lexical Translation Models 1I. January 27, 2015

Word Alignment III: Fertility Models & CRFs. February 3, 2015

EM with Features. Nov. 19, Sunday, November 24, 13

Word Alignment. Chris Dyer, Carnegie Mellon University

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Out of GIZA Efficient Word Alignment Models for SMT

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

IBM Model 1 and the EM Algorithm

CRF Word Alignment & Noisy Channel Translation

IBM Model 1 for Machine Translation

Natural Language Processing (CSEP 517): Machine Translation

COMS 4705, Fall Machine Translation Part III

Hidden Markov Models

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

Discriminative Training. March 4, 2014

Discrete Latent Variable Models

Hidden Markov Models

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Latent Variable Models in NLP

statistical machine translation

COMS 4705 (Fall 2011) Machine Translation Part II

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

A Discriminative Model for Semantics-to-String Translation

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Probability and Statistics

Topics in Natural Language Processing

Word Alignment by Thresholded Two-Dimensional Normalization

Phrase-Based Statistical Machine Translation with Pivot Languages

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

Integrating Morphology in Probabilistic Translation Models

Soft Inference and Posterior Marginals. September 19, 2013

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Collaborative Filtering. Radek Pelánek

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

Machine Learning, Fall 2012 Homework 2

1. Markov models. 1.1 Markov-chain

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Better Conditional Language Modeling. Chris Dyer

Machine Learning for OR & FE

Theory of Alignment Generators and Applications to Statistical Machine Translation

Statistical Methods for NLP

Logical Form 5 Famous Valid Forms. Today s Lecture 1/26/10

PAPER Bayesian Word Alignment and Phrase Table Training for Statistical Machine Translation

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Midterm sample questions

Conditional Random Fields

Syntax-Based Decoding

CS 188: Artificial Intelligence Spring Announcements

Cross-Lingual Language Modeling for Automatic Speech Recogntion

4 : Exact Inference: Variable Elimination

Discriminative Training

Introduction to Artificial Intelligence (AI)

Bayesian Networks (Part I)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Gaussian Process Approximations of Stochastic Differential Equations

Expectation Maximization (EM)

Hidden Markov Models: All the Glorious Gory Details

Hidden Markov Models in Language Processing

Pattern Recognition and Machine Learning

Supplementary Notes on Inductive Definitions

Pair Hidden Markov Models

Intelligent Systems (AI-2)

Stephen Scott.

Algorithms for Syntax-Aware Statistical Machine Translation

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Machine Translation. Uwe Dick

Introduction to Machine Learning

Intelligent Systems (AI-2)

Learning to translate with neural networks. Michael Auli

Searching Algorithms. CSE21 Winter 2017, Day 3 (B00), Day 2 (A00) January 13, 2017

Homework 6: Image Completion using Mixture of Bernoullis

Lecture 12: Algorithms for HMMs

order is number of previous outputs

Hidden Markov models

Lecture 18: Analysis of variance: ANOVA

Statistical Methods for NLP

Statistical Machine Translation

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Lecture 6 April

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Fast Collocation-Based Bayesian HMM Word Alignment

Bayesian Networks BY: MOHAMAD ALSABBAGH

Sampling Alignment Structure under a Bayesian Translation Model

Probability Review. September 25, 2015

CS 188: Artificial Intelligence Spring Announcements

Latent Variable Models

Generative Clustering, Topic Modeling, & Bayesian Inference

A minimalist s exposition of EM

Nonparameteric Regression:

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Transcription:

Lexical Translation Models 1I Machine Translation Lecture 5 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn

Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } { z } { p(e f,m)= a2[0,n] m p(a f,m) p(e i f ai )

X p(e f,m)= a2[0,n] m p(a f,m) p(e i f ai ) Alternate ways of defining p(e i f ai,f ai 1 ) the translation probability p(e i f,f 1) ai ai p(e i f ai,e i 1 ) p(e i,e i+1 f ai ) What is the problem here?

X p(e f,m)= a2[0,n] m p(a f,m) p(e i f ai ) = X 1 p(e i f ai ) a2[0,n] m 1+n {z } p(a f,m) = X a2[0,n] m 1 Can we do something better here? 1+n p(e i f ai ) = X a2[0,n] m p(a i ) p(e i f ai )

0 NULL 1 2 3 4 das Haus ist klein the house is small 1 2 3 4 0 NULL 1 2 3 4 das Haus ist klein house is small the 1 2 3 4

p(e f,m) = X p(a i ) p(e i f ai ) a2[0,n] m Model 2 = X p(a i i, m, n) p(e i f ai ) a2[0,n] m

Model 2 = X p(a i i, m, n) p(e i f ai ) a2[0,n] m Model alignment with an absolute position distribution Probability of translating a foreign word at position to generate the word at position i (with target length mand source length n) p(a i i, m, n) a i EM training of this model is almost the same as with Model 1 (same conditional independencies hold)

Model 2 = X p(a i i, m, n) p(e i f ai ) a2[0,n] m natürlich ist das haus klein natürlich natürlich das haus ist klein of course the house is small

Model 2 = X p(a i i, m, n) p(e i f ai ) a2[0,n] m Pros Non-uniform alignment model Fast EM training / marginal inference Cons Absolute position is very naive How many parameters to model p(a i i, m, n)

}m = 6 Model 2 = X p(a i i, m, n) p(e i f ai ) a2[0,n] m How much do we know when we only know the source & target lengths and the current position? null j 0 =1 j 0 =2 j 0 =3 How many parameters j 0 =4 j 0 =5 do we actually need to model this? i =3 i =2 i =1 i =4 i =5 i =6 }n = 5

}m = 6 Model 2 = X p(a i i, m, n) p(e i f ai ) a2[0,n] m pos in target pos in source h(j, i, m, n) = i m j n null j 0 =1 target len source len j 0 =2 j 0 =3 b(j i, m, n) = exp h(j, i, m, n) P j 0 exp h(j 0, i, m, n) j 0 =4 j 0 =5 i =4 i =3 i =2 i =1 i =6 i =5 }n = 5 p(a i i, m, n) = ( p 0 if a i =0 (1 p 0 )b(a i i, m, n) otherwise

Words reorder in groups. Model this!

p(e f,m) = X p(a i ) p(e i f ai ) a2[0,n] m Model 2 = X p(a i i, m, n) p(e i f ai ) a2[0,n] m HMM = X p(a i a i 1 ) p(e i f ai ) a2[0,n] m

HMM = X p(a i a i 1 ) p(e i f ai ) a2[0,n] m Insight: words translate in groups Condition on previous alignment position Probability of translating a foreign word at position given that the previous position translated was p(a i a i 1 ) EM training of this model using forward-backward algorithm (dynamic programming) a i 1 a i

HMM = X p(a i a i 1 ) p(e i f ai ) a2[0,n] m Improvement: model jumps through the source sentence p(a i a i 1 )=j(a i a i 1 ) -4 0.0008-3 0.0015-2 0.08-1 0.18 0 0.0881 1 0.4 2 0.16 3 0.064 4 0.0256 Relative position model rather than absolute position model

HMM = X p(a i a i 1 ) p(e i f ai ) a2[0,n] m Be careful! NULLs must be handled carefully. Here is one option (due to Och): p(a i a i ni )= ( p 0 if a i =0 (1 p 0 )j(a i a i ni ) otherwise n i is the index of the first non-null aligned word in the alignment to the left of i.

HMM = X p(a i a i 1 ) p(e i f ai ) a2[0,n] m Other extensions: certain word-types are more likely to be reordered j( f) j( C(f)) Condition the jump probability on the previous word translated j( f,e) j( A(f), B(e)) Condition the jump probability on the previous word translated, and how it was translated

Fertility Models The models we have considered so far have been efficient This efficiency has come at a modeling cost: What is to stop the model from translating a word 0, 1, 2, or 100 times? We introduce fertility models to deal with this

IBM Model 3

Fertility Fertility: the number of English words generated by a foreign word Modeled by categorical distribution Examples: n( f) Unabhaengigkeitserklaerung 0 0.00008 1 0.1 2 0.0002 3 0.8 4 0.009 5 0 zum = (zu + dem) 0 0.01 1 0 2 0.9 3 0.0009 4 0.0001 5 0 Haus 0 0.01 1 0.92 2 0.07 3 0 4 0 5 0

Fertility X p(e f,m)= a2[0,n] m p(a f,m) p(e i f ai ) Fertility models mean that we can no longer exploit conditional independencies to write p(a f,m) as a series of local alignment decisions. How do we compute the statistics required for EM training?

EM Recipe reminder If alignment points were visible, training fertility models would be easy We would and n( =3 f = Unabhaenigkeitserklaerung) = count(3, Unabhaenigkeitserklaerung) count(unabhaenigkeitserklaerung) But, alignments are not visible n( =3 f = Unabhaenigkeitserklaerung) = E[count(3, Unabhaenigkeitserklaerung)] E[count(Unabhaenigkeitserklaerung)]

Expectation & Fertility We need to compute expected counts under p(a f,e,m) Unfortunately p(a f,e,m) doesn t factorize nicely. :( Can we sum exhaustively? How many different a s are there? What to do?

Sampling Alignments Monte-Carlo methods Gibbs sampling Importance sampling Particle filtering For historical reasons Use model 2 alignment to start (easy!) Weighted sum over all alignment configurations that are close to this alignment configuration Is this correct? No! Does it work? Sort of.

Lexical Translation IBM Models 1-5 [Brown et al., 1993] Model 2: absolute position model Model 3: fertility Model 5: non-deficient model Widely used Giza++ toolkit Model 1: lexical translation, uniform alignment Model 4: relative position model (jumps in target string) HMM translation model [Vogel et al., 1996] Relative position model (jumps in source string) Latent variables are more useful these days than the translations

Pitfalls of Conditional Models IBM Model 4 alignment Our model's alignm

A few tricks... p(f e) p(e f)

A few tricks... p(f e) p(e f)

A few tricks... p(f e) p(e f)

Suggestions for HW1 Matching the baseline will get you a B Implement IBM Model 2 in addition to IBM Model 1 Try the heuristics for merging the many-toone and one-to-many alignments Try to reduce sparse counts by preprocessing your training data Other ideas?

Reading Read Chapter 4 from the textbook (today we covered 4.4 through 4.6)

Announcements HW1 leaderboard submissions are due by Tuesday at 11:59pm HW1 write ups and code are due 24 hours later