IBM Model 1 for Machine Translation

Similar documents
CRF Word Alignment & Noisy Channel Translation

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

COMS 4705 (Fall 2011) Machine Translation Part II

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Sequences and Information

Natural Language Processing (CSEP 517): Machine Translation

DT2118 Speech and Speaker Recognition

Word Alignment. Chris Dyer, Carnegie Mellon University

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Cross-Lingual Language Modeling for Automatic Speech Recogntion

EM with Features. Nov. 19, Sunday, November 24, 13

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Midterm sample questions

Lecture 2: Probability and Language Models

Log-Linear Models, MEMMs, and CRFs

Lexical Translation Models 1I

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

COMS 4705, Fall Machine Translation Part III

Out of GIZA Efficient Word Alignment Models for SMT

Lexical Translation Models 1I. January 27, 2015

Discriminative Training

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Discriminative Training. March 4, 2014

7.1 What is it and why should we care?

Lecture 3: ASR: HMMs, Forward, Viterbi

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Conditional Random Fields

Hierarchical Bayesian Nonparametrics

Other types of Movement

statistical machine translation

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Phrase-Based Statistical Machine Translation with Pivot Languages

Chapter 14 (Partially) Unsupervised Parsing

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Probabilistic Language Modeling

6.864 (Fall 2007) Machine Translation Part II

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Algorithms for Syntax-Aware Statistical Machine Translation

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Improved Decipherment of Homophonic Ciphers

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

CHAPTER 6 - THINKING ABOUT AND PRACTICING PROPOSITIONAL LOGIC

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Logic for Computer Science - Week 2 The Syntax of Propositional Logic

Natural Language Processing (CSE 490U): Language Models

10/17/04. Today s Main Points

- a value calculated or derived from the data.

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Lecture 12: Algorithms for HMMs

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Massachusetts Institute of Technology. Solution to Problem 1: The Registrar s Worst Nightmare

Chapter 14. From Randomness to Probability. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Structural Induction

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Machine Translation 1 CS 287

Natural Language Processing. Statistical Inference: n-grams

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Syntax-Based Decoding

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

First Order Logic: Syntax and Semantics

N-gram Language Modeling

Section 20: Arrow Diagrams on the Integers

Chapter Three. Deciphering the Code. Understanding Notation

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

Statistical Methods for NLP

Conditional Language Modeling. Chris Dyer

Lecture 12: Algorithms for HMMs

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Expectation Maximization (EM)

The Noisy Channel Model and Markov Models

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Machine Learning for natural language processing

Discrete Probability and State Estimation

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Parsing Beyond Context-Free Grammars: Tree Adjoining Grammars

LECTURER: BURCU CAN Spring

Expectation Maximization (EM)

Statistical Pattern Recognition

Bringing machine learning & compositional semantics together: central concepts

Latent Models: Sequence Models Beyond HMMs and Machine Translation Alignment CMSC 473/673 UMBC

Probability Review. September 25, 2015

Kneser-Ney smoothing explained

Solving Equations by Adding and Subtracting

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Probabilistic Linguistics

Math 350: An exploration of HMMs through doodles.

Probability and Statistics

CPSC 540: Machine Learning

Natural Language Processing

Transcription:

IBM Model 1 for Machine Translation Micha Elsner March 28, 2014

2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of the text For instance to disambiguate words As with generation, MT research follows two paths Shallow text-to-text methods Deeper methods using interlingua /semantics knowledge-based "Language of thought" (Semantics) parsing interlingua syntactic word-based generation Source language Target language

3 Shallow methods We ll study an older method from the shallow word-based school Still used as an ingredient in more sophisticated translations Bears some similarity to lost-language decipherment methods Ex. Rosetta Stone, Linear B...

4 Notation Given: English sentence E 1:n French (or foreign ) sentence F 1:m The letters are conventional regardless of the actual languages involved We want to model P(E F), and then search for: max P(E F) E We ll assume we have bilingual parallel text for training: D : (E i, F i ) Same documents in English and French Canadian Hansards

5 The noisy channel We ve seen this before (discussing speech recognition) Studied in information theory (Shannon)... But really just a specific use of Bayes rule max E P(E F) = max P(E)P(F E) Dropping the constant P(F) in the denominator P(E) is the source model language model for English P(F E) is the channel or noise model E When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. Warren Weaver

6 Why noisy channels? max E P(E F) = max P(E)P(F E) Seems like we haven t made any progress But: we re good at language modeling Also, can use monolingual data The LM can take over enforcing grammaticality in English Translation model P(F E) doesn t have to know as much grammar... Mainly focused on content E

7 The channel model Need to build a model of P(F E) Basic intuition: model is an En-Fr dictionary Translation: pick an En word, look up and take a Fr word This word-to-translation correspondence is called alignment This alignment idea is what makes it an IBM model Add to the sentence We don t know any syntax, so just throw it in wherever Result: each Fr word corresponds to a single English word Also a NULL English word to model things French people just say Brown+al `90 En: He is going by train NULL Ger: Er fährt mit dem Zug

8 Bad cases for alignment I Je know sais que you tu are es a un cat chat the das long distance train station Fernbahnhof

9 Formalization Introduce A 1:m, an alignment variable for each French word a i [0... n] is a pointer to the English word which is the source of this French word Probability of French word depends on a single English word (selected by a i ): P(F i E 1:n, a i ) = P tr (F i E ai ) P tr is the dictionary: P(f the ) P(f know ) le 0.52 sais.267 de 0.162 que.254 la 0.151 savons.112 les 0.107 savoir.11

10 Probability of all the French words A single French word: P(F i E 1:n, A i ) = P tr (F i E ai ) All French words are independent given E, A: m P(F 1:m E 1:n, A 1:m ) = P tr (F i E 1:n, A i ) i=1 This would be all we need... if we knew A One can imagine paying students to annotate it But this would be slow and expensive Instead we will use unsupervised learning

11 Setting up an EM algorithm Probability of a sentence: P(F 1:m E 1:n, A 1:m ) = m P tr (F i E 1:n, A i ) In hard EM, would alternate between predicting A and learning P tr However, soft EM is known to work well here For more complex IBM models, hard EM is not as good (Och and Ney 97) Will calculate probability of various A and expected counts i=1 Maximizing the probability of the sentence: Marginalizing over A P(F 1:m E 1:n ) = A ( m ) P tr (F i E 1:n, A i ) P(A 1:m ) i=1

12 Computing the responsibilities Soft EM: need partial counts to estimate P tr ˆP tr (f e) = #(F i = f, a i = j, E j = e) #(E j = e) Need to know probability French word f is aligned to English word e (applying Bayes rule) P(A i = j F 1:m, E 1:n ) = P(F 1:m A i = j, E 1:n )P(A i = j) P(F 1:m ) To do this, will need to know more about P(A)

13 Probabilities of alignment What is P(A)? Tells us where in English sentence we should look for the source of each French word This depends on syntax Fr, En both SVO, so go in order German is SOV so look for verbs at the end In IBM model 1, make simplest possible assumption Look anywhere! (uniform) IBM model 1 assumption P(A 1:m E 1:n ) = i P(A i E 1:n ) = i 1 n

14 Simplifying We can use the assumption to simplify our objective So we can compute responsibilities for EM P(A 1:m E 1:n ) = P(A i E 1:n ) = 1 n i i P(F 1:m E 1:n ) = ( m ) P tr (F i E 1:n, A i )P(A 1:m ) A i=1 P(F 1:m E 1:n ) = ( m ) 1 P tr (F i E 1:n, A i ) n A i=1 i P(F 1:m E 1:n ) = ( m ) P tr (F i E 1:n, A i ) 1 n A i=1 m P(F 1:m E 1:n ) = P tr (F i E 1:n, A i ) 1 n i=1 A i [1...n]

15 Computing the responsibilities P(F 1:m E 1:n ) = m i=1 For each French word, we have: We need: P(F i E 1:n ) = A i [1...n] A i [1...n] P tr (F i E 1:n, A i ) 1 n P tr (F i E 1:n, A i ) 1 n P(A i = j F 1:m, E 1:n ) = P(F 1:m A i = j, E 1:n )P(A i = j) P(F 1:m ) Independence assumptions: drop dependence on all French words but F i : P(A i = j F i, E 1:n ) = P(F i A i = j, E 1:n )P(A i = j) P(F i )

16 Continued... P(A i = j F i, E 1:n ) = P(F i A i = j, E 1:n )P(A i = j) P(F i ) Numerator: use fact that A i = j and P(A) is uniform; P(A i = j F i, E 1:n ) = P tr (F i E j ) 1 n P(F i ) Denominator: derived on prev slide (or sum over numerators): A i [1...n] P tr (F i E 1:n, A i ) 1 n Thus (note that the 1 n divides out): P(A i = j F i, E 1:n ) = P tr (F i E j ) 1 n A i [1...n] P tr (F i E 1:n, A i ) 1 n

17 Example Running EM: similar to Rosetta Stone decipherment problems: Ich habe eine Katze I have a cat Ich habe eine Matte I have a mat Die Katze sitz auf der Matte The cat sits on the mat Begin with P tr uniform: In E-step 1, calculate P(A i = j) for all German words Since we know nothing, these are uniform over English words in corresponding sentence In ss 1, Katze is either I, have, a, cat or NULL (each P = 1 5 ) Counts: #(I, Katze) 1/5 (from ss 1) #(have, Katze) 1/5 (from ss 1) #(sits, Katze) 1/7 (from ss 3) #(cat, Katze) 1/7 + 1/5 (from ss 1 and 3) #(the, Katze) 2/7 (from ss 3)......

18 M-step We re-estimate: ˆP tr (Katze cat) = 1 7 + 1 5 various counts for things aligned to cat... Alignments to cat will probably increase Alignments of other words to cat will decrease Since we now suspect cat translates as Katze This will help us deal with other words habe less likely to be cat... So more likely to be have

19 Conclusion IBM: shallow (word-to-word) translation Based on noisy channel (LM and channel model) Channel model based on word-to-word alignments and dictionary IBM 1: alignments are uniform (no syntax) Use EM to learn alignments/dictionary