Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

Similar documents
COMS 4705, Fall Machine Translation Part III

6.864: Lecture 17 (November 10th, 2005) Machine Translation Part III

Natural Language Processing (CSEP 517): Machine Translation

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

IBM Model 1 and the EM Algorithm

statistical machine translation

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Natural Language Processing (CSE 517): Machine Translation

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

COMS 4705 (Fall 2011) Machine Translation Part II

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

IBM Model 1 for Machine Translation

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

Syntax-Based Decoding

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Lexical Translation Models 1I

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

6.864 (Fall 2007) Machine Translation Part II

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Discriminative Training. March 4, 2014

Phrase-Based Statistical Machine Translation with Pivot Languages

Natural Language Processing

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

THE LANGUAGE OF FUNCTIONS *

Lexical Translation Models 1I. January 27, 2015

Minimum Error Rate Training Semiring

Statistical Machine Translation

Constructions on Finite Automata

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

The Geometry of Statistical Machine Translation

A Discriminative Model for Semantics-to-String Translation

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2017

Out of GIZA Efficient Word Alignment Models for SMT

CSP 517 Natural Language Processing Winter 2015

Word Alignment III: Fertility Models & CRFs. February 3, 2015

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Discriminative Training

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Word Alignment by Thresholded Two-Dimensional Normalization

Constructions on Finite Automata

8: Hidden Markov Models

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

CSC321 Lecture 10 Training RNNs

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

CSC321 Lecture 16: ResNets and Attention

CS 301. Lecture 18 Decidable languages. Stephen Checkoway. April 2, 2018

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

Tuning as Linear Regression

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Lecture: Face Recognition and Feature Reduction

Intro to Theory of Computation

Machine Learning for natural language processing

Machine Translation without Words through Substring Alignment

Kneser-Ney smoothing explained

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Theory of Alignment Generators and Applications to Statistical Machine Translation

Log-Linear Models, MEMMs, and CRFs

CRF Word Alignment & Noisy Channel Translation

Word Alignment. Chris Dyer, Carnegie Mellon University

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Lecture: Face Recognition and Feature Reduction

Statistical Machine Translation of Natural Languages

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Variational Decoding for Statistical Machine Translation

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

Improved Decipherment of Homophonic Ciphers

Machine Translation 1 CS 287

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields


Applications of Tree Automata Theory Lecture VI: Back to Machine Translation

Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms

Learning to translate with neural networks. Michael Auli

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 3, 7 Sep., 2016

Multi-Source Neural Translation

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

CSC321 Lecture 15: Recurrent Neural Networks

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Hidden Markov Models

Expectation Maximization (EM)

CS 188: Artificial Intelligence Spring Announcements

Sampling for Frequent Itemset Mining The Power of PAC learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Machine Translation. Uwe Dick

Lecture 7: Introduction to syntax-based MT

Multi-Task Word Alignment Triangulation for Low-Resource Languages

Homework 1 2/7/2018 SOLUTIONS Exercise 1. (a) Graph the following sets (i) C = {x R x in Z} Answer:

Multi-Source Neural Translation

Word Alignment via Submodular Maximization over Matroids

Mathematics for linguists

Transcription:

Overview Learning phrases from alignments 6.864 (Fall 2007) Machine Translation Part III A phrase-based model Decoding in phrase-based models (Thanks to Philipp Koehn for giving me slides from his EACL 2006 tutorial) 1 3 Roadmap for Next Few Lectures Lecture 1 (last time): IBM Models 1 and 2 Lecture 2 (today): phrase-based models Lecture 3: Syntax in statistical machine translation Phrase-Based Models First stage in training a phrase-based model is extraction of a phrase-based (PB) lexicon A PB lexicon pairs strings in one language with strings in anor language, e.g., nach Kanada in Canada zur Konferenz to conference Morgen tomorrow fliege will fly... 2 4

An Example (from tutorial by Koehn and Knight) A training example (Spanish/English sentence pair): Spanish: Maria no daba una bofetada a la bruja verde English: Mary not slap Some (not all) phrase pairs extracted from this example: (Maria Mary), (bruja ), (verde ), (no not), (no daba una bofetada not slap), (daba una bofetada a la slap ) We ll see how to do this using alignments from IBM models (e.g., from IBM model 2) Representation as Alignment Matrix Mary not slap (Note: bof = bofetada ) In IBM model 2, each foreign (Spanish) word is aligned to exactly one English word. The matrix shows se alignments. 5 7 Recap: IBM Model 2 IBM model 2 defines a distribution P (a, f e) where f is foreign (French) sentence, e is an English sentence, a is an alignment A useful by-product: once we ve trained model, for any (f, e) pair, we can calculate Finding Alignment Matrices Step 1: train IBM model 2 for P (f e), and come up with most likely alignment for each (e, f) pair Step 2: train IBM model 4 for P (e f) and come up with most likely alignment for each (e, f) pair We now have two alignments: take intersection of two alignments as a starting point a = arg max a P (a f, e) = arg max P (a, f e) a under model. a is most likely alignment 6 8

Alignment from P (f e) model: Mary not slap Alignment from P (e f) model: Mary not slap Heuristics for Growing Alignments Only explore alignment in union of P (f e) and P (e f) alignments Add one alignment point at a time Only add alignment points which align a word that currently has no alignment At first, restrict ourselves to alignment points that are neighbors (adjacent or diagonal) of current alignment points Later, consider or alignment points 9 11 Intersection of two alignments: Mary not slap The intersection of two alignments has been found to be a very reliable starting point The final alignment, created by taking intersection of two alignments, n adding new points using growing heuristics: Mary not slap Note that alignment is no longer many-to-one: potentially multiple Spanish words can be aligned to a single English word, and vice versa. 10 12

Extracting Phrase Pairs from Alignment Matrix Mary not slap A phrase-pair consists of a sequence of English words, e, paired with a sequence of foreign words, f A phrase-pair (e, f) is consistent if re are no words in f aligned to words outside e, and re are no words in e aligned to words outside f e.g., (Mary not, Maria no) is consistent. (Mary, Maria no) is not consistent: no is aligned to not, which is not in string Mary We extract all consistent phrase pairs from training example. Koehn, EACL 2006 tutorial, pages 103-108 for illustration. See An Example Phrase Translation Table An example from Koehn, EACL 2006 tutorial. (Note that we have P (e f) not P (f e) in this example.) Phrase Translations for den Vorschlag English P (e f) English P (e f) proposal 0.6227 suggestions 0.0114 s proposal 0.1068 proposed 0.0114 a proposal 0.0341 motion 0.0091 idea 0.0250 idea of 0.0091 this proposal 0.0227 proposal, 0.0068 proposal 0.0205 its proposal 0.0068 of proposal 0.0159 it 0.0068 proposals 0.0159...... 13 15 Probabilities for Phrase Pairs For any phrase pair (f, e) extracted from training data, we can calculate Count(f, e) P (f e) = Count(e) e.g., Count(daba una bofetada, slap) P (daba una bofetada slap) = Count(slap) Overview Learning phrases from alignments A phrase-based model Decoding in phrase-based models 14 16

Phrase-Based Systems: A Sketch Heute werden wir uber die Wiedereroffnung des Mont-Blanc- Tunnels diskutieren Phrase-Based Systems: A Sketch Heute werden wir uber die Wiedereroffnung des Mont-Blanc-Tunnels diskutieren Score = log P (Today START) Language model + log P (Heute Today) Phrase model + log P (1-1 1-1) Distortion model 17 19 Phrase-Based Systems: A Sketch Heute werden wir uber die Wiedereroffnung des Mont-Blanc- Tunnels diskutieren Score = log P (we shall be today) Language model Phrase-Based Systems: A Sketch Heute werden wir uber die Wiedereroffnung des Mont-Blanc- Tunnels diskutieren + log P (werden wir we will be) Phrase model + log P (2-3 2-4) Distortion model 18 20

Phrase-Based Systems: A Sketch Heute werden wir uber die Wiedereroffnung des Mont-Blanc-Tunnels diskutieren Phrase-Based Systems: Formal Definitions We n have Cost(E, F ) = P (E) l P (f i e i )d(a i b i 1 ) i=1 P (E) is language model score for string defined by E P (f i e i ) is phrase-table probability for i th phrase pair d(a i b i 1 ) is some probability/penalty for distance between i th phrase and (i 1) th phrase. Usually, we define d(a i b i 1 ) = α a i b i 1 1 for some α < 1. Note that this is not a coherent probability model 21 23 Phrase-Based Systems: Formal Definitions (following notation in Jurafsky and Martin, chapter 25) We d like to translate a French string f E is a sequence of l English phrases, e 1, e 2,..., e l. For example, e 1 = Mary, e 2 = not, e 3 = slap, e 4 =, e 5 = E defines a possible translation, in this case e 1 e 2... e 5 = Mary not slap. F is a sequence of l foreign phrases, f 1, f 2,..., f l. For example, f 1 = Maria, f 2 = no, f 3 = dio una bofetada, f 4 = a la, f 5 = bruja verde An Example Position 1 2 3 4 5 English Mary not slap Spanish Maria no dio una bofetada a la bruja verde In this case, Cost(E, F ) = P L (Mary not slap ) P (Maria Mary) d(1) P (no not) d(1) P (dio una bofetada slap) d(1) P (a la ) d(1) P (bruja verde ) d(1) P L is score from a language model a i for i = 1... l is position of first word of f i in f. b i for i = 1... l is position of last word of f i in f. 22 24

Anor Example Position 1 2 3 4 5 6 English Mary not slap Spanish Maria no dio una bofetada a la verde bruje The original Spanish string was Maria no dio una bofetada a la bruje verde, so notice that last two phrase pairs involve reordering In this case, Cost(E, F ) = P L (Mary not slap ) P (Maria Mary) d(1) P (no not) d(1) P (dio una bofetada slap) d(1) P (a la ) d(1) P (verde ) d(2) P (bruja ) d(1) The Decoding Problem For a given foreign string f, decoding problem is to find arg max (E,F ) Cost(E, F ) where arg max is over all (E, F ) pairs that are consistent with f See Koehn tutorial, EACL 2006, slides 29 57 See Jurafsky and Martin, Chapter 25, Figure 25.30 See Jurafsky and Martin, Chapter 25, section 25.8 25 27 Overview Learning phrases from alignments A phrase-based model Decoding in phrase-based models 26