Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Similar documents
Discriminative Training

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

COMS 4705, Fall Machine Translation Part III

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

The Geometry of Statistical Machine Translation

Minimum Error Rate Training Semiring

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Natural Language Processing (CSEP 517): Machine Translation

Computing Lattice BLEU Oracle Scores for Machine Translation

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

Discriminative Training. March 4, 2014

Theory of Alignment Generators and Applications to Statistical Machine Translation

statistical machine translation

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

CS 188: Artificial Intelligence Spring Announcements

Soft Inference and Posterior Marginals. September 19, 2013

Decoding and Inference with Syntactic Translation Models

IBM Model 1 and the EM Algorithm

Statistical Machine Translation

23. Cutting planes and branch & bound

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Tuning as Linear Regression

Phrase-Based Statistical Machine Translation with Pivot Languages

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Improved Decipherment of Homophonic Ciphers

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Lecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1

CS 347 Parallel and Distributed Data Processing

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

Informed Search. Chap. 4. Breadth First. O(Min(N,B L )) BFS. Search. have same cost BIBFS. Bi- Direction. O(Min(N,2B L/2 )) BFS. have same cost UCS

Informed Search. Day 3 of Search. Chap. 4, Russel & Norvig. Material in part from

National Centre for Language Technology School of Computing Dublin City University

Dual Decomposition for Natural Language Processing. Decoding complexity

Section Notes 8. Integer Programming II. Applied Math 121. Week of April 5, expand your knowledge of big M s and logical constraints.

Optimal Planning for Delete-free Tasks with Incremental LM-cut

Expectation Maximization (EM)

CS540 ANSWER SHEET

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

N-gram Language Modeling

Introduction to Mathematical Programming

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Statistical Ranking Problem

Word Alignment. Chris Dyer, Carnegie Mellon University

Support Vector Machines

IBM Model 1 for Machine Translation

CS 188: Artificial Intelligence Spring Announcements

CMU Lecture 4: Informed Search. Teacher: Gianni A. Di Caro

Midterm sample questions

Conditional Language Modeling. Chris Dyer

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 3, 7 Sep., 2016

Syntax-Based Decoding

Machine Translation 1 CS 287

Homework 2: MDPs and Search

Natural Language Processing (CSE 517): Machine Translation

1 Evaluation of SMT systems: BLEU

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Lab 12: Structured Prediction

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

The Purpose of Hypothesis Testing

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation

9.7 Extension: Writing and Graphing the Equations

Lectures 6, 7 and part of 8

SVMs, Duality and the Kernel Trick

Variational Decoding for Statistical Machine Translation

A* Search. 1 Dijkstra Shortest Path

Lecture 8: Determinants I

An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems

Applied Natural Language Processing

Language Model Rest Costs and Space-Efficient Storage

Topics in Natural Language Processing

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Large-scale Information Processing, Summer Recommender Systems (part 2)

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Learning to translate with neural networks. Michael Auli

Machine Translation without Words through Substring Alignment

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

BMI/CS 576 Fall 2016 Final Exam

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Data Warehousing & Data Mining

Dual fitting approximation for Set Cover, and Primal Dual approximation for Set Cover

Spring 2017 CO 250 Course Notes TABLE OF CONTENTS. richardwu.ca. CO 250 Course Notes. Introduction to Optimization

AN ABSTRACT OF THE DISSERTATION OF

CS 301. Lecture 18 Decidable languages. Stephen Checkoway. April 2, 2018

N-gram Language Modeling Tutorial

Input: System of inequalities or equalities over the reals R. Output: Value for variables that minimizes cost function

CS 4100 // artificial intelligence. Recap/midterm review!

Machine Learning. Support Vector Machines. Manfred Huber

LP Duality: outline. Duality theory for Linear Programming. alternatives. optimization I Idea: polyhedra

Algorithms for Syntax-Aware Statistical Machine Translation

Chapter 14 (Partially) Unsupervised Parsing

Theory of Alignment Generators and Applications to Statistical Machine Translation

The Noisy Channel Model and Markov Models

Transcription:

Decoding Revisited: Easy-Part-First & MERT February 26, 2015

Translating the Easy Part First? the tourism initiative addresses this for the first time the die tm:-0.19,lm:-0.4, d:0, all:-0.65 tourism touristische tm:-1.16,lm:-2.93 d:0, all:-4.09 initiative initiative tm:-1.21,lm:-4.67 d:0, all:-5.88 the first time das erste mal tm:-0.56,lm:-2.81 d:-0.74. all:-4.11 both hypotheses translate 3 words worse hypothesis has better score Chapter 6: Decoding 25

Estimating Future Cost Future cost estimate: how expensive is translation of rest of sentence? Optimistic: choose cheapest translation options Cost for each translation option translation model: cost known language model: output words known, but not context! estimate without context reordering model: unknown, ignored for future cost estimation Chapter 6: Decoding 26

Cost Estimates from Translation Options the tourism initiative addresses this for the first time -1.0-2.0-1.5-2.4-1.4-1.0-1.0-1.9-1.6-4.0-2.5-2.2-1.3-2.4-2.7-2.3-2.3-2.3 cost of cheapest translation options for each input span (log-probabilities) Chapter 6: Decoding 27

Cost Estimates for all Spans Compute cost estimate for all contiguous spans by combining cheapest options first future cost estimate for n words (from first) word 1 2 3 4 5 6 7 8 9 the -1.0-3.0-4.5-6.9-8.3-9.3-9.6-10.6-10.6 tourism -2.0-3.5-5.9-7.3-8.3-8.6-9.6-9.6 initiative -1.5-3.9-5.3-6.3-6.6-7.6-7.6 addresses -2.4-3.8-4.8-5.1-6.1-6.1 this -1.4-2.4-2.7-3.7-3.7 for -1.0-1.3-2.3-2.3 the -1.0-2.2-2.3 first -1.9-2.4 time -1.6 Function words cheaper (the: -1.0) than content words (tourism -2.0) Common phrases cheaper (for the first time: -2.3) than unusual ones (tourism initiative addresses: -5.9) Chapter 6: Decoding 28

Combining Score and Future Cost -6.1-9.3-6.9-2.2 the tourism initiative die touristische initiative tm:-1.21,lm:-4.67 d:0, all:-5.88-6.1 + the first time das erste mal -9.3 + this for... time für diese zeit -9.1 + -5.88 = -4.11 = -4.86 = tm:-0.56,lm:-2.81 tm:-0.82,lm:-2.98-11.98-13.41-13.96 d:-0.74. all:-4.11 d:-1.06. all:-4.86 Hypothesis score and future cost estimate are combined for pruning left hypothesis starts with hard part: the tourism initiative score: -5.88, future cost: -6.1! total cost -11.98 middle hypothesis starts with easiest part: the first time score: -4.11, future cost: -9.3! total cost -13.41 right hypothesis picks easy parts: this for... time score: -4.86, future cost: -9.1! total cost -13.96 Chapter 6: Decoding 29

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary : <s> Mary : *-------- : 0.9 fc: 8.6e-9 e: <s> cp : --------- Maria e: <s> Maria : 1.0 fc: 1.5e-9 c : *-------- p: 0.3 fc: 8.6e-9 Not e cp e cp : <s> Not : -*------- : 0.4 fc: 1.0e-9 Future costs make these }hypotheses comparable.

Other Decoding Algorithms A* search Greedy hill-climbing Using finite state transducers (standard toolkits) Chapter 6: Decoding 30

A* Search probability + heuristic estimate cheapest score depth-first expansion to completed path number of words covered Uses admissible future cost heuristic: never overestimates cost Translation agenda: create hypothesis with lowest score + heuristic cost Done, when complete hypothesis created Chapter 6: Decoding 31

Greedy Hill-Climbing Create one complete hypothesis with depth-first search (or other means) Search for better hypotheses by applying change operators change the translation of a word or phrase combine the translation of two words into a phrase split up the translation of a phrase into two smaller phrase translations move parts of the output into a di erent position swap parts of the output with the output at a di erent part of the sentence Terminates if no operator application produces a better translation Chapter 6: Decoding 32

Decoding algorithm Translation as a search problem Partial hypothesis keeps track of which source words have been translated (coverage vector) n-1 most recent words of English (for LM!) a back pointer list to the previous hypothesis + (e,f) phrase pair used the (partial) translation probability the estimated probability of translating the remaining words (precomputed, a function of the coverage vector) Start state: no translated words, E=<s>, bp=nil Goal state: all translated words

Decoding algorithm Q[0] Start state for i = 0 to f -1 Keep b best hypotheses at Q[i] for each hypothesis h in Q[i] for each untranslated span in h.c for which there is a translation <e,f> in the phrase table h = h extend by <e,f> Is there an item in Q[ h.c ] with = LM state? yes: update the item bp list and probability no: Q[ h.c ] h Find the best hypothesis in Q[ f ], reconstruction translation by following back pointers

Parameter Learning: Review 13

K-Best List Example h 1 ~w #3 #6 #5 #4 #2#1 0.8 apple < 1.0 0.6 apple < 0.8 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 14

h 1 h 2 Fit a linear model 15

h 1 h 2 ~w Fit a linear model 16

K-Best List Example h 1 #3 #2#1 0.8 apple < 1.0 #6 #5 #4 0.6 apple < 0.8 0.4 apple < 0.6 ~w #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 17

Limitations We can t optimize corpus-level metrics, like BLEU, on a test set These don t decompose by sentence! We turn now to a kind of direct cost minimization 18

MERT Minimum Error Rate Training Directly target an automatic evaluation metric BLEU is defined at the corpus level MERT optimizes at the corpus level Downsides Does not deal well with > ~20 features 19

MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 20

MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 21

MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 22

MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 23

MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b = a + b + v > h(g, e, a) {z } a m Linear function in 2D! 24

MERT m 25

MERT m Recall our k-best set { e i, a i } K i=1 26

MERT m Recall our k-best set { e i, a i } K i=1 27

MERT m 28

MERT he 162, a 162i m he 28, a 28i he 73, a 73i 29

MERT he 162, a 162i m he 28, a 28i he 73, a 73i 30

MERT he 162, a 162i m he 28, a 28i he 73, a 73i errors 31

MERT he 162, a 162i m he 28, a 28i he 73, a 73i errors 32

MERT m errors 33

MERT m errors 34

errors Let w new = v + w 35

MERT In practice errors are sufficient statistics for evaluation metrics (e.g., BLEU) Can maximize or minimize How do you pick the search direction? 36

Dynamic Programming MERT 37

Other Algorithms Given a hypergraph translation space In the Viterbi (Inside) algorithm, there are two operations Multiplication (extend path) Maximization (choose between paths) Semirings generalize these to compute other quantities

Semirings

Inside Algorithm

Point-Line Duality Represent a set of lines as a set of points (and vice-versa) y = mx + b => (m, -b) The slope between dual points is the intersection x-axis of the pair of lines An upper envelope is dual to a lower convex hull

Primal Dual

Convex Hull Semiring

Theorem 2 The Inside algorithm with the computes the convex hull dual to the MERT upper envelope generated from the -best list of derivations

Summary Evaluation metrics Figure out how well we re doing Figure out if a feature helps Train your system What s a great way to improve translation? Improve evaluation! 45