Decoding Revisited: Easy-Part-First & MERT February 26, 2015
Translating the Easy Part First? the tourism initiative addresses this for the first time the die tm:-0.19,lm:-0.4, d:0, all:-0.65 tourism touristische tm:-1.16,lm:-2.93 d:0, all:-4.09 initiative initiative tm:-1.21,lm:-4.67 d:0, all:-5.88 the first time das erste mal tm:-0.56,lm:-2.81 d:-0.74. all:-4.11 both hypotheses translate 3 words worse hypothesis has better score Chapter 6: Decoding 25
Estimating Future Cost Future cost estimate: how expensive is translation of rest of sentence? Optimistic: choose cheapest translation options Cost for each translation option translation model: cost known language model: output words known, but not context! estimate without context reordering model: unknown, ignored for future cost estimation Chapter 6: Decoding 26
Cost Estimates from Translation Options the tourism initiative addresses this for the first time -1.0-2.0-1.5-2.4-1.4-1.0-1.0-1.9-1.6-4.0-2.5-2.2-1.3-2.4-2.7-2.3-2.3-2.3 cost of cheapest translation options for each input span (log-probabilities) Chapter 6: Decoding 27
Cost Estimates for all Spans Compute cost estimate for all contiguous spans by combining cheapest options first future cost estimate for n words (from first) word 1 2 3 4 5 6 7 8 9 the -1.0-3.0-4.5-6.9-8.3-9.3-9.6-10.6-10.6 tourism -2.0-3.5-5.9-7.3-8.3-8.6-9.6-9.6 initiative -1.5-3.9-5.3-6.3-6.6-7.6-7.6 addresses -2.4-3.8-4.8-5.1-6.1-6.1 this -1.4-2.4-2.7-3.7-3.7 for -1.0-1.3-2.3-2.3 the -1.0-2.2-2.3 first -1.9-2.4 time -1.6 Function words cheaper (the: -1.0) than content words (tourism -2.0) Common phrases cheaper (for the first time: -2.3) than unusual ones (tourism initiative addresses: -5.9) Chapter 6: Decoding 28
Combining Score and Future Cost -6.1-9.3-6.9-2.2 the tourism initiative die touristische initiative tm:-1.21,lm:-4.67 d:0, all:-5.88-6.1 + the first time das erste mal -9.3 + this for... time für diese zeit -9.1 + -5.88 = -4.11 = -4.86 = tm:-0.56,lm:-2.81 tm:-0.82,lm:-2.98-11.98-13.41-13.96 d:-0.74. all:-4.11 d:-1.06. all:-4.86 Hypothesis score and future cost estimate are combined for pruning left hypothesis starts with hard part: the tourism initiative score: -5.88, future cost: -6.1! total cost -11.98 middle hypothesis starts with easiest part: the first time score: -4.11, future cost: -9.3! total cost -13.41 right hypothesis picks easy parts: this for... time score: -4.86, future cost: -9.1! total cost -13.96 Chapter 6: Decoding 29
f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary : <s> Mary : *-------- : 0.9 fc: 8.6e-9 e: <s> cp : --------- Maria e: <s> Maria : 1.0 fc: 1.5e-9 c : *-------- p: 0.3 fc: 8.6e-9 Not e cp e cp : <s> Not : -*------- : 0.4 fc: 1.0e-9 Future costs make these }hypotheses comparable.
Other Decoding Algorithms A* search Greedy hill-climbing Using finite state transducers (standard toolkits) Chapter 6: Decoding 30
A* Search probability + heuristic estimate cheapest score depth-first expansion to completed path number of words covered Uses admissible future cost heuristic: never overestimates cost Translation agenda: create hypothesis with lowest score + heuristic cost Done, when complete hypothesis created Chapter 6: Decoding 31
Greedy Hill-Climbing Create one complete hypothesis with depth-first search (or other means) Search for better hypotheses by applying change operators change the translation of a word or phrase combine the translation of two words into a phrase split up the translation of a phrase into two smaller phrase translations move parts of the output into a di erent position swap parts of the output with the output at a di erent part of the sentence Terminates if no operator application produces a better translation Chapter 6: Decoding 32
Decoding algorithm Translation as a search problem Partial hypothesis keeps track of which source words have been translated (coverage vector) n-1 most recent words of English (for LM!) a back pointer list to the previous hypothesis + (e,f) phrase pair used the (partial) translation probability the estimated probability of translating the remaining words (precomputed, a function of the coverage vector) Start state: no translated words, E=<s>, bp=nil Goal state: all translated words
Decoding algorithm Q[0] Start state for i = 0 to f -1 Keep b best hypotheses at Q[i] for each hypothesis h in Q[i] for each untranslated span in h.c for which there is a translation <e,f> in the phrase table h = h extend by <e,f> Is there an item in Q[ h.c ] with = LM state? yes: update the item bp list and probability no: Q[ h.c ] h Find the best hypothesis in Q[ f ], reconstruction translation by following back pointers
Parameter Learning: Review 13
K-Best List Example h 1 ~w #3 #6 #5 #4 #2#1 0.8 apple < 1.0 0.6 apple < 0.8 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 14
h 1 h 2 Fit a linear model 15
h 1 h 2 ~w Fit a linear model 16
K-Best List Example h 1 #3 #2#1 0.8 apple < 1.0 #6 #5 #4 0.6 apple < 0.8 0.4 apple < 0.6 ~w #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 17
Limitations We can t optimize corpus-level metrics, like BLEU, on a test set These don t decompose by sentence! We turn now to a kind of direct cost minimization 18
MERT Minimum Error Rate Training Directly target an automatic evaluation metric BLEU is defined at the corpus level MERT optimizes at the corpus level Downsides Does not deal well with > ~20 features 19
MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 20
MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 21
MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 22
MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 23
MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b = a + b + v > h(g, e, a) {z } a m Linear function in 2D! 24
MERT m 25
MERT m Recall our k-best set { e i, a i } K i=1 26
MERT m Recall our k-best set { e i, a i } K i=1 27
MERT m 28
MERT he 162, a 162i m he 28, a 28i he 73, a 73i 29
MERT he 162, a 162i m he 28, a 28i he 73, a 73i 30
MERT he 162, a 162i m he 28, a 28i he 73, a 73i errors 31
MERT he 162, a 162i m he 28, a 28i he 73, a 73i errors 32
MERT m errors 33
MERT m errors 34
errors Let w new = v + w 35
MERT In practice errors are sufficient statistics for evaluation metrics (e.g., BLEU) Can maximize or minimize How do you pick the search direction? 36
Dynamic Programming MERT 37
Other Algorithms Given a hypergraph translation space In the Viterbi (Inside) algorithm, there are two operations Multiplication (extend path) Maximization (choose between paths) Semirings generalize these to compute other quantities
Semirings
Inside Algorithm
Point-Line Duality Represent a set of lines as a set of points (and vice-versa) y = mx + b => (m, -b) The slope between dual points is the intersection x-axis of the pair of lines An upper envelope is dual to a lower convex hull
Primal Dual
Convex Hull Semiring
Theorem 2 The Inside algorithm with the computes the convex hull dual to the MERT upper envelope generated from the -best list of derivations
Summary Evaluation metrics Figure out how well we re doing Figure out if a feature helps Train your system What s a great way to improve translation? Improve evaluation! 45