Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Size: px

Start display at page:

Download "Decoding Revisited: Easy-Part-First & MERT. February 26, 2015"

Piers Cooper
6 years ago
Views:

1 Decoding Revisited: Easy-Part-First & MERT February 26, 2015

2 Translating the Easy Part First? the tourism initiative addresses this for the first time the die tm:-0.19,lm:-0.4, d:0, all:-0.65 tourism touristische tm:-1.16,lm:-2.93 d:0, all:-4.09 initiative initiative tm:-1.21,lm:-4.67 d:0, all:-5.88 the first time das erste mal tm:-0.56,lm:-2.81 d: all:-4.11 both hypotheses translate 3 words worse hypothesis has better score Chapter 6: Decoding 25

3 Estimating Future Cost Future cost estimate: how expensive is translation of rest of sentence? Optimistic: choose cheapest translation options Cost for each translation option translation model: cost known language model: output words known, but not context! estimate without context reordering model: unknown, ignored for future cost estimation Chapter 6: Decoding 26

4 Cost Estimates from Translation Options the tourism initiative addresses this for the first time cost of cheapest translation options for each input span (log-probabilities) Chapter 6: Decoding 27

5 Cost Estimates for all Spans Compute cost estimate for all contiguous spans by combining cheapest options first future cost estimate for n words (from first) word the tourism initiative addresses this for the first time -1.6 Function words cheaper (the: -1.0) than content words (tourism -2.0) Common phrases cheaper (for the first time: -2.3) than unusual ones (tourism initiative addresses: -5.9) Chapter 6: Decoding 28

6 Combining Score and Future Cost the tourism initiative die touristische initiative tm:-1.21,lm:-4.67 d:0, all: the first time das erste mal this for... time für diese zeit = = = tm:-0.56,lm:-2.81 tm:-0.82,lm: d: all:-4.11 d: all:-4.86 Hypothesis score and future cost estimate are combined for pruning left hypothesis starts with hard part: the tourism initiative score: -5.88, future cost: -6.1! total cost middle hypothesis starts with easiest part: the first time score: -4.11, future cost: -9.3! total cost right hypothesis picks easy parts: this for... time score: -4.86, future cost: -9.1! total cost Chapter 6: Decoding 29

7 f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary : <s> Mary : * : 0.9 fc: 8.6e-9 e: <s> cp : Maria e: <s> Maria : 1.0 fc: 1.5e-9 c : * p: 0.3 fc: 8.6e-9 Not e cp e cp : <s> Not : -* : 0.4 fc: 1.0e-9 Future costs make these }hypotheses comparable.

8 Other Decoding Algorithms A* search Greedy hill-climbing Using finite state transducers (standard toolkits) Chapter 6: Decoding 30

9 A* Search probability + heuristic estimate cheapest score depth-first expansion to completed path number of words covered Uses admissible future cost heuristic: never overestimates cost Translation agenda: create hypothesis with lowest score + heuristic cost Done, when complete hypothesis created Chapter 6: Decoding 31

10 Greedy Hill-Climbing Create one complete hypothesis with depth-first search (or other means) Search for better hypotheses by applying change operators change the translation of a word or phrase combine the translation of two words into a phrase split up the translation of a phrase into two smaller phrase translations move parts of the output into a di erent position swap parts of the output with the output at a di erent part of the sentence Terminates if no operator application produces a better translation Chapter 6: Decoding 32

11 Decoding algorithm Translation as a search problem Partial hypothesis keeps track of which source words have been translated (coverage vector) n-1 most recent words of English (for LM!) a back pointer list to the previous hypothesis + (e,f) phrase pair used the (partial) translation probability the estimated probability of translating the remaining words (precomputed, a function of the coverage vector) Start state: no translated words, E=<s>, bp=nil Goal state: all translated words

12 Decoding algorithm Q[0] Start state for i = 0 to f -1 Keep b best hypotheses at Q[i] for each hypothesis h in Q[i] for each untranslated span in h.c for which there is a translation <e,f> in the phrase table h = h extend by <e,f> Is there an item in Q[ h.c ] with = LM state? yes: update the item bp list and probability no: Q[ h.c ] h Find the best hypothesis in Q[ f ], reconstruction translation by following back pointers

13 Parameter Learning: Review 13

14 K-Best List Example h 1 ~w #3 #6 #5 #4 #2#1 0.8 apple < apple < apple < 0.6 #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 14

15 h 1 h 2 Fit a linear model 15

16 h 1 h 2 ~w Fit a linear model 16

17 K-Best List Example h 1 #3 #2#1 0.8 apple < 1.0 #6 #5 #4 0.6 apple < apple < 0.6 ~w #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 17

18 Limitations We can t optimize corpus-level metrics, like BLEU, on a test set These don t decompose by sentence! We turn now to a kind of direct cost minimization 18

19 MERT Minimum Error Rate Training Directly target an automatic evaluation metric BLEU is defined at the corpus level MERT optimizes at the corpus level Downsides Does not deal well with > ~20 features 19

20 MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 20

21 MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 21

22 MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 22

23 MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 23

24 MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b = a + b + v > h(g, e, a) {z } a m Linear function in 2D! 24

25 MERT m 25

26 MERT m Recall our k-best set { e i, a i } K i=1 26

27 MERT m Recall our k-best set { e i, a i } K i=1 27

28 MERT m 28

29 MERT he 162, a 162i m he 28, a 28i he 73, a 73i 29

30 MERT he 162, a 162i m he 28, a 28i he 73, a 73i 30

31 MERT he 162, a 162i m he 28, a 28i he 73, a 73i errors 31

32 MERT he 162, a 162i m he 28, a 28i he 73, a 73i errors 32

33 MERT m errors 33

34 MERT m errors 34

35 errors Let w new = v + w 35

36 MERT In practice errors are sufficient statistics for evaluation metrics (e.g., BLEU) Can maximize or minimize How do you pick the search direction? 36

37 Dynamic Programming MERT 37

38 Other Algorithms Given a hypergraph translation space In the Viterbi (Inside) algorithm, there are two operations Multiplication (extend path) Maximization (choose between paths) Semirings generalize these to compute other quantities

39 Semirings

40 Inside Algorithm

41 Point-Line Duality Represent a set of lines as a set of points (and vice-versa) y = mx + b => (m, -b) The slope between dual points is the intersection x-axis of the pair of lines An upper envelope is dual to a lower convex hull

42 Primal Dual

43 Convex Hull Semiring

44 Theorem 2 The Inside algorithm with the computes the convex hull dual to the MERT upper envelope generated from the -best list of derivations

45 Summary Evaluation metrics Figure out how well we re doing Figure out if a feature helps Train your system What s a great way to improve translation? Improve evaluation! 45

Discriminative Training

Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder