Computing Lattice BLEU Oracle Scores for Machine Translation

Size: px

Start display at page:

Download "Computing Lattice BLEU Oracle Scores for Machine Translation"

Luke Hill
5 years ago
Views:

1 Computing Lattice Oracle Scores for Machine Translation Artem Sokolov & Guillaume Wisniewski & François Yvon LIMSI, Orsay, France 1 Introduction 2 Oracle Decoding Task 3 Proposed Oracle Decoders 4 Experiments 5 Conclusion & Future Work

2 Model & Inference in SMT Usual translating of sentences f: best translation: e = arg max e E(f) score(e f) e.g.: score(e f) = λ h(e, f) h(e, f) features, λ params E(f) search space interested in this Example 0 source: Vénus est la jumelle infernale de la Terre search space options: list of n-best hypotheses Venus - the twin of hell of the Earth Venus is the infernal binocular of the Earth Venus is the hellish twin of the Earth word lattice Vénus : Venus Vénus_est : Venus 2 1 est : is NULL : - 3 la_jumelle : twin la : the jumelle : twin 5 4 infernale : hellish 7 infernale : infernal 6 infernale : of_hell jumelle : twin jumelle : binocular 8 de_la_terre : of_the_earth 9 Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 1 / 15

3 Motivation in reality SMT systems perform poorly knowing why is difficult because of system complexity useful to know the best possible hypotheses can help failure analysis Usefulness of (oracle) translation Finding bottlenecks if oracle is good why the system couldn t return it? if oracle is bad why doesn t E contain better ones? Learning update towards the best hypothesis In this work interested in finding best hypotheses contained in lattices two new methods, comparison with 2 existing methods Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 2 / 15

4 Oracle Decoding Task by best hypothesis in E we mean highest translation ( n ) 1/n n-({e i }, {r i }) = BrevityPenalty p m m=1 e i hypotheses, r i references, p m clipped precisions need some sentence-level approximation Exact oracle finding with sent- in n-best list straight-forward in word lattice NP-hard even for 2- [Leusch et al., 2008] so, approximations are needed focus of this presentation Oracle decoding on lattice L for reference r f π (f) = arg max (e π, r f ). π {paths on L} Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 3 / 15

5 Earlier Approaches Language Model Oracle [Li & Khudanpur, 2009] LM n-gram language model A LM built on reference r find most probable hypothesis π LM (f) = ShortestPath(L A LM(r f )) Partial Oracle [Dreyer et al., 2007] PB/PBl for partial path π let its weight be: log (π) = 1 4 m=1..4 log p m hypothesis h 1 is prefered to h 2 if log (h 1 ) > log (h 2 ) and same 3-word history or PB same 3-word history and same length PBl find shortest path πpb (f) = ShortestPath(L) Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 4 / 15

6 Problems Problems with existing approaches 1 LM distant relation to low quality 2 PB approximate search (recombination heuristics) can be resource-greedy (keeps many hypotheses until the final state) 3 Both LM & PB ignore: precision clipping brevity penalty (except PBl) corpus nature of Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 5 / 15

7 Our Contribution New oracles partially take clipping, brevity and corpus nature into account produce better hypotheses and often faster Take-away message although true oracles are NP-hard... we can find approximate oracles: very easy to calculate very high Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 6 / 15

8 Linear approximation of corpus- LB Linear approximation of [Tromble et al., 2008] 4 lin (π) = θ 0 e π + θ n c u (e π )δ u (r) n=1 u Σ n c u (e) number of occurences of u in e, δ u (r) indicator of presence of u in r. parameters: θ 0 = 1 ( brevity), θ n = 1 4p r n 1 (n-gram discounts) de Bruijn transducers n each n contains all n-grams as output symbols arcs weighted by δ u (r) 0:00/0 1:11/0 1:ε/0 1 0:ε/0 1:ε/0 10 1:101/0 0:110/0 0:100/0 1:111/0 11 1:01/0 q₀ 1:011/0 1:1/0 0:0/0 0:ε/0 0 0:10/0 1 0:ε/0 0:ε/0 0:000/0 q₀ q₀ 1:ε/0 0 1:ε/0 01 0:010/0 1:001/0 00 n automata for alphabet {0, 1} and n = Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 7 / 15

9 Linear approximation of corpus- LB Composition effect of s L n produces n-gram lattice each n-gram arc weighted by θ n, if the n-gram was in reference Find LB oracle translation for 4- weight lattice L with θ 0 weight i with θ i find shortest path π LB = ShortestPath(L ). so far all oracles: transformed and/or reweighted lattice L called ShortestPath difficult to account for clipping (it depends on the resulting path) Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 8 / 15

10 Integer Linear Programming Oracles ILP how to take clipping into account? need to fine-tune path weights reformulate 1- shortest path problem as an ILP problem: arg max ξ ξ = 1, ξ in(q fin ) ξ out(q) i {arcs} ξ ξ i (Θ 1 1 i r Θ 2 1 i r ) ξ out(q init ) ξ in(q) ξ = 1 ξ = 0, q {nodes} \ {q init, q fin } maximize 1- solution should be path arc indicator ξ i = { 1 if arc i is active in solution, 0 otherwise, Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 9 / 15

11 Taking account of clipping in ILP New variable: Clipped word counter γ w = min{ ξ, # of w in r} ξ {edges with w} ILP problem with clipping arg max Θ 1 ξ,γ w w {words} γ w Θ 2 γ w 0, γ w c w (r), γ w ξ in(q fin ) ξ out(q) ξ = 1, ξ ξ out(q init ) ξ in(q) ξ = 1 i {arcs} ξ i ξ {edges carring w} w {words} ξ = 0, q {nodes} \ {q init, q fin } ξ γ w many ILP solvers available we used Gurobi Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 10 / 15

12 Langrangian Relaxation-based Oracle Decoder RLX can be done without 3rd-party solvers Lagrangian relaxation to deal with clipping Primal arg min ξ i θ i ξ {paths} ξ with w i arcs ξ c w (r), w r arg max L(λ) λ λ w 0, w r Dual Where L(λ) = min ξ i {arcs} ξ i θ i + w r λ w ξ with w ξ c w (r) Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 11 / 15

13 Langrangian Relaxation-based Oracle Decoder L(λ) = min ξ ξ = arg min ξ L(λ) λ w = λ w ξ i θ i + i {arcs} w r L(λ) = arg min ξ ξ Ω(w) ξ ξ c w (r) i {arcs} ξ with w ξ c w (r) ξ i ( λw(ξi ) θ i ) just find shortest path update λ towards max L(λ) RLX Incremental Constraint Enforcing cf. [Rush et al., 2010] w, λ (0) w 0 for t = 1 T do ξ (t) = arg min ξ i ξ i (λ ) w(ξi) θ i if all clipping constraints are enforced then optimal solution found else for w r do n w n. of occurrences of w in ξ (t) λ (t) w λ (t) w λ (t) w + α (t) (n w c w (r)) max(0, λ (t) w ) this was for 1- for n- run on L n Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 12 / 15

14 Experiments: Performance for two decoders N-code avg. time n-best RLX ILP LB4 LB2 PB PBl LM4 LM avg. time, s avg. time n-best RLX ILP LB4 LB2 PB PBl LM4LM avg. time, s avg. time n-best RLX ILP LB4 LB2 PB PBl LM4LM avg. time, s (a) fr en (test: 27.88) (b) de en (test: 22.05) Moses (c) en de (test: 15.83) avg. time n-best RLX ILP LB4 LB2 PB PBl LM4 LM2 3 avg. time, s RLX ILP LB4 LB2 PB PBl LM4 LM2 4 3 avg. time, s RLX ILP LB4 LB2 PB PBl LM4 LM avg. time, s (d) fr en (test: 27.68) (e) de en (test: 21.85) (f) en de (test: 15.89) Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 13 / 15

15 Conclusion & Future Work Results new oracle decoders find better hypotheses...and do it faster (clipping/brevity/corpus) optimization for 2- is sufficient to obtain good 4- Main Conclusion lattices contain excellent translations points to n-best list oracles +20 points above results of state-of-the-art models SMT has serious scoring problems Future Work linear models not flexible enough? non-linear score func. more features needed? rich features training flawed? discriminative training with lattice oracles Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 14 / 15

16 Thank you for your attention! Questions? Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 15 / 15

17 Bibliography Dreyer, M., Hall, K. B., & Khudanpur, S. P. (2007). Comparing reordering constraints for SMT using efficient oracle computation. In Proc. of the Workshop on Syntax and Structure in Statistical Translation (pp ). Morristown, NJ, USA. Leusch, G., Matusov, E., & Ney, H. (2008). Complexity of finding the -optimal hypothesis in a confusion network. In Proc. of the 2008 Conf. on EMNLP (pp ). Honolulu, Hawaii. Li, Z. & Khudanpur, S. (2009). Efficient extraction of oracle-best translations from hypergraphs. In Proc. of Human Language Technologies: The 2009 Annual Conf. of the North American Chapter of the ACL, Companion Volume: Short Papers (pp. 9 12). Morristown, NJ, USA. Rush, A. M., Sontag, D., Collins, M., & Jaakkola, T. (2010). On dual decomposition and linear programming relaxations for natural language processing. In Proc. of the 2010 Conf. on EMNLP (pp. 1 11). Stroudsburg, PA, USA. Tromble, R. W., Kumar, S., Och, F., & Macherey, W. (2008). Lattice minimum bayes-risk decoding for statistical machine translation. In Proc. of the Conf. on EMNLP (pp ). Stroudsburg, PA, USA. Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 1 / 2

18 Implementation OpenFST toolbox works with any semiring (PB required semiring of sets) proven and well optimized ShortestPath algorithms Gurobi ILP solver Artem Sokolov, G. Wisniewski, F. Yvon LIMSI-CNRS Lattice Oracles for SMT 2 / 2

Minimum Error Rate Training Semiring

Minimum Error Rate Training Semiring Artem Sokolov & François Yvon LIMSI-CNRS & LIMSI-CNRS/Univ. Paris Sud {artem.sokolov,francois.yvon}@limsi.fr EAMT 2011 31 May 2011 Artem Sokolov & François Yvon (LIMSI)