Minimum Error Rate Training Semiring

Size: px

Start display at page:

Download "Minimum Error Rate Training Semiring"

Lily Williamson
5 years ago
Views:

1 Minimum Error Rate Training Semiring Artem Sokolov & François Yvon LIMSI-CNRS & LIMSI-CNRS/Univ. Paris Sud EAMT May 2011 Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

2 Talk Plan 1 Introduction Phrase-based statistical machine translation Minimum Error Rate Training Contribution 2 Semirings Lattice MERT MERT Semiring 3 Implementation 4 Experiments Setup Results 5 Future Work Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

3 Introduction Phrase-based statistical machine translation Probability model and inference in SMT system Probability of translation e given source sentence f: p(e f) = Z(f) 1 exp( λ h(e, f)) h(e, f) feature vector (various compatibility measures of e and f) λ parameter vector, λ i regulates importance of the feature h i (e, f) Translating by MAP-inference: ẽ f ( λ) = arg max e E p(e f) = arg max λ h(e, f) e E E reachable translations (search space), can be approximated by: list of n-best hypotheses word lattice Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

4 Introduction Tuning SMT system with MERT Minimum Error Rate Training Given: development set {(f, r f )} (source f & reference r f pairs) Solve: λ = arg max BLEU({ẽ f ( λ, E( λ)), r f }) λ BLEU is non-convex and not differentiable, hence heuristics (MERT). Search space approximation depends on λ, so iterative tuning: {(f, r f )} Configuration: λ t Decoder Approximation of E Updater Scorer Updated λ t Tuning: MERT Features Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

5 Introduction Minimum Error Rate Training MERT proceeds in series of optimizations along directions r: Optimal translation: ẽ f (γ) = arg max e E λ = λ 0 + γ r λ h(e, f) = arg max e E λ 0 h(e, f) }{{} intercept +γ r h(e, f) }{{} slope each translation hypothesis is associated with a line, upper envelope: dominating lines when λ is moved along r λ h γ Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

6 Introduction Minimum Error Rate Training γ-projections of intersections give intervals of constant optimal hypothesis optimal γ found by merging intervals for f F and scoring each update λ = λ 0 + γ i r i, where i is the index of the direction yielding the highest BLEU λ h γ Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

7 Introduction Minimum Error Rate Training MERT problems very slow, because of: overall number of iterations folklore: number of iterations number of dimensions slowness of each iteration (dominated by decoding time) non-monotonicity/instability of the training process sensitivity of the resulting solutions to initial conditions Ways to tackle the problems improve optimization other target function approximations changes into optimization algorithms improve search space processing this presentation use lattices (better approximation of the complete search space) reduce search to standard operations (facilitates implementation) reduce number of iteration this presentation Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

8 Contribution Introduction Contribution Recast Lattice MERT algorithm of [Macherey et al., 2008] in a semiring framework has already been hinted to in [Dyer et al., 2010] but was never formally described lack of implementation details Reimplement MERT using this reformulation and general-purpose FST toolbox OpenFST Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

9 Semirings Lattice MERT Semirings Semiring K = K,,, 0, 1 : K,, 0 is a commutative monoid with identity element 0: a (b c) = (a b) c a b = b a a 0 = 0 a = a K,, 1 is a monoid with identity element 1 distributes over a (b c) = (a b) (a c) (b c) a = (b a) (c a) element 0 annihilates K a 0 = 0 a = 0. Examples R, +,, 0, 1 real semiring S,,,, i S i semiring of sets Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

10 Semirings Lattice MERT Lattice MERT [Macherey et al., 2008] source fr: Vénus est la jumelle infernale de la Terre target en: Venus is Earth s hellish twin Vénus:Venus h_02 2 est:is h_23 la_jumelle:twin h_35 0 Vénus_est:Venus h_01 1 NULL:- h_13 3 la:the h_34 jumelle:twin h_45 5 infernale:of_hell h_58 4 infernale:hellish h_47 infernale:infernal h_46 7 jumelle:twin h_78 jumelle:twin h_68 8 de_la_terre:of_the_earth h_ Decomposability of h(e, f) into a sum of local features h 01,h Envelopes are distributed over nodes in the lattice Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

11 MERT Semiring Semirings MERT Semiring Host set: a line: d y + d s x (hypothesis) D = D,,, 0, 1 set of lines d i : d = {d i,y + d i,s x} (set of hypotheses) set of sets d k of lines: D = { {d k i,y + d k i,s x}} Operations and : for d 1, d 2 D d 1 d 2 = env(d 1 d 2 ) d 1 d 2 = env({(di.y 1 + d j.y 2 ) + (d i.s 1 + d j.s 2 ) x d 1 i d 1, dj 2 d 2 }) Unities: 0 = 1 = {0 + 0 x} Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

12 d 1 d 2 = env(d 1 d 2 ) Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18 Semirings Semiring Operations Illustration MERT Semiring -example d 1 d 2 = env({(d 1 i.y + d 2 j.y ) + (d 1 i.s + d 2 j.s) x d 1 i d 1, d 2 j d 2 }) -example

13 Semirings MERT Semiring Shortest Paths for MERT Semiring Each arc in the FST carries: target word a vector h(a, f) of local features associated with a singleton set containing line d with slope d s = ( r h(a, f)) y-intercept d y = ( λ 0 h(a, f)) Weight of a candidate translation path e = e 1... e l : w(e) = l w(e i ) = { λ 0 i=1 l h(e i, f) + ( r i=1 Upper envelope of all the lines (hypotheses): env( e w(e)) = e w(e) = e l(e) l h(e i, f)) x} i=1 w(e i ). Generic shortest distance algorithms over acyclic graphs calculate this. Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18 i=1

14 Implementation Implementation Basics: OpenFST toolbox works with any semiring proven and well optimized ShortestPath algorithms other useful algorithms: Union, Determinize, etc. Lattice minimization: Union of lattices between decoder runs Determinize+Minimize to eliminate duplicate hypotheses won t work MERT semiring is not divisible circumvent by performing Union+Determinize over (min, +) semiring All directions simultaneously weights as arrays of envelopes random direction BLEU Random restarts help only for the first iteration Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

15 Experiments Setup Experiments Data: NewsCommentary (dev: 2051) & WMT10 (dev: 1026), common test French to English FST MERT tuning: OpenFST-based multi-threaded implementation zero restart points axes and additional random directions Baseline MERT tuning: MERT implementation included in MOSES toolkit 100-best list, 20 restart points Koehn s coordinate descend (only axis directions) Decoder: n-gram phrase-based SMT system N-code 1, 11 features 1 Demo on Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

16 Experiments Results Experiments newsco, dev newsco, test BLEU n-best MERT, dev 15.5 fst MERT (r=0), dev fst MERT (r=20), dev WMT10, dev n-best MERT, test 15.5 fst MERT (r=0), test 15.0 fst MERT (r=20), test WMT10, test BLEU n-best MERT, dev fst MERT (r=0), dev fst MERT (r=20), dev fst MERT (r=50), dev n-best MERT, test fst MERT (r=0), test fst MERT (r=20), test fst MERT (r=50), test iteration iteration Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

17 Future Work Conclusion & Future Work Conclusion Semiring formalization allows using generic FST toolkits to do MERT Convergence in less iterations Future Work Better stopping criteria to detect saturation Faster should be most helpful for speed up Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

18 Future Work Thank you for your attention! Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 18

19 Bibliography Dyer, C., Lopez, A., Ganitkevitch, J., Weese, J., Ture, F., Blunsom, P., Setiawan, H., Eidelman, V., & Resnik, P. (2010). cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proc. of the ACL (pp. 7 12). Macherey, W., Och, F. J., Thayer, I., & Uszkoreit, J. (2008). Lattice-based minimum error rate training for statistical machine translation. In Proc. of the Conf. on EMNLP (pp ). Artem Sokolov & François Yvon (LIMSI) Minimum Error Rate Training Semiring EAMT / 1

Computing Lattice BLEU Oracle Scores for Machine Translation

Computing Lattice BLEU Oracle Scores for Machine Translation Computing Lattice Oracle Scores for Machine Translation Artem Sokolov & Guillaume Wisniewski & François Yvon {firstname.lastname}@limsi.fr LIMSI, Orsay, France 1 Introduction 2 Oracle Decoding Task 3 Proposed