Lecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1

Size: px

Start display at page:

Download "Lecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1"

Derek Walsh
5 years ago
Views:

1 Lecture 9: Decoding Andreas Maletti Statistical Machine Translation Stuttgart January 20, 2012 SMT VIII A. Maletti 1

2 Lecture 9 Last time Synchronous grammars (tree transducers) Rule extraction Weight training (EM) This time Inside and outside weights n-best derivations Cube pruning Factorization SMT VIII A. Maletti 2

3 Lecture 9 Last time Synchronous grammars (tree transducers) Rule extraction Weight training (EM) This time Inside and outside weights (again) n-best derivations Cube pruning Factorization SMT VIII A. Maletti 2

4 Lecture 9 Last time Synchronous grammars (tree transducers) Rule extraction Weight training (EM) This time Inside and outside weights n-best derivations Cube pruning Factorization SMT VIII A. Maletti 2

5 Lecture 9 Last time Synchronous grammars (tree transducers) Rule extraction Weight training (EM) This time Inside and outside weights n-best derivations Cube pruning Factorization SMT VIII A. Maletti 2

6 Lecture 9 Last time Synchronous grammars (tree transducers) Rule extraction Weight training (EM) This time Inside and outside weights n-best derivations Cube pruning Factorization SMT VIII A. Maletti 2

7 Contents 1 Inside and Outside Weights 2 n-best Derivations 3 Cube Pruning 4 Factorization SMT VIII A. Maletti 3

8 State Metrics of a Weighted Tree Grammar Given weighted tree grammar G = (Q, Σ, I, R) D q M : derivations in M starting from state q Definition Inside weight of q Q: in(q) = d D q M wt(d) Outside weight of q Q: out(q) = t C Σ ({q}) d D M (t) wt(d) SMT VIII A. Maletti 4

9 Inside Weight Question Why did nobody complain about the infinite sum? in(q) = wt(d) d D q M SMT VIII A. Maletti 5

10 Computing Inside Weights Theorem in(q) = c in(q 1 )... in(q k ) q c σ(q 1,...,q k ) R Problem But that is not a well-founded recursion! Turn it into a system of equations 0 = in(q) + c in(q 1 )... in(q k ) 0 = in(p) + Q variables and Q equations q c σ(q 1,...,q k ) R SMT VIII A. Maletti 6

11 Computing Inside Weights Theorem in(q) = c in(q 1 )... in(q k ) q c σ(q 1,...,q k ) R Problem But that is not a well-founded recursion! Turn it into a system of equations 0 = in(q) + c in(q 1 )... in(q k ) 0 = in(p) + Q variables and Q equations q c σ(q 1,...,q k ) R SMT VIII A. Maletti 6

12 Computing Inside Weights Theorem in(q) = c in(q 1 )... in(q k ) q c σ(q 1,...,q k ) R Problem But that is not a well-founded recursion! Turn it into a system of equations 0 = in(q) + c in(q 1 )... in(q k ) 0 = in(p) + Q variables and Q equations q c σ(q 1,...,q k ) R SMT VIII A. Maletti 6

13 Solving a system of equations Q variables and Q equations 0 = in(q) + c in(q 1 )... in(q k ) q c σ(q 1,...,q k ) R 0 = in(p) + Question Is it a linear system of equations? Approximation Use any method for approximating zero-points of functions Regula Falsi Tangent method (NEWTON-RAPHSON) Secant method... SMT VIII A. Maletti 7

14 Solving a system of equations Q variables and Q equations 0 = in(q) + c in(q 1 )... in(q k ) 0 = in(p) + q c σ(q 1,...,q k ) R Question Is it a linear system of equations? NO! Approximation Use any method for approximating zero-points of functions Regula Falsi Tangent method (NEWTON-RAPHSON) Secant method... SMT VIII A. Maletti 7

15 Solving a system of equations Q variables and Q equations 0 = in(q) + c in(q 1 )... in(q k ) 0 = in(p) + q c σ(q 1,...,q k ) R Question Is it a linear system of equations? NO! Approximation Use any method for approximating zero-points of functions Regula Falsi Tangent method (NEWTON-RAPHSON) Secant method... SMT VIII A. Maletti 7

16 Newton-Raphson method Algorithm 1 Select initial point p 2 Determine tangent to curve at p 3 Determine zero-point p of tangent 4 Go back to 2 SMT VIII A. Maletti 8

17 Newton-Raphson Iteration SMT VIII A. Maletti 9

18 Newton-Raphson Iteration SMT VIII A. Maletti 9

19 Newton-Raphson Iteration SMT VIII A. Maletti 9

20 Newton-Raphson Iteration SMT VIII A. Maletti 9

21 Newton-Raphson Iteration SMT VIII A. Maletti 9

22 Newton-Raphson Iteration SMT VIII A. Maletti 9

23 Newton-Raphson Iteration SMT VIII A. Maletti 9

24 Newton-Raphson Iteration SMT VIII A. Maletti 9

25 Newton-Raphson Iteration SMT VIII A. Maletti 9

26 Newton-Raphson Iteration SMT VIII A. Maletti 9

27 Newton-Raphson Iteration SMT VIII A. Maletti 9

28 Newton-Raphson Iteration SMT VIII A. Maletti 9

29 Newton-Raphson Iteration SMT VIII A. Maletti 9

30 Newton-Raphson Iteration SMT VIII A. Maletti 9

31 Newton-Raphson Iteration SMT VIII A. Maletti 9

32 Newton-Raphson Iteration SMT VIII A. Maletti 9

33 Newton-Raphson Iteration SMT VIII A. Maletti 9

34 Convergence Speed given a good initial point converges quadratically to the solution (number of correct decimal points doubles with each iteration) But Convergence is not guaranteed Initial point needs to be reasonably close to zero-point SMT VIII A. Maletti 10

35 Convergence Speed given a good initial point converges quadratically to the solution (number of correct decimal points doubles with each iteration) But Convergence is not guaranteed Initial point needs to be reasonably close to zero-point SMT VIII A. Maletti 10

36 Convergence SMT VIII A. Maletti 10

37 Contents 1 Inside and Outside Weights 2 n-best Derivations 3 Cube Pruning 4 Factorization SMT VIII A. Maletti 11

38 n-best Derivations Problem Given WTG G = (Q, Σ, I, R) and q Q Compute the n highest scoring derivations of D q M Caveats Semantics is given by wt(t) = ( ) wt(d) q I d D q M (t) We disregard the actual input We disregard the summation SMT VIII A. Maletti 12

39 n-best Derivations Problem Given WTG G = (Q, Σ, I, R) and q Q Compute the n highest scoring derivations of D q M Caveats Semantics is given by wt(t) = ( ) wt(d) q I d D q M (t) We disregard the actual input We disregard the summation SMT VIII A. Maletti 12

40 n-best Derivations Problem Given WTG G = (Q, Σ, I, R) and q Q Compute the n highest scoring derivations of D q M Caveats Semantics is given by wt(t) = ( ) wt(d) q I d D q M (t) We disregard the actual input Use product construction to restrict to actual input We disregard the summation SMT VIII A. Maletti 12

41 n-best Derivations Problem Given WTG G = (Q, Σ, I, R) and q Q Compute the n highest scoring derivations of D q M Caveats Semantics is given by wt(t) = ( ) wt(d) q I d D q M (t) We disregard the actual input Use product construction to restrict to actual input We disregard the summation No Fix! SMT VIII A. Maletti 12

42 Viterbi Algorithm Algorithm 1 Sort states topologically 2 Select least unprocessed state q 3 Determine its best derivation (based on previous states) 4 Mark q and return to 2. Requirement Optimal substructure property (dynamic programming) a i b i implies f (a 1,..., a k ) f (b 1,..., b k ) in our case: a i b i must imply k i=1 a i k i=1 b i SMT VIII A. Maletti 13

43 Viterbi Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 SMT VIII A. Maletti 14

44 Viterbi Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 Best derivations: 1 α 0.7 SMT VIII A. Maletti 14

45 Viterbi Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 Best derivations: 1 α β 0.8 SMT VIII A. Maletti 14

46 Viterbi Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 Best derivations: 1 α β σ(α, β) SMT VIII A. Maletti 14

47 Viterbi Algorithm Exercise (Taken from KAESLIN: A gentle introduction to dynamic programming) SMT VIII A. Maletti 15

48 Viterbi Algorithm Suitable weight structures (R 0, +,, 0, 1) (R, +,, 0, 1) ({0, 1}, max, min, 0, 1) (N, +,, 0, 1) ([0, 1], max,, 0, 1) Extension Can we handle cyclic WTG? additional requirement: f (a 1,..., a k ) a i or in our case k a i a i i=1 SMT VIII A. Maletti 16

49 Viterbi Algorithm Suitable weight structures (R 0, +,, 0, 1) YES (R, +,, 0, 1) ({0, 1}, max, min, 0, 1) (N, +,, 0, 1) ([0, 1], max,, 0, 1) Extension Can we handle cyclic WTG? additional requirement: f (a 1,..., a k ) a i or in our case k a i a i i=1 SMT VIII A. Maletti 16

50 Viterbi Algorithm Suitable weight structures (R 0, +,, 0, 1) YES (R, +,, 0, 1) NO ({0, 1}, max, min, 0, 1) (N, +,, 0, 1) ([0, 1], max,, 0, 1) Extension Can we handle cyclic WTG? additional requirement: f (a 1,..., a k ) a i or in our case k a i a i i=1 SMT VIII A. Maletti 16

51 Viterbi Algorithm Suitable weight structures (R 0, +,, 0, 1) YES (R, +,, 0, 1) NO ({0, 1}, max, min, 0, 1) YES (N, +,, 0, 1) ([0, 1], max,, 0, 1) Extension Can we handle cyclic WTG? additional requirement: f (a 1,..., a k ) a i or in our case k a i a i i=1 SMT VIII A. Maletti 16

52 Viterbi Algorithm Suitable weight structures (R 0, +,, 0, 1) YES (R, +,, 0, 1) NO ({0, 1}, max, min, 0, 1) YES (N, +,, 0, 1) YES ([0, 1], max,, 0, 1) Extension Can we handle cyclic WTG? additional requirement: f (a 1,..., a k ) a i or in our case k a i a i i=1 SMT VIII A. Maletti 16

53 Viterbi Algorithm Suitable weight structures (R 0, +,, 0, 1) YES (R, +,, 0, 1) NO ({0, 1}, max, min, 0, 1) YES (N, +,, 0, 1) YES ([0, 1], max,, 0, 1) YES Extension Can we handle cyclic WTG? additional requirement: f (a 1,..., a k ) a i or in our case k a i a i i=1 SMT VIII A. Maletti 16

54 Viterbi Algorithm Suitable weight structures (R 0, +,, 0, 1) YES (R, +,, 0, 1) NO ({0, 1}, max, min, 0, 1) YES (N, +,, 0, 1) YES ([0, 1], max,, 0, 1) YES Extension Can we handle cyclic WTG? additional requirement: f (a 1,..., a k ) a i or in our case k a i a i i=1 SMT VIII A. Maletti 16

55 Viterbi Algorithm Suitable weight structures (R 0, +,, 0, 1) YES (R, +,, 0, 1) NO ({0, 1}, max, min, 0, 1) YES (N, +,, 0, 1) YES ([0, 1], max,, 0, 1) YES Extension Can we handle cyclic WTG? additional requirement: YES f (a 1,..., a k ) a i or in our case k a i a i i=1 SMT VIII A. Maletti 16

56 n-best Algorithm Observations Best subderivations yield the best derivation How do we obtain the 2nd best derivation? SMT VIII A. Maletti 17

57 n-best Algorithm Observations Best subderivations yield the best derivation How do we obtain the 2nd best derivation? SMT VIII A. Maletti 17

58 n-best Algorithm Observations Best subderivations yield the best derivation How do we obtain the 2nd best derivation? SMT VIII A. Maletti 17

59 n-best Algorithm Observations Best subderivations yield the best derivation How do we obtain the 2nd best derivation? SMT VIII A. Maletti 17

60 n-best Algorithm Observations Best subderivations yield the best derivation How do we obtain the 2nd best derivation? SMT VIII A. Maletti 17

61 n-best Algorithm Observations Best subderivations yield the best derivation How do we obtain the 2nd best derivation? SMT VIII A. Maletti 17

62 n-best Algorithm Observations Best subderivations yield the best derivation How do we obtain the 2nd best derivation? SMT VIII A. Maletti 17

63 n-best Algorithm Optimization lazy computation of values yields O( E + D max n log n) E : number of transitions D max : longest derivation in top n n: desired number of derivations SMT VIII A. Maletti 18

64 3-best Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 1st derivation: σ(α, β) with weight = nd derivation: SMT VIII A. Maletti 19

65 3-best Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 1st derivation: σ(α, β) with weight = nd derivation: best of β 0.3 σ 0.8 with 2nd best from 1 and best from 2 σ 0.8 with best from 1 and 2nd best from 2 SMT VIII A. Maletti 19

66 3-best Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 1st derivation: σ(α, β) with weight = nd derivation: best of β σ 0.8 with 2nd best from 1 and best from 2 σ 0.8 with best from 1 and 2nd best from 2 SMT VIII A. Maletti 19

67 3-best Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 1st derivation: σ(α, β) with weight = nd derivation: best of β σ 0.8 with 2nd best from 1 and best from σ 0.8 with best from 1 and 2nd best from 2 SMT VIII A. Maletti 19

68 3-best Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 1st derivation: σ(α, β) with weight = nd derivation: best of β σ 0.8 with 2nd best from 1 and best from σ 0.8 with best from 1 and 2nd best from SMT VIII A. Maletti 19

69 3-best Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 1st derivation: σ(α, β) with weight = nd derivation: β with weight 0.3 = 0.3 3rd derivation: σ 0.8 with 2nd best from 1 and best from σ 0.8 with best from 1 and 2nd best from SMT VIII A. Maletti 19

70 3-best Algorithm 3 σ α 0.7 β 0.3 α 0.2 β 0.8 1st derivation: σ(α, β) with weight = nd derivation: β with weight 0.3 = 0.3 3rd derivation: σ(β, β) with weight = SMT VIII A. Maletti 19

71 5-best Algorithm Exercise SMT VIII A. Maletti 20

72 Contents 1 Inside and Outside Weights 2 n-best Derivations 3 Cube Pruning 4 Factorization SMT VIII A. Maletti 21

73 Cube Pruning Cube Pruning = n-best + language model SMT VIII A. Maletti 22

74 Cube Pruning Syntax-based systems Input Parser Machine translation system Language model Output SMT VIII A. Maletti 23

75 Cube Pruning Syntax-based systems Input Parser Machine translation system Language model Output Notes We can individually compute n-best lists But that leads to large search errors Slightly better: 10 n+2 -best list and 10 n -best list SMT VIII A. Maletti 23

76 Cube Pruning Syntax-based systems Input Parser Machine translation system Language model Output Notes We can individually compute n-best lists But that leads to large search errors Slightly better: 10 n+2 -best list and 10 n -best list SMT VIII A. Maletti 23

77 Cube Pruning Syntax-based systems Input Parser Machine translation system Language model Output Alternatives Full integration (Intersection) Pruning and beam search Cube pruning SMT VIII A. Maletti 23

78 Cube Pruning Syntax-based systems Input Parser Machine translation system Language model Output Alternatives Full integration (Intersection) Pruning and beam search Cube pruning too expensive SMT VIII A. Maletti 23

79 Cube Pruning Syntax-based systems Input Parser Machine translation system Language model Output Alternatives Full integration (Intersection) Pruning and beam search Cube pruning too expensive SMT VIII A. Maletti 23

80 Cube Pruning Syntax-based systems Input Parser Machine translation system Language model Output Alternatives Full integration (Intersection) Pruning and beam search Cube pruning too expensive too bad SMT VIII A. Maletti 23

81 Cube Pruning Syntax-based systems Input Parser Machine translation system Language model Output Alternatives Full integration (Intersection) Pruning and beam search Cube pruning too expensive too bad SMT VIII A. Maletti 23

82 Cube Pruning Algorithm Proceed as for n-best but multiply the language model score in the end SMT VIII A. Maletti 24

83 Cube Pruning Algorithm Proceed as for n-best but multiply the language model score in the end SMT VIII A. Maletti 24

84 Cube Pruning Algorithm Proceed as for n-best but multiply the language model score in the end SMT VIII A. Maletti 24

85 Cube Pruning Algorithm Proceed as for n-best but multiply the language model score in the end SMT VIII A. Maletti 24

86 Cube Pruning Algorithm Proceed as for n-best but multiply the language model score in the end SMT VIII A. Maletti 24

87 Cube Pruning Algorithm Proceed as for n-best but multiply the language model score in the end SMT VIII A. Maletti 24

88 Cube Pruning Algorithm Proceed as for n-best but multiply the language model score in the end SMT VIII A. Maletti 24

89 Cube Pruning Observations retains the efficiency of n-best derivation output list not sorted no guarantees SMT VIII A. Maletti 25

90 Contents 1 Inside and Outside Weights 2 n-best Derivations 3 Cube Pruning 4 Factorization SMT VIII A. Maletti 26

91 Factorization Motivation rk(m) essential part in the complexity of xtt operations [like input restriction] O( R len(m) P 2 rk(m)+5 ) Factorization reduce rk(m) by decomposing rules maximal decomposition prefered SMT VIII A. Maletti 27

92 Factorization References ZHANG, HUANG, GILDEA, KNIGHT: Synchronous binarization for machine translation. HLT-NAACL 2006 [for SCFG] GILDEA, SATTA, ZHANG: Factoring synchronous grammars by sorting. CoLing/ACL 2006 [for SCFG] NESSON, SATTA, SHIEBER: Optimal k-arization of synchronous tree-adjoining grammar. ACL 2008 [for STAG] GILDEA: Optimal parsing strategies for linear context-free rewriting systems. HLT-NAACL 2010 [for LCFRS] SMT VIII A. Maletti 28

93 Maximal Factorization Common advertisement slogan Optimal k-arization Optimal parsing strategy But factorization is just one way to reduce rk(m) obtained optimal rank is not optimal for the given transformation rk(m) is just one parameter to the parsing complexity Conclusion optimal k-arization maximal factorization optimal parsing strategy maximal factorization SMT VIII A. Maletti 29

94 Maximal Factorization Common advertisement slogan Optimal k-arization Optimal parsing strategy But factorization is just one way to reduce rk(m) obtained optimal rank is not optimal for the given transformation rk(m) is just one parameter to the parsing complexity Conclusion optimal k-arization maximal factorization optimal parsing strategy maximal factorization SMT VIII A. Maletti 29

95 Construction Example γ σ σ q 1 σ q 3 q 2 q q 1 γ σ q 2 q 3 Constructed rules SMT VIII A. Maletti 30

96 Construction Example γ σ σ q 1 σ q 3 q 2 q q 1 γ σ q 2 q 3 Constructed rules SMT VIII A. Maletti 30

97 Construction Example γ σ σ q 1 σ q 3 q 2 q q 1 γ σ q 2 q 3 Constructed rules σ q 1 q q 1 γ σ σ q 3 q 2 γ σ q 2 q 3 SMT VIII A. Maletti 30

98 Construction Notes runs in linear time returns maximal (meaningful) factorization returned XTOP is rank-optimal (among all factorizations) SMT VIII A. Maletti 31

99 Construction Notes runs in linear time returns maximal (meaningful) factorization returned XTOP is rank-optimal (among all factorizations) More Notes Factorization is strongly related to rule extraction Rule extraction always yields rank-optimal XTOP SMT VIII A. Maletti 31

100 References HUANG, CHIANG: Better k-best parsing. IWPT 2005 CHIANG: Hierarchical Phrase-Based Translation. Computat. Linguist. 33, 2007 MAY, KNIGHT: TIBURON a weighted tree automata toolkit. CIAA 2006 SMT VIII A. Maletti 32

101 References HUANG, CHIANG: Better k-best parsing. IWPT 2005 CHIANG: Hierarchical Phrase-Based Translation. Computat. Linguist. 33, 2007 MAY, KNIGHT: TIBURON a weighted tree automata toolkit. CIAA 2006 Thank you for your attention! SMT VIII A. Maletti 32

102 Questions? SMT VIII A. Maletti 33

Applications of Tree Automata Theory Lecture VI: Back to Machine Translation

Applications of Tree Automata Theory Lecture VI: Back to Machine Translation Andreas Maletti Institute of Computer Science Universität Leipzig, Germany on leave from: Institute for Natural Language Processing