Statistical Machine Translation of Natural Languages

Size: px

Start display at page:

Download "Statistical Machine Translation of Natural Languages"

Dylan Hines
5 years ago
Views:

1 1/26 Statistical Machine Translation of Natural Languages Heiko Vogler Technische Universität Dresden Germany Graduiertenkolleg Quantitative Logics and Automata Dresden, November, 2012

2 1/26 Weighted Tree Automata and Weighted Tree Transducers can help in Statistical Machine Translation of Natural Languages Heiko Vogler Technische Universität Dresden Germany Graduiertenkolleg Quantitative Logics and Automata Dresden, November, 2012

3 2/26 outline of the talk: Statistical machine translation

4 2/26 outline of the talk: Statistical machine translation Modeling with wta and wtt

5 2/26 outline of the talk: Statistical machine translation Modeling with wta and wtt Using automata theoretic results to improve modeling

6 2/26 outline of the talk: Statistical machine translation Modeling with wta and wtt Using automata theoretic results to improve modeling Summary

7 3/26 outline of the talk: Statistical machine translation no survey! Modeling with wta and wtt Using automata theoretic results to improve modeling Summary

8 4/26 given: source language SL target language TL find: translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel.

9 4/26 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel.

10 4/26 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel. machine translation statistical machine translation

11 4/26 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel. machine translation statistical machine translation modeling - training - evaluation

12 4/26 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel. machine translation statistical machine translation modeling - training - evaluation

13 4/26 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel. machine translation statistical machine translation modeling - training - evaluation assumptions modeling H hypothesis space assumptions: mental work, experience, no data hypothesis space: H {h h : SL TL}

14 log-linear modeling: 5/26

15 5/26 log-linear modeling: set Y of correspondence structures [Liang et al. 06] π SL : Y SL π TL : Y TL

16 5/26 log-linear modeling: set Y of correspondence structures [Liang et al. 06] π SL : Y SL π TL : Y TL hypothesis space: H = {h λ,φ λ R m 0, Φ : Y Rm } h λ,φ : SL TL s π TL (argmax y Y : λ Φ(y) π SL (y)=s )

17 5/26 log-linear modeling: set Y of correspondence structures [Liang et al. 06] π SL : Y SL π TL : Y TL hypothesis space: H = {h λ,φ λ R m 0, Φ : Y Rm } h λ,φ : SL TL s π TL (argmax y Y : λ Φ(y) π SL (y)=s ) λ 1 Φ(y) λ m Φ(y) m

18 6/26 hypothesis space: H = {h λ,φ λ R m 0, Φ : Y Rm } h λ,φ : SL TL s π TL (argmax y Y : λ Φ(y) π SL (y)=s ) λ 1 Φ(y) λ m Φ(y) m

19 6/26 hypothesis space: H = {h λ,φ λ R m 0, Φ : Y Rm } h λ,φ : SL TL s π TL (argmax y Y : λ Φ(y) π SL (y)=s ) λ 1 Φ(y) λ m Φ(y) m here: m = 2 Φ(y) = (h TM (y), h LM (y)) h TM : translation model h LM : language model

20 7/26 outline of the talk: Statistical machine translation Modeling with wta and wtt Using automata theoretic results to improve modeling Summary

21 8/26 first assumption: SL and TL are the yields of weighted recognizable tree languages

22 8/26 first assumption: SL and TL are the yields of weighted recognizable tree languages weighted tree language: L : T Σ R

23 8/26 first assumption: SL and TL are the yields of weighted recognizable tree languages weighted tree language: L : T Σ R L is recognizable: if there is a wta A which recognizes (computes) L

24 9/26 weighted tree automaton (wta) A = (Q, Σ, δ, F ) Q finite set (states) Σ ranked alphabet (input symbols) δ = (δ σ σ Σ) δ σ : Q k Q R δ σ (q 1 q k, q) R F Q (final states)

25 9/26 weighted tree automaton (wta) A = (Q, Σ, δ, F ) Q finite set (states) Σ ranked alphabet (input symbols) δ = (δ σ σ Σ) δ σ : Q k Q R δ σ (q 1 q k, q) R F Q (final states) run on ξ T Σ : r : pos(ξ) Q set of runs on ξ: R A (ξ)

26 9/26 weighted tree automaton (wta) A = (Q, Σ, δ, F ) Q finite set (states) Σ ranked alphabet (input symbols) δ = (δ σ σ Σ) δ σ : Q k Q R δ σ (q 1 q k, q) R F Q (final states) run on ξ T Σ : r : pos(ξ) Q set of runs on ξ: R A (ξ)

27 9/26 weighted tree automaton (wta) A = (Q, Σ, δ, F ) Q finite set (states) Σ ranked alphabet (input symbols) δ = (δ σ σ Σ) δ σ : Q k Q R δ σ (q 1 q k, q) R F Q (final states) run on ξ T Σ : r : pos(ξ) Q set of runs on ξ: R A (ξ) weight of r: wt(r) = w pos(ξ) δ σ(r(w1) r(wk), r(w)) σ: label of ξ at w k: rank of σ

28 9/26 weighted tree automaton (wta) A = (Q, Σ, δ, F ) Q finite set (states) Σ ranked alphabet (input symbols) δ = (δ σ σ Σ) δ σ : Q k Q R δ σ (q 1 q k, q) R F Q (final states) run on ξ T Σ : r : pos(ξ) Q set of runs on ξ: R A (ξ) weight of r: wt(r) = w pos(ξ) δ σ(r(w1) r(wk), r(w)) σ: label of ξ at w k: rank of σ

29 9/26 weighted tree automaton (wta) A = (Q, Σ, δ, F ) Q finite set (states) Σ ranked alphabet (input symbols) δ = (δ σ σ Σ) δ σ : Q k Q R δ σ (q 1 q k, q) R F Q (final states) run on ξ T Σ : r : pos(ξ) Q set of runs on ξ: R A (ξ) weight of r: wt(r) = w pos(ξ) δ σ(r(w1) r(wk), r(w)) σ: label of ξ at w k: rank of σ

30 9/26 weighted tree automaton (wta) A = (Q, Σ, δ, F ) Q finite set (states) Σ ranked alphabet (input symbols) δ = (δ σ σ Σ) δ σ : Q k Q R δ σ (q 1 q k, q) R F Q (final states) run on ξ T Σ : r : pos(ξ) Q set of runs on ξ: R A (ξ) weight of r: wt(r) = w pos(ξ) δ σ(r(w1) r(wk), r(w)) σ: label of ξ at w k: rank of σ

31 9/26 weighted tree automaton (wta) A = (Q, Σ, δ, F ) Q finite set (states) Σ ranked alphabet (input symbols) δ = (δ σ σ Σ) δ σ : Q k Q R δ σ (q 1 q k, q) R F Q (final states) run on ξ T Σ : r : pos(ξ) Q set of runs on ξ: R A (ξ) weight of r: wt(r) = w pos(ξ) δ σ(r(w1) r(wk), r(w)) σ: label of ξ at w k: rank of σ

32 9/26 weighted tree automaton (wta) A = (Q, Σ, δ, F ) Q finite set (states) Σ ranked alphabet (input symbols) δ = (δ σ σ Σ) δ σ : Q k Q R δ σ (q 1 q k, q) R F Q (final states) run on ξ T Σ : r : pos(ξ) Q set of runs on ξ: R A (ξ) weight of r: wt(r) = w pos(ξ) δ σ(r(w1) r(wk), r(w)) σ: label of ξ at w k: rank of σ

33 9/26 weighted tree automaton (wta) A = (Q, Σ, δ, F ) Q finite set (states) Σ ranked alphabet (input symbols) δ = (δ σ σ Σ) δ σ : Q k Q R δ σ (q 1 q k, q) R F Q (final states) run on ξ T Σ : r : pos(ξ) Q set of runs on ξ: R A (ξ) weight of r: wt(r) = w pos(ξ) δ σ(r(w1) r(wk), r(w)) weighted tree language recognized by A: σ: label of ξ at w k: rank of σ L A : T Σ R, L A (ξ) = max r R A (ξ) wt(r)

34 10/26 second assumption: translation from SL and TL is specified by a weighted tree transducer [Yamada, Knight 01] translation from English to Japanese

35 11/26 weighted tree transducer (wtt) M = (Q, Σ, q 0, R) Q, Σ as for wta q 0 Q (initial state) Σ: input and output symbols

36 11/26 weighted tree transducer (wtt) M = (Q, Σ, q 0, R) Q, Σ as for wta q 0 Q (initial state) Σ: input and output symbols R finite set of particular term rewrite rules with weights linear, nondeleting in x 1,..., x k

37 11/26 weighted tree transducer (wtt) M = (Q, Σ, q 0, R) Q, Σ as for wta q 0 Q (initial state) Σ: input and output symbols R finite set of particular term rewrite rules with weights

38 11/26 weighted tree transducer (wtt) M = (Q, Σ, q 0, R) Q, Σ as for wta q 0 Q (initial state) Σ: input and output symbols R finite set of particular term rewrite rules with weights (leftmost) derivation: d = ρ 1 ρ n

39 11/26 weighted tree transducer (wtt) M = (Q, Σ, q 0, R) Q, Σ as for wta q 0 Q (initial state) Σ: input and output symbols R finite set of particular term rewrite rules with weights (leftmost) derivation: weight of a derivation d: d = ρ 1 ρ n wt(d) = n i=1 wt(ρ i)

40 11/26 weighted tree transducer (wtt) M = (Q, Σ, q 0, R) Q, Σ as for wta q 0 Q (initial state) Σ: input and output symbols R finite set of particular term rewrite rules with weights (leftmost) derivation: weight of a derivation d: d = ρ 1 ρ n wt(d) = n i=1 wt(ρ i) weighted tree transformation computed by M: τ M : T Σ T Σ R, τ M (ξ 1, ξ 2 ) = max D M : set of all derivations d D M : π(d)=(ξ 1,ξ 2) wt(d)

41 12/26 log-linear modeling with wtt and wta: set Y of correspondence structures [Liang et al. 06]

42 12/26 log-linear modeling with wtt and wta: set Y of correspondence structures [Liang et al. 06] wtt M as translation model; d D M

43 12/26 log-linear modeling with wtt and wta: set Y of correspondence structures [Liang et al. 06] wtt M as translation model; d D M wta A as language model; r R A

44 12/26 log-linear modeling with wtt and wta: set Y of correspondence structures [Liang et al. 06] wtt M as translation model; d D M wta A as language model; r R A Y = {(d, r) D M R A r R A (last(d))} π SL (d, r) = yield(first(d)) π TL (d, r) = yield(last(d))

45 13/26 wtt M as translation model; d D M wta A as language model; r R A Y = {(d, r) D M R A r R A (last(d))}

46 13/26 wtt M as translation model; d D M wta A as language model; r R A Y = {(d, r) D M R A r R A (last(d))} here: m = ( 2 ) Φ(d, r) = log wt M (d), log wt A (r)

47 13/26 wtt M as translation model; d D M wta A as language model; r R A Y = {(d, r) D M R A r R A (last(d))} here: m = ( 2 ) Φ(d, r) = log wt M (d), log wt A (r) H = {h λ,m,a λ R 2 0, wtt M, wta A} h λ,m,a : SL TL ) s π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ 2 π SL (d,r)=s

48 14/26 outline of the talk: Statistical machine translation Modeling with wta and wtt Using automata theoretic results to improve modeling Summary Weight exponentiation Output product

49 15/26 h λ,m,a : SL TL ) s π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ 2 π SL (d,r)=s

50 15/26 h λ,m,a : SL TL ) s π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ 2 π SL (d,r)=s ) = π TL (argmax (d,r) Y : wt M (d) wt A (r) λ 2/λ 1 π SL (d,r)=s

51 15/26 h λ,m,a : SL TL ) s π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ 2 π SL (d,r)=s ) = π TL (argmax (d,r) Y : wt M (d) wt A (r) λ 2/λ 1 π SL (d,r)=s Lemma: Let A be a wta and λ R 0.

52 15/26 h λ,m,a : SL TL ) s π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ 2 π SL (d,r)=s ) = π TL (argmax (d,r) Y : wt M (d) wt A (r) λ 2/λ 1 π SL (d,r)=s Lemma: Let A be a wta and λ R 0. There is a wta A s.t. Q A = Q A and wt A (r) = wt A (r) λ for every r.

53 15/26 h λ,m,a : SL TL ) s π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ 2 π SL (d,r)=s ) = π TL (argmax (d,r) Y : wt M (d) wt A (r) λ 2/λ 1 π SL (d,r)=s Lemma: Let A be a wta and λ R 0. There is a wta A s.t. Q A = Q A and wt A (r) = wt A (r) λ for every r. = π TL (argmax (d,r) Y : wt M (d) wt A (r) π SL (d,r)=s )

54 16/26 outline of the talk: Statistical machine translation Modeling with wta and wtt Using automata theoretic results to improve modeling Summary Weight exponentiation Output product

55 Let τ : T Σ T Σ R and L : T Σ R 17/26

56 17/26 Let τ : T Σ T Σ R and L : T Σ R output product of τ and L: τ L : T Σ T Σ R (τ L)(ξ, ζ) τ(ξ, ζ) L(ζ)

57 17/26 Let τ : T Σ T Σ R and L : T Σ R output product of τ and L: τ L : T Σ T Σ R (τ L)(ξ, ζ) τ(ξ, ζ) L(ζ) input product of L and τ: L τ : T Σ T Σ R (L τ)(ξ, ζ) L(ξ) τ(ξ, ζ)

58 Theorem [Maletti 06]: Let M wtt and A wta. 18/26

59 Theorem [Maletti 06]: Let M wtt and A wta. There is a wtt M A such that: τ M A = τ M L A 18/26

60 18/26 Theorem [Maletti 06]: Let M wtt and A wta. There is a wtt M A such that: τ M A = τ M L A Proof: [Baker 79, Engelfriet, Fülöp,V. 02]

61 18/26 Theorem [Maletti 06]: Let M wtt and A wta. There is a wtt M A such that: τ M A = τ M L A Proof: [Baker 79, Engelfriet, Fülöp,V. 02] rule of M:

62 18/26 Theorem [Maletti 06]: Let M wtt and A wta. There is a wtt M A such that: τ M A = τ M L A Proof: [Baker 79, Engelfriet, Fülöp,V. 02] rule of M: states of A: p, p 1, p 2

63 18/26 Theorem [Maletti 06]: Let M wtt and A wta. There is a wtt M A such that: τ M A = τ M L A Proof: [Baker 79, Engelfriet, Fülöp,V. 02] rule of M: states of A: p, p 1, p 2 rule of M A:

64 19/26

65 19/26

66 19/26

67 ϕ : D M {(d, p) d D M, p R partial A (d)} bijection wt(d ) = wt(d) max r completion( p) wt(r) 19/26

68 20/26 recall: h λ,m,a : SL TL s π TL (argmax (d,r) Y : wt M (d) wt A (r) π SL(d,r)=s )

69 20/26 h λ,m,a : SL TL ) s π TL (argmax (d,r) Y : wt M (d) wt A (r) π SL(d,r)=s = π TL (argmax d DM wt M (d) max r RA (last(d)) wt A (r) π SL(d)=s )

70 20/26 h λ,m,a : SL TL ) s π TL (argmax (d,r) Y : wt M (d) wt A (r) π SL(d,r)=s = π TL (argmax d DM wt M (d) max r RA (last(d)) wt A (r) π SL(d)=s ) = π TL (argmax d DM wt M (d) p max p R A π SL(d)=s (d) max r compl.( p) wt A (r) )

71 20/26 h λ,m,a : SL TL s π TL (argmax (d,r) Y : wt M (d) wt A (r) π SL(d,r)=s = π TL (argmax d DM wt M (d) max r RA (last(d)) wt A (r) π SL(d)=s ) = π TL (argmax d DM wt M (d) p max p R A π SL(d)=s (d) max r compl.( p) wt A (r) = π TL argmax d DM wt M (d) max r compl.( p) wt A (r) π SL(d)=s p R p A (d) ) )

72 20/26 h λ,m,a : SL TL s π TL (argmax (d,r) Y : wt M (d) wt A (r) π SL(d,r)=s = π TL (argmax d DM wt M (d) max r RA (last(d)) wt A (r) π SL(d)=s ) = π TL (argmax d DM wt M (d) p max p R A π SL(d)=s (d) max r compl.( p) wt A (r) = π TL argmax d DM wt M (d) max r compl.( p) wt A (r) π SL(d)=s p R p A (d) = π TL (argmax d DM A wt M A (d) π SL(d)=s ) ) )

73 21/26 recall: Theorem [Maletti 06]: Let M wtt and A wta. There is a wtt M A such that: τ M A = τ M L A

74 21/26 recall: Theorem [Maletti 06]: Let M wtt and A wta. There is a wtt M A such that: τ M A = τ M L A generalization to mildly context-sensitive languages Theorem [Büchse, Nederhof, V. 11]: Let M synchronized tree-adjoining grammar (STAG) and A wta. There is an STAG M A such that: τ M A = τ M L A

75 21/26 recall: Theorem [Maletti 06]: Let M wtt and A wta. There is a wtt M A such that: τ M A = τ M L A generalization to mildly context-sensitive languages Theorem [Büchse, Nederhof, V. 11]: Let M synchronized tree-adjoining grammar (STAG) and A wta. There is an STAG M A such that: τ M A = τ M L A Theorem [Nederhof, V. 12]: Let M synchronized context-free tree grammar (SCFTG) and A wta. There is an SCFTG M A such that: τ M A = τ M L A

76 22/26 outline of the talk: Statistical machine translation Modeling with wta and wtt Using automata theoretic results to improve modeling Summary

77 23/26 model for translation from SL to TL: model for TL: wta A sentence s of SL wtt M

78 23/26 model for translation from SL to TL: model for TL: wta A sentence s of SL wtt M ) h λ,m,a (s) = π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ2 π SL(d,r)=s

79 23/26 model for translation from SL to TL: model for TL: wta A sentence s of SL wtt M ) h λ,m,a (s) = π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ2 π SL(d,r)=s (weight exp.) = π TL (argmax (d,r) Y : wt M (d) wt A (r) π SL(d,r)=s )

80 23/26 model for translation from SL to TL: model for TL: wta A sentence s of SL wtt M ) h λ,m,a (s) = π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ2 π SL(d,r)=s (weight exp.) = π TL (argmax (d,r) Y : wt M (d) wt A (r) π SL(d,r)=s (output product) = π TL (argmax d DM A : π SL(d)=s wt M A (d) ) )

81 23/26 model for translation from SL to TL: model for TL: wta A sentence s of SL wtt M ) h λ,m,a (s) = π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ2 π SL(d,r)=s (weight exp.) = π TL (argmax (d,r) Y : wt M (d) wt A (r) π SL(d,r)=s (output product) = π TL (argmax d DM A : wt M A (d) π SL(d)=s ) (B,S,P; input product) = π TL (argmax d DAs (M A) wt As (M A )(d) ) )

82 23/26 model for translation from SL to TL: model for TL: wta A sentence s of SL wtt M ) h λ,m,a (s) = π TL (argmax (d,r) Y : wt M (d) λ1 wt A (r) λ2 π SL(d,r)=s (weight exp.) = π TL (argmax (d,r) Y : wt M (d) wt A (r) π SL(d,r)=s (output product) = π TL (argmax d DM A : wt M A (d) π SL(d)=s ) (B,S,P; input product) = π TL (argmax d DAs (M A) wt As (M A )(d) = π TL ( Knuth(As (M A ))) ) ) )

83 impression: SMT is playing with formulas 24/26

84 24/26 impression: SMT is playing with formulas but: SMT is an engineering task!

85 24/26 impression: SMT is playing with formulas but: SMT is an engineering task! weighted tree automata and weighted tree transducers can help in modeling statistical machine translation of natural languages

86 25/26 References: [Baker 79] Composition of top-down and bottom-up tree transductions [Bar-Hillel, Shamir, Perles 61] On formal properties of simple phrase structure grammars [Büchse, Nederhof, V. 11] Tree Parsing with Synchronous Tree-Adjoining Grammars [Engelfriet, Fülöp, V. 02] Bottom-up and Top-down Tree Series Transformations [Fülöp, V. 09] Weighted tree automata and tree transducers [Knight et al ]... [Knuth 77] A generalization of Dijkstra s algorithm [Lopez 08] Statistical Machine Translation [Liang, Bouchard-Côté, Klein, Taskar 06] An End-to-End Discriminative Approach to Machine Translation [Maletti 06] Compositions of Tree Series Transformations [Maletti, Satta 09] Parsing Algorithms based on Tree Automata [Nederhof, V. 12] Synchronous Context-Free Tree Grammars [Thatcher 67] Characterizing derivation trees of context-free grammars through a generalization of finite automata theory [Yamada, Knight 01] A syntax-based statistical translation model

87 26/26 [Knight et al ] [Charniak, Knight, Yamada 03] Syntax-based language models for statistical machine translation [Galley, Hopkins, Knight, Marcu 04] What s in a translation rule? [Graehl, Knight 04] Training tree transducers [Graehl, Knight 05] An overview of probabilistic tree transducers for natural language processing

88 26/26 [Knight et al ] [Charniak, Knight, Yamada 03] Syntax-based language models for statistical machine translation [Galley, Hopkins, Knight, Marcu 04] What s in a translation rule? [Graehl, Knight 04] Training tree transducers [Graehl, Knight 05] An overview of probabilistic tree transducers for natural language processing [Maletti 11] Survey: Weighted Extended Top-down Tree Transducers Part III: Applications in Machine Translation

89 26/26 [Knight et al ] [Charniak, Knight, Yamada 03] Syntax-based language models for statistical machine translation [Galley, Hopkins, Knight, Marcu 04] What s in a translation rule? [Graehl, Knight 04] Training tree transducers [Graehl, Knight 05] An overview of probabilistic tree transducers for natural language processing [Maletti 11] Survey: Weighted Extended Top-down Tree Transducers Part III: Applications in Machine Translation Thank you!

Statistical Machine Translation of Natural Languages

1/37 Statistical Machine Translation of Natural Languages Rule Extraction and Training Probabilities Matthias Büchse, Toni Dietze, Johannes Osterholzer, Torsten Stüber, Heiko Vogler Technische Universität