Statistical Machine Translation of Natural Languages

Size: px

Start display at page:

Download "Statistical Machine Translation of Natural Languages"

Rosamond Richard
5 years ago
Views:

1 1/37 Statistical Machine Translation of Natural Languages Rule Extraction and Training Probabilities Matthias Büchse, Toni Dietze, Johannes Osterholzer, Torsten Stüber, Heiko Vogler Technische Universität Dresden Germany Graduiertenkolleg Quantitative Logics and Automata Gohrisch, March, 2013

2 2/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities

3 3/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities

4 4/37 given: source language SL target language TL find: translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel.

5 4/37 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel.

6 4/37 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel. machine translation statistical machine translation (SMT)

7 5/37 assumptions modeling H hypothesis space hypothesis space: H {h h : SL TL}

8 5/37 assumptions modeling H hypothesis space hypothesis space: H {h h : SL TL} H and training data training ĥ H [Lopez 08]: By examining many samples of human-produced translations, SMT algorithms automatically learn how to translate.

9 5/37 assumptions modeling H hypothesis space hypothesis space: H {h h : SL TL} H and training data training ĥ H [Lopez 08]: By examining many samples of human-produced translations, SMT algorithms automatically learn how to translate. ĥ and test data evaluation score

10 6/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities

11 7/37 weighted tree transducer (wtt) M = (Q, Σ,, q 0, R) Q finite set (states) Σ, finite sets (input- / output-symbols) q 0 Q (initial state) R finite set of particular term rewrite rules with weights linear, nondeleting in x 1,..., x k

12 8/37

13 extended left-hand sides, one-symbol look-ahead: 9/37

14 10/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H ĥ and test data evaluation score

15 10/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s recall: H and training data training ĥ H ĥ and test data evaluation score

16 11/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities

17 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H

18 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H training data (corpora): Hansards of Parliament of Canada, English-French 2 x 1,278,000 EUROPARL, Danish, German, English, French,..., 1,500,000 / language pair, 50,000,000 words / language TIGER (v2.1), German, 50,000 sentences

19 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H

20 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H H H = {h M wtt M}

21 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H H H = {h M wtt M} = {h (N,p) tt N, probability assignment p}

22 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H H H = {h M wtt M} = {h (N,p) tt N, probability assignment p} = {h (N,p) probability assignment p} tt N rule extraction: tt N training probabilities: probability assignment p : R [0, 1]

23 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment C ξ = {(, ),... } e f { e f,... } glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N

24 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment Garcia has a company also. C ξ = {(, ),... } e f { e f,... } glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N

25 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment C ξ = {(, ),... } e f { e f,... } glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N

26 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment e : Garcia has a company also. C ξ = {(, ),... } e f { e f,... } f : Garcia tambien tiene una empresa. glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N

27 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment e : Garcia has a company also. C ξ = {(, ),... } e f { e f,... } f : Garcia tambien tiene una empresa. glue alignment graph G = { ξ e f,... } alignment: rule extraction yxtt M tt N

28 14/37 alignment graph: e : Garcia has a company also. f : Garcia tambien tiene una empresa.

29 15/37 alignment graph: e : Garcia has a company also. f : Garcia tambien tiene una empresa. extracted rule for the tt N (tree-to-string): q ( VP(x 1 : VBZ, NP(x 2 : NP, x 3 : ADVP)) ) q(x 3 ) q(x 1 ) q(x 2 )

30 16/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities

31 16/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt

32 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1

33 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1 set of all distributions over Y : B(Y ) probability model over Y : B B(Y )

34 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1 set of all distributions over Y : B(Y ) probability model over Y : B B(Y ) unrestricted probability model: B(Y )

35 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1 set of all distributions over Y : B(Y ) probability model over Y : B B(Y ) unrestricted probability model: B(Y ) Let (Y 1, p 1 ), (Y 2, p 2 ) two random experiments. independent product: (Y 1 Y 2, p 1 p 2 ) (p 1 p 2 )(y 1, y 2 ) = p 1 (y 1 ) p 2 (y 2 )

36 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6

37 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6 tossing two coins Y = {h, t} {h, t} h t p p probability model B = {p1 p 2 B(Y ) p 1, p 2 B({h, t})}

38 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6 tossing two coins Y = {h, t} {h, t} h t p p probability model B = {p1 p 2 B(Y ) p 1, p 2 B({h, t})} consider distribution p B({h, t} {h, t}): p(h, h) = p(t, t) = 0, p(h, t) = p(t, h) = 0.5

39 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6 tossing two coins Y = {h, t} {h, t} h t p p probability model B = {p1 p 2 B(Y ) p 1, p 2 B({h, t})} consider distribution p B({h, t} {h, t}): p(h, h) = p(t, t) = 0, p(h, t) = p(t, h) = 0.5 observe: p / B

40 19/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt

41 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) 20/37

42 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y)

43 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B

44 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B maximum-likelihood estimation of c in B: ˆp = argmax p B L(c, p)

45 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B maximum-likelihood estimation of c in B: ˆp = argmax p B L(c, p) = mle(c, B)

46 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B maximum-likelihood estimation of c in B: ˆp = argmax p B L(c, p) = mle(c, B) note: mle(c, B) B 20/37

47 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c)

48 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c Example: tossing of one coin h t c: 8 12 rfe(c) rfe(c) B(Y )

49 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c) B B(Y ) \ {rfe(c)} : mle(c, B) rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c rfe(c) B(Y )

50 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c) B B(Y ) \ {rfe(c)} : mle(c, B) rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c Example: tossing two coins (h, h) (h, t) (t, h) (t, t) c: rfe(c) 0 1/2 1/2 0 rfe(c) B(Y ) rfe(c) {p 1 p 2 p 1, p 2 B({h, t})}

51 21/37 Theorem Let c : Y R 0 corpus. but... mle(c, B(Y )) = rfe(c) B B(Y ) \ {rfe(c)} : mle(c, B) rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c Example: tossing two coins (h, h) (h, t) (t, h) (t, t) c: rfe(c) 0 1/2 1/2 0 rfe(c) B(Y ) rfe(c) {p 1 p 2 p 1, p 2 B({h, t})}

52 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}.

53 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 )

54 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c

55 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c for i {1, 2}: h t c i 5 5 rfe(c i ) 1/2 1/2

56 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c for i {1, 2}: h t c i 5 5 rfe(c i ) 1/2 1/2 corpus: (h, h) (t, h) (h, t) (t, t) mle(c, B) 1/4 1/4 1/4 1/5 22/37

57 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c for i {1, 2}: h t c i 5 5 rfe(c i ) 1/2 1/2 corpus: (h, h) (t, h) (h, t) (t, t) mle(c, B) 1/4 1/4 1/4 1/4 22/37

58 23/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt

59 24/37 Example: tossing two coins and hiding information player A: tosses two coins several times, each time she maintains the number of heads, and eventually forms the corresponding corpus c : {0, 1, 2} R 0 Example: c(0) = 4, c(1) = 9, c(2) = 2

60 24/37 Example: tossing two coins and hiding information player A: tosses two coins several times, each time she maintains the number of heads, and eventually forms the corresponding corpus c : {0, 1, 2} R 0 Example: c(0) = 4, c(1) = 9, c(2) = 2 player B: gets a corpus c : {0, 1, 2} R 0 of observations and has to estimate the distribution p 1 and p 2 of the coins.

61 24/37 Example: tossing two coins and hiding information player A: tosses two coins several times, each time she maintains the number of heads, and eventually forms the corresponding corpus c : {0, 1, 2} R 0 Example: c(0) = 4, c(1) = 9, c(2) = 2 player B: gets a corpus c : {0, 1, 2} R 0 of observations and has to estimate the distribution p 1 and p 2 of the coins. c : {0, 1, 2} R 0 is incomplete data

62 25/37 set of events: set of observations: observation mapping: Y X π : Y X Example: tossing two coins and hiding information set of events Y = {(h, h), (h, t), (t, h), (t, t)} set of observations: X = {0, 1, 2} observation mapping: y (h, h) (h, t) (t, h) (h, h) π(y)

63 25/37 set of events: set of observations: observation mapping: Y X π : Y X

64 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B

65 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B likelihood of c under p B: L(c, p) = x ( y:π(y)=x p(y) ) c(x)

66 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B likelihood of c under p B: L(c, p) = x ( y:π(y)=x p(y) ) c(x) maximum-likelihood estimation: ˆp = argmax p B L(c, p)

67 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B likelihood of c under p B: L(c, p) = x ( y:π(y)=x p(y) ) c(x) maximum-likelihood estimation: ˆp = argmax p B L(c, p) = mle(c, B)

68 [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39: /37

69 [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39: pages, /37

70 [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39:1 38. E: Q(ψ; ψ(k) ) = E ψ (k){log L(c, ψ)} M: ψ (k+1) = argmax ψ Q(ψ; ψ (k) ) 359 pages, /37

71 27/37 Expectation-Maximization algorithm [Prescher 05] input corpus c : X R 0 ; probability model B B(Y ), starting distribution p 0 B. observation mapping π : Y X output sequence p 1, p 2, p 3... B

72 27/37 Expectation-Maximization algorithm [Prescher 05] input corpus c : X R 0 ; probability model B B(Y ), starting distribution p 0 B. observation mapping π : Y X output sequence p 1, p 2, p 3... B for each i = 1, 2, 3,... E-step: compute corpus c pi 1 : Y R 0 expected by p i 1 : ( ) c pi 1 (y) := c(π(y)) p i 1 (y) / p i 1 (y ) M-step: y :π(y )=π(y) compute maximum-likelihood estimate: p i := mle(c pi 1, B) print p i

73 28/37 Theorem [Dempster, Laird, Rubin 77] Let c : X R 0 be a corpus, B B(Y ), p 0 B, π : Y X observation mapping.

74 28/37 Theorem [Dempster, Laird, Rubin 77] Let c : X R 0 be a corpus, B B(Y ), p 0 B, π : Y X observation mapping. If p 1, p 2, p 3,... is generated by the EM algorithm, then L(c, p 0 ) L(c, p 1 ) L(c, p 2 )... mle(c, B).

75 29/37 Example: tossing of two coins and hiding information c(0) = 4 c(1) = 9 c(2) = 2 run 1 run 2 run 3 run 4 (p1(h), 0 p2(h)) 0 (0.200, 0.500) (0.900, 0.600) (0.000, 1.000) (0.400, 0.400) (p1(h), 1 p2(h)) 1 (0.253, 0.613) (0.648, 0.219) (0.133, 0.733) (0.433, 0.433) (p1(h), 2 p2(h)) 2 (0.239, 0.628) (0.654, 0.213) (0.165, 0.687) (0.433, 0.433) (p1(h), 3 p2(h)) 3 (0.228, 0.639) (0.658, 0.208) (0.180, 0.679) (0.433, 0.433) (p1(h), 4 p2(h)) 4 (0.219, 0.648) (0.661, 0.205) (0.188, 0.674) (0.433, 0.433) (p1(h), 5 p2(h)) 5 (0.213, 0.654) (0.663, 0.204) (0.193, 0.671) (0.433, 0.433) (p1 20 (h), p2 20 (h)) (0.200, 0.667) (0.667, 0.200) (0.200, 0.667) (0.433, 0.433) EM-algorithm converges to either of the three distributions: h t h t h t p 1 1/5 4/5 p 1 2/3 1/3 p 1 13/30 17/30 p 2 2/3 1/3 p 2 1/5 4/5 p 2 13/30 13/30

76 30/ b 0.75 f(a, b) a

77 31/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt

78 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS)

79 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)

80 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r)

81 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r) p(q(l) r)

82 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r) p(q(l) r) Let p be a prob. assignment. Extend p to p B(D N ): ( ) p (q 1 (l 1 ) r 1 )... (q n (l n ) r n ) = n i=1 p(q i(l i ) r i )

83 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r) p(q(l) r) Let p be a prob. assignment. Extend p to p B(D N ): ( ) p (q 1 (l 1 ) r 1 )... (q n (l n ) r n ) = n i=1 p(q i(l i ) r i ) probability model: B N = { p B(D N ) probability assignment p}

84 33/37 Expectation-Maximization algorithm, i input corpus c : T Σ T R 0 ; probability model B N B(D N ), starting distribution p 0 B N. observation mapping π : D N T Σ T output sequence p 1, p 2, p 3... B N

85 33/37 Expectation-Maximization algorithm, i input corpus c : T Σ T R 0 ; probability model B N B(D N ), starting distribution p 0 B N. observation mapping π : D N T Σ T output sequence p 1, p 2, p 3... B N for each i = 1, 2, 3,... E-step: compute corpus c pi 1 : D N R 0 expected by p i 1 : ( ) c pi 1 (d) := c(π(d)) p i 1 (d)/ p i 1 (d ) M-step: d :π(d )=π(d) compute maximum-likelihood estimate: p i := mle(c pi 1, B N ) print p i

86 some more calculations... yield... 34/37

87 Require: corpus c : T Σ T R 0 with supp(c) = {(ξ 1, ζ 1 ),..., (ξ n, ζ n)} some initial prob. assignment p 0 : R R 0 Ensure: approximation of a local maximum or saddle point of R L(c,.) : R 0 R 0 with L(c, p) = P(ξ, ζ) c(ξ,ζ). (ξ,ζ) supp(c) Variables: p : R R 0, count : R R 0, γ R 0 Approach: compute a sequence of probability assignments p 0, p 1, p 2,... such that L(c, p i ) L(c, p i+1 ). 1: for all (ξ, ζ) supp(h) do 2: construct the RTG Prod(ξ, N, ζ) 3: for all i = 1, 2, 3,... do 4: count(ρ) := 0 for every ρ R; 5: p := p i 1 6: for all (ξ, ζ) supp(h) do 7: let G = (Prod(ξ, N, ζ), p) and Prod(ξ, N, ζ) = (N, R, S, R ); 8: compute out G and in G ; 9: for all (A τ) R do 10: γ := out G (A) p(τ(ε)) in G (τ); 11: count(τ(ε)) := count(τ(ε)) + c(ξ, ζ) γ in G (S) 12: for all ρ = (q(l) r) R do ( 13: ) p i (ρ) := count(ρ) count(ρ 1 ) ρ Rq 14: output(p i ) 35/37

88 Require: corpus c : T Σ T R 0 with supp(c) = {(ξ 1, ζ 1 ),..., (ξ n, ζ n)} some initial prob. assignment p 0 : R R 0 Ensure: approximation of a local maximum or saddle point of R L(c,.) : R 0 R 0 with L(c, p) = P(ξ, ζ) c(ξ,ζ). (ξ,ζ) supp(c) [Graehl, Knight 04] Variables: p : R R 0, count : R R 0, γ R 0 Approach: compute a sequence of probability assignments p 0, p 1, p 2,... such that L(c, p i ) L(c, p i+1 ). 1: for all (ξ, ζ) supp(h) do 2: construct the RTG Prod(ξ, N, ζ) 3: for all i = 1, 2, 3,... do 4: count(ρ) := 0 for every ρ R; 5: p := p i 1 6: for all (ξ, ζ) supp(h) do 7: let G = (Prod(ξ, N, ζ), p) and Prod(ξ, N, ζ) = (N, R, S, R ); 8: compute out G and in G ; 9: for all (A τ) R do 10: γ := out G (A) p(τ(ε)) in G (τ); 11: count(τ(ε)) := count(τ(ε)) + c(ξ, ζ) γ in G (S) 12: for all ρ = (q(l) r) R do ( 13: ) p i (ρ) := count(ρ) count(ρ 1 ) ρ Rq 14: output(p i ) 35/37

89 Require: corpus c : T Σ T R 0 with supp(c) = {(ξ 1, ζ 1 ),..., (ξ n, ζ n)} some initial prob. assignment p 0 : R R 0 Ensure: approximation of a local maximum or saddle point of R L(c,.) : R 0 R 0 with L(c, p) = P(ξ, ζ) c(ξ,ζ). (ξ,ζ) supp(c) [Graehl, Knight 04] VANDA Variables: p : R R 0, count : R R 0, γ R 0 Approach: compute a sequence of probability assignments p 0, p 1, p 2,... such that L(c, p i ) L(c, p i+1 ). 1: for all (ξ, ζ) supp(h) do 2: construct the RTG Prod(ξ, N, ζ) 3: for all i = 1, 2, 3,... do 4: count(ρ) := 0 for every ρ R; 5: p := p i 1 6: for all (ξ, ζ) supp(h) do 7: let G = (Prod(ξ, N, ζ), p) and Prod(ξ, N, ζ) = (N, R, S, R ); 8: compute out G and in G ; 9: for all (A τ) R do 10: γ := out G (A) p(τ(ε)) in G (τ); 11: count(τ(ε)) := count(τ(ε)) + c(ξ, ζ) γ in G (S) 12: for all ρ = (q(l) r) R do ( 13: ) p i (ρ) := count(ρ) count(ρ 1 ) ρ Rq 14: output(p i ) 35/37

90 36/37 References: [Arnold, Dauchet 82] Morphismes et bimorphismes d arbes, Theoretical Computer Science, 20(1): [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39:1 38. [Fülöp, Maletti, Vogler 11] Weighted extended tree transducers, Fundamenta Informaticae 111(2): [Galley, Hopkins, Knight, Marcu 04] What s in a translation rule? Proc. HLT-NAACL, Association for Computational Linguistics, [Graehl, Knight 04] Training tree transducers, HLT-NAACL, Association for Computational Linguistics, [Lopez 08] Statistical Machine Translation, ACM Computing Surveys, 40(3): 8:1 8:9. [Liang, Bouchard-Côté, Klein, Taskar 06] An End-to-End Discriminative Approach to Machine Translation, Proc. 21st Int. Conf. Computational Linguistics and 44th Ann. Meeting of the Assoc. Comput. Ling., [Prescher 05] A Tutorial on the Expectation-Maximization Algorithm..., arxiv:cs/ v2 [Stüber 12] personal communication

91 36/37 References: [Arnold, Dauchet 82] Morphismes et bimorphismes d arbes, Theoretical Computer Science, 20(1): [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39:1 38. [Fülöp, Maletti, Vogler 11] Weighted extended tree transducers, Fundamenta Informaticae 111(2): [Galley, Hopkins, Knight, Marcu 04] What s in a translation rule? Proc. HLT-NAACL, Association for Computational Linguistics, [Graehl, Knight 04] Training tree transducers, HLT-NAACL, Association for Computational Linguistics, [Lopez 08] Statistical Machine Translation, ACM Computing Surveys, 40(3): 8:1 8:9. [Liang, Bouchard-Côté, Klein, Taskar 06] An End-to-End Discriminative Approach to Machine Translation, Proc. 21st Int. Conf. Computational Linguistics and 44th Ann. Meeting of the Assoc. Comput. Ling., [Prescher 05] A Tutorial on the Expectation-Maximization Algorithm..., arxiv:cs/ v2 [Stüber 12] personal communication thanks

92 37/37 set of all distributions over Y : B(Y ) probability model over Y : B B(Y ) corpus: c : Y R 0 size of c: c = y c(y) relative frequency estimation: rfe(c)(y) = c(y) c likelihood of Y -corpus c under p B(Y ): L(c, p) = y p(y)c(y) maximum-likelihood estimation of Y -corpus c in B B(Y ): mle(c, B) = argmax p B L(c, p) likelihood of X -corpus c under p B(Y ): L(c, p) = x ( y:π(y)=x p(y))c(x) maximum-likelihood estimation of X -corpus c in B B(Y ): mle(c, B) = argmax p B L(c, p)

Statistical Machine Translation of Natural Languages

1/26 Statistical Machine Translation of Natural Languages Heiko Vogler Technische Universität Dresden Germany Graduiertenkolleg Quantitative Logics and Automata Dresden, November, 2012 1/26 Weighted Tree