Statistical Machine Translation of Natural Languages

Size: px
Start display at page:

Download "Statistical Machine Translation of Natural Languages"

Transcription

1 1/37 Statistical Machine Translation of Natural Languages Rule Extraction and Training Probabilities Matthias Büchse, Toni Dietze, Johannes Osterholzer, Torsten Stüber, Heiko Vogler Technische Universität Dresden Germany Graduiertenkolleg Quantitative Logics and Automata Gohrisch, March, 2013

2 2/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities

3 3/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities

4 4/37 given: source language SL target language TL find: translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel.

5 4/37 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel.

6 4/37 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel. machine translation statistical machine translation (SMT)

7 5/37 assumptions modeling H hypothesis space hypothesis space: H {h h : SL TL}

8 5/37 assumptions modeling H hypothesis space hypothesis space: H {h h : SL TL} H and training data training ĥ H [Lopez 08]: By examining many samples of human-produced translations, SMT algorithms automatically learn how to translate.

9 5/37 assumptions modeling H hypothesis space hypothesis space: H {h h : SL TL} H and training data training ĥ H [Lopez 08]: By examining many samples of human-produced translations, SMT algorithms automatically learn how to translate. ĥ and test data evaluation score

10 6/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities

11 7/37 weighted tree transducer (wtt) M = (Q, Σ,, q 0, R) Q finite set (states) Σ, finite sets (input- / output-symbols) q 0 Q (initial state) R finite set of particular term rewrite rules with weights linear, nondeleting in x 1,..., x k

12 8/37

13 extended left-hand sides, one-symbol look-ahead: 9/37

14 10/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H ĥ and test data evaluation score

15 10/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s recall: H and training data training ĥ H ĥ and test data evaluation score

16 11/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities

17 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H

18 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H training data (corpora): Hansards of Parliament of Canada, English-French 2 x 1,278,000 EUROPARL, Danish, German, English, French,..., 1,500,000 / language pair, 50,000,000 words / language TIGER (v2.1), German, 50,000 sentences

19 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H

20 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H H H = {h M wtt M}

21 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H H H = {h M wtt M} = {h (N,p) tt N, probability assignment p}

22 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H H H = {h M wtt M} = {h (N,p) tt N, probability assignment p} = {h (N,p) probability assignment p} tt N rule extraction: tt N training probabilities: probability assignment p : R [0, 1]

23 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment C ξ = {(, ),... } e f { e f,... } glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N

24 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment Garcia has a company also. C ξ = {(, ),... } e f { e f,... } glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N

25 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment C ξ = {(, ),... } e f { e f,... } glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N

26 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment e : Garcia has a company also. C ξ = {(, ),... } e f { e f,... } f : Garcia tambien tiene una empresa. glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N

27 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment e : Garcia has a company also. C ξ = {(, ),... } e f { e f,... } f : Garcia tambien tiene una empresa. glue alignment graph G = { ξ e f,... } alignment: rule extraction yxtt M tt N

28 14/37 alignment graph: e : Garcia has a company also. f : Garcia tambien tiene una empresa.

29 15/37 alignment graph: e : Garcia has a company also. f : Garcia tambien tiene una empresa. extracted rule for the tt N (tree-to-string): q ( VP(x 1 : VBZ, NP(x 2 : NP, x 3 : ADVP)) ) q(x 3 ) q(x 1 ) q(x 2 )

30 16/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities

31 16/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt

32 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1

33 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1 set of all distributions over Y : B(Y ) probability model over Y : B B(Y )

34 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1 set of all distributions over Y : B(Y ) probability model over Y : B B(Y ) unrestricted probability model: B(Y )

35 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1 set of all distributions over Y : B(Y ) probability model over Y : B B(Y ) unrestricted probability model: B(Y ) Let (Y 1, p 1 ), (Y 2, p 2 ) two random experiments. independent product: (Y 1 Y 2, p 1 p 2 ) (p 1 p 2 )(y 1, y 2 ) = p 1 (y 1 ) p 2 (y 2 )

36 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6

37 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6 tossing two coins Y = {h, t} {h, t} h t p p probability model B = {p1 p 2 B(Y ) p 1, p 2 B({h, t})}

38 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6 tossing two coins Y = {h, t} {h, t} h t p p probability model B = {p1 p 2 B(Y ) p 1, p 2 B({h, t})} consider distribution p B({h, t} {h, t}): p(h, h) = p(t, t) = 0, p(h, t) = p(t, h) = 0.5

39 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6 tossing two coins Y = {h, t} {h, t} h t p p probability model B = {p1 p 2 B(Y ) p 1, p 2 B({h, t})} consider distribution p B({h, t} {h, t}): p(h, h) = p(t, t) = 0, p(h, t) = p(t, h) = 0.5 observe: p / B

40 19/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt

41 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) 20/37

42 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y)

43 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B

44 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B maximum-likelihood estimation of c in B: ˆp = argmax p B L(c, p)

45 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B maximum-likelihood estimation of c in B: ˆp = argmax p B L(c, p) = mle(c, B)

46 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B maximum-likelihood estimation of c in B: ˆp = argmax p B L(c, p) = mle(c, B) note: mle(c, B) B 20/37

47 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c)

48 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c Example: tossing of one coin h t c: 8 12 rfe(c) rfe(c) B(Y )

49 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c) B B(Y ) \ {rfe(c)} : mle(c, B) rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c rfe(c) B(Y )

50 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c) B B(Y ) \ {rfe(c)} : mle(c, B) rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c Example: tossing two coins (h, h) (h, t) (t, h) (t, t) c: rfe(c) 0 1/2 1/2 0 rfe(c) B(Y ) rfe(c) {p 1 p 2 p 1, p 2 B({h, t})}

51 21/37 Theorem Let c : Y R 0 corpus. but... mle(c, B(Y )) = rfe(c) B B(Y ) \ {rfe(c)} : mle(c, B) rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c Example: tossing two coins (h, h) (h, t) (t, h) (t, t) c: rfe(c) 0 1/2 1/2 0 rfe(c) B(Y ) rfe(c) {p 1 p 2 p 1, p 2 B({h, t})}

52 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}.

53 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 )

54 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c

55 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c for i {1, 2}: h t c i 5 5 rfe(c i ) 1/2 1/2

56 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c for i {1, 2}: h t c i 5 5 rfe(c i ) 1/2 1/2 corpus: (h, h) (t, h) (h, t) (t, t) mle(c, B) 1/4 1/4 1/4 1/5 22/37

57 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c for i {1, 2}: h t c i 5 5 rfe(c i ) 1/2 1/2 corpus: (h, h) (t, h) (h, t) (t, t) mle(c, B) 1/4 1/4 1/4 1/4 22/37

58 23/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt

59 24/37 Example: tossing two coins and hiding information player A: tosses two coins several times, each time she maintains the number of heads, and eventually forms the corresponding corpus c : {0, 1, 2} R 0 Example: c(0) = 4, c(1) = 9, c(2) = 2

60 24/37 Example: tossing two coins and hiding information player A: tosses two coins several times, each time she maintains the number of heads, and eventually forms the corresponding corpus c : {0, 1, 2} R 0 Example: c(0) = 4, c(1) = 9, c(2) = 2 player B: gets a corpus c : {0, 1, 2} R 0 of observations and has to estimate the distribution p 1 and p 2 of the coins.

61 24/37 Example: tossing two coins and hiding information player A: tosses two coins several times, each time she maintains the number of heads, and eventually forms the corresponding corpus c : {0, 1, 2} R 0 Example: c(0) = 4, c(1) = 9, c(2) = 2 player B: gets a corpus c : {0, 1, 2} R 0 of observations and has to estimate the distribution p 1 and p 2 of the coins. c : {0, 1, 2} R 0 is incomplete data

62 25/37 set of events: set of observations: observation mapping: Y X π : Y X Example: tossing two coins and hiding information set of events Y = {(h, h), (h, t), (t, h), (t, t)} set of observations: X = {0, 1, 2} observation mapping: y (h, h) (h, t) (t, h) (h, h) π(y)

63 25/37 set of events: set of observations: observation mapping: Y X π : Y X

64 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B

65 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B likelihood of c under p B: L(c, p) = x ( y:π(y)=x p(y) ) c(x)

66 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B likelihood of c under p B: L(c, p) = x ( y:π(y)=x p(y) ) c(x) maximum-likelihood estimation: ˆp = argmax p B L(c, p)

67 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B likelihood of c under p B: L(c, p) = x ( y:π(y)=x p(y) ) c(x) maximum-likelihood estimation: ˆp = argmax p B L(c, p) = mle(c, B)

68 [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39: /37

69 [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39: pages, /37

70 [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39:1 38. E: Q(ψ; ψ(k) ) = E ψ (k){log L(c, ψ)} M: ψ (k+1) = argmax ψ Q(ψ; ψ (k) ) 359 pages, /37

71 27/37 Expectation-Maximization algorithm [Prescher 05] input corpus c : X R 0 ; probability model B B(Y ), starting distribution p 0 B. observation mapping π : Y X output sequence p 1, p 2, p 3... B

72 27/37 Expectation-Maximization algorithm [Prescher 05] input corpus c : X R 0 ; probability model B B(Y ), starting distribution p 0 B. observation mapping π : Y X output sequence p 1, p 2, p 3... B for each i = 1, 2, 3,... E-step: compute corpus c pi 1 : Y R 0 expected by p i 1 : ( ) c pi 1 (y) := c(π(y)) p i 1 (y) / p i 1 (y ) M-step: y :π(y )=π(y) compute maximum-likelihood estimate: p i := mle(c pi 1, B) print p i

73 28/37 Theorem [Dempster, Laird, Rubin 77] Let c : X R 0 be a corpus, B B(Y ), p 0 B, π : Y X observation mapping.

74 28/37 Theorem [Dempster, Laird, Rubin 77] Let c : X R 0 be a corpus, B B(Y ), p 0 B, π : Y X observation mapping. If p 1, p 2, p 3,... is generated by the EM algorithm, then L(c, p 0 ) L(c, p 1 ) L(c, p 2 )... mle(c, B).

75 29/37 Example: tossing of two coins and hiding information c(0) = 4 c(1) = 9 c(2) = 2 run 1 run 2 run 3 run 4 (p1(h), 0 p2(h)) 0 (0.200, 0.500) (0.900, 0.600) (0.000, 1.000) (0.400, 0.400) (p1(h), 1 p2(h)) 1 (0.253, 0.613) (0.648, 0.219) (0.133, 0.733) (0.433, 0.433) (p1(h), 2 p2(h)) 2 (0.239, 0.628) (0.654, 0.213) (0.165, 0.687) (0.433, 0.433) (p1(h), 3 p2(h)) 3 (0.228, 0.639) (0.658, 0.208) (0.180, 0.679) (0.433, 0.433) (p1(h), 4 p2(h)) 4 (0.219, 0.648) (0.661, 0.205) (0.188, 0.674) (0.433, 0.433) (p1(h), 5 p2(h)) 5 (0.213, 0.654) (0.663, 0.204) (0.193, 0.671) (0.433, 0.433) (p1 20 (h), p2 20 (h)) (0.200, 0.667) (0.667, 0.200) (0.200, 0.667) (0.433, 0.433) EM-algorithm converges to either of the three distributions: h t h t h t p 1 1/5 4/5 p 1 2/3 1/3 p 1 13/30 17/30 p 2 2/3 1/3 p 2 1/5 4/5 p 2 13/30 13/30

76 30/ b 0.75 f(a, b) a

77 31/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt

78 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS)

79 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)

80 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r)

81 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r) p(q(l) r)

82 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r) p(q(l) r) Let p be a prob. assignment. Extend p to p B(D N ): ( ) p (q 1 (l 1 ) r 1 )... (q n (l n ) r n ) = n i=1 p(q i(l i ) r i )

83 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r) p(q(l) r) Let p be a prob. assignment. Extend p to p B(D N ): ( ) p (q 1 (l 1 ) r 1 )... (q n (l n ) r n ) = n i=1 p(q i(l i ) r i ) probability model: B N = { p B(D N ) probability assignment p}

84 33/37 Expectation-Maximization algorithm, i input corpus c : T Σ T R 0 ; probability model B N B(D N ), starting distribution p 0 B N. observation mapping π : D N T Σ T output sequence p 1, p 2, p 3... B N

85 33/37 Expectation-Maximization algorithm, i input corpus c : T Σ T R 0 ; probability model B N B(D N ), starting distribution p 0 B N. observation mapping π : D N T Σ T output sequence p 1, p 2, p 3... B N for each i = 1, 2, 3,... E-step: compute corpus c pi 1 : D N R 0 expected by p i 1 : ( ) c pi 1 (d) := c(π(d)) p i 1 (d)/ p i 1 (d ) M-step: d :π(d )=π(d) compute maximum-likelihood estimate: p i := mle(c pi 1, B N ) print p i

86 some more calculations... yield... 34/37

87 Require: corpus c : T Σ T R 0 with supp(c) = {(ξ 1, ζ 1 ),..., (ξ n, ζ n)} some initial prob. assignment p 0 : R R 0 Ensure: approximation of a local maximum or saddle point of R L(c,.) : R 0 R 0 with L(c, p) = P(ξ, ζ) c(ξ,ζ). (ξ,ζ) supp(c) Variables: p : R R 0, count : R R 0, γ R 0 Approach: compute a sequence of probability assignments p 0, p 1, p 2,... such that L(c, p i ) L(c, p i+1 ). 1: for all (ξ, ζ) supp(h) do 2: construct the RTG Prod(ξ, N, ζ) 3: for all i = 1, 2, 3,... do 4: count(ρ) := 0 for every ρ R; 5: p := p i 1 6: for all (ξ, ζ) supp(h) do 7: let G = (Prod(ξ, N, ζ), p) and Prod(ξ, N, ζ) = (N, R, S, R ); 8: compute out G and in G ; 9: for all (A τ) R do 10: γ := out G (A) p(τ(ε)) in G (τ); 11: count(τ(ε)) := count(τ(ε)) + c(ξ, ζ) γ in G (S) 12: for all ρ = (q(l) r) R do ( 13: ) p i (ρ) := count(ρ) count(ρ 1 ) ρ Rq 14: output(p i ) 35/37

88 Require: corpus c : T Σ T R 0 with supp(c) = {(ξ 1, ζ 1 ),..., (ξ n, ζ n)} some initial prob. assignment p 0 : R R 0 Ensure: approximation of a local maximum or saddle point of R L(c,.) : R 0 R 0 with L(c, p) = P(ξ, ζ) c(ξ,ζ). (ξ,ζ) supp(c) [Graehl, Knight 04] Variables: p : R R 0, count : R R 0, γ R 0 Approach: compute a sequence of probability assignments p 0, p 1, p 2,... such that L(c, p i ) L(c, p i+1 ). 1: for all (ξ, ζ) supp(h) do 2: construct the RTG Prod(ξ, N, ζ) 3: for all i = 1, 2, 3,... do 4: count(ρ) := 0 for every ρ R; 5: p := p i 1 6: for all (ξ, ζ) supp(h) do 7: let G = (Prod(ξ, N, ζ), p) and Prod(ξ, N, ζ) = (N, R, S, R ); 8: compute out G and in G ; 9: for all (A τ) R do 10: γ := out G (A) p(τ(ε)) in G (τ); 11: count(τ(ε)) := count(τ(ε)) + c(ξ, ζ) γ in G (S) 12: for all ρ = (q(l) r) R do ( 13: ) p i (ρ) := count(ρ) count(ρ 1 ) ρ Rq 14: output(p i ) 35/37

89 Require: corpus c : T Σ T R 0 with supp(c) = {(ξ 1, ζ 1 ),..., (ξ n, ζ n)} some initial prob. assignment p 0 : R R 0 Ensure: approximation of a local maximum or saddle point of R L(c,.) : R 0 R 0 with L(c, p) = P(ξ, ζ) c(ξ,ζ). (ξ,ζ) supp(c) [Graehl, Knight 04] VANDA Variables: p : R R 0, count : R R 0, γ R 0 Approach: compute a sequence of probability assignments p 0, p 1, p 2,... such that L(c, p i ) L(c, p i+1 ). 1: for all (ξ, ζ) supp(h) do 2: construct the RTG Prod(ξ, N, ζ) 3: for all i = 1, 2, 3,... do 4: count(ρ) := 0 for every ρ R; 5: p := p i 1 6: for all (ξ, ζ) supp(h) do 7: let G = (Prod(ξ, N, ζ), p) and Prod(ξ, N, ζ) = (N, R, S, R ); 8: compute out G and in G ; 9: for all (A τ) R do 10: γ := out G (A) p(τ(ε)) in G (τ); 11: count(τ(ε)) := count(τ(ε)) + c(ξ, ζ) γ in G (S) 12: for all ρ = (q(l) r) R do ( 13: ) p i (ρ) := count(ρ) count(ρ 1 ) ρ Rq 14: output(p i ) 35/37

90 36/37 References: [Arnold, Dauchet 82] Morphismes et bimorphismes d arbes, Theoretical Computer Science, 20(1): [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39:1 38. [Fülöp, Maletti, Vogler 11] Weighted extended tree transducers, Fundamenta Informaticae 111(2): [Galley, Hopkins, Knight, Marcu 04] What s in a translation rule? Proc. HLT-NAACL, Association for Computational Linguistics, [Graehl, Knight 04] Training tree transducers, HLT-NAACL, Association for Computational Linguistics, [Lopez 08] Statistical Machine Translation, ACM Computing Surveys, 40(3): 8:1 8:9. [Liang, Bouchard-Côté, Klein, Taskar 06] An End-to-End Discriminative Approach to Machine Translation, Proc. 21st Int. Conf. Computational Linguistics and 44th Ann. Meeting of the Assoc. Comput. Ling., [Prescher 05] A Tutorial on the Expectation-Maximization Algorithm..., arxiv:cs/ v2 [Stüber 12] personal communication

91 36/37 References: [Arnold, Dauchet 82] Morphismes et bimorphismes d arbes, Theoretical Computer Science, 20(1): [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39:1 38. [Fülöp, Maletti, Vogler 11] Weighted extended tree transducers, Fundamenta Informaticae 111(2): [Galley, Hopkins, Knight, Marcu 04] What s in a translation rule? Proc. HLT-NAACL, Association for Computational Linguistics, [Graehl, Knight 04] Training tree transducers, HLT-NAACL, Association for Computational Linguistics, [Lopez 08] Statistical Machine Translation, ACM Computing Surveys, 40(3): 8:1 8:9. [Liang, Bouchard-Côté, Klein, Taskar 06] An End-to-End Discriminative Approach to Machine Translation, Proc. 21st Int. Conf. Computational Linguistics and 44th Ann. Meeting of the Assoc. Comput. Ling., [Prescher 05] A Tutorial on the Expectation-Maximization Algorithm..., arxiv:cs/ v2 [Stüber 12] personal communication thanks

92 37/37 set of all distributions over Y : B(Y ) probability model over Y : B B(Y ) corpus: c : Y R 0 size of c: c = y c(y) relative frequency estimation: rfe(c)(y) = c(y) c likelihood of Y -corpus c under p B(Y ): L(c, p) = y p(y)c(y) maximum-likelihood estimation of Y -corpus c in B B(Y ): mle(c, B) = argmax p B L(c, p) likelihood of X -corpus c under p B(Y ): L(c, p) = x ( y:π(y)=x p(y))c(x) maximum-likelihood estimation of X -corpus c in B B(Y ): mle(c, B) = argmax p B L(c, p)

Statistical Machine Translation of Natural Languages

Statistical Machine Translation of Natural Languages 1/26 Statistical Machine Translation of Natural Languages Heiko Vogler Technische Universität Dresden Germany Graduiertenkolleg Quantitative Logics and Automata Dresden, November, 2012 1/26 Weighted Tree

More information

Applications of Tree Automata Theory Lecture VI: Back to Machine Translation

Applications of Tree Automata Theory Lecture VI: Back to Machine Translation Applications of Tree Automata Theory Lecture VI: Back to Machine Translation Andreas Maletti Institute of Computer Science Universität Leipzig, Germany on leave from: Institute for Natural Language Processing

More information

The Power of Tree Series Transducers

The Power of Tree Series Transducers The Power of Tree Series Transducers Andreas Maletti 1 Technische Universität Dresden Fakultät Informatik June 15, 2006 1 Research funded by German Research Foundation (DFG GK 334) Andreas Maletti (TU

More information

Hierarchies of Tree Series TransducersRevisited 1

Hierarchies of Tree Series TransducersRevisited 1 Hierarchies of Tree Series TransducersRevisited 1 Andreas Maletti 2 Technische Universität Dresden Fakultät Informatik June 27, 2006 1 Financially supported by the Friends and Benefactors of TU Dresden

More information

Applications of Tree Automata Theory Lecture V: Theory of Tree Transducers

Applications of Tree Automata Theory Lecture V: Theory of Tree Transducers Applications of Tree Automata Theory Lecture V: Theory of Tree Transducers Andreas Maletti Institute of Computer Science Universität Leipzig, Germany on leave from: Institute for Natural Language Processing

More information

Compositions of Tree Series Transformations

Compositions of Tree Series Transformations Compositions of Tree Series Transformations Andreas Maletti a Technische Universität Dresden Fakultät Informatik D 01062 Dresden, Germany maletti@tcs.inf.tu-dresden.de December 03, 2004 1. Motivation 2.

More information

Compositions of Bottom-Up Tree Series Transformations

Compositions of Bottom-Up Tree Series Transformations Compositions of Bottom-Up Tree Series Transformations Andreas Maletti a Technische Universität Dresden Fakultät Informatik D 01062 Dresden, Germany maletti@tcs.inf.tu-dresden.de May 17, 2005 1. Motivation

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09 Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require

More information

How to train your multi bottom-up tree transducer

How to train your multi bottom-up tree transducer How to train your multi bottom-up tree transducer Andreas Maletti Universität tuttgart Institute for Natural Language Processing tuttgart, Germany andreas.maletti@ims.uni-stuttgart.de Portland, OR June

More information

Preservation of Recognizability for Weighted Linear Extended Top-Down Tree Transducers

Preservation of Recognizability for Weighted Linear Extended Top-Down Tree Transducers Preservation of Recognizability for Weighted Linear Extended Top-Down Tree Transducers Nina eemann and Daniel Quernheim and Fabienne Braune and Andreas Maletti University of tuttgart, Institute for Natural

More information

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context-Free Grammars. Michael Collins, Columbia University Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Lecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1

Lecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1 Lecture 9: Decoding Andreas Maletti Statistical Machine Translation Stuttgart January 20, 2012 SMT VIII A. Maletti 1 Lecture 9 Last time Synchronous grammars (tree transducers) Rule extraction Weight training

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Natural Language Processing (CSEP 517): Machine Translation

Natural Language Processing (CSEP 517): Machine Translation Natural Language Processing (CSEP 57): Machine Translation Noah Smith c 207 University of Washington nasmith@cs.washington.edu May 5, 207 / 59 To-Do List Online quiz: due Sunday (Jurafsky and Martin, 2008,

More information

Random Generation of Nondeterministic Tree Automata

Random Generation of Nondeterministic Tree Automata Random Generation of Nondeterministic Tree Automata Thomas Hanneforth 1 and Andreas Maletti 2 and Daniel Quernheim 2 1 Department of Linguistics University of Potsdam, Germany 2 Institute for Natural Language

More information

Discriminative Training

Discriminative Training Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder

More information

Probabilistic Context-Free Grammar

Probabilistic Context-Free Grammar Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612

More information

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this. Chapter 12 Synchronous CFGs Synchronous context-free grammars are a generalization of CFGs that generate pairs of related strings instead of single strings. They are useful in many situations where one

More information

A Discriminative Model for Semantics-to-String Translation

A Discriminative Model for Semantics-to-String Translation A Discriminative Model for Semantics-to-String Translation Aleš Tamchyna 1 and Chris Quirk 2 and Michel Galley 2 1 Charles University in Prague 2 Microsoft Research July 30, 2015 Tamchyna, Quirk, Galley

More information

Compositions of Bottom-Up Tree Series Transformations

Compositions of Bottom-Up Tree Series Transformations Compositions of Bottom-Up Tree Series Transformations Andreas Maletti a Technische Universität Dresden Fakultät Informatik D 01062 Dresden, Germany maletti@tcs.inf.tu-dresden.de May 27, 2005 1. Motivation

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Word Alignment via Submodular Maximization over Matroids

Word Alignment via Submodular Maximization over Matroids Word Alignment via Submodular Maximization over Matroids Hui Lin, Jeff Bilmes University of Washington, Seattle Dept. of Electrical Engineering June 21, 2011 Lin and Bilmes Submodular Word Alignment June

More information

Structure and Complexity of Grammar-Based Machine Translation

Structure and Complexity of Grammar-Based Machine Translation Structure and of Grammar-Based Machine Translation University of Padua, Italy New York, June 9th, 2006 1 2 Synchronous context-free grammars Definitions Computational problems 3 problem SCFG projection

More information

Composition Closure of ε-free Linear Extended Top-down Tree Transducers

Composition Closure of ε-free Linear Extended Top-down Tree Transducers Composition Closure of ε-free Linear Extended Top-down Tree Transducers Zoltán Fülöp 1, and Andreas Maletti 2, 1 Department of Foundations of Computer Science, University of Szeged Árpád tér 2, H-6720

More information

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars Cross-Entropy and Estimation of Probabilistic Context-Free Grammars Anna Corazza Department of Physics University Federico II via Cinthia I-8026 Napoli, Italy corazza@na.infn.it Giorgio Satta Department

More information

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations

More information

Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation

Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation Spyros Martzoukos Christophe Costa Florêncio and Christof Monz Intelligent Systems Lab

More information

Linking Theorems for Tree Transducers

Linking Theorems for Tree Transducers Linking Theorems for Tree Transducers Andreas Maletti maletti@ims.uni-stuttgart.de peyer October 1, 2015 Andreas Maletti Linking Theorems for MBOT Theorietag 2015 1 / 32 tatistical Machine Translation

More information

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite Notes on the Inside-Outside Algorithm To make a grammar probabilistic, we need to assign a probability to each context-free rewrite rule. But how should these probabilities be chosen? It is natural to

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Word Alignment by Thresholded Two-Dimensional Normalization

Word Alignment by Thresholded Two-Dimensional Normalization Word Alignment by Thresholded Two-Dimensional Normalization Hamidreza Kobdani, Alexander Fraser, Hinrich Schütze Institute for Natural Language Processing University of Stuttgart Germany {kobdani,fraser}@ims.uni-stuttgart.de

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning Probabilistic Context Free Grammars Many slides from Michael Collins and Chris Manning Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

Prenominal Modifier Ordering via MSA. Alignment

Prenominal Modifier Ordering via MSA. Alignment Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,

More information

TnT Part of Speech Tagger

TnT Part of Speech Tagger TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

arxiv:cs/ v2 [cs.cl] 11 Mar 2005

arxiv:cs/ v2 [cs.cl] 11 Mar 2005 arxiv:cs/0412015v2 [cs.cl] 11 Mar 2005 A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars 1 Introduction

More information

Lecture 7: Introduction to syntax-based MT

Lecture 7: Introduction to syntax-based MT Lecture 7: Introduction to syntax-based MT Andreas Maletti Statistical Machine Translation Stuttgart December 16, 2011 SMT VII A. Maletti 1 Lecture 7 Goals Overview Tree substitution grammars (tree automata)

More information

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments Overview Learning phrases from alignments 6.864 (Fall 2007) Machine Translation Part III A phrase-based model Decoding in phrase-based models (Thanks to Philipp Koehn for giving me slides from his EACL

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Latent Variable Models in NLP

Latent Variable Models in NLP Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

IBM Model 1 for Machine Translation

IBM Model 1 for Machine Translation IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Out of GIZA Efficient Word Alignment Models for SMT

Out of GIZA Efficient Word Alignment Models for SMT Out of GIZA Efficient Word Alignment Models for SMT Yanjun Ma National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Series March 4, 2009 Y. Ma (DCU) Out of Giza

More information

Algorithms for Syntax-Aware Statistical Machine Translation

Algorithms for Syntax-Aware Statistical Machine Translation Algorithms for Syntax-Aware Statistical Machine Translation I. Dan Melamed, Wei Wang and Ben Wellington ew York University Syntax-Aware Statistical MT Statistical involves machine learning (ML) seems crucial

More information

Compositions of Top-down Tree Transducers with ε-rules

Compositions of Top-down Tree Transducers with ε-rules Compositions of Top-down Tree Transducers with ε-rules Andreas Maletti 1, and Heiko Vogler 2 1 Universitat Rovira i Virgili, Departament de Filologies Romàniques Avinguda de Catalunya 35, 43002 Tarragona,

More information

Discriminative Training. March 4, 2014

Discriminative Training. March 4, 2014 Discriminative Training March 4, 2014 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder e =

More information

Theory of Alignment Generators and Applications to Statistical Machine Translation

Theory of Alignment Generators and Applications to Statistical Machine Translation Theory of Alignment Generators and Applications to Statistical Machine Translation Raghavendra Udupa U Hemanta K Mai IBM India Research Laboratory, New Delhi {uraghave, hemantkm}@inibmcom Abstract Viterbi

More information

Text Mining. March 3, March 3, / 49

Text Mining. March 3, March 3, / 49 Text Mining March 3, 2017 March 3, 2017 1 / 49 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, 2017 2 / 49

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

statistical machine translation

statistical machine translation statistical machine translation P A R T 3 : D E C O D I N G & E V A L U A T I O N CSC401/2511 Natural Language Computing Spring 2019 Lecture 6 Frank Rudzicz and Chloé Pou-Prom 1 University of Toronto Statistical

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Mixtures of Rasch Models

Mixtures of Rasch Models Mixtures of Rasch Models Hannah Frick, Friedrich Leisch, Achim Zeileis, Carolin Strobl http://www.uibk.ac.at/statistics/ Introduction Rasch model for measuring latent traits Model assumption: Item parameters

More information

Key words. tree transducer, natural language processing, hierarchy, composition closure

Key words. tree transducer, natural language processing, hierarchy, composition closure THE POWER OF EXTENDED TOP-DOWN TREE TRANSDUCERS ANDREAS MALETTI, JONATHAN GRAEHL, MARK HOPKINS, AND KEVIN KNIGHT Abstract. Extended top-down tree transducers (transducteurs généralisés descendants [Arnold,

More information

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti Quasi-Synchronous Phrase Dependency Grammars for Machine Translation Kevin Gimpel Noah A. Smith 1 Introduction MT using dependency grammars on phrases Phrases capture local reordering and idiomatic translations

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Unit 1: Sequence Models

Unit 1: Sequence Models CS 562: Empirical Methods in Natural Language Processing Unit 1: Sequence Models Lecture 5: Probabilities and Estimations Lecture 6: Weighted Finite-State Machines Week 3 -- Sep 8 & 10, 2009 Liang Huang

More information

Multi-Task Word Alignment Triangulation for Low-Resource Languages

Multi-Task Word Alignment Triangulation for Low-Resource Languages Multi-Task Word Alignment Triangulation for Low-Resource Languages Tomer Levinboim and David Chiang Department of Computer Science and Engineering University of Notre Dame {levinboim.1,dchiang}@nd.edu

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013 Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48 Roadmap

More information

Tuning as Linear Regression

Tuning as Linear Regression Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

Bayesian decision making

Bayesian decision making Bayesian decision making Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 166 36 Prague 6, Jugoslávských partyzánů 1580/3, Czech Republic http://people.ciirc.cvut.cz/hlavac,

More information

Improved Decipherment of Homophonic Ciphers

Improved Decipherment of Homophonic Ciphers Improved Decipherment of Homophonic Ciphers Malte Nuhn and Julian Schamper and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University, Aachen,

More information

Processing/Speech, NLP and the Web

Processing/Speech, NLP and the Web CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011 Bracketed Structure: Treebank Corpus [ S1[

More information

Features of Statistical Parsers

Features of Statistical Parsers Features of tatistical Parsers Preliminary results Mark Johnson Brown University TTI, October 2003 Joint work with Michael Collins (MIT) upported by NF grants LI 9720368 and II0095940 1 Talk outline tatistical

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

The Geometry of Statistical Machine Translation

The Geometry of Statistical Machine Translation The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide

More information

Advanced Natural Language Processing Syntactic Parsing

Advanced Natural Language Processing Syntactic Parsing Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm

More information

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation David Vilar, Daniel Stein, Hermann Ney IWSLT 2008, Honolulu, Hawaii 20. October 2008 Human Language Technology

More information

Multiword Expression Identification with Tree Substitution Grammars

Multiword Expression Identification with Tree Substitution Grammars Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic

More information

Backward and Forward Bisimulation Minimisation of Tree Automata

Backward and Forward Bisimulation Minimisation of Tree Automata Backward and Forward Bisimulation Minimisation of Tree Automata Johanna Högberg 1, Andreas Maletti 2, and Jonathan May 3 1 Dept. of Computing Science, Umeå University, S 90187 Umeå, Sweden johanna@cs.umu.se

More information

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Multi-Source Neural Translation

Multi-Source Neural Translation Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu Abstract We build a multi-source

More information

Triplet Lexicon Models for Statistical Machine Translation

Triplet Lexicon Models for Statistical Machine Translation Triplet Lexicon Models for Statistical Machine Translation Saša Hasan, Juri Ganitkevitch, Hermann Ney and Jesús Andrés Ferrer lastname@cs.rwth-aachen.de CLSP Student Seminar February 6, 2009 Human Language

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Ling 289 Contingency Table Statistics

Ling 289 Contingency Table Statistics Ling 289 Contingency Table Statistics Roger Levy and Christopher Manning This is a summary of the material that we ve covered on contingency tables. Contingency tables: introduction Odds ratios Counting,

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

COMS 4705, Fall Machine Translation Part III

COMS 4705, Fall Machine Translation Part III COMS 4705, Fall 2011 Machine Translation Part III 1 Roadmap for the Next Few Lectures Lecture 1 (last time): IBM Models 1 and 2 Lecture 2 (today): phrase-based models Lecture 3: Syntax in statistical machine

More information

P R + RQ P Q: Transliteration Mining Using Bridge Language

P R + RQ P Q: Transliteration Mining Using Bridge Language P R + RQ P Q: Transliteration Mining Using Bridge Language Mitesh M. Khapra Raghavendra Udupa n Institute of Technology Microsoft Research, Bombay, Bangalore, Powai, Mumbai 400076 raghavu@microsoft.com

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Machine Translation. Uwe Dick

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Machine Translation. Uwe Dick Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Machine Translation Uwe Dick Google Translate Rosetta Stone Hieroglyphs Demotic Greek Machine Translation Automatically translate

More information

Statistical NLP Spring Corpus-Based MT

Statistical NLP Spring Corpus-Based MT Statistical NLP Spring 2010 Lecture 17: Word / Phrase MT Dan Klein UC Berkeley Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it

More information

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1 Statistical NLP Spring 2010 Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See

More information

Comparison between conditional and marginal maximum likelihood for a class of item response models

Comparison between conditional and marginal maximum likelihood for a class of item response models (1/24) Comparison between conditional and marginal maximum likelihood for a class of item response models Francesco Bartolucci, University of Perugia (IT) Silvia Bacci, University of Perugia (IT) Claudia

More information

Bisimulation Minimisation for Weighted Tree Automata

Bisimulation Minimisation for Weighted Tree Automata Bisimulation Minimisation for Weighted Tree Automata Johanna Högberg 1, Andreas Maletti 2, and Jonathan May 3 1 Department of Computing Science, Umeå University S 90187 Umeå, Sweden johanna@cs.umu.se 2

More information