Statistical Machine Translation of Natural Languages
|
|
- Rosamond Richard
- 5 years ago
- Views:
Transcription
1 1/37 Statistical Machine Translation of Natural Languages Rule Extraction and Training Probabilities Matthias Büchse, Toni Dietze, Johannes Osterholzer, Torsten Stüber, Heiko Vogler Technische Universität Dresden Germany Graduiertenkolleg Quantitative Logics and Automata Gohrisch, March, 2013
2 2/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities
3 3/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities
4 4/37 given: source language SL target language TL find: translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel.
5 4/37 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel.
6 4/37 given: source language SL target language TL find: machine translation h : SL TL e.g. SL = English TL = German s = I saw the man with the telescope h(s) = Ich sah den Mann durch das Tel. machine translation statistical machine translation (SMT)
7 5/37 assumptions modeling H hypothesis space hypothesis space: H {h h : SL TL}
8 5/37 assumptions modeling H hypothesis space hypothesis space: H {h h : SL TL} H and training data training ĥ H [Lopez 08]: By examining many samples of human-produced translations, SMT algorithms automatically learn how to translate.
9 5/37 assumptions modeling H hypothesis space hypothesis space: H {h h : SL TL} H and training data training ĥ H [Lopez 08]: By examining many samples of human-produced translations, SMT algorithms automatically learn how to translate. ĥ and test data evaluation score
10 6/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities
11 7/37 weighted tree transducer (wtt) M = (Q, Σ,, q 0, R) Q finite set (states) Σ, finite sets (input- / output-symbols) q 0 Q (initial state) R finite set of particular term rewrite rules with weights linear, nondeleting in x 1,..., x k
12 8/37
13 extended left-hand sides, one-symbol look-ahead: 9/37
14 10/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H ĥ and test data evaluation score
15 10/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s recall: H and training data training ĥ H ĥ and test data evaluation score
16 11/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities
17 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H
18 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H training data (corpora): Hansards of Parliament of Canada, English-French 2 x 1,278,000 EUROPARL, Danish, German, English, French,..., 1,500,000 / language pair, 50,000,000 words / language TIGER (v2.1), German, 50,000 sentences
19 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H
20 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H H H = {h M wtt M}
21 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H H H = {h M wtt M} = {h (N,p) tt N, probability assignment p}
22 12/37 assumptions modeling H hypothesis space H = {h λ,m,a λ R 2 0, wtt M, wta A} ( h λ,m,a (s) = π TL argmax (d,r) Y : wt M(d) λ1 wt ) A(r) λ 2 π SL (d,r)=s H and training data training ĥ H H H = {h M wtt M} = {h (N,p) tt N, probability assignment p} = {h (N,p) probability assignment p} tt N rule extraction: tt N training probabilities: probability assignment p : R [0, 1]
23 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment C ξ = {(, ),... } e f { e f,... } glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N
24 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment Garcia has a company also. C ξ = {(, ),... } e f { e f,... } glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N
25 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment C ξ = {(, ),... } e f { e f,... } glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N
26 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment e : Garcia has a company also. C ξ = {(, ),... } e f { e f,... } f : Garcia tambien tiene una empresa. glue alignment graph G = { ξ e f,... } rule extraction yxtt M tt N
27 13/37 Rule extraction: [Galley, Hopkins, Knight, Marcu 04] parallel corpus C = {(, ),... } e f Berkeley parser alignment e : Garcia has a company also. C ξ = {(, ),... } e f { e f,... } f : Garcia tambien tiene una empresa. glue alignment graph G = { ξ e f,... } alignment: rule extraction yxtt M tt N
28 14/37 alignment graph: e : Garcia has a company also. f : Garcia tambien tiene una empresa.
29 15/37 alignment graph: e : Garcia has a company also. f : Garcia tambien tiene una empresa. extracted rule for the tt N (tree-to-string): q ( VP(x 1 : VBZ, NP(x 2 : NP, x 3 : ADVP)) ) q(x 3 ) q(x 1 ) q(x 2 )
30 16/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities
31 16/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt
32 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1
33 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1 set of all distributions over Y : B(Y ) probability model over Y : B B(Y )
34 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1 set of all distributions over Y : B(Y ) probability model over Y : B B(Y ) unrestricted probability model: B(Y )
35 17/37 random experiment (Y, p): finite set Y (events), distribution p over Y : p : Y [0, 1], y p(y) = 1 set of all distributions over Y : B(Y ) probability model over Y : B B(Y ) unrestricted probability model: B(Y ) Let (Y 1, p 1 ), (Y 2, p 2 ) two random experiments. independent product: (Y 1 Y 2, p 1 p 2 ) (p 1 p 2 )(y 1, y 2 ) = p 1 (y 1 ) p 2 (y 2 )
36 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6
37 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6 tossing two coins Y = {h, t} {h, t} h t p p probability model B = {p1 p 2 B(Y ) p 1, p 2 B({h, t})}
38 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6 tossing two coins Y = {h, t} {h, t} h t p p probability model B = {p1 p 2 B(Y ) p 1, p 2 B({h, t})} consider distribution p B({h, t} {h, t}): p(h, h) = p(t, t) = 0, p(h, t) = p(t, h) = 0.5
39 18/37 Examples: tossing a coin Y = {h, t} p(h) = 0.4, p(t) = 0.6 tossing two coins Y = {h, t} {h, t} h t p p probability model B = {p1 p 2 B(Y ) p 1, p 2 B({h, t})} consider distribution p B({h, t} {h, t}): p(h, h) = p(t, t) = 0, p(h, t) = p(t, h) = 0.5 observe: p / B
40 19/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt
41 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) 20/37
42 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y)
43 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B
44 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B maximum-likelihood estimation of c in B: ˆp = argmax p B L(c, p)
45 20/37 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B maximum-likelihood estimation of c in B: ˆp = argmax p B L(c, p) = mle(c, B)
46 corpus: c : Y R 0 frequency of y: c(y) supp(c) finite! size of c: c = y c(y) Let p B(Y ) and c : Y R 0 corpus likelihood of c under p: L(c, p) = y p(y)c(y) B B(Y ) and corpus c training ˆp B maximum-likelihood estimation of c in B: ˆp = argmax p B L(c, p) = mle(c, B) note: mle(c, B) B 20/37
47 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c)
48 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c Example: tossing of one coin h t c: 8 12 rfe(c) rfe(c) B(Y )
49 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c) B B(Y ) \ {rfe(c)} : mle(c, B) rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c rfe(c) B(Y )
50 21/37 Theorem Let c : Y R 0 corpus. mle(c, B(Y )) = rfe(c) B B(Y ) \ {rfe(c)} : mle(c, B) rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c Example: tossing two coins (h, h) (h, t) (t, h) (t, t) c: rfe(c) 0 1/2 1/2 0 rfe(c) B(Y ) rfe(c) {p 1 p 2 p 1, p 2 B({h, t})}
51 21/37 Theorem Let c : Y R 0 corpus. but... mle(c, B(Y )) = rfe(c) B B(Y ) \ {rfe(c)} : mle(c, B) rfe(c) relative frequency estimation of c: rfe(c) : Y [0, 1] (empirical distr.) rfe(c)(y) = c(y) c Example: tossing two coins (h, h) (h, t) (t, h) (t, t) c: rfe(c) 0 1/2 1/2 0 rfe(c) B(Y ) rfe(c) {p 1 p 2 p 1, p 2 B({h, t})}
52 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}.
53 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 )
54 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c
55 22/37 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c for i {1, 2}: h t c i 5 5 rfe(c i ) 1/2 1/2
56 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c for i {1, 2}: h t c i 5 5 rfe(c i ) 1/2 1/2 corpus: (h, h) (t, h) (h, t) (t, t) mle(c, B) 1/4 1/4 1/4 1/5 22/37
57 Observation Let c : Y R 0 corpus, Y = Y 1 Y 2, and B = {p 1 p 2 p 1 B(Y 1 ), p 2 B(Y 2 )}. Then: mle(c, B) = rfe(c 1 ) rfe(c 2 ) where c 1 (y 1 ) = y 2 c(y 1, y 2 ) c 2 (y 2 ) = y 1 c(y 1, y 2 ) Example: tossing two coins (Y 1 = Y 2 = {h, t}) corpus: (h, h) (h, t) (t, h) (t, t) c for i {1, 2}: h t c i 5 5 rfe(c i ) 1/2 1/2 corpus: (h, h) (t, h) (h, t) (t, t) mle(c, B) 1/4 1/4 1/4 1/4 22/37
58 23/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt
59 24/37 Example: tossing two coins and hiding information player A: tosses two coins several times, each time she maintains the number of heads, and eventually forms the corresponding corpus c : {0, 1, 2} R 0 Example: c(0) = 4, c(1) = 9, c(2) = 2
60 24/37 Example: tossing two coins and hiding information player A: tosses two coins several times, each time she maintains the number of heads, and eventually forms the corresponding corpus c : {0, 1, 2} R 0 Example: c(0) = 4, c(1) = 9, c(2) = 2 player B: gets a corpus c : {0, 1, 2} R 0 of observations and has to estimate the distribution p 1 and p 2 of the coins.
61 24/37 Example: tossing two coins and hiding information player A: tosses two coins several times, each time she maintains the number of heads, and eventually forms the corresponding corpus c : {0, 1, 2} R 0 Example: c(0) = 4, c(1) = 9, c(2) = 2 player B: gets a corpus c : {0, 1, 2} R 0 of observations and has to estimate the distribution p 1 and p 2 of the coins. c : {0, 1, 2} R 0 is incomplete data
62 25/37 set of events: set of observations: observation mapping: Y X π : Y X Example: tossing two coins and hiding information set of events Y = {(h, h), (h, t), (t, h), (t, t)} set of observations: X = {0, 1, 2} observation mapping: y (h, h) (h, t) (t, h) (h, h) π(y)
63 25/37 set of events: set of observations: observation mapping: Y X π : Y X
64 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B
65 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B likelihood of c under p B: L(c, p) = x ( y:π(y)=x p(y) ) c(x)
66 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B likelihood of c under p B: L(c, p) = x ( y:π(y)=x p(y) ) c(x) maximum-likelihood estimation: ˆp = argmax p B L(c, p)
67 25/37 set of events: set of observations: observation mapping: Y X π : Y X B B(Y ) and corpus c : X R 0 training ˆp B likelihood of c under p B: L(c, p) = x ( y:π(y)=x p(y) ) c(x) maximum-likelihood estimation: ˆp = argmax p B L(c, p) = mle(c, B)
68 [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39: /37
69 [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39: pages, /37
70 [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39:1 38. E: Q(ψ; ψ(k) ) = E ψ (k){log L(c, ψ)} M: ψ (k+1) = argmax ψ Q(ψ; ψ (k) ) 359 pages, /37
71 27/37 Expectation-Maximization algorithm [Prescher 05] input corpus c : X R 0 ; probability model B B(Y ), starting distribution p 0 B. observation mapping π : Y X output sequence p 1, p 2, p 3... B
72 27/37 Expectation-Maximization algorithm [Prescher 05] input corpus c : X R 0 ; probability model B B(Y ), starting distribution p 0 B. observation mapping π : Y X output sequence p 1, p 2, p 3... B for each i = 1, 2, 3,... E-step: compute corpus c pi 1 : Y R 0 expected by p i 1 : ( ) c pi 1 (y) := c(π(y)) p i 1 (y) / p i 1 (y ) M-step: y :π(y )=π(y) compute maximum-likelihood estimate: p i := mle(c pi 1, B) print p i
73 28/37 Theorem [Dempster, Laird, Rubin 77] Let c : X R 0 be a corpus, B B(Y ), p 0 B, π : Y X observation mapping.
74 28/37 Theorem [Dempster, Laird, Rubin 77] Let c : X R 0 be a corpus, B B(Y ), p 0 B, π : Y X observation mapping. If p 1, p 2, p 3,... is generated by the EM algorithm, then L(c, p 0 ) L(c, p 1 ) L(c, p 2 )... mle(c, B).
75 29/37 Example: tossing of two coins and hiding information c(0) = 4 c(1) = 9 c(2) = 2 run 1 run 2 run 3 run 4 (p1(h), 0 p2(h)) 0 (0.200, 0.500) (0.900, 0.600) (0.000, 1.000) (0.400, 0.400) (p1(h), 1 p2(h)) 1 (0.253, 0.613) (0.648, 0.219) (0.133, 0.733) (0.433, 0.433) (p1(h), 2 p2(h)) 2 (0.239, 0.628) (0.654, 0.213) (0.165, 0.687) (0.433, 0.433) (p1(h), 3 p2(h)) 3 (0.228, 0.639) (0.658, 0.208) (0.180, 0.679) (0.433, 0.433) (p1(h), 4 p2(h)) 4 (0.219, 0.648) (0.661, 0.205) (0.188, 0.674) (0.433, 0.433) (p1(h), 5 p2(h)) 5 (0.213, 0.654) (0.663, 0.204) (0.193, 0.671) (0.433, 0.433) (p1 20 (h), p2 20 (h)) (0.200, 0.667) (0.667, 0.200) (0.200, 0.667) (0.433, 0.433) EM-algorithm converges to either of the three distributions: h t h t h t p 1 1/5 4/5 p 1 2/3 1/3 p 1 13/30 17/30 p 2 2/3 1/3 p 2 1/5 4/5 p 2 13/30 13/30
76 30/ b 0.75 f(a, b) a
77 31/37 outline of the talk: Statistical machine translation (recall...) Modeling with wta and wtt (recall...) Training Rule extraction Training probabilities random experiments corpora and maximum-likelihood estimation Expectation-Maximization algorithm instantiation to training probability assignment for tt
78 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS)
79 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)
80 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r)
81 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r) p(q(l) r)
82 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r) p(q(l) r) Let p be a prob. assignment. Extend p to p B(D N ): ( ) p (q 1 (l 1 ) r 1 )... (q n (l n ) r n ) = n i=1 p(q i(l i ) r i )
83 32/37 Let N = (Q, Σ,, q 0, R) extended tree transducer set of events: Y = D N (set of derivations of N ) set of observations: observation mapping: X = T Σ T π : D N T Σ T π(d) = (ξ, ζ): retrieve ξ, ζ from d probability assignment: p : Q B(LHS RHS) p(q)(l, r) p(q(l) r) Let p be a prob. assignment. Extend p to p B(D N ): ( ) p (q 1 (l 1 ) r 1 )... (q n (l n ) r n ) = n i=1 p(q i(l i ) r i ) probability model: B N = { p B(D N ) probability assignment p}
84 33/37 Expectation-Maximization algorithm, i input corpus c : T Σ T R 0 ; probability model B N B(D N ), starting distribution p 0 B N. observation mapping π : D N T Σ T output sequence p 1, p 2, p 3... B N
85 33/37 Expectation-Maximization algorithm, i input corpus c : T Σ T R 0 ; probability model B N B(D N ), starting distribution p 0 B N. observation mapping π : D N T Σ T output sequence p 1, p 2, p 3... B N for each i = 1, 2, 3,... E-step: compute corpus c pi 1 : D N R 0 expected by p i 1 : ( ) c pi 1 (d) := c(π(d)) p i 1 (d)/ p i 1 (d ) M-step: d :π(d )=π(d) compute maximum-likelihood estimate: p i := mle(c pi 1, B N ) print p i
86 some more calculations... yield... 34/37
87 Require: corpus c : T Σ T R 0 with supp(c) = {(ξ 1, ζ 1 ),..., (ξ n, ζ n)} some initial prob. assignment p 0 : R R 0 Ensure: approximation of a local maximum or saddle point of R L(c,.) : R 0 R 0 with L(c, p) = P(ξ, ζ) c(ξ,ζ). (ξ,ζ) supp(c) Variables: p : R R 0, count : R R 0, γ R 0 Approach: compute a sequence of probability assignments p 0, p 1, p 2,... such that L(c, p i ) L(c, p i+1 ). 1: for all (ξ, ζ) supp(h) do 2: construct the RTG Prod(ξ, N, ζ) 3: for all i = 1, 2, 3,... do 4: count(ρ) := 0 for every ρ R; 5: p := p i 1 6: for all (ξ, ζ) supp(h) do 7: let G = (Prod(ξ, N, ζ), p) and Prod(ξ, N, ζ) = (N, R, S, R ); 8: compute out G and in G ; 9: for all (A τ) R do 10: γ := out G (A) p(τ(ε)) in G (τ); 11: count(τ(ε)) := count(τ(ε)) + c(ξ, ζ) γ in G (S) 12: for all ρ = (q(l) r) R do ( 13: ) p i (ρ) := count(ρ) count(ρ 1 ) ρ Rq 14: output(p i ) 35/37
88 Require: corpus c : T Σ T R 0 with supp(c) = {(ξ 1, ζ 1 ),..., (ξ n, ζ n)} some initial prob. assignment p 0 : R R 0 Ensure: approximation of a local maximum or saddle point of R L(c,.) : R 0 R 0 with L(c, p) = P(ξ, ζ) c(ξ,ζ). (ξ,ζ) supp(c) [Graehl, Knight 04] Variables: p : R R 0, count : R R 0, γ R 0 Approach: compute a sequence of probability assignments p 0, p 1, p 2,... such that L(c, p i ) L(c, p i+1 ). 1: for all (ξ, ζ) supp(h) do 2: construct the RTG Prod(ξ, N, ζ) 3: for all i = 1, 2, 3,... do 4: count(ρ) := 0 for every ρ R; 5: p := p i 1 6: for all (ξ, ζ) supp(h) do 7: let G = (Prod(ξ, N, ζ), p) and Prod(ξ, N, ζ) = (N, R, S, R ); 8: compute out G and in G ; 9: for all (A τ) R do 10: γ := out G (A) p(τ(ε)) in G (τ); 11: count(τ(ε)) := count(τ(ε)) + c(ξ, ζ) γ in G (S) 12: for all ρ = (q(l) r) R do ( 13: ) p i (ρ) := count(ρ) count(ρ 1 ) ρ Rq 14: output(p i ) 35/37
89 Require: corpus c : T Σ T R 0 with supp(c) = {(ξ 1, ζ 1 ),..., (ξ n, ζ n)} some initial prob. assignment p 0 : R R 0 Ensure: approximation of a local maximum or saddle point of R L(c,.) : R 0 R 0 with L(c, p) = P(ξ, ζ) c(ξ,ζ). (ξ,ζ) supp(c) [Graehl, Knight 04] VANDA Variables: p : R R 0, count : R R 0, γ R 0 Approach: compute a sequence of probability assignments p 0, p 1, p 2,... such that L(c, p i ) L(c, p i+1 ). 1: for all (ξ, ζ) supp(h) do 2: construct the RTG Prod(ξ, N, ζ) 3: for all i = 1, 2, 3,... do 4: count(ρ) := 0 for every ρ R; 5: p := p i 1 6: for all (ξ, ζ) supp(h) do 7: let G = (Prod(ξ, N, ζ), p) and Prod(ξ, N, ζ) = (N, R, S, R ); 8: compute out G and in G ; 9: for all (A τ) R do 10: γ := out G (A) p(τ(ε)) in G (τ); 11: count(τ(ε)) := count(τ(ε)) + c(ξ, ζ) γ in G (S) 12: for all ρ = (q(l) r) R do ( 13: ) p i (ρ) := count(ρ) count(ρ 1 ) ρ Rq 14: output(p i ) 35/37
90 36/37 References: [Arnold, Dauchet 82] Morphismes et bimorphismes d arbes, Theoretical Computer Science, 20(1): [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39:1 38. [Fülöp, Maletti, Vogler 11] Weighted extended tree transducers, Fundamenta Informaticae 111(2): [Galley, Hopkins, Knight, Marcu 04] What s in a translation rule? Proc. HLT-NAACL, Association for Computational Linguistics, [Graehl, Knight 04] Training tree transducers, HLT-NAACL, Association for Computational Linguistics, [Lopez 08] Statistical Machine Translation, ACM Computing Surveys, 40(3): 8:1 8:9. [Liang, Bouchard-Côté, Klein, Taskar 06] An End-to-End Discriminative Approach to Machine Translation, Proc. 21st Int. Conf. Computational Linguistics and 44th Ann. Meeting of the Assoc. Comput. Ling., [Prescher 05] A Tutorial on the Expectation-Maximization Algorithm..., arxiv:cs/ v2 [Stüber 12] personal communication
91 36/37 References: [Arnold, Dauchet 82] Morphismes et bimorphismes d arbes, Theoretical Computer Science, 20(1): [Dempster, Laird, Rubin 77] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39:1 38. [Fülöp, Maletti, Vogler 11] Weighted extended tree transducers, Fundamenta Informaticae 111(2): [Galley, Hopkins, Knight, Marcu 04] What s in a translation rule? Proc. HLT-NAACL, Association for Computational Linguistics, [Graehl, Knight 04] Training tree transducers, HLT-NAACL, Association for Computational Linguistics, [Lopez 08] Statistical Machine Translation, ACM Computing Surveys, 40(3): 8:1 8:9. [Liang, Bouchard-Côté, Klein, Taskar 06] An End-to-End Discriminative Approach to Machine Translation, Proc. 21st Int. Conf. Computational Linguistics and 44th Ann. Meeting of the Assoc. Comput. Ling., [Prescher 05] A Tutorial on the Expectation-Maximization Algorithm..., arxiv:cs/ v2 [Stüber 12] personal communication thanks
92 37/37 set of all distributions over Y : B(Y ) probability model over Y : B B(Y ) corpus: c : Y R 0 size of c: c = y c(y) relative frequency estimation: rfe(c)(y) = c(y) c likelihood of Y -corpus c under p B(Y ): L(c, p) = y p(y)c(y) maximum-likelihood estimation of Y -corpus c in B B(Y ): mle(c, B) = argmax p B L(c, p) likelihood of X -corpus c under p B(Y ): L(c, p) = x ( y:π(y)=x p(y))c(x) maximum-likelihood estimation of X -corpus c in B B(Y ): mle(c, B) = argmax p B L(c, p)
Statistical Machine Translation of Natural Languages
1/26 Statistical Machine Translation of Natural Languages Heiko Vogler Technische Universität Dresden Germany Graduiertenkolleg Quantitative Logics and Automata Dresden, November, 2012 1/26 Weighted Tree
More informationApplications of Tree Automata Theory Lecture VI: Back to Machine Translation
Applications of Tree Automata Theory Lecture VI: Back to Machine Translation Andreas Maletti Institute of Computer Science Universität Leipzig, Germany on leave from: Institute for Natural Language Processing
More informationThe Power of Tree Series Transducers
The Power of Tree Series Transducers Andreas Maletti 1 Technische Universität Dresden Fakultät Informatik June 15, 2006 1 Research funded by German Research Foundation (DFG GK 334) Andreas Maletti (TU
More informationHierarchies of Tree Series TransducersRevisited 1
Hierarchies of Tree Series TransducersRevisited 1 Andreas Maletti 2 Technische Universität Dresden Fakultät Informatik June 27, 2006 1 Financially supported by the Friends and Benefactors of TU Dresden
More informationApplications of Tree Automata Theory Lecture V: Theory of Tree Transducers
Applications of Tree Automata Theory Lecture V: Theory of Tree Transducers Andreas Maletti Institute of Computer Science Universität Leipzig, Germany on leave from: Institute for Natural Language Processing
More informationCompositions of Tree Series Transformations
Compositions of Tree Series Transformations Andreas Maletti a Technische Universität Dresden Fakultät Informatik D 01062 Dresden, Germany maletti@tcs.inf.tu-dresden.de December 03, 2004 1. Motivation 2.
More informationCompositions of Bottom-Up Tree Series Transformations
Compositions of Bottom-Up Tree Series Transformations Andreas Maletti a Technische Universität Dresden Fakultät Informatik D 01062 Dresden, Germany maletti@tcs.inf.tu-dresden.de May 17, 2005 1. Motivation
More informationA Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister
A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model
More informationNatural Language Processing : Probabilistic Context Free Grammars. Updated 5/09
Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require
More informationHow to train your multi bottom-up tree transducer
How to train your multi bottom-up tree transducer Andreas Maletti Universität tuttgart Institute for Natural Language Processing tuttgart, Germany andreas.maletti@ims.uni-stuttgart.de Portland, OR June
More informationPreservation of Recognizability for Weighted Linear Extended Top-Down Tree Transducers
Preservation of Recognizability for Weighted Linear Extended Top-Down Tree Transducers Nina eemann and Daniel Quernheim and Fabienne Braune and Andreas Maletti University of tuttgart, Institute for Natural
More informationProbabilistic Context-Free Grammars. Michael Collins, Columbia University
Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins
Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationLecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1
Lecture 9: Decoding Andreas Maletti Statistical Machine Translation Stuttgart January 20, 2012 SMT VIII A. Maletti 1 Lecture 9 Last time Synchronous grammars (tree transducers) Rule extraction Weight training
More informationProbabilistic Context-free Grammars
Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John
More informationNatural Language Processing (CSEP 517): Machine Translation
Natural Language Processing (CSEP 57): Machine Translation Noah Smith c 207 University of Washington nasmith@cs.washington.edu May 5, 207 / 59 To-Do List Online quiz: due Sunday (Jurafsky and Martin, 2008,
More informationRandom Generation of Nondeterministic Tree Automata
Random Generation of Nondeterministic Tree Automata Thomas Hanneforth 1 and Andreas Maletti 2 and Daniel Quernheim 2 1 Department of Linguistics University of Potsdam, Germany 2 Institute for Natural Language
More informationDiscriminative Training
Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder
More informationProbabilistic Context-Free Grammar
Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612
More informationThis kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.
Chapter 12 Synchronous CFGs Synchronous context-free grammars are a generalization of CFGs that generate pairs of related strings instead of single strings. They are useful in many situations where one
More informationA Discriminative Model for Semantics-to-String Translation
A Discriminative Model for Semantics-to-String Translation Aleš Tamchyna 1 and Chris Quirk 2 and Michel Galley 2 1 Charles University in Prague 2 Microsoft Research July 30, 2015 Tamchyna, Quirk, Galley
More informationCompositions of Bottom-Up Tree Series Transformations
Compositions of Bottom-Up Tree Series Transformations Andreas Maletti a Technische Universität Dresden Fakultät Informatik D 01062 Dresden, Germany maletti@tcs.inf.tu-dresden.de May 27, 2005 1. Motivation
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationWord Alignment via Submodular Maximization over Matroids
Word Alignment via Submodular Maximization over Matroids Hui Lin, Jeff Bilmes University of Washington, Seattle Dept. of Electrical Engineering June 21, 2011 Lin and Bilmes Submodular Word Alignment June
More informationStructure and Complexity of Grammar-Based Machine Translation
Structure and of Grammar-Based Machine Translation University of Padua, Italy New York, June 9th, 2006 1 2 Synchronous context-free grammars Definitions Computational problems 3 problem SCFG projection
More informationComposition Closure of ε-free Linear Extended Top-down Tree Transducers
Composition Closure of ε-free Linear Extended Top-down Tree Transducers Zoltán Fülöp 1, and Andreas Maletti 2, 1 Department of Foundations of Computer Science, University of Szeged Árpád tér 2, H-6720
More informationCross-Entropy and Estimation of Probabilistic Context-Free Grammars
Cross-Entropy and Estimation of Probabilistic Context-Free Grammars Anna Corazza Department of Physics University Federico II via Cinthia I-8026 Napoli, Italy corazza@na.infn.it Giorgio Satta Department
More informationLanguage Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN
Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations
More informationInvestigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation
Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation Spyros Martzoukos Christophe Costa Florêncio and Christof Monz Intelligent Systems Lab
More informationLinking Theorems for Tree Transducers
Linking Theorems for Tree Transducers Andreas Maletti maletti@ims.uni-stuttgart.de peyer October 1, 2015 Andreas Maletti Linking Theorems for MBOT Theorietag 2015 1 / 32 tatistical Machine Translation
More informationTo make a grammar probabilistic, we need to assign a probability to each context-free rewrite
Notes on the Inside-Outside Algorithm To make a grammar probabilistic, we need to assign a probability to each context-free rewrite rule. But how should these probabilities be chosen? It is natural to
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationWord Alignment by Thresholded Two-Dimensional Normalization
Word Alignment by Thresholded Two-Dimensional Normalization Hamidreza Kobdani, Alexander Fraser, Hinrich Schütze Institute for Natural Language Processing University of Stuttgart Germany {kobdani,fraser}@ims.uni-stuttgart.de
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationNatural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning
Probabilistic Context Free Grammars Many slides from Michael Collins and Chris Manning Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic
More informationLECTURER: BURCU CAN Spring
LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can
More informationPrenominal Modifier Ordering via MSA. Alignment
Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,
More informationTnT Part of Speech Tagger
TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationarxiv:cs/ v2 [cs.cl] 11 Mar 2005
arxiv:cs/0412015v2 [cs.cl] 11 Mar 2005 A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars 1 Introduction
More informationLecture 7: Introduction to syntax-based MT
Lecture 7: Introduction to syntax-based MT Andreas Maletti Statistical Machine Translation Stuttgart December 16, 2011 SMT VII A. Maletti 1 Lecture 7 Goals Overview Tree substitution grammars (tree automata)
More informationOverview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments
Overview Learning phrases from alignments 6.864 (Fall 2007) Machine Translation Part III A phrase-based model Decoding in phrase-based models (Thanks to Philipp Koehn for giving me slides from his EACL
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationLatent Variable Models in NLP
Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationIBM Model 1 for Machine Translation
IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of
More informationClassification & Information Theory Lecture #8
Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationOut of GIZA Efficient Word Alignment Models for SMT
Out of GIZA Efficient Word Alignment Models for SMT Yanjun Ma National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Series March 4, 2009 Y. Ma (DCU) Out of Giza
More informationAlgorithms for Syntax-Aware Statistical Machine Translation
Algorithms for Syntax-Aware Statistical Machine Translation I. Dan Melamed, Wei Wang and Ben Wellington ew York University Syntax-Aware Statistical MT Statistical involves machine learning (ML) seems crucial
More informationCompositions of Top-down Tree Transducers with ε-rules
Compositions of Top-down Tree Transducers with ε-rules Andreas Maletti 1, and Heiko Vogler 2 1 Universitat Rovira i Virgili, Departament de Filologies Romàniques Avinguda de Catalunya 35, 43002 Tarragona,
More informationDiscriminative Training. March 4, 2014
Discriminative Training March 4, 2014 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder e =
More informationTheory of Alignment Generators and Applications to Statistical Machine Translation
Theory of Alignment Generators and Applications to Statistical Machine Translation Raghavendra Udupa U Hemanta K Mai IBM India Research Laboratory, New Delhi {uraghave, hemantkm}@inibmcom Abstract Viterbi
More informationText Mining. March 3, March 3, / 49
Text Mining March 3, 2017 March 3, 2017 1 / 49 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, 2017 2 / 49
More informationCross-Lingual Language Modeling for Automatic Speech Recogntion
GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The
More informationstatistical machine translation
statistical machine translation P A R T 3 : D E C O D I N G & E V A L U A T I O N CSC401/2511 Natural Language Computing Spring 2019 Lecture 6 Frank Rudzicz and Chloé Pou-Prom 1 University of Toronto Statistical
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More informationMixtures of Rasch Models
Mixtures of Rasch Models Hannah Frick, Friedrich Leisch, Achim Zeileis, Carolin Strobl http://www.uibk.ac.at/statistics/ Introduction Rasch model for measuring latent traits Model assumption: Item parameters
More informationKey words. tree transducer, natural language processing, hierarchy, composition closure
THE POWER OF EXTENDED TOP-DOWN TREE TRANSDUCERS ANDREAS MALETTI, JONATHAN GRAEHL, MARK HOPKINS, AND KEVIN KNIGHT Abstract. Extended top-down tree transducers (transducteurs généralisés descendants [Arnold,
More informationQuasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti
Quasi-Synchronous Phrase Dependency Grammars for Machine Translation Kevin Gimpel Noah A. Smith 1 Introduction MT using dependency grammars on phrases Phrases capture local reordering and idiomatic translations
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationUnit 1: Sequence Models
CS 562: Empirical Methods in Natural Language Processing Unit 1: Sequence Models Lecture 5: Probabilities and Estimations Lecture 6: Weighted Finite-State Machines Week 3 -- Sep 8 & 10, 2009 Liang Huang
More informationMulti-Task Word Alignment Triangulation for Low-Resource Languages
Multi-Task Word Alignment Triangulation for Low-Resource Languages Tomer Levinboim and David Chiang Department of Computer Science and Engineering University of Notre Dame {levinboim.1,dchiang}@nd.edu
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationMachine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013
Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48 Roadmap
More informationTuning as Linear Regression
Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More informationBayesian decision making
Bayesian decision making Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 166 36 Prague 6, Jugoslávských partyzánů 1580/3, Czech Republic http://people.ciirc.cvut.cz/hlavac,
More informationImproved Decipherment of Homophonic Ciphers
Improved Decipherment of Homophonic Ciphers Malte Nuhn and Julian Schamper and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University, Aachen,
More informationProcessing/Speech, NLP and the Web
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011 Bracketed Structure: Treebank Corpus [ S1[
More informationFeatures of Statistical Parsers
Features of tatistical Parsers Preliminary results Mark Johnson Brown University TTI, October 2003 Joint work with Michael Collins (MIT) upported by NF grants LI 9720368 and II0095940 1 Talk outline tatistical
More informationHidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationThe Geometry of Statistical Machine Translation
The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide
More informationAdvanced Natural Language Processing Syntactic Parsing
Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm
More informationAnalysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation
Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation David Vilar, Daniel Stein, Hermann Ney IWSLT 2008, Honolulu, Hawaii 20. October 2008 Human Language Technology
More informationMultiword Expression Identification with Tree Substitution Grammars
Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic
More informationBackward and Forward Bisimulation Minimisation of Tree Automata
Backward and Forward Bisimulation Minimisation of Tree Automata Johanna Högberg 1, Andreas Maletti 2, and Jonathan May 3 1 Dept. of Computing Science, Umeå University, S 90187 Umeå, Sweden johanna@cs.umu.se
More informationBasic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology
Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are
More informationorder is number of previous outputs
Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationMulti-Source Neural Translation
Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu Abstract We build a multi-source
More informationTriplet Lexicon Models for Statistical Machine Translation
Triplet Lexicon Models for Statistical Machine Translation Saša Hasan, Juri Ganitkevitch, Hermann Ney and Jesús Andrés Ferrer lastname@cs.rwth-aachen.de CLSP Student Seminar February 6, 2009 Human Language
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More informationLing 289 Contingency Table Statistics
Ling 289 Contingency Table Statistics Roger Levy and Christopher Manning This is a summary of the material that we ve covered on contingency tables. Contingency tables: introduction Odds ratios Counting,
More informationLanguage as a Stochastic Process
CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any
More informationSequences and Information
Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationCOMS 4705, Fall Machine Translation Part III
COMS 4705, Fall 2011 Machine Translation Part III 1 Roadmap for the Next Few Lectures Lecture 1 (last time): IBM Models 1 and 2 Lecture 2 (today): phrase-based models Lecture 3: Syntax in statistical machine
More informationP R + RQ P Q: Transliteration Mining Using Bridge Language
P R + RQ P Q: Transliteration Mining Using Bridge Language Mitesh M. Khapra Raghavendra Udupa n Institute of Technology Microsoft Research, Bombay, Bangalore, Powai, Mumbai 400076 raghavu@microsoft.com
More informationNatural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Machine Translation. Uwe Dick
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Machine Translation Uwe Dick Google Translate Rosetta Stone Hieroglyphs Demotic Greek Machine Translation Automatically translate
More informationStatistical NLP Spring Corpus-Based MT
Statistical NLP Spring 2010 Lecture 17: Word / Phrase MT Dan Klein UC Berkeley Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it
More informationCorpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1
Statistical NLP Spring 2010 Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See
More informationComparison between conditional and marginal maximum likelihood for a class of item response models
(1/24) Comparison between conditional and marginal maximum likelihood for a class of item response models Francesco Bartolucci, University of Perugia (IT) Silvia Bacci, University of Perugia (IT) Claudia
More informationBisimulation Minimisation for Weighted Tree Automata
Bisimulation Minimisation for Weighted Tree Automata Johanna Högberg 1, Andreas Maletti 2, and Jonathan May 3 1 Department of Computing Science, Umeå University S 90187 Umeå, Sweden johanna@cs.umu.se 2
More information