Word Alignment III: Fertility Models & CRFs. February 3, 2015

Size: px

Start display at page:

Download "Word Alignment III: Fertility Models & CRFs. February 3, 2015"

Alan Stephens
5 years ago
Views:

1 Word Alignment III: Fertility Models & CRFs February 3, 2015

2 Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } { z } { my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai )

3 Fertility Models The models we have considered so far have been efficient This efficiency has come at a modeling cost: What is to stop the model from translating a word 0, 1, 2, or 100 times? We introduce fertility models to deal with this

4 IBM Model 3

5 Fertility Fertility: the number of English words generated by a foreign word Modeled by categorical distribution Examples: n( f) Unabhaengigkeitserklaerung zum = (zu + dem) Haus

6 Fertility X my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai ) Fertility models mean that we can no longer exploit conditional independencies to write p(a f,m) as a series of local alignment decisions. How do we compute the statistics required for EM training?

7 EM Recipe reminder If alignment points were visible, training fertility models would be easy We would and n( =3 f = Unabhaenigkeitserklaerung) = count(3, Unabhaenigkeitserklaerung) count(unabhaenigkeitserklaerung) But, alignments are not visible n( =3 f = Unabhaenigkeitserklaerung) = E[count(3, Unabhaenigkeitserklaerung)] E[count(Unabhaenigkeitserklaerung)]

8 Expectation & Fertility We need to compute expected counts under p(a f,e,m) Unfortunately p(a f,e,m) doesn t factorize nicely. :( Can we sum exhaustively? How many different a s are there? What to do?

9 Sample Alignments Monte-Carlo methods Gibbs sampling Importance sampling Particle filtering For historical reasons Use model 2 alignment to start (easy!) Weighted sum over all alignment configurations that are close to this alignment configuration Is this correct? No! Does it work? Sort of.

11 Pitfalls of Conditional Models IBM Model 4 alignment Our model's alignmen

12 Lexical Translation IBM Models 1-5 [Brown et al., 1993] Model 3: fertility Model 5: non-deficient model Widely used Giza++ toolkit Model 1: lexical translation, uniform alignment Model 2: absolute position model Model 4: relative position model (jumps in target string) HMM translation model [Vogel et al., 1996] Relative position model (jumps in source string) Latent variables are more useful these days than the translations

13 A few tricks... p(f e) p(e f)

14 A few tricks... p(f e) p(e f)

15 A few tricks... p(f e) p(e f)

16 Alignment Tool: fast_align

17 Another View With this model: X my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai ) The problem of word alignment is as: a = arg max p(a e, f,m) a2[0,n] m Can we model this distribution directly?

18 Markov Random Fields (MRFs) A B C X Y Z p(a, B, C, X, Y, Z) = p(a) p(b A) p(c B) p(x A)p(Y B)p(Z C) A B C X Y Z p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) 4(X) 5(Y ) 6(Z) Factors

19 Computing Z X X Y Z = X x2x 1(x, y) 2(x) 3(y) y2x X = {a, b, c} X 2 X Y 2 X When the graph has certain structures (e.g., chains), you can factor to get polytime DP algorithms. Z = X x2x 2(x) X y2x 1(x, y) 3(y)

20 1,2,3(x, y) =exp X k Log-linear models A B C p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) X Y Z 4(X) 5(Y ) 6(Z) w k f k (x, y) Weights (learned) Feature functions (specified)

21 Random Fields Benefits Potential functions can be defined with respect to arbitrary features (functions) of the variables Great way to incorporate knowledge Drawbacks Likelihood involves computing Z Maximizing likelihood usually requires computing Z (often over and over again!)

22 Conditional Random Fields Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the input p(y x) = 1 Z w (x) exp X F 2G y X w k f k (F, x, y) k All factors in the graph of y

23 Parameter Learning CRFs are trained to maximize conditional likelihood Y ŵ MLE = arg max p(y i x i ; w) w Recall we want to directly model p(a e, f) (x i,y i )2D The likelihood of what alignments? Gold reference alignments!

24 CRF for Alignment One of many possibilities, due to Blunsom & Cohn (2006) p(a e, f) = 1 Z w (e, f) exp e X i=1 X a has the same form as in the lexical translation models (still make a one-to-many assumption) w k are the model parameters f k are the feature functions k w k f(a i,a i 1,i,e, f) O(n 2 m) O(n 3 )

25 Model Labels (one per target word) index source sentence Train model (e,f) and (f,e) [inverting the reference alignments]

26 Experiments

27 pervez musharrafs langer abschied Identical word pervez musharraf s long goodbye Identical word 27

28 pervez musharrafs langer abschied Matching prefix pervez musharraf s long goodbye Identical word Matching prefix 28

29 pervez musharrafs langer abschied Matching suffix pervez musharraf s long goodbye Identical word Matching prefix Matching suffix 29

30 pervez musharrafs langer abschied Orthographic similarity pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity 30

31 pervez musharrafs langer abschied In dictionary pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity In dictionary... 31

32 Lexical Features Word word indicator features Various word word co-occurrence scores IBM Model 1 probabilities (t s, s t) Geometric mean of Model 1 probabilities Dice s coefficient [binned] Products of the above

33 Lexical Features Word class word class indicator NN translates as NN NN does not translate as MD Identical word feature 2010 = 2010 Identical prefix feature Obama ~ Obamu (NN_NN=1) (NN_MD=1) (IdentWord=1 IdentNum=1) (IdentPrefix=1) Orthographic similarity measure [binned] Al-Qaeda ~ Al-Kaida (OrthoSim050_080=1)

34 Other Features Compute features from large amounts of unlabeled text Does the Model 4 alignment contain this alignment point? What is the Model 1 posterior probability of this alignment point?

35 CRF Results AER P R French English 9% 97% 86% French English 9% 98% 83% French English 7% 96% 90% French English+M4 7% 98% 88% French English+M4 7% 98% 87% French English+M4 5% 98% 91% IBM Model 4 9% 87% 95%

36 Summary Unfortunately, you need gold alignments!

37 CRF Autoencoders X! input! X X i-1!! X i!! X X i+1!! encode Y! latent! Y i-1!! Y i!! Y Y i+1!! X! X! X! i-1! i! i+1! reconstruct reconstruc,on! X! X arg max, x log X y2y(x) p (y x) p (x 0 y) CRF Encoder reconstruction model Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders

38 Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders CRF Autoencoders X arg max, x log X y2y(x) p (y x) p (x 0 y) arg max, X (e,f) log X a2a(e,f) p (a e, f) CRF Aligner my i=1 p (e i f aj ) lexical translation probabilities

39 Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders CRF Autoencoders AER P R Czech English 28% 71% 73% Czech English 21% 80% 77% Czech English 19% 81% 81% IBM Model 4 22% 75% 80%

40 Summary This is it for word alignment- questions? Next time: evaluation Keep working on HW1

CRF Word Alignment & Noisy Channel Translation

CRF Word Alignment & Noisy Channel Translation January 31, 2013 Last Time... X p( Translation)= p(, Translation) Alignment Alignment Last Time... X p( Translation)= p(, Translation) Alignment X Alignment