Word Alignment III: Fertility Models & CRFs. February 3, PDF Free Download

Word Alignment III: Fertility Models & CRFs February 3, 2015

Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } { z } { my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai )

Fertility Models The models we have considered so far have been efficient This efficiency has come at a modeling cost: What is to stop the model from translating a word 0, 1, 2, or 100 times? We introduce fertility models to deal with this

IBM Model 3

Fertility Fertility: the number of English words generated by a foreign word Modeled by categorical distribution Examples: n( f) Unabhaengigkeitserklaerung zum = (zu + dem) Haus 0 0.00008 1 0.1 2 0.0002 3 0.8 4 0.009 5 0 0 0.01 1 0 2 0.9 3 0.0009 4 0.0001 5 0 0 0.01 1 0.92 2 0.07 3 0 4 0 5 0

Fertility X my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai ) Fertility models mean that we can no longer exploit conditional independencies to write p(a f,m) as a series of local alignment decisions. How do we compute the statistics required for EM training?

EM Recipe reminder If alignment points were visible, training fertility models would be easy We would and n( =3 f = Unabhaenigkeitserklaerung) = count(3, Unabhaenigkeitserklaerung) count(unabhaenigkeitserklaerung) But, alignments are not visible n( =3 f = Unabhaenigkeitserklaerung) = E[count(3, Unabhaenigkeitserklaerung)] E[count(Unabhaenigkeitserklaerung)]

Expectation & Fertility We need to compute expected counts under p(a f,e,m) Unfortunately p(a f,e,m) doesn t factorize nicely. :( Can we sum exhaustively? How many different a s are there? What to do?

Sample Alignments Monte-Carlo methods Gibbs sampling Importance sampling Particle filtering For historical reasons Use model 2 alignment to start (easy!) Weighted sum over all alignment configurations that are close to this alignment configuration Is this correct? No! Does it work? Sort of.

Pitfalls of Conditional Models IBM Model 4 alignment Our model's alignmen

Lexical Translation IBM Models 1-5 [Brown et al., 1993] Model 3: fertility Model 5: non-deficient model Widely used Giza++ toolkit Model 1: lexical translation, uniform alignment Model 2: absolute position model Model 4: relative position model (jumps in target string) HMM translation model [Vogel et al., 1996] Relative position model (jumps in source string) Latent variables are more useful these days than the translations

A few tricks... p(f e) p(e f)

Alignment Tool: fast_align

Another View With this model: X my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai ) The problem of word alignment is as: a = arg max p(a e, f,m) a2[0,n] m Can we model this distribution directly?

Markov Random Fields (MRFs) A B C X Y Z p(a, B, C, X, Y, Z) = p(a) p(b A) p(c B) p(x A)p(Y B)p(Z C) A B C X Y Z p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) 4(X) 5(Y ) 6(Z) Factors

Computing Z X X Y Z = X x2x 1(x, y) 2(x) 3(y) y2x X = {a, b, c} X 2 X Y 2 X When the graph has certain structures (e.g., chains), you can factor to get polytime DP algorithms. Z = X x2x 2(x) X y2x 1(x, y) 3(y)

1,2,3(x, y) =exp X k Log-linear models A B C p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) X Y Z 4(X) 5(Y ) 6(Z) w k f k (x, y) Weights (learned) Feature functions (specified)

Random Fields Benefits Potential functions can be defined with respect to arbitrary features (functions) of the variables Great way to incorporate knowledge Drawbacks Likelihood involves computing Z Maximizing likelihood usually requires computing Z (often over and over again!)

Conditional Random Fields Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the input p(y x) = 1 Z w (x) exp X F 2G y X w k f k (F, x, y) k All factors in the graph of y

Parameter Learning CRFs are trained to maximize conditional likelihood Y ŵ MLE = arg max p(y i x i ; w) w Recall we want to directly model p(a e, f) (x i,y i )2D The likelihood of what alignments? Gold reference alignments!

CRF for Alignment One of many possibilities, due to Blunsom & Cohn (2006) p(a e, f) = 1 Z w (e, f) exp e X i=1 X a has the same form as in the lexical translation models (still make a one-to-many assumption) w k are the model parameters f k are the feature functions k w k f(a i,a i 1,i,e, f) O(n 2 m) O(n 3 )

Model Labels (one per target word) index source sentence Train model (e,f) and (f,e) [inverting the reference alignments]

Experiments

pervez musharrafs langer abschied Identical word pervez musharraf s long goodbye Identical word 27

pervez musharrafs langer abschied Matching prefix pervez musharraf s long goodbye Identical word Matching prefix 28

pervez musharrafs langer abschied Matching suffix pervez musharraf s long goodbye Identical word Matching prefix Matching suffix 29

pervez musharrafs langer abschied Orthographic similarity pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity 30

pervez musharrafs langer abschied In dictionary pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity In dictionary... 31

Lexical Features Word word indicator features Various word word co-occurrence scores IBM Model 1 probabilities (t s, s t) Geometric mean of Model 1 probabilities Dice s coefficient [binned] Products of the above

Lexical Features Word class word class indicator NN translates as NN NN does not translate as MD Identical word feature 2010 = 2010 Identical prefix feature Obama ~ Obamu (NN_NN=1) (NN_MD=1) (IdentWord=1 IdentNum=1) (IdentPrefix=1) Orthographic similarity measure [binned] Al-Qaeda ~ Al-Kaida (OrthoSim050_080=1)

Other Features Compute features from large amounts of unlabeled text Does the Model 4 alignment contain this alignment point? What is the Model 1 posterior probability of this alignment point?

CRF Results AER P R French English 9% 97% 86% French English 9% 98% 83% French English 7% 96% 90% French English+M4 7% 98% 88% French English+M4 7% 98% 87% French English+M4 5% 98% 91% IBM Model 4 9% 87% 95%

Summary Unfortunately, you need gold alignments!

CRF Autoencoders X! input! X X i-1!! X i!! X X i+1!! encode Y! latent! Y i-1!! Y i!! Y Y i+1!! X! X! X! i-1! i! i+1! reconstruct reconstruc,on! X! X arg max, x log X y2y(x) p (y x) p (x 0 y) CRF Encoder reconstruction model Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders

Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders CRF Autoencoders X arg max, x log X y2y(x) p (y x) p (x 0 y) arg max, X (e,f) log X a2a(e,f) p (a e, f) CRF Aligner my i=1 p (e i f aj ) lexical translation probabilities

Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders CRF Autoencoders AER P R Czech English 28% 71% 73% Czech English 21% 80% 77% Czech English 19% 81% 81% IBM Model 4 22% 75% 80%

Summary This is it for word alignment- questions? Next time: evaluation Keep working on HW1

Word Alignment III: Fertility Models & CRFs. February 3, 2015