Word Alignment III: Fertility Models & CRFs February 3, 2015
Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } { z } { my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai )
Fertility Models The models we have considered so far have been efficient This efficiency has come at a modeling cost: What is to stop the model from translating a word 0, 1, 2, or 100 times? We introduce fertility models to deal with this
IBM Model 3
Fertility Fertility: the number of English words generated by a foreign word Modeled by categorical distribution Examples: n( f) Unabhaengigkeitserklaerung zum = (zu + dem) Haus 0 0.00008 1 0.1 2 0.0002 3 0.8 4 0.009 5 0 0 0.01 1 0 2 0.9 3 0.0009 4 0.0001 5 0 0 0.01 1 0.92 2 0.07 3 0 4 0 5 0
Fertility X my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai ) Fertility models mean that we can no longer exploit conditional independencies to write p(a f,m) as a series of local alignment decisions. How do we compute the statistics required for EM training?
EM Recipe reminder If alignment points were visible, training fertility models would be easy We would and n( =3 f = Unabhaenigkeitserklaerung) = count(3, Unabhaenigkeitserklaerung) count(unabhaenigkeitserklaerung) But, alignments are not visible n( =3 f = Unabhaenigkeitserklaerung) = E[count(3, Unabhaenigkeitserklaerung)] E[count(Unabhaenigkeitserklaerung)]
Expectation & Fertility We need to compute expected counts under p(a f,e,m) Unfortunately p(a f,e,m) doesn t factorize nicely. :( Can we sum exhaustively? How many different a s are there? What to do?
Sample Alignments Monte-Carlo methods Gibbs sampling Importance sampling Particle filtering For historical reasons Use model 2 alignment to start (easy!) Weighted sum over all alignment configurations that are close to this alignment configuration Is this correct? No! Does it work? Sort of.
Pitfalls of Conditional Models IBM Model 4 alignment Our model's alignmen
Lexical Translation IBM Models 1-5 [Brown et al., 1993] Model 3: fertility Model 5: non-deficient model Widely used Giza++ toolkit Model 1: lexical translation, uniform alignment Model 2: absolute position model Model 4: relative position model (jumps in target string) HMM translation model [Vogel et al., 1996] Relative position model (jumps in source string) Latent variables are more useful these days than the translations
A few tricks... p(f e) p(e f)
A few tricks... p(f e) p(e f)
A few tricks... p(f e) p(e f)
Alignment Tool: fast_align
Another View With this model: X my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai ) The problem of word alignment is as: a = arg max p(a e, f,m) a2[0,n] m Can we model this distribution directly?
Markov Random Fields (MRFs) A B C X Y Z p(a, B, C, X, Y, Z) = p(a) p(b A) p(c B) p(x A)p(Y B)p(Z C) A B C X Y Z p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) 4(X) 5(Y ) 6(Z) Factors
Computing Z X X Y Z = X x2x 1(x, y) 2(x) 3(y) y2x X = {a, b, c} X 2 X Y 2 X When the graph has certain structures (e.g., chains), you can factor to get polytime DP algorithms. Z = X x2x 2(x) X y2x 1(x, y) 3(y)
1,2,3(x, y) =exp X k Log-linear models A B C p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) X Y Z 4(X) 5(Y ) 6(Z) w k f k (x, y) Weights (learned) Feature functions (specified)
Random Fields Benefits Potential functions can be defined with respect to arbitrary features (functions) of the variables Great way to incorporate knowledge Drawbacks Likelihood involves computing Z Maximizing likelihood usually requires computing Z (often over and over again!)
Conditional Random Fields Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the input p(y x) = 1 Z w (x) exp X F 2G y X w k f k (F, x, y) k All factors in the graph of y
Parameter Learning CRFs are trained to maximize conditional likelihood Y ŵ MLE = arg max p(y i x i ; w) w Recall we want to directly model p(a e, f) (x i,y i )2D The likelihood of what alignments? Gold reference alignments!
CRF for Alignment One of many possibilities, due to Blunsom & Cohn (2006) p(a e, f) = 1 Z w (e, f) exp e X i=1 X a has the same form as in the lexical translation models (still make a one-to-many assumption) w k are the model parameters f k are the feature functions k w k f(a i,a i 1,i,e, f) O(n 2 m) O(n 3 )
Model Labels (one per target word) index source sentence Train model (e,f) and (f,e) [inverting the reference alignments]
Experiments
pervez musharrafs langer abschied Identical word pervez musharraf s long goodbye Identical word 27
pervez musharrafs langer abschied Matching prefix pervez musharraf s long goodbye Identical word Matching prefix 28
pervez musharrafs langer abschied Matching suffix pervez musharraf s long goodbye Identical word Matching prefix Matching suffix 29
pervez musharrafs langer abschied Orthographic similarity pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity 30
pervez musharrafs langer abschied In dictionary pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity In dictionary... 31
Lexical Features Word word indicator features Various word word co-occurrence scores IBM Model 1 probabilities (t s, s t) Geometric mean of Model 1 probabilities Dice s coefficient [binned] Products of the above
Lexical Features Word class word class indicator NN translates as NN NN does not translate as MD Identical word feature 2010 = 2010 Identical prefix feature Obama ~ Obamu (NN_NN=1) (NN_MD=1) (IdentWord=1 IdentNum=1) (IdentPrefix=1) Orthographic similarity measure [binned] Al-Qaeda ~ Al-Kaida (OrthoSim050_080=1)
Other Features Compute features from large amounts of unlabeled text Does the Model 4 alignment contain this alignment point? What is the Model 1 posterior probability of this alignment point?
CRF Results AER P R French English 9% 97% 86% French English 9% 98% 83% French English 7% 96% 90% French English+M4 7% 98% 88% French English+M4 7% 98% 87% French English+M4 5% 98% 91% IBM Model 4 9% 87% 95%
Summary Unfortunately, you need gold alignments!
CRF Autoencoders X! input! X X i-1!! X i!! X X i+1!! encode Y! latent! Y i-1!! Y i!! Y Y i+1!! X! X! X! i-1! i! i+1! reconstruct reconstruc,on! X! X arg max, x log X y2y(x) p (y x) p (x 0 y) CRF Encoder reconstruction model Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders
Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders CRF Autoencoders X arg max, x log X y2y(x) p (y x) p (x 0 y) arg max, X (e,f) log X a2a(e,f) p (a e, f) CRF Aligner my i=1 p (e i f aj ) lexical translation probabilities
Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders CRF Autoencoders AER P R Czech English 28% 71% 73% Czech English 21% 80% 77% Czech English 19% 81% 81% IBM Model 4 22% 75% 80%
Summary This is it for word alignment- questions? Next time: evaluation Keep working on HW1