CRF Word Alignment & Noisy Channel Translation

Size: px

Start display at page:

Download "CRF Word Alignment & Noisy Channel Translation"

Phillip Parrish
5 years ago
Views:

1 CRF Word Alignment & Noisy Channel Translation January 31, 2013

2 Last Time... X p( Translation)= p(, Translation) Alignment Alignment

3 Last Time... X p( Translation)= p(, Translation) Alignment X Alignment = p( p( Alignment) Translation Alignment) Alignment

4 Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X my z } { z } { p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai )

5 MAP alignment IBM Model 4 alignment Our model's alignmen

6 MAP alignment IBM Model 4 alignment Our model's alignmen

7 A few tricks... p(f e) p(e f)

8 A few tricks... p(f e) p(e f)

9 A few tricks... p(f e) p(e f)

10 Another View With this model: X my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai ) The problem of word alignment is as: a = arg max p(a e, f,m) a2[0,n] m

11 Another View With this model: X my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai ) The problem of word alignment is as: a = arg max p(a e, f,m) a2[0,n] m Can we model this distribution directly?

12 Markov Random Fields (MRFs) A B C X Y Z p(a, B, C, X, Y, Z) = p(a) p(b A) p(c B) p(x A)p(Y B)p(Z C)

13 Markov Random Fields (MRFs) A B C X Y Z p(a, B, C, X, Y, Z) = p(a) p(b A) p(c B) p(x A)p(Y B)p(Z C) A B C X Y Z p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) 4(X) 5(Y ) 6(Z)

14 Markov Random Fields (MRFs) A B C X Y Z p(a, B, C, X, Y, Z) = p(a) p(b A) p(c B) p(x A)p(Y B)p(Z C) A B C X Y Z p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) 4(X) 5(Y ) 6(Z) Factors

15 Computing Z X X Y Z = X x2x 1(x, y) 2(x) 3(y) y2x X = {a, b, c} X 2 X Y 2 X When the graph has certain structures (e.g., chains), you can factor to get polytime DP algorithms. Z = X x2x 2(x) X y2x 1(x, y) 3(y)

16 Log-linear models A B C X Y Z p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) 4(X) 5(Y ) 6(Z) 1,2,3(x, y) =exp X k w k f k (x, y)

17 Log-linear models A B C X Y Z p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) 4(X) 5(Y ) 6(Z) 1,2,3(x, y) =exp X k w k f k (x, y) Weights (learned)

18 1,2,3(x, y) =exp X k Log-linear models A B C p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) X Y Z 4(X) 5(Y ) 6(Z) w k f k (x, y) Weights (learned) Feature functions (specified)

19 Random Fields Benefits Potential functions can be defined with respect to arbitrary features (functions) of the variables Great way to incorporate knowledge Drawbacks Likelihood involves computing Z Maximizing likelihood usually requires computing Z (often over and over again!)

20 Conditional Random Fields Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the input

21 Conditional Random Fields Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the input p(y x) = 1 Z w (y) exp X F 2G X k w k f k (F, x)

22 Conditional Random Fields Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the input p(y x) = 1 Z w (y) exp X F 2G X k w k f k (F, x) All factors in the graph of y

23 Parameter Learning CRFs are trained to maximize conditional likelihood Y ŵ MLE = arg max p(y i x i ; w) w Recall we want to directly model p(a e, f) (x i,y i )2D The likelihood of what alignments?

24 Parameter Learning CRFs are trained to maximize conditional likelihood Y ŵ MLE = arg max p(y i x i ; w) w Recall we want to directly model p(a e, f) (x i,y i )2D The likelihood of what alignments? Gold reference alignments!

25 CRF for Alignment One of many possibilities, due to Blunsom & Cohn (2006) p(a e, f) = 1 Z w (e, f) exp e X i=1 X a has the same form as in the lexical translation models (still make a one-to-many assumption) w k are the model parameters f k are the feature functions k w k f(a i,a i 1,i,e, f)

26 CRF for Alignment One of many possibilities, due to Blunsom & Cohn (2006) p(a e, f) = 1 Z w (e, f) exp e X i=1 X a has the same form as in the lexical translation models (still make a one-to-many assumption) w k are the model parameters f k are the feature functions k w k f(a i,a i 1,i,e, f) O(n 2 m) O(n 3 )

27 Model Labels (one per target word) index source sentence Train model (e,f) and (f,e) [inverting the reference alignments]

28 Experiments

29 pervez musharrafs langer abschied Identical word pervez musharraf s long goodbye Identical word 17

30 pervez musharrafs langer abschied Matching prefix pervez musharraf s long goodbye Identical word Matching prefix 18

31 pervez musharrafs langer abschied Matching suffix pervez musharraf s long goodbye Identical word Matching prefix Matching suffix 19

32 pervez musharrafs langer abschied Orthographic similarity pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity 20

33 pervez musharrafs langer abschied In dictionary pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity In dictionary... 21

34 pervez musharrafs langer abschied pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity In dictionary... 21

35 Lexical Features Word word indicator features Various word word co-occurrence scores IBM Model 1 probabilities (t s, s t) Geometric mean of Model 1 probabilities Dice s coefficient [binned] Products of the above

36 Lexical Features Word class word class indicator NN translates as NN NN does not translate as MD Identical word feature 2010 = 2010 Identical prefix feature Obama ~ Obamu (NN_NN=1) (NN_MD=1) (IdentWord=1 IdentNum=1) (IdentPrefix=1) Orthographic similarity measure [binned] Al-Qaeda ~ Al-Kaida (OrthoSim050_080=1)

37 Other Features Compute features from large amounts of unlabeled text Does the Model 4 alignment contain this alignment point? What is the Model 1 posterior probability of this alignment point?

38 Results

39 Summary

40 Summary Unfortunately, you need gold alignments!

41 Putting the pieces together p(e) p(e f,m) p(e, a f,m) p(a e, f)

42 Putting the pieces together We have seen how to model the following: p(e) p(e f,m) p(e, a f,m) p(a e, f)

43 Putting the pieces together We have seen how to model the following: p(e) p(e f,m) p(e, a f,m) p(a e, f)

44 Putting the pieces together We have seen how to model the following: p(e) p(e f,m) p(e, a f,m) p(a e, f) Goal: a better model of p(e) p(e f,m) that knows about

45 One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. Warren Weaver to Norbert Wiener, March, 1947

46 M Message Encoder Y X M 0 Sent Noisy channel Received Decoder Recovered message Claude Shannon. A Mathematical Theory of Communication 1948.

47 M Message Encoder Y X M 0 Sent p(y) Noisy channel p(x y) Received p(x) Decoder Recovered message Claude Shannon. A Mathematical Theory of Communication 1948.

48 M Message Encoder Y X M 0 Sent p(y) Noisy channel p(x y) Received p(x) Decoder Recovered message Claude Shannon. A Mathematical Theory of Communication 1948.

49 M Message Encoder Y X M 0 Sent p(y) Noisy channel p(x y) Received Decoder Recovered message Shannon s theory tells us: 1) how much data you can send 2) the limits of compression 3) why your download is so slow 4) how to translate Claude Shannon. A Mathematical Theory of Communication 1948.

50 Y X M 0 Sent p(y) Noisy channel p(x y) Received Decoder Y 0 Recovered message

51 Y X M 0 Sent p(y) Noisy channel p(x y) Received Decoder Y 0 Recovered message

52 Y X M 0 Sent p(y) Noisy channel p(x y) Received Decoder Y 0 Recovered message

53 Y X M 0 Sent p(y) Noisy channel p(x y) Received Decoder Y 0 Recovered message y 0 = arg max y = arg max y = arg max y p(y x) p(x y)p(y) p(x) p(x y)p(y)

54 Y X M 0 Sent p(y) Noisy channel p(x y) Received 6= Decoder Y 0 Recovered message y 0 = arg max y = arg max y = arg max y p(y x) p(x y)p(y) p(x) p(x y)p(y)

55 Y X M 0 Sent p(y) Noisy channel p(x y) Received 6= Decoder Y 0 Recovered message y 0 = arg max y p(y x) = arg max y I can help. p(x y)p(y) p(x) = arg max y p(x y)p(y)

56 Y X M 0 Sent Noisy channel Received Decoder Y 0 Recovered message y 0 = arg max y = arg max y = arg max y p(y x) p(x y)p(y) p(x) p(x y)p(y)

57 Y X M 0 Sent Noisy channel Received Decoder Y 0 Recovered message y 0 = arg max y = arg max y p(y x) p(x y)p(y) p(x) = arg max y p(x y)p(y) Denominator doesn t depend on y.

58 Y X M 0 Sent Noisy channel Received Decoder Y 0 Recovered message y 0 = arg max y = arg max y = arg max y p(y x) p(x y)p(y) p(x) p(x y)p(y)

59 Y X M 0 Sent Noisy channel Received Decoder Y 0 Recovered message y 0 = arg max y p(x y)p(y)

60 Y X M 0 Sent Noisy channel Received Decoder Y 0 Recovered message English French English y 0 = arg max y p(x y)p(y) e 0 = arg max e p(f e)p(e)

61 Y X M 0 Sent Noisy channel Received Decoder Y 0 Recovered message English French English y 0 = arg max y p(x y)p(y) e 0 = arg max e p(f e)p(e) translation model

62 Y X M 0 Sent Noisy channel Received Decoder Y 0 Recovered message English French English y 0 = arg max y p(x y)p(y) e 0 = arg max e p(f e)p(e) translation model language model

63 Y X M 0 Sent Noisy channel Received Decoder Y 0 Recovered message English French English y 0 = arg max y p(x y)p(y) e 0 = arg max e p(f e)p(e) translation model language model Other noisy channel applications: OCR, speech recognition, spelling correction...

64 Division of labor Translation model probability of translation back into the source ensures adequacy of translation Language model is a translation hypothesis good English? ensures fluency of translation

65 p(e) English

66 p(e) English p(f e)

67 p(e) English p(f e) e = arg max e = arg max e p(e f) p(f e) p(e)

68 Announcements Upcoming language-in-10 Tuesday: Jon/Austin - Русский Leaderboard is functional

Word Alignment III: Fertility Models & CRFs. February 3, 2015

Word Alignment III: Fertility Models & CRFs February 3, 2015 Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } {