Word Alignment. Chris Dyer, Carnegie Mellon University

Size: px

Start display at page:

Download "Word Alignment. Chris Dyer, Carnegie Mellon University"

Isabella Stewart
6 years ago
Views:

1 Word Alignment Chris Dyer, Carnegie Mellon University

2 John ate an apple John hat einen Apfel gegessen

3 John ate an apple John hat einen Apfel gegessen

4 Outline Modeling translation with probabilistic models Using EM to learn parameters

5 Lexical Translation How do we translate a word? Look it up in the dictionary Haus : house, home, shell, household Multiple translations Different word senses, different registers, different inflections (?) house, home are common shell is specialized (the Haus of a snail is a shell)

6 How common is each? Translation Count house 5,000 home 2,000 shell 100 household 80

7 MLE ˆp MLE (e Haus) = if e = house >< if e = home if e = shell if e = household >: 0 otherwise

8 Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in eis generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f = hf 1,f 2,...,f n i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

9 Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word. f ai

10 Lexical Translation Putting our assumptions together, we have: p(e f,m)= X a2[0,n] m p(a f,m) my p(e i f ) ai i=1 Alignment Translation Alignment

11 Lexical Translation p(e i f ai ) p(house Haus) p(shell Haus) p(declaration Unabhaenigkeitserkaerung) This is just a bigram model!

12 Lexical Translation Putting our assumptions together, we have: p(e f,m)= X a2[0,n] m p(a f,m) my p(e i f ) ai i=1 Alignment Translation Alignment

13 Alignment p(a f,m) Most of the action for the first 10 years of MT was here. Words weren t the problem, word order was hard.

14 Alignment Alignments can be visualized in by drawing links between two sentences, and they are represented as vectors of positions: a =(1, 2, 3, 4) >

15 Reordering Words may be reordered during translation. a =(3, 4, 2, 1) >

16 Word Dropping A source word may not be translated at all a =(2, 3, 4) >

17 Word Insertion Words may be inserted during translation English just does not have an equivalent But it must be explained - we typically assume every source sentence contains a NULL token a =(1, 2, 3, 0, 4) >

18 One-to-many Translation A source word may translate into more than one target word a =(1, 2, 3, 4, 4) >

19 Many-to-one Translation More than one source word may not translate as a unit in lexical translation das Haus brach zusammen the house collapsed a =??? a =(1, 2, (3, 4) > ) >?

20 IBM Model 1 Simplest possible lexical translation model Additional assumptions The m alignment decisions are independent The alignment distribution for each a i over all source words and NULL for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) is uniform

21 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1

22 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n

23 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai )

24 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai ) p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e, a f,m)= my p(e i,a i f,m) i=1

25 Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other. p(a, b, c, d) =p(a)p(b)p(c)p(d) my p(e f,m)= p(e i f,m) i=1

26 Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1

27 Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0

28 Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0

29 Example 0 NULL das Haus ist klein Start with a foreign sentence and a target length.

30 Example 0 NULL das Haus ist klein the house is small

31 Example 0 NULL das Haus ist klein house is small the

32 Finding the Viterbi Alignment a = arg max p(a e, f) a2[0,1,...,n] m p(e, a f) = arg max P a2[0,1,...,n] m a 0 p(e, a 0 f) = arg max p(e, a f) a2[0,1,...,n] m a i = arg n max a i =0 1 1+n p(e i f ai ) = arg n max a i =0 p(e i f ai )

33 Finding the Viterbi Alignment 0 NULL das Haus ist klein the home is little

34 Finding the Viterbi Alignment 0 NULL das Haus ist klein the home is little

35 Finding the Viterbi Alignment 0 NULL das Haus ist klein the home is little

36 Finding the Viterbi Alignment 0 NULL das Haus ist klein the home is little

alignments, we could estimate the parameters (MLE) If we

37 Learning Lexical Translation Models How do we learn the parameters p(e f) Chicken and egg problem If we had the alignments, we could estimate the parameters (MLE) If we had parameters, we could find the most likely alignments

38 EM Algorithm pick some random (or uniform) parameters Repeat until you get bored (~ 5 iterations for lexical translation models) using your current parameters, compute expected alignments for every target word token in the training data p(a i e, f) (on board) keep track of the expected number of times f translates into e throughout the whole corpus keep track of the expected number of times that f is used as the source of any translation use these expected counts as if they were real counts in the standard MLE equation

39 EM das Haus das Buch ein Buch the house the book a book

40 EM: Model (init) das Buch Haus ein the book house a

41 EM: E Step das Buch Haus ein the book house a training instance: das the Buch book

42 EM: E Step das Buch Haus ein the book house a training instance: das the Buch book

43 EM: E Step das Buch Haus ein the book house a training instance: das the Buch book

44 EM: E Step das Buch Haus ein the book house a training instance: das the Buch book C(das) = 1 C(the das) = 0.5 C(book das) = 0.5

45 EM: E Step das Buch Haus ein the book house a training instance: das the Haus house

46 EM: E Step das Buch Haus ein the book house a training instance: das the Haus house C(das) = 1+1 C(the das) = C(book das) = 0.5 C(house das) = 0.5

47 EM: E Step das Buch Haus ein the book house a training instance: das the Haus house C(das) = 2 C(the das) = 1 C(book das) = 0.5 C(house das) = 0.5

48 EM: M-Step das Buch Haus ein the book house a C(das) = 2 C(the das) = 1 C(book das) = 0.5 C(house das) = 0.5

49 EM: M-Step das Buch Haus ein the book house a C(das) = 2 C(the das) = 1 C(book das) = 0.5 C(house das) = 0.5

50 EM: Model (step 2) das Buch Haus ein the book house a training instance: das the Buch book

51 EM: Model (step 2) das Buch Haus ein the book house a training instance: das the Buch book _ 2 1_ 3 3 C(das) = 1 C(the das) = C(book das) =

52 EM: Model das Buch Haus ein the book house a

53 EM: Model (final) das Buch Haus ein the book house a

54 Evaluation

55 Evaluation Since we have a probabilistic model, we can evaluate perplexity. PPL = 2 1 P(e,f)2D e log Q (e,f)2d p(e f) Iter 1 Iter 2 Iter 3 Iter 4... Iter -log likelihood perplexity

$)= P \ A A Recall(A, S) = S \ A S AER(A, P, S) =1 S \ A + P \ A S +$

56 Alignment Error Rate Possible links P Sure links S Precision(A, P )= P \ A A Recall(A, S) = S \ A S AER(A, P, S) =1 S \ A + P \ A S + A

57 Questions?

EM with Features. Nov. 19, Sunday, November 24, 13

EM with Features. Nov. 19, Sunday, November 24, 13 EM with Features Nov. 19, 2013 Word Alignment das Haus ein Buch das Buch the house a book the book Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical