EM with Features. Nov. 19, Sunday, November 24, 13

Size: px

Start display at page:

Download "EM with Features. Nov. 19, Sunday, November 24, 13"

Buck Derick Hoover
6 years ago
Views:

1 EM with Features Nov. 19, 2013

2 Word Alignment das Haus ein Buch das Buch the house a book the book

3 Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

4 Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

5 Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f = hf 1,f 2,...,f n i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

6 Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word. f ai

7 IBM Model 1 Simplest possible lexical translation model Additional assumptions The m alignment decisions are independent The alignment distribution for each a i over all source words and NULL for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) is uniform

8 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1

9 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n

10 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai )

11 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai )

12 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai ) p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e, a f,m)= my p(e i,a i f,m) i=1

13 Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other.

14 Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other. p(a, b, c, d) =p(a)p(b)p(c)p(d)

15 Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other.

16 Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other. p(e f,m)= my i=1 p(e i f,m)

17 Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1

18 Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0

19 Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0

20 Example 0 NULL das Haus ist klein Start with a foreign sentence and a target length.

21 Example 0 NULL das Haus ist klein

22 Example 0 NULL das Haus ist klein

23 Example 0 NULL das Haus ist klein the

24 Example 0 NULL das Haus ist klein the

25 Example 0 NULL das Haus ist klein the house

26 Example 0 NULL das Haus ist klein the house

27 Example 0 NULL das Haus ist klein the house is

28 Example 0 NULL das Haus ist klein the house is

29 Example 0 NULL das Haus ist klein the house is small

30 Example 0 NULL das Haus ist klein

31 Example 0 NULL das Haus ist klein

32 Example 0 NULL das Haus ist klein the

33 Example 0 NULL das Haus ist klein the

34 Example 0 NULL das Haus ist klein house the

35 Example 0 NULL das Haus ist klein house the

36 Example 0 NULL das Haus ist klein house is the

37 Example 0 NULL das Haus ist klein house is the

38 Example 0 NULL das Haus ist klein house is small the

39 freq(buch, book) =? freq(das, book) =? freq(ein, book) =? freq(buch, book) = X I(ẽ i = book, f ai i = Buch) E pw (1) (a f=das Buch,e=the book) X I[e i = book,f ai = Buch] i

40 Convergence

41 Evaluation Since we have a probabilistic model, we can evaluate perplexity. PPL = 2 1 P(e,f)2D e log Q (e,f)2d p(e f) Iter 1 Iter 2 Iter 3 Iter 4... Iter -log likelihood perplexity

42 Hidden Markov Models GMMs, the aggregate bigram model, and Model 1 don t have conditional dependencies between random variables Let s consider an example of a model where this is not the case p(x) = X y 0 2Y x (y x! stop) x Y i=1 (y i 1! y i ) (y i # x i )

43 EM for HMMs What statistics are sufficient to determine the parameter values? freq(q # x) How often does q emit x? freq(q! r) How often does q transition to r? freq(q) How often do we visit q?

44 EM for HMMs What statistics are sufficient to determine the parameter values? freq(q # x) How often does q emit x? freq(q! r) How often does q transition to r? freq(q) How often do we visit q? And of course... freq(q) = X r2q freq(q! r)

45 a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

46 a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

47 a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

48 a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

49 p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x)

50 p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x) The expectation over state occupancy is E[freq(q)] = X r2q E[freq(q! r)]

51 p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x) The expectation over state occupancy is E[freq(q)] = X r2q E[freq(q! r)] What is E[freq(q # x)]?

52 Random Restarts Non-convex optimization only finds a local solution Several strategies Random restarts Simulated annealing

53 Decipherment

54 Grammar Induction

55 Inductive Bias A model can learn nothing without inductive bias... whence inductive bias? Model structure Priors (next week) Posterior regularization (Google it) Features provide a very flexible means to bias a model

56 EM with Features Let s replace the multinomials with log linear distributions (q! r) = q,r = P exp w > f(q, r) q 0 2Q exp w> f(q 0,r) How will the likelihood of this model compare to the likelihood of the previous model?

57 EM with Features Let s replace the multinomials with log linear distributions (q! r) = q,r = P exp w > f(q, r) q 0 2Q exp w> f(q 0,r) How will the likelihood of this model compare to the likelihood of the previous model?

58 Learning Algorithm I E step given model parameters, compute posterior distribution over transitions (states, etc) compute f(q, r) q,r These are your empirical expectations E q(y) X

59 Learning Algorithm 1 M step The gradient of the expected log likelihood of x,y under q(y) is re q(y) log p(x, y) =E q X f(q, r) X E q [freq(q)]e p(r q;w) f(q, r) q,r q,r Use LBFGS or gradient descent to solve

Word Alignment. Chris Dyer, Carnegie Mellon University

Word Alignment. Chris Dyer, Carnegie Mellon University Word Alignment Chris Dyer, Carnegie Mellon University John ate an apple John hat einen Apfel gegessen John ate an apple John hat einen Apfel gegessen Outline Modeling translation with probabilistic models