EM with Features. Nov. 19, Sunday, November 24, 13

EM with Features Nov. 19, 2013

Word Alignment das Haus ein Buch das Buch the house a book the book

Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f = hf 1,f 2,...,f n i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

IBM Model 1 Simplest possible lexical translation model Additional assumptions The m alignment decisions are independent The alignment distribution for each a i over all source words and NULL for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) is uniform

IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1

IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n

IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai )

IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai ) p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e, a f,m)= my p(e i,a i f,m) i=1

Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other.

Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1

Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0

Example 0 NULL das Haus ist klein Start with a foreign sentence and a target length.

Example 0 NULL das Haus ist klein

Example 0 NULL das Haus ist klein the

Example 0 NULL das Haus ist klein the house

Example 0 NULL das Haus ist klein the house is

Example 0 NULL das Haus ist klein the house is small

Example 0 NULL das Haus ist klein

Example 0 NULL das Haus ist klein the

Example 0 NULL das Haus ist klein house the

Example 0 NULL das Haus ist klein house is the

Example 0 NULL das Haus ist klein house is small the

freq(buch, book) =? freq(das, book) =? freq(ein, book) =? freq(buch, book) = X I(ẽ i = book, f ai i = Buch) E pw (1) (a f=das Buch,e=the book) X I[e i = book,f ai = Buch] i

Convergence

Evaluation Since we have a probabilistic model, we can evaluate perplexity. PPL = 2 1 P(e,f)2D e log Q (e,f)2d p(e f) Iter 1 Iter 2 Iter 3 Iter 4... Iter -log likelihood perplexity - 7.66 7.21 6.84... -6-2.42 2.30 2.21... 2

Hidden Markov Models GMMs, the aggregate bigram model, and Model 1 don t have conditional dependencies between random variables Let s consider an example of a model where this is not the case p(x) = X y 0 2Y x (y x! stop) x Y i=1 (y i 1! y i ) (y i # x i )

EM for HMMs What statistics are sufficient to determine the parameter values? freq(q # x) How often does q emit x? freq(q! r) How often does q transition to r? freq(q) How often do we visit q?

EM for HMMs What statistics are sufficient to determine the parameter values? freq(q # x) How often does q emit x? freq(q! r) How often does q transition to r? freq(q) How often do we visit q? And of course... freq(q) = X r2q freq(q! r)

a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x)

Random Restarts Non-convex optimization only finds a local solution Several strategies Random restarts Simulated annealing

Decipherment

Grammar Induction

Inductive Bias A model can learn nothing without inductive bias... whence inductive bias? Model structure Priors (next week) Posterior regularization (Google it) Features provide a very flexible means to bias a model

EM with Features Let s replace the multinomials with log linear distributions (q! r) = q,r = P exp w > f(q, r) q 0 2Q exp w> f(q 0,r) How will the likelihood of this model compare to the likelihood of the previous model?

Learning Algorithm I E step given model parameters, compute posterior distribution over transitions (states, etc) compute f(q, r) q,r These are your empirical expectations E q(y) X

Learning Algorithm 1 M step The gradient of the expected log likelihood of x,y under q(y) is re q(y) log p(x, y) =E q X f(q, r) X E q [freq(q)]e p(r q;w) f(q, r) q,r q,r Use LBFGS or gradient descent to solve