EM with Features Nov. 19, 2013
Word Alignment das Haus ein Buch das Buch the house a book the book
Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.
Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.
Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f = hf 1,f 2,...,f n i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.
Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word. f ai
IBM Model 1 Simplest possible lexical translation model Additional assumptions The m alignment decisions are independent The alignment distribution for each a i over all source words and NULL for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) is uniform
IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1
IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n
IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai )
IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai )
IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai ) p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e, a f,m)= my p(e i,a i f,m) i=1
Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other.
Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other. p(a, b, c, d) =p(a)p(b)p(c)p(d)
Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other.
Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other. p(e f,m)= my i=1 p(e i f,m)
Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1
Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0
Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0
Example 0 NULL das Haus ist klein Start with a foreign sentence and a target length.
Example 0 NULL das Haus ist klein
Example 0 NULL das Haus ist klein
Example 0 NULL das Haus ist klein the
Example 0 NULL das Haus ist klein the
Example 0 NULL das Haus ist klein the house
Example 0 NULL das Haus ist klein the house
Example 0 NULL das Haus ist klein the house is
Example 0 NULL das Haus ist klein the house is
Example 0 NULL das Haus ist klein the house is small
Example 0 NULL das Haus ist klein
Example 0 NULL das Haus ist klein
Example 0 NULL das Haus ist klein the
Example 0 NULL das Haus ist klein the
Example 0 NULL das Haus ist klein house the
Example 0 NULL das Haus ist klein house the
Example 0 NULL das Haus ist klein house is the
Example 0 NULL das Haus ist klein house is the
Example 0 NULL das Haus ist klein house is small the
freq(buch, book) =? freq(das, book) =? freq(ein, book) =? freq(buch, book) = X I(ẽ i = book, f ai i = Buch) E pw (1) (a f=das Buch,e=the book) X I[e i = book,f ai = Buch] i
Convergence
Evaluation Since we have a probabilistic model, we can evaluate perplexity. PPL = 2 1 P(e,f)2D e log Q (e,f)2d p(e f) Iter 1 Iter 2 Iter 3 Iter 4... Iter -log likelihood perplexity - 7.66 7.21 6.84... -6-2.42 2.30 2.21... 2
Hidden Markov Models GMMs, the aggregate bigram model, and Model 1 don t have conditional dependencies between random variables Let s consider an example of a model where this is not the case p(x) = X y 0 2Y x (y x! stop) x Y i=1 (y i 1! y i ) (y i # x i )
EM for HMMs What statistics are sufficient to determine the parameter values? freq(q # x) How often does q emit x? freq(q! r) How often does q transition to r? freq(q) How often do we visit q?
EM for HMMs What statistics are sufficient to determine the parameter values? freq(q # x) How often does q emit x? freq(q! r) How often does q transition to r? freq(q) How often do we visit q? And of course... freq(q) = X r2q freq(q! r)
a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)
a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)
a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)
a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)
p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x)
p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x) The expectation over state occupancy is E[freq(q)] = X r2q E[freq(q! r)]
p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x) The expectation over state occupancy is E[freq(q)] = X r2q E[freq(q! r)] What is E[freq(q # x)]?
Random Restarts Non-convex optimization only finds a local solution Several strategies Random restarts Simulated annealing
Decipherment
Grammar Induction
Inductive Bias A model can learn nothing without inductive bias... whence inductive bias? Model structure Priors (next week) Posterior regularization (Google it) Features provide a very flexible means to bias a model
EM with Features Let s replace the multinomials with log linear distributions (q! r) = q,r = P exp w > f(q, r) q 0 2Q exp w> f(q 0,r) How will the likelihood of this model compare to the likelihood of the previous model?
EM with Features Let s replace the multinomials with log linear distributions (q! r) = q,r = P exp w > f(q, r) q 0 2Q exp w> f(q 0,r) How will the likelihood of this model compare to the likelihood of the previous model?
Learning Algorithm I E step given model parameters, compute posterior distribution over transitions (states, etc) compute f(q, r) q,r These are your empirical expectations E q(y) X
Learning Algorithm 1 M step The gradient of the expected log likelihood of x,y under q(y) is re q(y) log p(x, y) =E q X f(q, r) X E q [freq(q)]e p(r q;w) f(q, r) q,r q,r Use LBFGS or gradient descent to solve