Lecture 12: EM Algorithm Kai-Wei hang S @ University of Virginia kw@kwchang.net ouse webpage: http://kwchang.net/teaching/nlp16 S6501 Natural Language Processing 1
Three basic problems for MMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): ow likely the sentence I love cat occurs POS tags of I love cat occurs ow to learn the model? v Find the best model parameters v ase 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v ase 2: unsupervised -- only unannotated text vforward-backward algorithm S6501 Natural Language Processing 2
EM algorithm v POS induction can we tag POS without annotated data? v An old idea v Good mathematical intuition v Tutorial paper: ftp://ftp.icsi.berkeley.edu/pub/techreports/1997/t r-97-021.pdf v http://people.csail.mit.edu/regina/6864/em_note s_mike.pdf S6501 Natural Language Processing 3
ard EM (Intuition) v We don t know the hidden states (i.e., POS tags) v If we know the model S6501 Natural Language Processing 4
Recap: Learning from Labeled Data v If we know the hidden states (labels) v we count how often we see t "#$ t " and w & t " then normalize. 2 3 2 2 3 1 2 3 2 S6501 Natural Language Processing 5
Recap: Tagging the input v If we know the model, we can find the best tag sequence S6501 Natural Language Processing 6
ard EM (Intuition) v We don t know the hidden states (i.e., POS tags) 1. Let s guess! 2. Then, we have labels; we can estimate the model 3. heck if the model is consistent with the labels we guessed; if no Step 1. S6501 Natural Language Processing 7
Let s make a guess P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5?????? 2 3 2?????? 2 3 1 2 3 2 S6501 Natural Language Processing 8
These are obvious P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5???? 2 3 2??? 2 3 1 2 3 2 S6501 Natural Language Processing 9
Guess more P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5? 2 3 2? 2 3 1 2 3 2 S6501 Natural Language Processing 10
Guess all of them Now we can estimate ML P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 3 2 2 3 1 2 3 2 S6501 Natural Language Processing 11
Does our guess consistent with the model? P( ) P( ) P( Start) ( 1 ) 0.5 0 - ( 2 ) 0.5 0.625 - ( 3 ) 0 0.375 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 3 2 2 3 1 2 3 2 S6501 Natural Language Processing 12
ow to find latent states based on our model? Viterbi! P( ) P( ) P( Start) ( 1 ) 0.5 0 - ( 2 ) 0.5 0.625 - ( 3 ) 0 0.375 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5?????? 2 3 2?????? 2 3 1 2 3 2 S6501 Natural Language Processing 13
Something wrong P( ) P( ) P( Start) ( 1 ) 0.5 0 - ( 2 ) 0.5 0.625 - ( 3 ) 0 0.375 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 From Viterbi 2 3 2 From Viterbi 2 3 1 2 3 2 S6501 Natural Language Processing 14
It s fine. Let s do again P( ) P( ) P( Start) ( 1 ) 1 0 - ( 2 ) 0 0.7 - ( 3 ) 0 0.3 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 3 2 2 3 1 2 3 2 S6501 Natural Language Processing 15
This time it is consistent P( ) P( ) P( Start) ( 1 ) 1 0 - ( 2 ) 0 0.7 - ( 3 ) 0 0.3 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 From Viterbi 2 3 2 From Viterbi 2 3 1 2 3 2 S6501 Natural Language Processing 16
No! Only one solution? EM is sensitive to initialization P( ) P( ) P( Start) ( 1 ) 0.22 0 - ( 2 ) 0.77 0 - ( 3 ) 0 1 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 3 2 2 3 1 2 3 2 S6501 Natural Language Processing 17
ow about this? P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( )?? 0.5 ( )?? 0.5?????? 2 3 2?????? 2 3 1 2 3 2 S6501 Natural Language Processing 18
ard EM v We don t know the hidden states (i.e., POS tags) 1. Let s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, we have labels; we can estimate the model v Maximum Likelihood Estimation 3. heck if the model is consistent with the labels we guessed; if no Step 1. S6501 Natural Language Processing 19
Soft EM v We don t know the hidden states (i.e., POS tags) 1. Let s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, Let s use we expected have labels; counts we instead! can estimate the model v Maximum Likelihood Estimation 3. heck if the model is consistent with the labels we guessed; if no Step 1. S6501 Natural Language Processing 20
Expected ounts P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5??? S6501 Natural Language Processing 21
Expected ounts Some sequences are more likely to occur than the others P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 S6501 Natural Language Processing 22
Expected ounts P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 0.01024 0.00256 0.00064 0.00256 S6501 Natural Language Processing 23
Expected ounts Assume we draw 100,000 random samples P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 1024 256 64 256 S6501 Natural Language Processing 24
Expected ounts Let s update model P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 1024 256 64 256 S6501 Natural Language Processing 25
Expected ounts Let s update model ow many -? 1024*2+256=2302 P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 1024 256 64 256 S6501 Natural Language Processing 26
Expected ounts ow many -? 1024*2+256=2302 ow many? P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8-1024*3+256*2+64*2+256=3968 ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 1024 2302/3968 = 0.580 256 P( )? 64 256 S6501 Natural Language Processing 27
Expected ounts P( )? 2302/3968 = 0.580 P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.58 0.2 0.5 ( ) 0.2 0.8 0.5 1024 256 Do this for all the other entries! 64 256 S6501 Natural Language Processing 28
Are we done yet? v What if we have 45 tags? v What if our sentences has 20 tokens...? v We need an efficent algorithm again! S6501 Natural Language Processing 29
Expected ounts P( )? 2302/3968 = 0.580 P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 (1024+256+256+64) /3968 = 0.403 P(1 )? ( ) 0.2 0.8 0.5 1024 256 64 256 S6501 Natural Language Processing 30
Expected ounts P( )? 2302/3968 = 0.580 P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 (1024+256+256+64) /3968 = 0.403 P(1 )? ( ) 0.2 0.8 0.5 0.01024 0.00064 0.00256 0.00256 S6501 Natural Language Processing 31
In general P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 2 2 S6501 Natural Language Processing 32
In general P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 2 2 S6501 Natural Language Processing 33
In general Let s say #words = n P w 1..n, t / = P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 i=k.. 2 2 2 S6501 Natural Language Processing 34
In general probability of w $ w 7 and tag k is P w 1..n, t / = = P w 1..k, t / = P w k31..n t / = probability of w 73$ w 8 and tag k is i=k.. 2 2 2 S6501 Natural Language Processing 35
In general an be computed by forward algorithm an be computed by backward algorithm P w 1..n, t / = = P w 1..k, t / = P w k31..n t / = P w 1..k, t / = = 9 P w 1..k, t 1..k#1,t / = P w k31..n t / = = 9 P w k..n, t k31..n t / = t 1..k;1 t k<1..n i=k.. 2 2 2 S6501 Natural Language Processing 36
Forward algorithm i i Induction: α 7 q =P w 7 q) B α 7#$ q A P(q q ) S6501 Natural Language Processing 37
Backward algorithm vp w k31..n t / = = P w k32..n t 73$ = q P q P(w 73$ q) q v β 7 = β 73$ q P q P(w 73$ q) B S6501 Natural Language Processing 38
In general an be computed by forward algorithm an be computed by backward algorithm P w 1..n, t / = = P w 1..k, t / = P w k31..n t / = P w 1..k, t / = = 9 P w 1..k, t 1..k#1,t / = t 1..k;1 P w k..n, t / = = 9 P w k..n, t k31..n, t / = i=k t k<1..n.. 2 2 2 S6501 Natural Language Processing 39
Emission ounts Expected counts of (2,) P 2 = " P(w " = 2, t " =, w 1..n ) " P(t " =, w 1..n ) i=k Expected counts of.. 2 2 2 S6501 Natural Language Processing 40
ow about the transition counts? P w 1..n, t / =, t 73$ = = P w 1..k, t / = P w k31..n t /3$ = P P(w k31 ) = α k β 73$ P P(w k31 ). i=k i=k+1 S6501 Natural Language Processing 41
Three basic problems for MMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): ow likely the sentence I love cat occurs POS tags of I love cat occurs ow to learn the model? v Find the best model parameters v ase 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v ase 2: unsupervised -- only unannotated text vforward-backward algorithm S6501 Natural Language Processing 42
Trick: computing everything in log space v omework: v Write forward, backward and Viterbi algorithm in log-space v int: you need a function to compute log(a+b) S6501 Natural Language Processing 43
Behind the scenes v What is EM optimized? v Log Likelihood of the input! v log P(w λ) v log P w λ = log t P(w, t λ) = log X Π 8 "V$ P t " t "#$, t "#W P(w " t " ) In contrast, in the supervised situation, We are optimizing log P(w, t λ) This is hard; In contrast log P w, t λ = logπ 8 "V$ P t " t "#$, t "#W P w " t " = "(log P t " t "#$, t "#W + log P w " t " ) Log Π is hard; log Π = log is easy S6501 Natural Language Processing 44
Intuition of EM (from the optimization perspective) λ (b3w) λ (b3$) λ (b) f λ g b3$ = logp w λ = log P(w, t λ) Key idea: 1. Define g c λ such that f λ g b λ λ and f λ (b) = g b λ b 2. Optimize g c λ g b S6501 Natural Language Processing 45
Intuition of EM (from optimization perspective) λ (b3w) λ (b3$) λ (b) > f λ g b3$ = logp w λ = log P(w, t λ) Key idea: 1. Define g c λ such that f λ g b λ λ and f λ (b) = g b λ b 2. Optimize g c λ g b ard EM, Soft EM define different g c λ S6501 Natural Language Processing 46
g c λ for soft EM v log P w, t X f w, t λ = log X f t w, λ b P t w, λ b λ Jensen inequality: Let p(x) = 1 log k f x p(x) k p(x)log f x P t X w, λ b log f w, t λ f t w, λ b S6501 Natural Language Processing 47
g c λ (b) = f λ b? v log P w, t X P t X w, λ b λ log f w, t λ f t w, λ b f λ b = log P w, t λ (b) = log P(w λ b ) X g b λ (b) = P t w, λ b X = X P t w, λ b log P w λ (b) = (logp w λ (b) ) X P t w, λ b = log w λ b f w, t λ(b) log f t w, λ b S6501 Natural Language Processing 48
Intuition of EM (from optimization perspective) λ (b3w) λ (b3$) λ (b) > f λ g b3$ = logp w λ = log P(w, t λ) Key idea: 1. Define g c λ such that f λ g b λ λ and f λ (b) = g b λ b 2. Optimize g c λ g b Soft EM define g c λ = P t w, λ b log f w, t λ X S6501 Natural Language Processing f t w, λ b 49
Optimizing g c λ g c λ = P t X w, λ (b) log f w, t λ f t w, λ (b) = P t w, λ b X (log P w, t λ log P t w, λ (b) ) max g s λ = P t w, λ (b) (log P w, t λ ) r X This term doesn t have λ = X P t w, λ (b) "(log P t " t "#$,t "#W + log P w " t " ) In contrast, in supervised learning case: We know how to solve this!! log P w, t λ = logπ 8 "V$ P t " t "#$, t "#W P w " t " = "(log P t " t "#$, t "#W + log P w " t " ) Log Π is hard; log Π = log is easy S6501 Natural Language Processing 50