Lecture 12: EM Algorithm

Size: px

Start display at page:

Download "Lecture 12: EM Algorithm"

Heather Martin
5 years ago
Views:

1 Lecture 12: EM Algorithm Kai-Wei hang University of Virginia kw@kwchang.net ouse webpage: S6501 Natural Language Processing 1

2 Three basic problems for MMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): ow likely the sentence I love cat occurs POS tags of I love cat occurs ow to learn the model? v Find the best model parameters v ase 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v ase 2: unsupervised -- only unannotated text vforward-backward algorithm S6501 Natural Language Processing 2

3 EM algorithm v POS induction can we tag POS without annotated data? v An old idea v Good mathematical intuition v Tutorial paper: ftp://ftp.icsi.berkeley.edu/pub/techreports/1997/t r pdf v s_mike.pdf S6501 Natural Language Processing 3

4 ard EM (Intuition) v We don t know the hidden states (i.e., POS tags) v If we know the model S6501 Natural Language Processing 4

5 Recap: Learning from Labeled Data v If we know the hidden states (labels) v we count how often we see t "#$ t " and w & t " then normalize S6501 Natural Language Processing 5

6 Recap: Tagging the input v If we know the model, we can find the best tag sequence S6501 Natural Language Processing 6

7 ard EM (Intuition) v We don t know the hidden states (i.e., POS tags) 1. Let s guess! 2. Then, we have labels; we can estimate the model 3. heck if the model is consistent with the labels we guessed; if no Step 1. S6501 Natural Language Processing 7

8 Let s make a guess P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) ( ) ?????? 2 3 2?????? S6501 Natural Language Processing 8

9 These are obvious P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) ( ) ???? 2 3 2??? S6501 Natural Language Processing 9

10 Guess more P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) ( ) ? 2 3 2? S6501 Natural Language Processing 10

11 Guess all of them Now we can estimate ML P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) ( ) S6501 Natural Language Processing 11

12 Does our guess consistent with the model? P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) S6501 Natural Language Processing 12

13 ow to find latent states based on our model? Viterbi! P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) ?????? 2 3 2?????? S6501 Natural Language Processing 13

14 Something wrong P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) From Viterbi From Viterbi S6501 Natural Language Processing 14

15 It s fine. Let s do again P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) S6501 Natural Language Processing 15

16 This time it is consistent P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) From Viterbi From Viterbi S6501 Natural Language Processing 16

17 No! Only one solution? EM is sensitive to initialization P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) S6501 Natural Language Processing 17

18 ow about this? P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( )?? 0.5 ( )?? 0.5?????? 2 3 2?????? S6501 Natural Language Processing 18

19 ard EM v We don t know the hidden states (i.e., POS tags) 1. Let s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, we have labels; we can estimate the model v Maximum Likelihood Estimation 3. heck if the model is consistent with the labels we guessed; if no Step 1. S6501 Natural Language Processing 19

20 Soft EM v We don t know the hidden states (i.e., POS tags) 1. Let s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, Let s use we expected have labels; counts we instead! can estimate the model v Maximum Likelihood Estimation 3. heck if the model is consistent with the labels we guessed; if no Step 1. S6501 Natural Language Processing 20

21 Expected ounts P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) ??? S6501 Natural Language Processing 21

22 Expected ounts Some sequences are more likely to occur than the others P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) S6501 Natural Language Processing 22

23 Expected ounts P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) S6501 Natural Language Processing 23

24 Expected ounts Assume we draw 100,000 random samples P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) S6501 Natural Language Processing 24

25 Expected ounts Let s update model P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) S6501 Natural Language Processing 25

26 Expected ounts Let s update model ow many -? 1024*2+256=2302 P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) S6501 Natural Language Processing 26

27 Expected ounts ow many -? 1024*2+256=2302 ow many? P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) *3+256*2+64*2+256=3968 ( ) ( ) /3968 = P( )? S6501 Natural Language Processing 27

28 Expected ounts P( )? 2302/3968 = P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) Do this for all the other entries! S6501 Natural Language Processing 28

29 Are we done yet? v What if we have 45 tags? v What if our sentences has 20 tokens...? v We need an efficent algorithm again! S6501 Natural Language Processing 29

30 Expected ounts P( )? 2302/3968 = P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) /3968 = P(1 )? ( ) S6501 Natural Language Processing 30

31 Expected ounts P( )? 2302/3968 = P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) /3968 = P(1 )? ( ) S6501 Natural Language Processing 31

32 In general P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) S6501 Natural Language Processing 32

33 In general P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) S6501 Natural Language Processing 33

34 In general Let s say #words = n P w 1..n, t / = P( ) P( ) P( Start) ( 1 ) ( 2 ) ( 3 ) ( ) ( ) i=k S6501 Natural Language Processing 34

35 In general probability of w $ w 7 and tag k is P w 1..n, t / = = P w 1..k, t / = P w k31..n t / = probability of w 73$ w 8 and tag k is i=k S6501 Natural Language Processing 35

36 In general an be computed by forward algorithm an be computed by backward algorithm P w 1..n, t / = = P w 1..k, t / = P w k31..n t / = P w 1..k, t / = = 9 P w 1..k, t 1..k#1,t / = P w k31..n t / = = 9 P w k..n, t k31..n t / = t 1..k;1 t k<1..n i=k S6501 Natural Language Processing 36

37 Forward algorithm i i Induction: α 7 q =P w 7 q) B α 7#$ q A P(q q ) S6501 Natural Language Processing 37

38 Backward algorithm vp w k31..n t / = = P w k32..n t 73$ = q P q P(w 73$ q) q v β 7 = β 73$ q P q P(w 73$ q) B S6501 Natural Language Processing 38

39 In general an be computed by forward algorithm an be computed by backward algorithm P w 1..n, t / = = P w 1..k, t / = P w k31..n t / = P w 1..k, t / = = 9 P w 1..k, t 1..k#1,t / = t 1..k;1 P w k..n, t / = = 9 P w k..n, t k31..n, t / = i=k t k<1..n S6501 Natural Language Processing 39

40 Emission ounts Expected counts of (2,) P 2 = " P(w " = 2, t " =, w 1..n ) " P(t " =, w 1..n ) i=k Expected counts of S6501 Natural Language Processing 40

41 ow about the transition counts? P w 1..n, t / =, t 73$ = = P w 1..k, t / = P w k31..n t /3$ = P P(w k31 ) = α k β 73$ P P(w k31 ). i=k i=k+1 S6501 Natural Language Processing 41

42 Three basic problems for MMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): ow likely the sentence I love cat occurs POS tags of I love cat occurs ow to learn the model? v Find the best model parameters v ase 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v ase 2: unsupervised -- only unannotated text vforward-backward algorithm S6501 Natural Language Processing 42

43 Trick: computing everything in log space v omework: v Write forward, backward and Viterbi algorithm in log-space v int: you need a function to compute log(a+b) S6501 Natural Language Processing 43

44 Behind the scenes v What is EM optimized? v Log Likelihood of the input! v log P(w λ) v log P w λ = log t P(w, t λ) = log X Π 8 "V$ P t " t "#$, t "#W P(w " t " ) In contrast, in the supervised situation, We are optimizing log P(w, t λ) This is hard; In contrast log P w, t λ = logπ 8 "V$ P t " t "#$, t "#W P w " t " = "(log P t " t "#$, t "#W + log P w " t " ) Log Π is hard; log Π = log is easy S6501 Natural Language Processing 44

45 Intuition of EM (from the optimization perspective) λ (b3w) λ (b3$) λ (b) f λ g b3$ = logp w λ = log P(w, t λ) Key idea: 1. Define g c λ such that f λ g b λ λ and f λ (b) = g b λ b 2. Optimize g c λ g b S6501 Natural Language Processing 45

46 Intuition of EM (from optimization perspective) λ (b3w) λ (b3$) λ (b) > f λ g b3$ = logp w λ = log P(w, t λ) Key idea: 1. Define g c λ such that f λ g b λ λ and f λ (b) = g b λ b 2. Optimize g c λ g b ard EM, Soft EM define different g c λ S6501 Natural Language Processing 46

47 g c λ for soft EM v log P w, t X f w, t λ = log X f t w, λ b P t w, λ b λ Jensen inequality: Let p(x) = 1 log k f x p(x) k p(x)log f x P t X w, λ b log f w, t λ f t w, λ b S6501 Natural Language Processing 47

48 g c λ (b) = f λ b? v log P w, t X P t X w, λ b λ log f w, t λ f t w, λ b f λ b = log P w, t λ (b) = log P(w λ b ) X g b λ (b) = P t w, λ b X = X P t w, λ b log P w λ (b) = (logp w λ (b) ) X P t w, λ b = log w λ b f w, t λ(b) log f t w, λ b S6501 Natural Language Processing 48

49 Intuition of EM (from optimization perspective) λ (b3w) λ (b3$) λ (b) > f λ g b3$ = logp w λ = log P(w, t λ) Key idea: 1. Define g c λ such that f λ g b λ λ and f λ (b) = g b λ b 2. Optimize g c λ g b Soft EM define g c λ = P t w, λ b log f w, t λ X S6501 Natural Language Processing f t w, λ b 49

50 Optimizing g c λ g c λ = P t X w, λ (b) log f w, t λ f t w, λ (b) = P t w, λ b X (log P w, t λ log P t w, λ (b) ) max g s λ = P t w, λ (b) (log P w, t λ ) r X This term doesn t have λ = X P t w, λ (b) "(log P t " t "#$,t "#W + log P w " t " ) In contrast, in supervised learning case: We know how to solve this!! log P w, t λ = logπ 8 "V$ P t " t "#$, t "#W P w " t " = "(log P t " t "#$, t "#W + log P w " t " ) Log Π is hard; log Π = log is easy S6501 Natural Language Processing 50

Lecture 11: Viterbi and Forward Algorithms

Lecture 11: Viterbi and Forward Algorithms Lecture 11: iterbi and Forward lgorithms Kai-Wei Chang CS @ University of irginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/lp16 CS6501 atural Language Processing 1 Quiz 1 Quiz 1 30 25