Lecture 11: Viterbi and Forward Algorithms

Size: px

Start display at page:

Download "Lecture 11: Viterbi and Forward Algorithms"

Nora Ford
5 years ago
Views:

1 Lecture 11: iterbi and Forward lgorithms Kai-Wei Chang University of irginia kw@kwchang.net Couse webpage: CS6501 atural Language Processing 1

2 Quiz 1 Quiz [0-5] [6-10] [11-15] [16-20] [21-25] v Max: 24; Mean: 18.1; Median: 18; SD: 3.36 CS6501 atural Language Processing 2

3 This lecture v Two important algorithms for inference v Forward algorithm v iterbi algorithm CS6501 atural Language Processing 3

4 CS6501 atural Language Processing #

5 Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v iterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 atural Language Processing 5

6 Likelihood of the input v How likely a sentence I love cat occur v Compute P(w λ) for the input w and HMM λ v Remember, we model P t,w λ v P w λ = P t, w λ t Marginal probability: Sum over all possible tag sequences CS6501 atural Language Processing 6

7 Likelihood of the input v How likely a sentence I love cat occur v P w λ = P t, w λ t = t Π 1./0 P w. t. P t. t.40 v ssume we have 2 tags, v P I love cat λ = P I love cat, λ + P I love cat, λ +P I love cat, λ + P I love cat, λ +P I love cat, λ + P I love cat, λ +P I love cat, λ + P I love cat, λ v ow, let s write down P( I love cat λ) with 45 tags CS6501 atural Language Processing 7

8 Trellis diagram λ is the parameter set of HMM. Let s ignore it in some slides for simplicity s sake v Goal: P(w λ) = t Π 1./0 P w. t. P t. t.40 P(t C = 2 t 0 = 1) P(t J = 1 t C = 1) P(w J t J = 1) i = 1 i = 2 i = 3 i = 4 CS6501 atural Language Processing 8

9 Trellis diagram v P( I eat a fish, ) P(I ) P( ) P( ) P( ) P( < S >) P(fish ) P(eat ) P(a ) i = 1 i = 2 i = 3 i = 4 CS6501 atural Language Processing 9

10 Trellis diagram v t Π 1./0 P w. t. P t. t.40 : sum over all paths i = 1 i = 2 i = 3 i = 4 CS6501 atural Language Processing 10

11 Dynamic programming v Recursively decompose a problem into smaller sub-problems v Similar to mathematical induction v Base step: initial values for i = 1 v Inductive step: assume we know the values for i = k, let s compute i = k + 1 CS6501 atural Language Processing 11

12 Forward algorithm v Inductive step: from i = k to i = k+1 P(w k, t Y = q) v t Y : tag sequence with length k, w Y = w 0, w C w Y v t P(t Y, w k g ) = k t P(t Y40, w k gij, t Y = q) tag sequences i=k tag sequences i = 1 i = 2 i = 3 i = 4 CS6501 atural Language Processing 12

13 Forward algorithm v Inductive step: from i = k to i = k+1 v t g P(t Y, w) v P(w k, t Y = q) = = P( km = P(w k, t Y = q) k k l P( w k, t Y40 = q m, t Y = q) w k40, t Y40 = q m )P(t Y = q t Y40 = q )P(w Y t Y = q) i = k 1 CS6501 atural Language Processing i = k 13

14 Forward algorithm v Inductive step: from i = k to i = k+1 v t g P(t Y, w) v P(w k, t Y = q) = = P( km = P(w k, t Y = q) k k l P( w k, t Y40 = q m, t Y = q) Let s call it α Y (q) This is α Y40 (q ) w k40, t Y40 = q m )P(t Y = q t Y40 = q )P(w Y t Y = q) i = k 1 CS6501 atural Language Processing i = k 14

15 Forward algorithm v Inductive step: from i = k to i = k+1 v α Y (q)= km α Y40 (q ) P(t Y = q t Y40 = q )P(w Y t Y = q) i = k 1 CS6501 atural Language Processing i = k 15

16 Forward algorithm v Inductive step: from i = k to i = k+1 v α Y (q)= km α Y40 (q ) P(t Y = q t Y40 = q )P(w Y t Y = q) =P(w Y t Y = q) km α Y40 (q ) P(t Y = q t Y40 = q ) i = k 1 CS6501 atural Language Processing i = k 16

17 Forward algorithm v Base step: i=0 v α 0 q = P w 0 t 0 = q P(t 0 = q t q ) initial probability p(t 0 = q) i = k 1 CS6501 atural Language Processing i = k 17

18 Implementation using an array v Use a n T table to keep α Y (q) From Julia Hockenmaier, Intro to LP CS6501 atural Language Processing 18

19 Implementation using an array Initial: Trellis[1][q] = P w 0 t 0 = q P(t 0 = q t q ) CS6501 atural Language Processing 19

20 Implementation using an array i i Induction: α Y (q)=p(w Y t Y = q) km α Y40 (q ) P(t Y = q t Y40 = q ) CS6501 atural Language Processing 20

21 The forward algorithm (Pseudo Code).fwd=0 CS6501 atural Language Processing 21

22 Jason s ice cream #cones v P( 1,2,1 )? " p( C) p( H) p( STRT) p(1 ) p(2 ) p(3 ) (C ) (H ) C C C H H H CS6501 atural Language Processing 22

23 Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v iterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 atural Language Processing 23

24 Prediction in generative model v Inference: What is the most likely sequence of tags for the given sequence of words w initial probability p(t 0 ) v What are the latent states that most likely generate the sequence of word w CS6501 atural Language Processing 24

25 Tagging the input v Find best tag sequence of I love cat v Remember, we model P t,w λ v t = arg max t P t, w Find the best one from all possible tag sequences λ CS6501 atural Language Processing 25

26 Tagging the input v ssume we have 2 tags, v Which one is the best? P I love cat, λ, P I love cat, λ, P I love cat, λ, P I love cat, λ, P I love cat, λ, P I love cat, λ, P I love cat, λ, P I love cat, λ v gain! We need an efficient algorithm CS6501 atural Language Processing 26

27 Trellis diagram v Goal: arg max t Π 1./0 P w. t. P t. t.40 P(t C = 2 t 0 = 1) P(t J = 1 t C = 1) P(w J t J = 1) i = 1 i = 2 i = 3 i = 4 CS6501 atural Language Processing 27

28 Trellis diagram v Goal: argmax Π 1./0 P w. t. P t. t.40 t v Find the best path! i = 1 i = 2 i = 3 i = 4 CS6501 atural Language Processing 28

29 Dynamic programming again! v Recursively decompose a problem into smaller sub-problems v Similar to mathematical induction v Base step: initial values for i = 1 v Inductive step: assume we know the values for i = k, let s compute i = k + 1 CS6501 atural Language Processing 29

30 iterbi algorithm v Inductive step: from i = k to i = k+1 v t Y : tag sequence with length k, w Y = w 0, w C w Y v max t k tag sequences P(t Y, w k ) = max k max t ki1 P(tY40, t Y = q, w k ) i=k tag sequences i = 1 i = 2 i = 3 i = 4 CS6501 atural Language Processing 30

31 iterbi algorithm v Inductive step: from i = k to i = k+1 Let s call it δ Y (q) v max t ki1 P(tY40, t Y = q, w k ) = max k l max t ki1 P(tY42, t Y = q, t Y40 = q m, w k ) = max k l max t ki1 P ty42, t Y40 = q m, w k41 P t Y = q, t Y40 = q m P w Y t Y = q This is δ Y40 (q ) i = k 1 CS6501 atural Language Processing i = k 31

32 iterbi algorithm v Inductive step: from i = k to i = k+1 v δ Y q = maxδ Y40 (q m ) P(t Y = q t Y40 = q )P(w Y t Y = q) km i = k 1 CS6501 atural Language Processing i = k 32

33 iterbi algorithm v Inductive step: from i = k to i = k+1 v δ Y q = max k l δ Y40 q m P t Y = q t Y40 = q m P w Y t Y = q = P w Y t Y = q max k l δ Y40 q m P t Y = q t Y40 = q m i = k 1 CS6501 atural Language Processing i = k 33

34 iterbi algorithm v Base step: i=0 v δ 0 q = P w 0 t 0 = q P(t 0 = q t q ) initial probability p(t 0 = q) i = k 1 CS6501 atural Language Processing i = k 34

35 Implementation using an array Initial: Trellis[1][q] = P w 0 t 0 = q P(t 0 = q t q ) CS6501 atural Language Processing 35

36 Implementation using an array Induction: δ Y q = P w Y t Y = q max k l δ Y40 q m P t Y = q t Y40 = q m CS6501 atural Language Processing 36

37 Retrieving the best sequence v Keep one backpointer CS6501 atural Language Processing 37

38 The iterbi algorithm (Pseudo Code).fwd=0 Max instead of sum CS6501 atural Language Processing 38

39 CS6501 atural Language Processing 39

40 Jason s ice cream #cones " p( C) p( H) p( STRT) p(1 ) p(2 ) p(3 ) (C ) (H ) v Best tag sequence for P( 1,2,1 )? C C C H H H CS6501 atural Language Processing 40

41 Trick: computing everything in log space v Homework: v Write forward and iterbi algorithm in log-space v Hint: you need a function to compute log(a+b) CS6501 atural Language Processing 41

42 Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v iterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 atural Language Processing 42

Lecture 12: EM Algorithm

Lecture 12: EM Algorithm Kai-Wei hang S @ University of Virginia kw@kwchang.net ouse webpage: http://kwchang.net/teaching/nlp16 S6501 Natural Language Processing 1 Three basic problems for MMs v Likelihood