Lecture 9: Hidden Markov Model

Size: px

Start display at page:

Download "Lecture 9: Hidden Markov Model"

MargaretMargaret Smith
5 years ago
Views:

1 Lecture 9: Hidden Markov Model Kai-Wei Chang University of Virginia kw@kwchang.net Couse webpage: CS6501 Natural Language Processing 1

2 This lecture v Hidden Markov Model v Different views of HMM v HMM in supervised learning setting CS6501 Natural Language Processing 2

3 Recap: Parts of Speech v Traditional parts of speech v ~ 8 of them CS6501 Natural Language Processing 3

4 Recap: Tagset v Penn TreeBank tagset, 45 tags: v PRP$, WRB, WP$, VBG v Penn POS annotations: The/DT grand/jj jury/nn commmented/vbd on/in a/dt number/nn of/in other/jj topics/nns./. v Universal Tag set, 12 tags v NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT,., X CS6501 Natural Language Processing 4

5 Recap: POS Tagging v.s. Word clustering v Words often have more than one POS: back v The back door = JJ v On my back = NN v Win the voters back = RB v Promised to back the bill = VB v Syntax v.s. Semantics (details later) These examples from Dekang Lin CS6501 Natural Language Processing 5

6 Recap: POS tag sequences v Some tag sequences more likely occur than others v POS Ngram view ntent=_adj_+_noun_%2c_adv_+_nou N_%2C+_ADV_+_VERB_ Existing methods often model POS tagging as a sequence tagging problem CS6501 Natural Language Processing 6

7 Evaluation v How many words in the unseen test data can be tagged correctly? v Usually evaluated on Penn Treebank v State of the art ~97% v Trivial baseline (most likely tag) ~94% v Human performance ~97% CS6501 Natural Language Processing 7

8 Building a POS tagger v Supervised learning v Assume linguistics have annotated several examples Tag set: DT, JJ, NN, VBD POS Tagger The/DT grand/jj jury/nn commented/vbd on/in a/dt number/nn of/in other/jj topics/nns./. CS6501 Natural Language Processing 8

9 POS induction v Unsupervised learning v Assume we only have an unannotated corpus Tag set: DT, JJ, NN, VBD POS Tagger The grand jury commented on a number of other topics. CS6501 Natural Language Processing 9

10 TODAY: Hidden Markov Model v We focus on supervised learning setting v What is the most likely sequence of tags for the given sequence of words w v We will talk about other ML models for this type of prediction tasks later. CS6501 Natural Language Processing 10

11 Let s try Don t worry! There is no problem with your eyes or computer. a/dt d6g/nn 0s/VBZ chas05g/vbg a/dt cat/nn./. a/dt f6x/nn 0s/VBZ 9u5505g/VBG./. a/dt b6y/nn 0s/VBZ s05g05g/vbg./. a/dt ha77y/jj b09d/nn What is the POS tag sequence of the following sentence? a ha77y cat was s05g05g. CS6501 Natural Language Processing 11

12 Let s try v a/dt d6g/nn 0s/VBZ chas05g/vbg a/dt cat/nn./. a/dt dog/nn is/vbz chasing/vbg a/dt cat/nn./. v a/dt f6x/nn 0s/VBZ 9u5505g/VBG./. a/dt fox/nn is/vbz running/vbg./. v a/dt b6y/nn 0s/VBZ s05g05g/vbg./. a/dt boy/nn is/vbz singing/vbg./. v a/dt ha77y/jj b09d/nn a/dt happy/jj bird/nn v a ha77y cat was s05g05g. a happy cat was singing. CS6501 Natural Language Processing 12

13 How you predict the tags? v Two types of information are useful v Relations between words and tags v Relations between tags and tags v DT NN, DT JJ NN CS6501 Natural Language Processing 13

14 Statistical POS tagging v What is the most likely sequence of tags for the given sequence of words w P( DT JJ NN a smart dog) = P(DD JJ NN a smart dog) / P (a smart dog) P(DD JJ NN a smart dog) = P(DD JJ NN) P(a smart dog DD JJ NN ) CS6501 Natural Language Processing 14

15 Transition Probability v Joint probability P(t, w) = P t P(w t) v P t = P t +, t,, t. = P t + P t, t + P t 1 t,, t + P t. t + t.2+ P t + P t, t + P t 1 t, P(t. t.2+ ) = Π. 78+ P t 7 t 72+ Markov assumption v Bigram model over POS tags! (similarly, we can define a n-gram model over POS tags, usually we called high-order HMM) CS6501 Natural Language Processing 15

16 Emission Probability v Joint probability P(t, w) = P t P(w t) v Assume words only depend on their POS-tag v P w t P w + t + P w, t, P(w. t. ) = Π. 78+ P w 7 t 7 Independent assumption i.e., P(a smart dog DD JJ NN ) = P(a DD) P(smart JJ ) P( dog NN ) CS6501 Natural Language Processing 16

17 Put them together v Joint probability P(t, w) = P t P(w t) v P t, w = P t + P t, t + P t 1 t, P t. t.2+ P w + t + P w, t, P(w. t. ) = Π. 78+ P w 7 t 7 P t 7 t 72+ e.g., P(a smart dog, DD JJ NN ) = P(a DD) P(smart JJ ) P( dog NN ) P(DD start) P(JJ DD) P(NN JJ ) CS6501 Natural Language Processing 17

18 Put them together v Two independent assumptions v Approximate P(t) by a bi(or N)-gram model v Assume each word depends only on its POStag initial probability p(t + ) CS6501 Natural Language Processing 18

19 HMMs as probabilistic FSA Julia Hockenmaier: Intro to NLP CS6501 Natural Language Processing 19

20 Table representation Let λ = {A, B, π} represents all parameters CS6501 Natural Language Processing 20

21 Hidden Markov Models (formal) v States T = t 1, t 2 t N; v Observations W= w 1, w 2 w N; v Each observation is a symbol from a vocabulary V = {v 1,v 2, v V } v Transition probabilities v Transition probability matrix A = {a ij } a 7V = P t 7 = j t 72+ = i 1 i, j N v Observation likelihoods v Output probability matrix B={b i (k)} b 7 (k) = P w 7 = v _ t 7 = i v Special initial probability vector π π 7 = P t + = i 1 i N CS6501 Natural Language Processing 21

22 How to build a second-order HMM? v Second-order HMM v Trigram model over POS tags vp t = Π. 78+ P t 7 t 72+, t 72, vp w, t = Π. 78+ P t 7 t 72+, t 72, P(w 7 t 7 ) CS6501 Natural Language Processing 22

23 Probabilistic FSA for second-order HMM Julia Hockenmaier: Intro to NLP CS6501 Natural Language Processing 23

24 Prediction in generative model v Inference: What is the most likely sequence of tags for the given sequence of words w initial probability p(t + ) v What are the latent states that most likely generate the sequence of word w CS6501 Natural Language Processing 24

25 Example: The Verb race v Secretariat/NNP is/vbz expected/vbn to/to race/vb tomorrow/nr v People/NNScontinue/VB to/to inquire/vb the/dt reason/nn for/in the/dt race/nn for/in outer/jj space/nn v How do we pick the right tag? CS6501 Natural Language Processing 25

26 Disambiguating race CS6501 Natural Language Processing 26

27 Disambiguating race v P(NN TO) = v P(VB TO) =.83 v P(race NN) = v P(race VB) = v P(NR VB) =.0027 v P(NR NN) =.0012 v P(VB TO)P(NR VB)P(race VB) = v P(NN TO)P(NR NN)P(race NN)= v So we (correctly) choose the verb reading, CS6501 Natural Language Processing 27

Jason and his Ice Creams v You are a climatologist in the year 2799 v Studying global warming v You can t find any records of the weather in Baltimore, MA for summer of 2007 v But you find Jason

28 Jason and his Ice Creams v You are a climatologist in the year 2799 v Studying global warming v You can t find any records of the weather in Baltimore, MA for summer of 2007 v But you find Jason Eisner s diary v Which lists how many ice-creams Jason ate every date that summer v Our job: figure out how hot it was CS6501 Natural Language Processing 28

29 (C)old day v.s. (H)ot day #cones " p( C) p( H) p( START) p(1 ) p(2 ) p(3 ) (C ) (H ) ) Weather States that Best Explain Ice Cream Consumption Ice Creams p(h) Diary Day CS6501 Natural Language Processing 29

30 Three basic problems for HMMs v Likelihood of the input: v Compute P(w λ) for the input w and HMM λ v Decoding (tagging) the input: v Find the best tag sequence argmax d P(t w, λ) v Estimation (learning): v Find the best model parameters How likely the sentence I love cat occurs v Case 1: supervised tags are annotated POS tags of I love cat occurs How to learn the model? v Case 2: unsupervised -- only unannotated text CS6501 Natural Language Processing 30

31 Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 Natural Language Processing 31

32 Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 Natural Language Processing 32

33 Learning from Labeled Data v Let play a game! v We count how often we see t 72+ t 7 and w e t 7 then normalize. CS6501 Natural Language Processing 33

34 Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters We need dynamic programming v Case 1: supervised tags are annotated for vmaximum the other likelihood problems estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 Natural Language Processing 34

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification