Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501 Natural Language Processing 1
This lecture v Hidden Markov Model v Different views of HMM v HMM in supervised learning setting CS6501 Natural Language Processing 2
Recap: Parts of Speech v Traditional parts of speech v ~ 8 of them CS6501 Natural Language Processing 3
Recap: Tagset v Penn TreeBank tagset, 45 tags: v PRP$, WRB, WP$, VBG v Penn POS annotations: The/DT grand/jj jury/nn commmented/vbd on/in a/dt number/nn of/in other/jj topics/nns./. v Universal Tag set, 12 tags v NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT,., X CS6501 Natural Language Processing 4
Recap: POS Tagging v.s. Word clustering v Words often have more than one POS: back v The back door = JJ v On my back = NN v Win the voters back = RB v Promised to back the bill = VB v Syntax v.s. Semantics (details later) These examples from Dekang Lin CS6501 Natural Language Processing 5
Recap: POS tag sequences v Some tag sequences more likely occur than others v POS Ngram view https://books.google.com/ngrams/graph?co ntent=_adj_+_noun_%2c_adv_+_nou N_%2C+_ADV_+_VERB_ Existing methods often model POS tagging as a sequence tagging problem CS6501 Natural Language Processing 6
Evaluation v How many words in the unseen test data can be tagged correctly? v Usually evaluated on Penn Treebank v State of the art ~97% v Trivial baseline (most likely tag) ~94% v Human performance ~97% CS6501 Natural Language Processing 7
Building a POS tagger v Supervised learning v Assume linguistics have annotated several examples Tag set: DT, JJ, NN, VBD POS Tagger The/DT grand/jj jury/nn commented/vbd on/in a/dt number/nn of/in other/jj topics/nns./. CS6501 Natural Language Processing 8
POS induction v Unsupervised learning v Assume we only have an unannotated corpus Tag set: DT, JJ, NN, VBD POS Tagger The grand jury commented on a number of other topics. CS6501 Natural Language Processing 9
TODAY: Hidden Markov Model v We focus on supervised learning setting v What is the most likely sequence of tags for the given sequence of words w v We will talk about other ML models for this type of prediction tasks later. CS6501 Natural Language Processing 10
Let s try Don t worry! There is no problem with your eyes or computer. a/dt d6g/nn 0s/VBZ chas05g/vbg a/dt cat/nn./. a/dt f6x/nn 0s/VBZ 9u5505g/VBG./. a/dt b6y/nn 0s/VBZ s05g05g/vbg./. a/dt ha77y/jj b09d/nn What is the POS tag sequence of the following sentence? a ha77y cat was s05g05g. CS6501 Natural Language Processing 11
Let s try v a/dt d6g/nn 0s/VBZ chas05g/vbg a/dt cat/nn./. a/dt dog/nn is/vbz chasing/vbg a/dt cat/nn./. v a/dt f6x/nn 0s/VBZ 9u5505g/VBG./. a/dt fox/nn is/vbz running/vbg./. v a/dt b6y/nn 0s/VBZ s05g05g/vbg./. a/dt boy/nn is/vbz singing/vbg./. v a/dt ha77y/jj b09d/nn a/dt happy/jj bird/nn v a ha77y cat was s05g05g. a happy cat was singing. CS6501 Natural Language Processing 12
How you predict the tags? v Two types of information are useful v Relations between words and tags v Relations between tags and tags v DT NN, DT JJ NN CS6501 Natural Language Processing 13
Statistical POS tagging v What is the most likely sequence of tags for the given sequence of words w P( DT JJ NN a smart dog) = P(DD JJ NN a smart dog) / P (a smart dog) P(DD JJ NN a smart dog) = P(DD JJ NN) P(a smart dog DD JJ NN ) CS6501 Natural Language Processing 14
Transition Probability v Joint probability P(t, w) = P t P(w t) v P t = P t +, t,, t. = P t + P t, t + P t 1 t,, t + P t. t + t.2+ P t + P t, t + P t 1 t, P(t. t.2+ ) = Π. 78+ P t 7 t 72+ Markov assumption v Bigram model over POS tags! (similarly, we can define a n-gram model over POS tags, usually we called high-order HMM) CS6501 Natural Language Processing 15
Emission Probability v Joint probability P(t, w) = P t P(w t) v Assume words only depend on their POS-tag v P w t P w + t + P w, t, P(w. t. ) = Π. 78+ P w 7 t 7 Independent assumption i.e., P(a smart dog DD JJ NN ) = P(a DD) P(smart JJ ) P( dog NN ) CS6501 Natural Language Processing 16
Put them together v Joint probability P(t, w) = P t P(w t) v P t, w = P t + P t, t + P t 1 t, P t. t.2+ P w + t + P w, t, P(w. t. ) = Π. 78+ P w 7 t 7 P t 7 t 72+ e.g., P(a smart dog, DD JJ NN ) = P(a DD) P(smart JJ ) P( dog NN ) P(DD start) P(JJ DD) P(NN JJ ) CS6501 Natural Language Processing 17
Put them together v Two independent assumptions v Approximate P(t) by a bi(or N)-gram model v Assume each word depends only on its POStag initial probability p(t + ) CS6501 Natural Language Processing 18
HMMs as probabilistic FSA Julia Hockenmaier: Intro to NLP CS6501 Natural Language Processing 19
Table representation Let λ = {A, B, π} represents all parameters CS6501 Natural Language Processing 20
Hidden Markov Models (formal) v States T = t 1, t 2 t N; v Observations W= w 1, w 2 w N; v Each observation is a symbol from a vocabulary V = {v 1,v 2, v V } v Transition probabilities v Transition probability matrix A = {a ij } a 7V = P t 7 = j t 72+ = i 1 i, j N v Observation likelihoods v Output probability matrix B={b i (k)} b 7 (k) = P w 7 = v _ t 7 = i v Special initial probability vector π π 7 = P t + = i 1 i N CS6501 Natural Language Processing 21
How to build a second-order HMM? v Second-order HMM v Trigram model over POS tags vp t = Π. 78+ P t 7 t 72+, t 72, vp w, t = Π. 78+ P t 7 t 72+, t 72, P(w 7 t 7 ) CS6501 Natural Language Processing 22
Probabilistic FSA for second-order HMM Julia Hockenmaier: Intro to NLP CS6501 Natural Language Processing 23
Prediction in generative model v Inference: What is the most likely sequence of tags for the given sequence of words w initial probability p(t + ) v What are the latent states that most likely generate the sequence of word w CS6501 Natural Language Processing 24
Example: The Verb race v Secretariat/NNP is/vbz expected/vbn to/to race/vb tomorrow/nr v People/NNScontinue/VB to/to inquire/vb the/dt reason/nn for/in the/dt race/nn for/in outer/jj space/nn v How do we pick the right tag? CS6501 Natural Language Processing 25
Disambiguating race CS6501 Natural Language Processing 26
Disambiguating race v P(NN TO) =.00047 v P(VB TO) =.83 v P(race NN) =.00057 v P(race VB) =.00012 v P(NR VB) =.0027 v P(NR NN) =.0012 v P(VB TO)P(NR VB)P(race VB) =.00000027 v P(NN TO)P(NR NN)P(race NN)=.00000000032 v So we (correctly) choose the verb reading, CS6501 Natural Language Processing 27
Jason and his Ice Creams v You are a climatologist in the year 2799 v Studying global warming v You can t find any records of the weather in Baltimore, MA for summer of 2007 v But you find Jason Eisner s diary v Which lists how many ice-creams Jason ate every date that summer v Our job: figure out how hot it was http://videolectures.net/hltss2010_eisner_plm/ http://www.cs.jhu.edu/~jason/papers/eisner.hmm.xls CS6501 Natural Language Processing 28
(C)old day v.s. (H)ot day #cones " p( C) p( H) p( START) p(1 ) 0.7 0.1 p(2 ) 0.2 0.2 p(3 ) 0.1 0.7 (C ) 0.8 0.1 0.5 (H ) 0.1 0.8 0.5 ) 0.1 0.1 0 3.5 3 Weather States that Best Explain Ice Cream Consumption Ice Creams p(h) 2.5 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Diary Day CS6501 Natural Language Processing 29
Three basic problems for HMMs v Likelihood of the input: v Compute P(w λ) for the input w and HMM λ v Decoding (tagging) the input: v Find the best tag sequence argmax d P(t w, λ) v Estimation (learning): v Find the best model parameters How likely the sentence I love cat occurs v Case 1: supervised tags are annotated POS tags of I love cat occurs How to learn the model? v Case 2: unsupervised -- only unannotated text CS6501 Natural Language Processing 30
Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 Natural Language Processing 31
Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 Natural Language Processing 32
Learning from Labeled Data v Let play a game! v We count how often we see t 72+ t 7 and w e t 7 then normalize. CS6501 Natural Language Processing 33
Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters We need dynamic programming v Case 1: supervised tags are annotated for vmaximum the other likelihood problems estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 Natural Language Processing 34