Lecture 6 Hdden Markov Models and Maxmum Entropy Models CS 6320 82
HMM Outlne Markov Chans Hdden Markov Model Lkelhood: Forard Alg. Decodng: Vterb Alg. Maxmum Entropy Models 83
Dentons A eghted nte-state automaton adds probabltes to the arcs The sum o the probabltes leavng any arc must sum to one A Markov chan s a specal case o a WFSA n hch the nput sequence unquely determnes hch states the automaton ll go through Markov chans can t represent nherently ambguous problems Useul or assgnng probabltes to unambguous sequences 84
Markov Chan or Weather 85
Markov Chan or Words 86
Markov Chan Model A set o states Q = q, q 2 q N; the state at tme t s q t Transton probabltes: a set o probabltes A = a 0 a 02 a n a nn. Each a j represents the probablty o transtonng rom state to state j The set o these s the transton probablty matrx A Markov Assumpton: Current state only depends on prevous state Pq q...q Pq q 87
Markov Chan Model n j= a j = 88
Weather example Markov chans are useul hen e need to compute the probabltes or a sequence o events that are observable. 89
Markov Chan or Weather What s the probablty o 4 consecutve arm days? Sequence s arm-arm-arm-arm I.e., state sequence s 3-3-3-3 P3,3,3,3 = 3 a 33 a 33 a 33 = 0.2 x 0.6 3 = 0.0432 But hat about states are not observable? 90
HMM or Ice Cream You are a clmatologst n the year 2799 Studyng global armng You can t nd any records o the eather n Baltmore, MA or summer o 2007 But you nd Jason Esner s dary Whch lsts ho many ce-creams Jason ate every date that summer Our job: gure out ho hot t as 9
Hdden Markov Model For Markov chans, the output symbols are the same as the states. See hot eather: e re n state hot But n part-o-speech taggng and other thngs The output symbols are ords But the hdden states are part-o-speech tags So e need an extenson! A Hdden Markov Model s an extenson o a Markov chan n hch the nput symbols are not the same as the states. Ths means e don t kno hch state e are n. 92
Hdden Markov Models States Q = q, q 2 q N; Observatons O= o, o 2 o T; Each observaton s a symbol rom a vocabulary V = {v,v 2, v V } Transton probabltes Transton probablty matrx A = {a j } a j Pq t j q t, j N Observaton lkelhoods Output probablty matrx B={b k} b k PX t o k q t Specal ntal probablty vector Pq N 93
Esner Task Gven Ice Cream Observaton Sequence:,2,3,2,2,2,3 Produce: Weather Sequence: H,C,H,H,H,C 94
HMM or Ice Cream There are to hdden states hot cold Observatons are the number o ce cream events O = {,2,3} 95
Transton Probabltes 96
Observaton Lkelhoods 97
HMM or Three Basc Problems 98
Lkelhood Computaton Gven an HMM = A, B and an observaton sequence O. Determne the lkelhood P O. Problem : Compute the probablty o eatng 3 3 ce creams. Problem 2: Compute the probablty o eatng 3 3 ce creams hen the hdden sequence s hot hot cold. 99
Lkelhood Computaton For a partcular hdden state sequence Q And an observaton sequence O The lkelhood o the observaton sequence s P O Q = T = Po q P 3 3 hot hot cold = P 3 hot x P hot x P3 cold 200
Lkelhood Computaton Jont probablty o beng n a eather state sequence Q and a partcular sequence o observatons O o ce cream events s: P O, Q = P O Q x P Q = n = P o q x n = Pq q 20
We can compute no the probablty o a sequence o observatons O usng the jont probabltes P O = P O, Q = P O Q PQ Q Q P3 3 = P3 3, cold cold cold + P3 3, cold cold hot +...... + P3 3, hot hot hot 202
Forard Algorthm For N hdden states and a sequence o T observatons Forard Algorthm uses ON 2 T operatons nstead o N T a t j s the probablty o beng n state j ater seng the rst t observatons a t j = Po, o 2 o t, q t = j λ a t j = N = a t a j b j o t 203
Forard trells or ce cream example 204
Forard Algorthm. Intalzaton a j = a 0j b j o j N 2. Recurson 3. Termnaton N a t j = a t a j b j o t ; j N, < t T = N P O = a T q F = a T a F = 205
Forard Algorthm 206
Forard Algorthm 207
Decodng POS taggng s such a problem, and so s the eather problem Recall that n the case o POS taggng e need to compute n tˆ arg max P t t n n n We could just enumerate all paths gven the nput and use the model to assgn probabltes to each. Not a good dea. Luckly dynamc programmng helps us here 208
Vterb Algorthm Vterb algorthm computes a trells usng dynamc programmng. Observaton s processed rom let to rght llng out a trells o states v t j s the probablty that HMM s n state j ater seeng the rst t observatons v t j = max Pq 0, q q t, o, o 2 o t q t = j λ qo,q q v t j = N max v t a j b j o t = 209
Vterb tralls or ce cream example 20
Vterb Algorthm. Intalzaton 2. Recurson v t j = bt t j = 3. Termnaton N max = v t a j b j o t ; j N, < t T N argmax v t a j b j o t ; j N, < t T = The best score: P = v t q F = N max v T = The start o backtrace: q T = b tt q F = a,f N argmax v T = a,f 2
Vterb Traceback 22
The Vterb Algorthm 23
Vterb Example 24
Vterb Summary Create an array Wth columns correspondng to nputs Ros correspondng to possble states Seep through the array n one pass llng the columns let to rght usng our transton probs and observatons probs Dynamc programmng key s that e need only store the MAX prob path to each cell, not all paths. 25
Evaluaton So once you have your POS tagger runnng ho do you evaluate t? Overall error rate th respect to a gold-standard test set. Error rates on partcular tags Error rates on partcular ords Tag conusons... 26
Error Analyss Look at a conuson matrx See hat errors are causng problems Noun NN vs ProperNoun NNP vs Adj JJ Past tense VBD vs Partcple VBN vs Adjectve JJ 27
Evaluaton The result s compared th a manually coded Gold Standard Typcally accuracy reaches 96-97% Ths may be compared th result or a baselne tagger one that uses no context. Important: 00% s mpossble even or human annotators. 28
Maxmum Entropy Models 29
MEM Outlne Maxmum Entropy Models Background Maxmum Entropy Model appled to NLP classcaton Maxmum Entropy Markov Models 220
Maxmum Entropy Probablstc machne learnng or sequence classcaton POS taggng, speech recognton non-sequental classcaton text classcaton, sentment analyss Maxmum entropy extracts eatures rom nputs, then combnes them to classy nputs. Computes the probablty o a class c gven an observaton x descrbed by a vector o eatures 22
Lnear Regresson Problem: Prce a house based on vague adjectves used n the adds. Ex: antastc, cute, charmng Fgure 6.7 Some made-up data on the number o vague adjectves antastc, cute, charmng n a real estate ad and the amount the house sold or over the askng prce. prce 0 Num_Adjectves Fgure 6.8 A plot o the made-up ponts n Fg. 6.7 and the regresson lne that best ts them, th the equaton y = -4900x + 6550. 222
223 Multple Lnear Regresson Num_Unsold_Houses Mortgage_Rate Num_Adjectves prce 3 2 0 N 0 prce y N n n b a b a b a b a b a 2 2 product: dot N y 0 lnear regresson: In realty, the prce o house depends on several actors.
Learnng n Lnear Regresson Problem: Learn the eghts y j pred N 0 j Mnmze the cost uncton produced by eghts or all M examples n the tranng set. cost W M 2 j j y pred yobs j0 Y = X W = X T X X T y 224
Logstc Regresson Lnear regresson predcts real-value unctons Classcaton problems deal th dscrete values or classes We calculate the probablty that an observaton s n a partcular class, and pck the class th the hghest probablty. Let observaton x have eature vector, and class y Py true x N 0 Use a model to predct the odds o y beng true p y true x -p y true x p y true x ln -p y true x 225
226 Logt Functon ln logt x -p x p x p e e x y p e e x y p x y p e x y p e x y p e x y p x y p e x y -p x y p x -py x y p true true true true true true true true true true true ln e e e x y p true e e e x y p alse Ths s called logstc uncton Logstc Regresson s the model n hch a lnear uncton s used to estmate a logt o probablty
227 Logstc Regresson--Classcaton N N e x y p x y p x y p x y p x y p x y p 0 0 hyperplane a the equaton o s 0 0 0 true true alse true alse true Problem: Gven an observaton x decde t belongs to class true or class alse.
Maxmum Entropy Modelng In NLP e need to classy problems th multple classes p c x exp Z p c x exp cc exp N 0 N 0 c c Z C p c x cc exp N 0 c p c x cc exp exp N 0 N 0 c c c, x c, x In MaxEnt nstead o ndcator unctons, e use c,x, meanng eature or a partcular class c or a gven observaton x 228
Maxmum Entropy Modelng Secretarat/NNP s/bez expected/vbn to/to race/?? tomorro/ ord "race"& c NN c, x 0 otherse 2 t TO & c VB c, x 0 otherse sux ord 3 c, x 0 otherse s_loer_case ord 4 c, x 0 otherse "ng" & c VBG ord "race"& c VB 5 c, x 0 otherse 6 t TO & c NN c, x 0 otherse "race"& c VB 229
Maxmum Entropy Modelng.8.3 e e P NN x.8.3.8.0. e e e e e.20.8.0. e e e P VB x.8.3.8.0. e e e e e.80 cˆ arg max cc P c x 230
Why call t Maxmum Entropy? Problem: Assgn a tag to the ord zzsh. Wthout any pror normaton Knong that only our tags are possble 23
Entropy equaton H x P xlog P x 2 P NN x P JJ P NNS Pords zzsh and t NN or t P VB NNS 8 0 P VB 20 p* = argmax Hp The exponental model or multnomal logstc regresson also nds the maxmum entropy dstrbuton subject to constrants rom eature uncton. 232
Maxmum Entropy Markov Models MEMM Tˆ argmax P T W T argmax P W T P T T argmax P ord T tag P tag tag Tˆ argmax P T W T argmax P tag T ord, tag Advantages o MEMM. We estmate drectly the probablty o each tag gvng the prevous tag and observed ord. 2. We can condton any useul eature o nput observaton, hch as not possble th HMM 233
234 MEMM n n q q P q o P O Q P n o q q P O Q P, Fgure 6.20 The HMM top and MEMM bottom representaton o the probablty computaton or the correct sequence o tags or the Secretarat sentence. Each arc ould be assocated th a probablty; the HMM computes to separate probabltes or the observaton lkelhood and the pror, hle the MEMM computes a sngle probablty uncton at each state, condtoned on the prevous state and current observaton.
MEMM Fgure 6.2 An MEMM or part-o-speech taggng, augmentng the descrpton n Fg. 6.20 by shong that an MEMM can condton on many eatures o the nput, such as captalzaton, morphology endng n -s or ed, as ell as earler ords or tags. We have shon some potental addtonal eatures or the rst three decsons, usng derent lne styles or each class. P q q, o exp o, q Z o, q 235