Lecture 6 Hidden Markov Models and Maximum Entropy Models

Lecture 6 Hdden Markov Models and Maxmum Entropy Models CS 6320 82

HMM Outlne Markov Chans Hdden Markov Model Lkelhood: Forard Alg. Decodng: Vterb Alg. Maxmum Entropy Models 83

Dentons A eghted nte-state automaton adds probabltes to the arcs The sum o the probabltes leavng any arc must sum to one A Markov chan s a specal case o a WFSA n hch the nput sequence unquely determnes hch states the automaton ll go through Markov chans can t represent nherently ambguous problems Useul or assgnng probabltes to unambguous sequences 84

Markov Chan or Weather 85

Markov Chan or Words 86

Markov Chan Model A set o states Q = q, q 2 q N; the state at tme t s q t Transton probabltes: a set o probabltes A = a 0 a 02 a n a nn. Each a j represents the probablty o transtonng rom state to state j The set o these s the transton probablty matrx A Markov Assumpton: Current state only depends on prevous state Pq q...q Pq q 87

Markov Chan Model n j= a j = 88

Weather example Markov chans are useul hen e need to compute the probabltes or a sequence o events that are observable. 89

Markov Chan or Weather What s the probablty o 4 consecutve arm days? Sequence s arm-arm-arm-arm I.e., state sequence s 3-3-3-3 P3,3,3,3 = 3 a 33 a 33 a 33 = 0.2 x 0.6 3 = 0.0432 But hat about states are not observable? 90

HMM or Ice Cream You are a clmatologst n the year 2799 Studyng global armng You can t nd any records o the eather n Baltmore, MA or summer o 2007 But you nd Jason Esner s dary Whch lsts ho many ce-creams Jason ate every date that summer Our job: gure out ho hot t as 9

Hdden Markov Model For Markov chans, the output symbols are the same as the states. See hot eather: e re n state hot But n part-o-speech taggng and other thngs The output symbols are ords But the hdden states are part-o-speech tags So e need an extenson! A Hdden Markov Model s an extenson o a Markov chan n hch the nput symbols are not the same as the states. Ths means e don t kno hch state e are n. 92

Hdden Markov Models States Q = q, q 2 q N; Observatons O= o, o 2 o T; Each observaton s a symbol rom a vocabulary V = {v,v 2, v V } Transton probabltes Transton probablty matrx A = {a j } a j Pq t j q t, j N Observaton lkelhoods Output probablty matrx B={b k} b k PX t o k q t Specal ntal probablty vector Pq N 93

Esner Task Gven Ice Cream Observaton Sequence:,2,3,2,2,2,3 Produce: Weather Sequence: H,C,H,H,H,C 94

HMM or Ice Cream There are to hdden states hot cold Observatons are the number o ce cream events O = {,2,3} 95

Transton Probabltes 96

Observaton Lkelhoods 97

HMM or Three Basc Problems 98

Lkelhood Computaton Gven an HMM = A, B and an observaton sequence O. Determne the lkelhood P O. Problem : Compute the probablty o eatng 3 3 ce creams. Problem 2: Compute the probablty o eatng 3 3 ce creams hen the hdden sequence s hot hot cold. 99

Lkelhood Computaton For a partcular hdden state sequence Q And an observaton sequence O The lkelhood o the observaton sequence s P O Q = T = Po q P 3 3 hot hot cold = P 3 hot x P hot x P3 cold 200

Lkelhood Computaton Jont probablty o beng n a eather state sequence Q and a partcular sequence o observatons O o ce cream events s: P O, Q = P O Q x P Q = n = P o q x n = Pq q 20

We can compute no the probablty o a sequence o observatons O usng the jont probabltes P O = P O, Q = P O Q PQ Q Q P3 3 = P3 3, cold cold cold + P3 3, cold cold hot +...... + P3 3, hot hot hot 202

Forard Algorthm For N hdden states and a sequence o T observatons Forard Algorthm uses ON 2 T operatons nstead o N T a t j s the probablty o beng n state j ater seng the rst t observatons a t j = Po, o 2 o t, q t = j λ a t j = N = a t a j b j o t 203

Forard trells or ce cream example 204

Forard Algorthm. Intalzaton a j = a 0j b j o j N 2. Recurson 3. Termnaton N a t j = a t a j b j o t ; j N, < t T = N P O = a T q F = a T a F = 205

Forard Algorthm 206

Forard Algorthm 207

Decodng POS taggng s such a problem, and so s the eather problem Recall that n the case o POS taggng e need to compute n tˆ arg max P t t n n n We could just enumerate all paths gven the nput and use the model to assgn probabltes to each. Not a good dea. Luckly dynamc programmng helps us here 208

Vterb Algorthm Vterb algorthm computes a trells usng dynamc programmng. Observaton s processed rom let to rght llng out a trells o states v t j s the probablty that HMM s n state j ater seeng the rst t observatons v t j = max Pq 0, q q t, o, o 2 o t q t = j λ qo,q q v t j = N max v t a j b j o t = 209

Vterb tralls or ce cream example 20

Vterb Algorthm. Intalzaton 2. Recurson v t j = bt t j = 3. Termnaton N max = v t a j b j o t ; j N, < t T N argmax v t a j b j o t ; j N, < t T = The best score: P = v t q F = N max v T = The start o backtrace: q T = b tt q F = a,f N argmax v T = a,f 2

Vterb Traceback 22

The Vterb Algorthm 23

Vterb Example 24

Vterb Summary Create an array Wth columns correspondng to nputs Ros correspondng to possble states Seep through the array n one pass llng the columns let to rght usng our transton probs and observatons probs Dynamc programmng key s that e need only store the MAX prob path to each cell, not all paths. 25

Evaluaton So once you have your POS tagger runnng ho do you evaluate t? Overall error rate th respect to a gold-standard test set. Error rates on partcular tags Error rates on partcular ords Tag conusons... 26

Error Analyss Look at a conuson matrx See hat errors are causng problems Noun NN vs ProperNoun NNP vs Adj JJ Past tense VBD vs Partcple VBN vs Adjectve JJ 27

Evaluaton The result s compared th a manually coded Gold Standard Typcally accuracy reaches 96-97% Ths may be compared th result or a baselne tagger one that uses no context. Important: 00% s mpossble even or human annotators. 28

Maxmum Entropy Models 29

MEM Outlne Maxmum Entropy Models Background Maxmum Entropy Model appled to NLP classcaton Maxmum Entropy Markov Models 220

Maxmum Entropy Probablstc machne learnng or sequence classcaton POS taggng, speech recognton non-sequental classcaton text classcaton, sentment analyss Maxmum entropy extracts eatures rom nputs, then combnes them to classy nputs. Computes the probablty o a class c gven an observaton x descrbed by a vector o eatures 22

Lnear Regresson Problem: Prce a house based on vague adjectves used n the adds. Ex: antastc, cute, charmng Fgure 6.7 Some made-up data on the number o vague adjectves antastc, cute, charmng n a real estate ad and the amount the house sold or over the askng prce. prce 0 Num_Adjectves Fgure 6.8 A plot o the made-up ponts n Fg. 6.7 and the regresson lne that best ts them, th the equaton y = -4900x + 6550. 222

223 Multple Lnear Regresson Num_Unsold_Houses Mortgage_Rate Num_Adjectves prce 3 2 0 N 0 prce y N n n b a b a b a b a b a 2 2 product: dot N y 0 lnear regresson: In realty, the prce o house depends on several actors.

Learnng n Lnear Regresson Problem: Learn the eghts y j pred N 0 j Mnmze the cost uncton produced by eghts or all M examples n the tranng set. cost W M 2 j j y pred yobs j0 Y = X W = X T X X T y 224

Logstc Regresson Lnear regresson predcts real-value unctons Classcaton problems deal th dscrete values or classes We calculate the probablty that an observaton s n a partcular class, and pck the class th the hghest probablty. Let observaton x have eature vector, and class y Py true x N 0 Use a model to predct the odds o y beng true p y true x -p y true x p y true x ln -p y true x 225

226 Logt Functon ln logt x -p x p x p e e x y p e e x y p x y p e x y p e x y p e x y p x y p e x y -p x y p x -py x y p true true true true true true true true true true true ln e e e x y p true e e e x y p alse Ths s called logstc uncton Logstc Regresson s the model n hch a lnear uncton s used to estmate a logt o probablty

227 Logstc Regresson--Classcaton N N e x y p x y p x y p x y p x y p x y p 0 0 hyperplane a the equaton o s 0 0 0 true true alse true alse true Problem: Gven an observaton x decde t belongs to class true or class alse.

Maxmum Entropy Modelng In NLP e need to classy problems th multple classes p c x exp Z p c x exp cc exp N 0 N 0 c c Z C p c x cc exp N 0 c p c x cc exp exp N 0 N 0 c c c, x c, x In MaxEnt nstead o ndcator unctons, e use c,x, meanng eature or a partcular class c or a gven observaton x 228

Maxmum Entropy Modelng Secretarat/NNP s/bez expected/vbn to/to race/?? tomorro/ ord "race"& c NN c, x 0 otherse 2 t TO & c VB c, x 0 otherse sux ord 3 c, x 0 otherse s_loer_case ord 4 c, x 0 otherse "ng" & c VBG ord "race"& c VB 5 c, x 0 otherse 6 t TO & c NN c, x 0 otherse "race"& c VB 229

Maxmum Entropy Modelng.8.3 e e P NN x.8.3.8.0. e e e e e.20.8.0. e e e P VB x.8.3.8.0. e e e e e.80 cˆ arg max cc P c x 230

Why call t Maxmum Entropy? Problem: Assgn a tag to the ord zzsh. Wthout any pror normaton Knong that only our tags are possble 23

Entropy equaton H x P xlog P x 2 P NN x P JJ P NNS Pords zzsh and t NN or t P VB NNS 8 0 P VB 20 p* = argmax Hp The exponental model or multnomal logstc regresson also nds the maxmum entropy dstrbuton subject to constrants rom eature uncton. 232

Maxmum Entropy Markov Models MEMM Tˆ argmax P T W T argmax P W T P T T argmax P ord T tag P tag tag Tˆ argmax P T W T argmax P tag T ord, tag Advantages o MEMM. We estmate drectly the probablty o each tag gvng the prevous tag and observed ord. 2. We can condton any useul eature o nput observaton, hch as not possble th HMM 233

234 MEMM n n q q P q o P O Q P n o q q P O Q P, Fgure 6.20 The HMM top and MEMM bottom representaton o the probablty computaton or the correct sequence o tags or the Secretarat sentence. Each arc ould be assocated th a probablty; the HMM computes to separate probabltes or the observaton lkelhood and the pror, hle the MEMM computes a sngle probablty uncton at each state, condtoned on the prevous state and current observaton.

MEMM Fgure 6.2 An MEMM or part-o-speech taggng, augmentng the descrpton n Fg. 6.20 by shong that an MEMM can condton on many eatures o the nput, such as captalzaton, morphology endng n -s or ed, as ell as earler ords or tags. We have shon some potental addtonal eatures or the rst three decsons, usng derent lne styles or each class. P q q, o exp o, q Z o, q 235