Lecture 6 Hidden Markov Models and Maximum Entropy Models

Size: px

Start display at page:

Download "Lecture 6 Hidden Markov Models and Maximum Entropy Models"

Gladys Bailey
5 years ago
Views:

1 Lecture 6 Hdden Markov Models and Maxmum Entropy Models CS

2 HMM Outlne Markov Chans Hdden Markov Model Lkelhood: Forard Alg. Decodng: Vterb Alg. Maxmum Entropy Models 83

3 Dentons A eghted nte-state automaton adds probabltes to the arcs The sum o the probabltes leavng any arc must sum to one A Markov chan s a specal case o a WFSA n hch the nput sequence unquely determnes hch states the automaton ll go through Markov chans can t represent nherently ambguous problems Useul or assgnng probabltes to unambguous sequences 84

4 Markov Chan or Weather 85

5 Markov Chan or Words 86

6 Markov Chan Model A set o states Q = q, q 2 q N; the state at tme t s q t Transton probabltes: a set o probabltes A = a 0 a 02 a n a nn. Each a j represents the probablty o transtonng rom state to state j The set o these s the transton probablty matrx A Markov Assumpton: Current state only depends on prevous state Pq q...q Pq q 87

7 Markov Chan Model n j= a j = 88

8 Weather example Markov chans are useul hen e need to compute the probabltes or a sequence o events that are observable. 89

9 Markov Chan or Weather What s the probablty o 4 consecutve arm days? Sequence s arm-arm-arm-arm I.e., state sequence s P3,3,3,3 = 3 a 33 a 33 a 33 = 0.2 x = But hat about states are not observable? 90

10 HMM or Ice Cream You are a clmatologst n the year 2799 Studyng global armng You can t nd any records o the eather n Baltmore, MA or summer o 2007 But you nd Jason Esner s dary Whch lsts ho many ce-creams Jason ate every date that summer Our job: gure out ho hot t as 9

11 Hdden Markov Model For Markov chans, the output symbols are the same as the states. See hot eather: e re n state hot But n part-o-speech taggng and other thngs The output symbols are ords But the hdden states are part-o-speech tags So e need an extenson! A Hdden Markov Model s an extenson o a Markov chan n hch the nput symbols are not the same as the states. Ths means e don t kno hch state e are n. 92

12 Hdden Markov Models States Q = q, q 2 q N; Observatons O= o, o 2 o T; Each observaton s a symbol rom a vocabulary V = {v,v 2, v V } Transton probabltes Transton probablty matrx A = {a j } a j Pq t j q t, j N Observaton lkelhoods Output probablty matrx B={b k} b k PX t o k q t Specal ntal probablty vector Pq N 93

13 Esner Task Gven Ice Cream Observaton Sequence:,2,3,2,2,2,3 Produce: Weather Sequence: H,C,H,H,H,C 94

14 HMM or Ice Cream There are to hdden states hot cold Observatons are the number o ce cream events O = {,2,3} 95

15 Transton Probabltes 96

16 Observaton Lkelhoods 97

17 HMM or Three Basc Problems 98

18 Lkelhood Computaton Gven an HMM = A, B and an observaton sequence O. Determne the lkelhood P O. Problem : Compute the probablty o eatng 3 3 ce creams. Problem 2: Compute the probablty o eatng 3 3 ce creams hen the hdden sequence s hot hot cold. 99

19 Lkelhood Computaton For a partcular hdden state sequence Q And an observaton sequence O The lkelhood o the observaton sequence s P O Q = T = Po q P 3 3 hot hot cold = P 3 hot x P hot x P3 cold 200

20 Lkelhood Computaton Jont probablty o beng n a eather state sequence Q and a partcular sequence o observatons O o ce cream events s: P O, Q = P O Q x P Q = n = P o q x n = Pq q 20

21 We can compute no the probablty o a sequence o observatons O usng the jont probabltes P O = P O, Q = P O Q PQ Q Q P3 3 = P3 3, cold cold cold + P3 3, cold cold hot P3 3, hot hot hot 202

22 Forard Algorthm For N hdden states and a sequence o T observatons Forard Algorthm uses ON 2 T operatons nstead o N T a t j s the probablty o beng n state j ater seng the rst t observatons a t j = Po, o 2 o t, q t = j λ a t j = N = a t a j b j o t 203

23 Forard trells or ce cream example 204

24 Forard Algorthm. Intalzaton a j = a 0j b j o j N 2. Recurson 3. Termnaton N a t j = a t a j b j o t ; j N, < t T = N P O = a T q F = a T a F = 205

25 Forard Algorthm 206

26 Forard Algorthm 207

27 Decodng POS taggng s such a problem, and so s the eather problem Recall that n the case o POS taggng e need to compute n tˆ arg max P t t n n n We could just enumerate all paths gven the nput and use the model to assgn probabltes to each. Not a good dea. Luckly dynamc programmng helps us here 208

28 Vterb Algorthm Vterb algorthm computes a trells usng dynamc programmng. Observaton s processed rom let to rght llng out a trells o states v t j s the probablty that HMM s n state j ater seeng the rst t observatons v t j = max Pq 0, q q t, o, o 2 o t q t = j λ qo,q q v t j = N max v t a j b j o t = 209

29 Vterb tralls or ce cream example 20

30 Vterb Algorthm. Intalzaton 2. Recurson v t j = bt t j = 3. Termnaton N max = v t a j b j o t ; j N, < t T N argmax v t a j b j o t ; j N, < t T = The best score: P = v t q F = N max v T = The start o backtrace: q T = b tt q F = a,f N argmax v T = a,f 2

31 Vterb Traceback 22

32 The Vterb Algorthm 23

33 Vterb Example 24

34 Vterb Summary Create an array Wth columns correspondng to nputs Ros correspondng to possble states Seep through the array n one pass llng the columns let to rght usng our transton probs and observatons probs Dynamc programmng key s that e need only store the MAX prob path to each cell, not all paths. 25

35 Evaluaton So once you have your POS tagger runnng ho do you evaluate t? Overall error rate th respect to a gold-standard test set. Error rates on partcular tags Error rates on partcular ords Tag conusons... 26

36 Error Analyss Look at a conuson matrx See hat errors are causng problems Noun NN vs ProperNoun NNP vs Adj JJ Past tense VBD vs Partcple VBN vs Adjectve JJ 27

37 Evaluaton The result s compared th a manually coded Gold Standard Typcally accuracy reaches 96-97% Ths may be compared th result or a baselne tagger one that uses no context. Important: 00% s mpossble even or human annotators. 28

38 Maxmum Entropy Models 29

39 MEM Outlne Maxmum Entropy Models Background Maxmum Entropy Model appled to NLP classcaton Maxmum Entropy Markov Models 220

40 Maxmum Entropy Probablstc machne learnng or sequence classcaton POS taggng, speech recognton non-sequental classcaton text classcaton, sentment analyss Maxmum entropy extracts eatures rom nputs, then combnes them to classy nputs. Computes the probablty o a class c gven an observaton x descrbed by a vector o eatures 22

41 Lnear Regresson Problem: Prce a house based on vague adjectves used n the adds. Ex: antastc, cute, charmng Fgure 6.7 Some made-up data on the number o vague adjectves antastc, cute, charmng n a real estate ad and the amount the house sold or over the askng prce. prce 0 Num_Adjectves Fgure 6.8 A plot o the made-up ponts n Fg. 6.7 and the regresson lne that best ts them, th the equaton y = -4900x

42 223 Multple Lnear Regresson Num_Unsold_Houses Mortgage_Rate Num_Adjectves prce N 0 prce y N n n b a b a b a b a b a 2 2 product: dot N y 0 lnear regresson: In realty, the prce o house depends on several actors.

43 Learnng n Lnear Regresson Problem: Learn the eghts y j pred N 0 j Mnmze the cost uncton produced by eghts or all M examples n the tranng set. cost W M 2 j j y pred yobs j0 Y = X W = X T X X T y 224

44 Logstc Regresson Lnear regresson predcts real-value unctons Classcaton problems deal th dscrete values or classes We calculate the probablty that an observaton s n a partcular class, and pck the class th the hghest probablty. Let observaton x have eature vector, and class y Py true x N 0 Use a model to predct the odds o y beng true p y true x -p y true x p y true x ln -p y true x 225

45 226 Logt Functon ln logt x -p x p x p e e x y p e e x y p x y p e x y p e x y p e x y p x y p e x y -p x y p x -py x y p true true true true true true true true true true true ln e e e x y p true e e e x y p alse Ths s called logstc uncton Logstc Regresson s the model n hch a lnear uncton s used to estmate a logt o probablty

46 227 Logstc Regresson--Classcaton N N e x y p x y p x y p x y p x y p x y p 0 0 hyperplane a the equaton o s true true alse true alse true Problem: Gven an observaton x decde t belongs to class true or class alse.

47 Maxmum Entropy Modelng In NLP e need to classy problems th multple classes p c x exp Z p c x exp cc exp N 0 N 0 c c Z C p c x cc exp N 0 c p c x cc exp exp N 0 N 0 c c c, x c, x In MaxEnt nstead o ndcator unctons, e use c,x, meanng eature or a partcular class c or a gven observaton x 228

48 Maxmum Entropy Modelng Secretarat/NNP s/bez expected/vbn to/to race/?? tomorro/ ord "race"& c NN c, x 0 otherse 2 t TO & c VB c, x 0 otherse sux ord 3 c, x 0 otherse s_loer_case ord 4 c, x 0 otherse "ng" & c VBG ord "race"& c VB 5 c, x 0 otherse 6 t TO & c NN c, x 0 otherse "race"& c VB 229

Maxmum Entropy Modelng.8.3 e e P NN x.8.3.8.0. e e e e e.20.8.0. e e e P VB x.

49 Maxmum Entropy Modelng.8.3 e e P NN x e e e e e e e e P VB x e e e e e.80 cˆ arg max cc P c x 230

50 Why call t Maxmum Entropy? Problem: Assgn a tag to the ord zzsh. Wthout any pror normaton Knong that only our tags are possble 23

exponental model or multnomal logstc regresson also nds the

51 Entropy equaton H x P xlog P x 2 P NN x P JJ P NNS Pords zzsh and t NN or t P VB NNS 8 0 P VB 20 p* = argmax Hp The exponental model or multnomal logstc regresson also nds the maxmum entropy dstrbuton subject to constrants rom eature uncton. 232

52 Maxmum Entropy Markov Models MEMM Tˆ argmax P T W T argmax P W T P T T argmax P ord T tag P tag tag Tˆ argmax P T W T argmax P tag T ord, tag Advantages o MEMM. We estmate drectly the probablty o each tag gvng the prevous tag and observed ord. 2. We can condton any useul eature o nput observaton, hch as not possble th HMM 233

53 234 MEMM n n q q P q o P O Q P n o q q P O Q P, Fgure 6.20 The HMM top and MEMM bottom representaton o the probablty computaton or the correct sequence o tags or the Secretarat sentence. Each arc ould be assocated th a probablty; the HMM computes to separate probabltes or the observaton lkelhood and the pror, hle the MEMM computes a sngle probablty uncton at each state, condtoned on the prevous state and current observaton.

MEMM Fgure 6.2 An MEMM or part-o-speech taggng, augmentng the descrpton n Fg. 6.20 by shong that an MEMM can condton on many eatures o the nput, such as captalzaton, morphology endng n -s or ed, as ell as earler ords or tags.

54 MEMM Fgure 6.2 An MEMM or part-o-speech taggng, augmentng the descrpton n Fg by shong that an MEMM can condton on many eatures o the nput, such as captalzaton, morphology endng n -s or ed, as ell as earler ords or tags. We have shon some potental addtonal eatures or the rst three decsons, usng derent lne styles or each class. P q q, o exp o, q Z o, q 235

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data Condtonal Random Felds: Probablstc Models for Segmentng and Labelng Sequence Data Paper by John Lafferty, Andrew McCallum, and Fernando Perera ICML 2001 Presentaton by Joe Drsh May 9, 2002 Man Goals Present