Feature-Rich Sequence Models. Statistical NLP Spring MEMM Taggers. Decoding. Derivative for Maximum Entropy. Maximum Entropy II

Size: px

Start display at page:

Download "Feature-Rich Sequence Models. Statistical NLP Spring MEMM Taggers. Decoding. Derivative for Maximum Entropy. Maximum Entropy II"

Dwayne Price
5 years ago
Views:

Statstcal NLP Sprng 2010 Feature-Rch Sequence Models Problem: HMMs make t hard to work wth arbtrary features of a sentence Example: name entty recognton (NER) PER PER O O O O O O ORG O O O O O LOC

1 Statstcal NLP Sprng 2010 Feature-Rch Sequence Models Problem: HMMs make t hard to work wth arbtrary features of a sentence Example: name entty recognton (NER) PER PER O O O O O O ORG O O O O O LOC LOC O Tm Boon has sgned a contract extenson wth Lecestershre whch wll keep hm at Grace Road. Local Context Lecture 7: POS / NER Taggng Dan Klen UC Berkeley Prev Cur Next State Other?????? Word at Grace Road Tag IN NNP NNP Sg x Xx Xx MEMM Taggers Idea: left-to-rght local decsons, condton on prevous tags and also entre nput Tran up t w,t -1,t -2 ) as a normal maxent model, then use to score sequences Ths s referred to as an MEMM tagger [Ratnaparkh 96] Beam search effectve! (Why?) What about beam sze 1? Decodng Decodng MEMM taggers: Just lke decodng HMMs, dfferent local scores Vterb, beam search, posteror decodng Vterb algorthm (HMMs): Vterb algorthm (MEMMs): General: Maxmum Entropy II Dervatve for Maxmum Entropy Remember: maxmum entropy objectve Problem: lots of features allow perfect ft to tranng set Regularzaton (compare to smoothng) Bg weghts are bad Total count of feature n n correct canddates Expected count of feature n n predcted canddates 1

2 Example: NER Regularzaton Perceptron Taggers [Collns 01] Because of regularzaton term, the more common prefxes have larger weghts even though entre-word features are more specfc. Local Context Prev Cur Next State Other?????? Word at Grace Road Tag IN NNP NNP Sg x Xx Xx Feature Weghts Feature Type Feature PERS LOC Prevous word at Current word Grace Begnnng bgram <G Current POS tag NNP Prev and cur tags IN NNP Prevous state Other Current sgnature Xx Prev state, cur sg O-Xx Prev-cur-next sg x-xx-xx P. state - p-cur sg O-x-Xx Total: Lnear models: that decompose along the sequence allow us to predct wth the Vterb algorthm whch means we can tran wth the perceptron algorthm (or related updates, lke MIRA) Condtonal Random Felds Make a maxent model over entre taggngs MEMM CRFs Lke any maxent model, dervatve s: CRF So all we need s to be able to compute the expectaton of each feature (for example the number of tmes the label par DT-NN occurs, or the number of tmes NN-nterest occurs) Crtcal quantty: counts of posteror margnals: Computng Posteror Margnals How many (expected) tmes s word w tagged wth s? How to compute that margnal? ^ ^ ^ ^ ^ ^ N N N N N N V V V V V V J J J J J J D D D D D D $ $ $ $ $ $ START Fed rases nterest rates END TBL Tagger [Brll 95] presents a transformaton-based tagger Label the tranng set wth most frequent tags DT MD VBD VBD. The can was rusted. Add transformaton rules whch reduce tranng mstakes MD NN : DT VBD VBN : VBD. Stop when no transformatons do suffcent good Does ths remnd anyone of anythng? Probably the most wdely used tagger (esp. outsde NLP) but defntely not the most accurate: 96.6% / 82.0 % 2

3 TBL Tagger II What gets learned? [from Brll 95] EngCG Tagger Englsh constrant grammar tagger [Tapananen and Voutlanen 94] Somethng else you should know about Hand-wrtten and knowledge drven Don t guess f you know (general pont about modelng more structure!) Tag set doesn t make all of the hard dstnctons as the standard tag set (e.g. JJ/NN) They get stellar accuraces: 99% on ther tag set Lngustc representaton matters but t s easer to wn when you make up the rules Doman Effects Accuraces degrade outsde of doman Up to trple error rate Usually make the most errors on the thngs you care about n the doman (e.g. proten names) Open questons How to effectvely explot unlabeled data from a new doman (what could we gan?) How to best ncorporate doman lexca n a prncpled way (e.g. UMLS specalst lexcon, ontologes) Unsupervsed Taggng? AKA part-of-speech nducton Task: Raw sentences n Tagged sentences out Obvous thng to do: Start wth a (mostly) unform HMM Run EM Inspect results EM for HMMs: Process Alternate between recomputng dstrbutons over hdden varables (the tags) and reestmatng parameters Crucal step: we want to tally up how many (fractonal) counts of each knd of transton and emsson we have under current params: EM for HMMs: Quanttes Total path values (correspond to probabltes here): Same quanttes we needed to tran a CRF! 3

EM for HMMs: Process From these quanttes, can compute expected transtons: And emssons: Meraldo: Setup Some (dscouragng) experments [Meraldo 94] Setup: You know the set of allowable tags for each word

Results Dstrbutonal Clusterng the presdent that the downturn was over presdent presdent reported the of the the of the apponted sources presdent that sources presdent reported the a [Fnch and Chater

4 EM for HMMs: Process From these quanttes, can compute expected transtons: And emssons: Meraldo: Setup Some (dscouragng) experments [Meraldo 94] Setup: You know the set of allowable tags for each word Fx k tranng examples to ther true labels Learn w t) on these examples Learn t t -1,t -2 ) on these examples On n examples, re-estmate wth EM Note: we know allowed tags but not frequences Meraldo: Results Dstrbutonal Clusterng the presdent that the downturn was over presdent presdent reported the of the the of the apponted sources presdent that sources presdent reported the a [Fnch and Chater 92, Shuetze 93, many others] Dstrbutonal Clusterng Nearest Neghbors Three man varants on the same dea: Parwse smlartes and heurstc clusterng E.g. [Fnch and Chater 92] Produces dendrograms Vector space methods E.g. [Shuetze 93] Models of ambguty Probablstc methods Varous formulatons, e.g. [Lee and Perera 99] 4

5 Dendrograms _ A Probablstc Verson? ( S, C) = c) w c) w 1, w+ 1 P c) c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 the presdent that the downturn was over = w c) c c P ( S, C) 1) c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 the presdent that the downturn was over What Else? Varous newer deas: Context dstrbutonal clusterng [Clark 00] Morphology-drven models [Clark 03] Contrastve estmaton [Smth and Esner 05] Feature-rch nducton [Haghgh and Klen 06] Also: What about ambguous words? Usng wder context sgnatures has been used for learnng synonyms (what s wrong wth ths approach?) Can extend these deas for grammar nducton (later) 5

Maxent Models & Deep Learning

Maxent Models & Deep Learning Maxent Models & Deep Learnng 1. Last bts of maxent (sequence) models 1.MEMMs vs. CRFs 2.Smoothng/regularzaton n maxent models 2. Deep Learnng 1. What s t? Why s t good? (Part 1) 2. From logstc regresson