Maxent Models & Deep Learning

Size: px

Start display at page:

Download "Maxent Models & Deep Learning"

Rose Maud Sutton
6 years ago
Views:

1 Maxent Models & Deep Learnng 1. Last bts of maxent (sequence) models 1.MEMMs vs. CRFs 2.Smoothng/regularzaton n maxent models 2. Deep Learnng 1. What s t? Why s t good? (Part 1) 2. From logstc regresson to neural networks 3. Word vector representatons 60

2 Maxmum entropy sequence models Maxmum entropy Markov models (MEMMs) a.k.a. Condtonal Markov models

3 Sequence Data Inference n Systems Sequence Level Sequence Model Inference (Search) Local Level Local Local Data Data Feature Extracton Label Features Classfer Type Optmzaton Smoothng Label Features Maxmum Entropy Model Optmzaton Regularzaton

4 CRFs [Lafferty, Perera, and McCallum 2001] Another sequence model: Condtonal Random Felds (CRFs) A whole-sequence condtonal model rather than a channg of local models. P( c d, λ) The space of c s s now the space of sequences = c' exp exp But f the features f reman local, the condtonal sequence lkelhood can be calculated exactly usng dynamc programmng Tranng s slower, but CRFs avod causal-competton bases These (or a varant usng a max margn crteron) are seen as the state-of-theart these days but n practce they usually work much the same as MEMMs. λ f ( c, d ) λ f ( c', d)

CoNLL 2003 NER shared task Results on Englsh Devset 96 94

5 CoNLL 2003 NER shared task Results on Englsh Devset Overall Loc Msc Org Person 82 MEMM 1st CRF MMMN

6 Smoothng/Prors/ Regularzaton for Maxent Models

7 Smoothng: Issues of Scale Lots of features: NLP maxent models can have ten mllon features. Even storng a sngle array of parameter values can have a substantal memory cost. Lots of sparsty: Overfttng very easy we need smoothng! Many features seen n tranng wll never occur agan at test tme. Optmzaton problems: Feature weghts can be nfnte, and teratve solvers can take a long tme to get to those nfntes.

8 Smoothng: Issues Assume the followng emprcal dstrbuton: Heads h Tals t Features: {Heads}, {Tals} We ll have the followng softmax model dstrbuton: λh λt e e pheads = p λh λ TAILS = T λh λt e + e e + e Logstc regresson! Really, only one degree of freedom (λ = λ H λ T ) p HEADS = e λ H e λ T e λ H e λ T + eλ T e λ T = e λ e λ + e 0 = eλ e λ +1 p TAILS = e 0 e λ + e = e λ λ

9 Smoothng: Issues The data lkelhood n ths model s: log P ( h, t λ) = hlog p + HEADS t log p log P ( h, t λ) = hλ ( t + h)log (1 + e TAILS λ ) log P log P log P Heads λ λ λ Tals Heads Tals Heads Tals

10 Smoothng: Early Stoppng In the 4/0 case, there were two problems: The optmal value of λ was, whch s a long trp for an optmzaton procedure The learned dstrbuton s just as spked as the emprcal one no smoothng One way to solve both ssues s to just stop the optmzaton early, after a few teratons: The value of λ wll be fnte (but presumably bg) The optmzaton won t take forever (clearly) Commonly used n early maxent work Has seen a revval n deep learnng J λ Heads Tals 4 0 Input Heads Tals 1 0 Output

11 Smoothng: Prors (MAP) What f we had a pror expectaton that parameter values wouldn t be very large? We could then balance evdence suggestng large parameters (or nfnte) aganst our pror. The evdence would never totally defeat the pror, and parameters would be smoothed (and kept fnte!). We can do ths explctly by changng the optmzaton objectve to maxmum posteror lkelhood: log P ( C, λ D) = log P( λ) + log P( C D, λ) Posteror Pror Evdence

12 Smoothng: Prors Gaussan, or quadratc, or L 2 prors: Intuton: parameters shouldn t be large. Formalzaton: pror expectaton that each parameter wll be dstrbuted accordng to a gaussan wth mean µ and varance σ 2. P( λ ) 1 & exp $ 2π % ( λ µ ) 2σ = 2 σ 2 #! " They don t even captalze my name anymore! 2σ 2 = 2σ 2 = 1 2σ 2 = 10 Penalzes parameters for drftng too far from ther mean pror value (usually µ=0). 2σ 2 =1 works surprsngly well.

13 Smoothng: Prors If we use gaussan prors / L 2 regularzaton: Trade off some expectaton-matchng for smaller parameters. When multple features can be recruted to explan a data pont, the more common ones generally receve more weght. Accuracy generally goes up! Change the objectve: log P ( C, λ D) = log P( C D, λ) +log P(λ) (λ log P( C, λ D) = P( c d, λ) µ ) 2 ( c, d ) ( C, D) Change the dervatve: log P( C, λ D) / λ = actual( f 2σ 2 + k, C) predcted( f, λ) (λ µ ) /σ 2 2σ 2 =1 2σ 2 = 2σ 2 = 10

14 Smoothng: Prors If we use gaussan prors / L 2 regularzaton : Trade off some expectaton-matchng for smaller parameters. When multple features can be recruted to explan a data pont, the more common ones generally receve more weght. Accuracy generally goes up! Change the objectve: log P ( C, λ D) = log P( C D, λ) +log P(λ) log P( C, λ D) ( c, d ) ( C, D) Change the dervatve: log P( C, λ D) / λ = actual( f = P( c d, λ) λ 2 2σ 2 + k, C) predcted( f, λ) λ /σ 2 2σ 2 =1 2σ 2 = 2σ 2 = 10 Takng pror mean as 0

15 Example: NER Smoothng Because of smoothng, the more common prefx and sngle-tag features have larger weghts even though entre-word and tag-par features are more specfc. Local Context Prev Cur Next State Other?????? Word at Grace Road Tag IN NNP NNP Sg x Xx Xx Feature Weghts Feature Type Feature PERS LOC Prevous word at Current word Grace Begnnng bgram <G Current POS tag NNP Prev and cur tags IN NNP Prevous state Other Current sgnature Xx Prev state, cur sg O-Xx Prev-cur-next sg x-xx-xx P. state - p-cur sg O-x-Xx Total:

16 Example: Named Entty Feature Overlap Grace s correlated wth PERSON, but does not add much evdence on top of already knowng prefx features. Local Context Prev Cur Next State Other?????? Word at Grace Road Tag IN NNP NNP Sg x Xx Xx Feature Weghts Feature Type Feature PERS LOC Prevous word at Current word Grace Begnnng bgram <G Current POS tag NNP Prev and cur tags IN NNP Prevous state Other Current sgnature Xx Prev state, cur sg O-Xx Prev-cur-next sg x-xx-xx P. state - p-cur sg O-x-Xx Total:

17 Example: POS Taggng From (Toutanova et al., 2003): DevTest Performance Wthout Smoothng Wth Smoothng Overall Accuracy Unknown Word Acc Smoothng helps: Softens dstrbutons. Pushes weght onto more explanatory features. Allows many features to be dumped safely nto the mx. Speeds up convergence (f both are allowed to converge)!

18 Smoothng / Regularzaton Talkng of prors and MAP estmaton s Bayesan language In frequentst statstcs, people wll nstead talk about usng regularzaton, and n partcular, a gaussan pror s L 2 regularzaton The choce of names makes no dfference to the math Recently, L 1 regularzaton s also very popular Gves sparse solutons most parameters become zero [Yay!] Harder optmzaton problem (non-contnuous dervatve)

19 Smoothng: Vrtual Data Another opton: smooth the data, not the parameters. Example: Heads Tals 4 0 Heads Tals 5 1 Equvalent to addng two extra data ponts. Smlar to add-one smoothng for generatve models. For feature-based models, hard to know what artfcal data to create!

20 Smoothng: Count Cutoffs In NLP, features wth low emprcal counts are often dropped. Very weak and ndrect smoothng method. Equvalent to lockng ther weght to be zero. Equvalent to assgnng them gaussan prors wth mean zero and varance zero. Droppng low counts does remove the features whch were most n need of smoothng and speeds up the estmaton by reducng model sze but count cutoffs generally hurt accuracy n the presence of proper smoothng. Don t use count cutoffs unless necessary for memory usage reasons. Prefer L 1 regularzaton for fndng features to drop.

21 Smoothng/Prors/ Regularzaton for Maxent Models

Feature-Rich Sequence Models. Statistical NLP Spring MEMM Taggers. Decoding. Derivative for Maximum Entropy. Maximum Entropy II

Feature-Rich Sequence Models. Statistical NLP Spring MEMM Taggers. Decoding. Derivative for Maximum Entropy. Maximum Entropy II Statstcal NLP Sprng 2010 Feature-Rch Sequence Models Problem: HMMs make t hard to work wth arbtrary features of a sentence Example: name entty recognton (NER) PER PER O O O O O O ORG O O O O O LOC LOC