CSE 517 Natural Language Processing Winter2015

Size: px

Start display at page:

Download "CSE 517 Natural Language Processing Winter2015"

Kelley Chase
5 years ago
Views:

1 CSE 517 Natural Language Processing Winter2015 Feature Rich Models Sameer Singh Guest lecture for Yejin Choi - University of Washington [Slides from Jason Eisner, Dan Klein, Luke ZeLlemoyer]

2 Outline POS Tagging MaxEnt MEMM CRFs Wrap-up OpUonal: Perceptron

3 Outline POS Tagging MaxEnt MEMM CRFs Wrap-up OpUonal: Perceptron

4 Parts-of-Speech (English) One basic kind of linguistic structure: syntactic word classes Open class (lexical) words Nouns Verbs Adjectives yellow Proper IBM Italy Common Closed class (functional) Determiners the some Conjunctions cat / cats snow and or Main see registered Modals can had Adverbs Numbers 122,312 one Prepositions Particles slowly more to with off up Pronouns he its more

5 Penn Treebank POS: 36 possible tags, 34 pages of tagging guidelines. CC conjunction, coordinating and both but either or CD numeral, cardinal mid-1890 nine-thirty 0.5 one DT determiner a all an every no that the EX existential there there FW foreign word gemeinschaft hund ich jeux IN preposition or conjunction, subordinating among whether out on by if JJ adjective or numeral, ordinal third ill-mannered regrettable JJR adjective, comparative braver cheaper taller JJS adjective, superlative bravest cheapest tallest MD modal auxiliary can may might will would NN noun, common, singular or mass cabbage thermostat investment subhumanity NNP noun, proper, singular Motown Cougar Yvette Liverpool NNPS noun, proper, plural Americans Materials States NNS noun, common, plural undergraduates bric-a-brac averages POS genitive marker ' 's PRP pronoun, personal hers himself it we them PRP$ pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously RBR adverb, comparative further gloomier heavier less-perfectly RBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open through \p://\p.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz

6 PRP pronoun, personal hers himself it we them PRP$ pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously RBR adverb, comparative further gloomier heavier less-perfectly RBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open through TO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heck VB verb, base form ask bring fire see take VBD verb, past tense pleaded swiped registered saw VBG verb, present participle or gerund stirring focusing approaching erasing VBN verb, past participle dilapidated imitated reunifed unsettled VBP verb, present tense, not 3rd person singular twist appear comprise mold postpone VBZ verb, present tense, 3rd person singular bases reconstructs marks uses WDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whom WP$ WH-pronoun, possessive whose WRB Wh-adverb however whenever where why \p://\p.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz

7 Why POS Tagging? Useful in and of itself (more than you d think) Text-to-speech: record, lead Lemmatization: saw[v] see, saw[n] saw Quick-and-dirty NP-chunk detection: grep {JJ NN}* {NN NNS} Useful as a pre-processing step for parsing Less tag ambiguity means fewer parses However, some tag choices are better decided by parsers IN DT NNP NN VBD VBN RP NN NNS The Georgia branch had taken on loan commitments VDN DT NN IN NN VBD NNS VBD The average of interbank offered rates plummeted

8 Part-of-Speech Ambiguity Words can have multiple parts of speech Fed raises interest rates

9 Part-of-Speech Ambiguity Words can have multiple parts of speech Fed raises interest rates

10 Ambiguity in POS Tagging Par$cle (RP) vs. preposi$on (IN) He talked over the deal. He talked over the telephone. past tense (VBD) vs. past par$ciple (VBN) The horse walked past the barn. The horse walked past the barn fell. noun vs. adjec$ve? The execu.ve decision. noun vs. present par$ciple Fishing can be fun

11 Ambiguity in POS Tagging Like can be a verb or a preposiuon I like/vbp candy. Time flies like/in an arrow. Around can be a preposiuon, parucle, or adverb I bought it at the shop around/in the corner. I never got around/rp to gegng a car. A new Prius costs around/rb $25K. 11

12 Baselines and Upper Bounds Choose the most common tag 90.3% with a bad unknown word model 93.7% with a good one Noise in the data Many errors in the training and test corpora Probably about 2% guaranteed error from noise (on this data) JJ JJ NN chief executive officer NN JJ NN chief executive officer JJ NN NN chief executive officer NN NN NN chief executive officer

13 POS Results Known Unknown Most Freq Upper

14 Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): A carefully smoothed trigram tagger Suffix trees for emissions 96.7% on WSJ text (SOA is ~97.5%) Most errors on unknown words Upper bound: ~98%

15 POS Results Known Unknown Most Freq HMM Upper

16 POS Results Known Unknown Most Freq HMM HMM++ Upper

17 Outline POS Tagging MaxEnt MEMM CRFs Wrap-up OpUonal: Perceptron

18 What about better features? Choose the most common tag 90.3% with a bad unknown word model 93.7% with a good one What about looking at a word and its environment, but no sequence information? Add in previous / next word the Previous / next word shapes X X Occurrence pattern features [X: x X occurs] Crude entity detection.. (Inc. Co.) Phrasal verb in sentence? put Conjunctions of these things Uses lots of features: > 200K x 2 s 3 x 3 x 4

19 Maximum Entropy (MaxEnt) Models Also known as Log-linear Models (linear if you take log) The feature vector representauon may include redundant and overlapping features

20 Training MaxEnt Models Maximize probability of what is known (training data) make no assumpuons about the rest ( maximum entropy )

21 Training MaxEnt Models Maximizing the likelihood of the training data incidentally maximizes the entropy (hence maximum entropy ) In parucular, we maximize condiuonal log likelihood

22 Training MaxEnt Models Maximizing the likelihood of the training data incidentally maximizes the entropy (hence maximum entropy ) In parucular, we maximize condiuonal log likelihood

23 Convex OpUmizaUon for Training The likelihood funcuon is convex. (can get global opumum) Many opumizauon algorithms/so\ware available. Gradient ascent (descent), Conjugate Gradient, L-BFGS, etc All we need are: (1) evaluate the funcuon at current w (2) evaluate its derivauve at current w

24 Convex OpUmizaUon for Training The likelihood funcuon is convex. (can get global opumum) Many opumizauon algorithms/so\ware available. Gradient ascent (descent), Conjugate Gradient, L-BFGS, etc All we need are: (1) evaluate the funcuon at current w (2) evaluate its derivauve at current w

25 Training MaxEnt Models

26 Training with RegularizaUon

27 Training MaxEnt Models

28 Training with RegularizaUon

29 Graphical RepresentaUon of MaxEnt Output Y Input x 1 x 2 x n

30 Graphical RepresentaUon of MaxEnt

31 Graphical RepresentaUon of Naïve Bayes P( X Y ) = j= 1 P( x j Y ) Output Y Input x 1 x 2 x n

32 Graphical RepresentaUon of Naïve Bayes P( X Y ) = j= 1 P( x j Y )

33 Y Y Naïve Bayes x 1 x 2 x n MaxEnt x 1 x 2 x n Naïve Bayes Classifier Genera+ve models è p(input output) è For instance, for text categorizauon, P(words category) è Unnecessary efforts on generaung input Maximum Entropy Classifier Discrimina+ve models è p(output input) è For instance, for text categorizauon, P(category words) è Focus directly on predicung the output è Independent assumpuon among input variables: Given the category, each word is generated independently from other words (too strong assumpuon in reality!) è Cannot incorporate arbitrary/redundant/ overlapping features è By condiuoning on the enure input, we don t need to worry about the independent assumpuon among input variables è Can incorporate arbitrary features: redundant and overlapping features

34 Overview: POS tagging Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% Upper bound: ~98%

35 POS Results Known Unknown Most Freq HMM HMM++ MaxEnt Upper

36 Outline POS Tagging MaxEnt MEMM CRFs Wrap-up OpUonal: Perceptron

37 Sequence Modeling Predicted POS of neighbors is important

38 MEMM Taggers One step up: also condition on previous tags

39 MEMM Taggers Conditioning on previous tags p(s 1...s m x 1...x m )= = my p(s i s 1...s i 1,x 1...x m ) i=1 my p(s i s i 1,x 1...x m ) i=1 Train up p(s i s i-1,x 1...x m ) as a discrete log-linear (maxent) model, then use to score sequences p(s i s i 1,x 1...x m )= exp (w (x 1...x m,i,s i 1,s i )) P s 0 exp (w (x 1...x m,i,s i 1,s 0 )) This is referred to as an MEMM tagger [Ratnaparkhi 96]

40 NNP VBZ VBN TO VB NR HMM Secretariat is expected to race tomorrow NNP VBZ VBN TO VB NR MEMM Secretariat is expected to race tomorrow HMM MEMM GeneraUve models è joint probability p( words, tags ) è generate input (in addiuon to tags) è but we need to predict tags, not words! Probability of each slice = emission * transiuon = p(word_i tag_i) * p(tag_i tag_i-1) = è Cannot incorporate long distance features DiscriminaUve or CondiUonal models è condiuonal probability p( tags words) è condiuon on input è Focusing only on predicung tags Probability of each slice = p( tag_i tag_i-1, word_i) or p( tag_i tag_i-1, all words) è Can incorporate long distance features

41 HMM v.s. MEMM HMM NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow MEMM NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow

42 The HMM State Lagce / Trellis (repeat slide) ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ START Fed raises interest rates STOP e(fed N) e(raises V) e(interest V) e(rates J) q(v V) e(stop V)

43 The MEMM State Lagce / Trellis ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ x = START Fed raises interest rates STOP p(v V,x)

44 Decoding: p(s 1...s m x 1...x m )= Decoding maxent taggers: Just like decoding HMMs Viterbi, beam search, posterior decoding Viterbi algorithm (HMMs): Define π(i,s i ) to be the max score of a sequence of length i ending in tag s i (i, s i ) = max s i 1 e(x i s i )q(s i s i 1 ) (i 1,s i 1 ) Viterbi algorithm (Maxent): my p(s i s 1...s i 1,x 1...x m ) Can use same algorithm for MEMMs, just need to redefine π(i,s i )! (i, s i ) = max s i 1 p(s i s i 1,x 1...x m ) (i 1,s i 1 ) i=1

45 Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% Upper bound: ~98%

46 POS Results Known Unknown Most Freq HMM HMM++ MaxEnt MEMM Upper

47 Outline POS Tagging MaxEnt MEMM CRFs Wrap-up OpUonal: Perceptron

48 Global Sequence Modeling MEMM and MaxEnt are local classifiers MaxEnt more so that MEMM make decision conditioned on local information Not much of a flow of information John got around Make prediction on the whole chain directly!

49 Global Discriminative Taggers Newer, higher-powered discriminative sequence models CRFs (also perceptrons, M3Ns) Do not decompose training into independent local regions Can be slower to train: repeated inference during training set However: one issue worth knowing about in local models Label bias and other explaining away effects MEMM taggers local scores can be near one without having both good transitions and emissions This means that often evidence doesn t flow properly Also: in decoding, condition on predicted, not gold, histories

50 Graphical Models Y1 Y2 Y3 X1 X2 X3 CondiUonal probability for each node e.g. p( Y3 Y2, X3 ) for Y3 e.g. p( X3 ) for X3 CondiUonal independence e.g. p( Y3 Y2, X3 ) = p( Y3 Y1, Y2, X1, X2, X3) Joint probability of the enure graph = product of condiuonal probability of each node

51 Undirected Graphical Model Basics Y1 Y2 Y3 X1 X2 X3 CondiUonal independence e.g. p( Y3 all other nodes ) = p( Y3 Y3 neighbor ) No condiuonal probability for each node Instead, poten+al func+on for each clique e.g. ( X1, X2, Y1 ) or ( Y1, Y2 ) Typically, log-linear potenual funcuons è ( Y1, Y2 ) = exp k wk fk (Y1, Y2)

52 Undirected Graphical Model Basics Y1 Y2 Y3 X1 X2 X3 Joint probability of the enure graph P(Y ) = 1 Z Z = Y clique C clique C ϕ(y C ) ϕ(y C )

53 MEMM v.s. CRF (CondiUonal Random Fields) MEMM NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow CRF NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow

54 MEMM MEMM v.s. CRF NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow CRF NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow

55 MEMM NN P VBZ VBN TO VB NR Secretariat is expected to race tomorrow CRF NN P VBZ VBN TO VB NR Secretariat is expected to race tomorrow MEMM Directed graphical model CRF Undirected graphical model DiscriminaUve or CondiUonal models è condiuonal probability p( tags words) Probability is defined for each slice = P ( tag_i tag_i-1, word_i) or p ( tag_i tag_i-1, all words) Instead of probability, poten+al (energy func+on) is defined for each slide = ( tag_i, tag_i-1 ) * (tag_i, word_i) or f ( tag_i, tag_i-1, all words ) * (tag_i, all words) è Can incorporate long distance features

Conditional Random Fields (CRFs) Maximum entropy (logistic regression) Sentence: x=x 1 x m p(y x; w) = exp (w (x, y)) P y 0 exp (w (x, y 0 )) Tag Sequence: y=s 1 s m Learning: maximize the (log)

56 Conditional Random Fields (CRFs) Maximum entropy (logistic regression) Sentence: x=x 1 x m p(y x; w) = exp (w (x, y)) P y 0 exp (w (x, y 0 )) Tag Sequence: y=s 1 s m Learning: maximize the (log) conditional likelihood of training data {(x i,y i )} j L(w) = nx i=1 j(x i,y i )! X p(y x i ; w) j (x i,y) y w j Computational Challenges? Most likely tag sequence, normalization constant, gradient [Lafferty, McCallum, Pereira 01]

57 Conditional Random Fields (CRFs) Maximum entropy (logistic regression) p(y x; w) = Learning: maximize (log) conditional likelihood of training data {(x i,y i )} n nx X L(w) = j(x i,y i ) p(y x i ; w) j (x j i=1 Computational Challenges? exp (w (x, y)) P y 0 exp (w (x, y 0 )) Most likely tag sequence, normalization constant, gradient y [Lafferty, McCallum, Pereira 01]

58 Decoding s = arg max s CRFs Features must be local, for x=x 1 x m, and s=s 1 s m p(s x; w) p(s x; w) = exp (w (x, s)) P s 0 exp (w (x, s 0 )) (x, s) = mx j=1 (x, j, s j 1,s j ) arg max s exp (w (x, s)) Ps exp (w (x, s 0 )) 0 = arg max s exp (w (x, s)) = arg max s w (x, s) Same as Linear Perceptron!!! (i, s i ) = max s i 1 (x, i, s i i,s i )+ (i 1,s i 1 )

59 Decoding s = arg max s p(s x; w) CRFs Features must be local, for x=x 1 x m, and s=s 1 s m p(s x; w) = exp (w (x, s)) P s 0 exp (w (x, s 0 )) (x, s) = mx j=1 (x, j, s j 1,s j ) arg max s exp (w (x, s)) Ps exp (w (x, s 0 )) 0 = arg max s exp (w (x, s)) = arg max s w (x, s)

60 The MEMM State Lagce / Trellis (repeat) ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ x = START Fed raises interest rates STOP p(v V,x) x x x x

61 CRF State Lagce / Trellis ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ x = START Fed raises interest rates STOP wφ(x,3,v,v)

62 CRFs: Computing Normalization* p(s x; w) = X s 0 exp w (x, s 0 ) exp (w (x, s)) P s exp (w (x, s 0 )) 0 = X X s 0 j = X Y exp (w (x, j, s j 1,s j )) s 0 j 0 (x, s) = w (x, j, s j 1,s j ) A mx j=1 1 Define norm(i,s i ) to sum of scores for sequences ending in posiuon i (x, j, s j 1,s j ) norm(i, y i )= X s i 1 exp (w (x, i, s i 1,s i )) norm(i 1,s i 1 ) Forward Algorithm! Remember HMM case: (i, y i )= X y i 1 e(x i y i )q(y i y i 1 ) (i 1,y i 1 )

63 CRFs: Computing Gradient* p(s x; w) = exp (w (x, s)) P s 0 exp (w (x, s j L(w) = X p(s x i ; w) s = mx X j=1 a,b nx i=1 j(x i,s i ) j (x i,s) = X s X s:s j 1 =a,s b =b (x, s) =! X p(s x i ; w) j (x i,s) s p(s x i ; w) mx j=1 mx j=1 p(s x i ; w) k (x i,j,s j 1,s j ) (x, j, s j 1,s j ) k(x i,j,s j 1,s j ) Need forward and backward messages See notes for full details!

64 Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% CRF (untuned) 95.7% / 76.2% Upper bound: ~98%

65 POS Results Known Unknown Most Freq 55 HMM HMM++ MaxEnt MEMM? CRF Upper

66 Cyclic Network [Toutanova et al 03] Train two MEMMs, multiple together to score And be very careful Tune regularizauon Try lots of different features See paper for full details (a) Left-to-Right CMM (b) Right-to-Left CMM (c) Bidirectional Dependency N Figure 1: Dependency networks: (a) the (stan first-order CMM, (b) the (reversed) right-to-l the bidirectional dependency network.

67 Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% CRF (untuned) 95.7% / 76.2% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98%

68 POS Results Known Unknown Most Freq 55 HMM HMM++ MaxEnt MEMM? CRF Cyclic Upper

69 Outline POS Tagging MaxEnt MEMM CRFs Wrap-up OpUonal: Perceptron

70 Summary Feature-rich models are important!

71 Outline POS Tagging MaxEnt MEMM CRFs Wrap-up OpUonal: Perceptron

Linear Models: Perceptron [Collins 02] The perceptron algorithm Iteratively processes the training set, reacting to training errors Can be thought of as trying to drive down training error The

72 Linear Models: Perceptron [Collins 02] The perceptron algorithm Iteratively processes the training set, reacting to training errors Can be thought of as trying to drive down training error The (online) perceptron algorithm: Start with zero weights Visit training instances (x i,y i ) one by one y =argmax y Make a prediction w (x i,y) Sentence: x=x 1 x m If correct (y*==y i ): no change, goto next example! If wrong: adjust weights w = w + (x i,y i ) (x i,y ) Challenge: How to compute argmax efficiently? Tag Sequence: y=s 1 s m

73 Linear Models: Perceptron [Collins 02] The perceptron algorithm Iteratively processes the training set, reacting to training errors Can be thought of as trying to drive down training error The (online) perceptron algorithm: Start with zero weights Visit training instances (x i,y i ) one by one Make a prediction y =argmax y w (x i,y) If correct (y*==y i ): no change, goto next example! If wrong: adjust weights w = w + (x i,y i ) (x i,y ) Challenge: How to compute argmax efficiently?

74 Linear Perceptron Decoding s = arg max s Features must be local, for x=x 1 x m, and s=s 1 s m (x, s) = mx j=1 (x, j, s j 1,s j ) w (x, s)

75 The MEMM State Lagce / Trellis (repeat) ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ x = START Fed raises interest rates STOP p(v V,x) x x x x

76 The Perceptron State Lagce / Trellis ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ x = START Fed raises interest rates STOP wφ(x,3,v,v)

77 Linear Perceptron Decoding Features must be local, for x=x 1 x m, and s=s 1 s m (x, s) = mx j=1 Define π(i,s i ) to be the max score of a sequence of length i ending in tag s i Viterbi algorithm (HMMs): s = arg max s (x, j, s j 1,s j ) w (x, s) (i, s i ) = max s i 1 w (x, i, s i i,s i )+ (i 1,s i 1 ) (i, s i ) = max s i 1 e(x i s i )q(s i s i 1 ) (i 1,s i 1 ) Viterbi algorithm (Maxent): (i, s i ) = max s i 1 p(s i s i 1,x 1...x m ) (i 1,s i 1 )

78 Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% Perceptron 96.7% /?? Upper bound: ~98%

79 POS Results Known Unknown Most Freq 55 HMM HMM++ MaxEnt MEMM Percep. CRF Cyclic Upper

CSE 490 U Natural Language Processing Spring 2016

CSE 490 U Natural Language Processing Spring 2016 CSE 490 U atural Language Processing Spring 2016 Parts of Speech Yejin Choi [Slides adapted from an Klein, Luke Zettlemoyer] Overview POS Tagging Feature Rich Techniques Maximum Entropy Markov Models (MEMMs)