Sequence Prediction and Part-of-speech Tagging

Size: px

Start display at page:

Download "Sequence Prediction and Part-of-speech Tagging"

Charles Henderson
6 years ago
Views:

1 CS5740: atural Language Processing Spring 2017 Sequence Prediction and Part-of-speech Tagging Instructor: Yoav Artzi Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, Yejin Choi, and Slav Petrov

2 Overview POS Tagging: the problem Hidden Markov Models (HMM) Supervised Learning Inference The iterbi algorithm Feature-rich models Maximum-entropy Markov Models Perceptron Conditional Random Fields

3 Parts of Speech Open class (lexical) words ouns erbs Adjectives old older oldest Proper Common Main Adverbs slowly IBM cat / cats see Italy snow registered umbers more 122,312 Closed class (functional) Modals one Determiners the some can Prepositions to with Conjunctions and or had Particles off up more Pronouns he its Interjections Ow Eh

4 POS Tagging Words often have more than one POS: back The back door = JJ On my back = Win the voters back = RB Promised to back the bill = B The POS tagging problem is to determine the POS tag for a particular instance of a word.

5 POS Tagging Penn Treebank POS tags Input: Plays well with others Ambiguity: S/BZ UH/JJ//RB I S Output: Plays/BZ well/rb with/i others/s Uses: Text-to-speech (how do we pronounce lead?) Can write regular expressions like (Det) Adj* + over the output for phrases, etc. As input to or to speed up a full parser If you know the tag, you can back off to it in other tasks

6 Penn TreeBank Tagset Possible tags: 45 Tagging guidelines: 36 pages ewswire text

7 CC conjunction, coordinating and both but either or CD numeral, cardinal mid-1890 nine-thirty 0.5 one DT determiner a all an every no that the EX existential there there FW foreign word gemeinschaft hund ich jeux I preposition or conjunction, subordinating among whether out on by if JJ adjective or numeral, ordinal third ill-mannered regrettable JJR adjective, comparative braver cheaper taller JJS adjective, superlative bravest cheapest tallest MD modal auxiliary can may might will would noun, common, singular or mass cabbage thermostat investment subhumanity P noun, proper, singular Motown Cougar Yvette Liverpool PS noun, proper, plural Americans Materials States S noun, common, plural undergraduates bric-a-brac averages POS genitive marker ' 's PRP pronoun, personal hers himself it we them PRP pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously RBR adverb, comparative further gloomier heavier less-perfectly RBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open through TO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heck B verb, base form ask bring fire see take BD verb, past tense pleaded swiped registered saw BG verb, present participle or gerund stirring focusing approaching erasing B verb, past participle dilapidated imitated reunifed unsettled BP verb, present tense, not 3rd person singular twist appear comprise mold postpone BZ verb, present tense, 3rd person singular bases reconstructs marks uses WDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whom WP WH-pronoun, possessive whose WRB Wh-adverb however whenever where why Main Tags

8 Penn TreeBank Tagset How accurate are taggers? (Tag accuracy) About 97% currently But baseline is already 90% Baseline is performance of simplest possible method Tag every word with its most frequent tag Tag unknown words as nouns Partly easy because Many words are unambiguous You get points for them (the, a, etc.) and for punctuation marks! Upperbound: probably 2% annotation errors

9 Hard Cases are Hard Mrs/P Shaefer/P never/rb got/bd around/rp to/to joining/bg All/DT we/prp gotta/b do/b is/bz go/b around/i the/dt corner/ Chateau/P Petrus/P costs/bz around/rb 250/CD

10 How Difficult is POS Tagging? About 11% of the word types in the Brown corpus are ambiguous with regard to part of speech But they tend to be very common words. E.g., that I know that he is honest = I Yes, that play was nice = DT You can t go that far = RB 40% of the word tokens are ambiguous

11 The Tagset Wait, do we really need all these tags? What about other languages? Each language has its own tagset

12 Tagsets in Different Languages [Petrov et al. 2012]

13 The Tagset Wait, do we really need all these tags? What about other languages? Each language has its own tagset But why is this bad? Differences in downstream tasks Harder to do language transfer

14 Alternative: The Universal Tagset 12 tags: OU, ERB, ADJ, AD, PRO, DET, ADP, UM, COJ, PRT,., and X. Deterministic conversion from tagsets in 22 languages. Better unsupervised parsing results Was used to transfer parsers [Petrov et al. 2012]

15 Sources of Information What are the main sources of information for POS tagging? Knowledge of neighboring words Bill saw that man yesterday P B(D) DT B I B Knowledge of word probabilities man is rarely used as a verb. The latter proves the most useful, but the former also helps

16 Word-level Features Can do surprisingly well just looking at a word by itself: Word the: the DT Lowercased words: importantly RB Prefixes unfathomable: un- JJ Suffixes Importantly: -ly RB Capitalization Meridian: CAP P Word shapes 35-year: d-x JJ

17 Sequence-to-Sequence Consider the problem of jointly modeling a pair of strings E.g.: part of speech tagging DT P BD B RP S The Georgia branch had taken on loan commitments DT I BD S BD The average of interbank offered rates plummeted Q: How do we map each word in the input sentence onto the appropriate label? A: We can learn a joint distribution: p(x 1...x n,y 1...y n ) And then compute the most likely assignment: arg max y 1...y n p(x 1...x n,y 1...y n )

18 Classic Solution: HMMs We want a model of sequences y and observations x y 0 y 1 y 2 y n Transition Emission x 1 x 2 x n p(x 1...x n,y 1...y n )=q(stop y n ) ny i=1 q(y i y i 1 )e(x i y i ) where y 0 =START and we call q y i y %&' ) the transition distribution and e x i y i ) the emission (or observation) distribution.

19 Model Assumptions y 0 y 1 y 2 y n Transition Emission x 1 x 2 x n p(x 1...x n,y 1...y n )=q(stop y n ) ny i=1 q(y i y i 1 )e(x i y i ) Tag/state sequence is generated by a Markov model Words are chosen independently, conditioned only on the tag/state These are totally broken assumptions for POS: why?

20 HMM for POS Tagging The Georgia branch had taken on loan commitments DT P BD B RP S HMM Model: States Y = {DT, P,,... } are the POS tags Observations X = are words Transition dist n q y % y %&' ) models the tag sequences Emission dist n e x % y % ) models words given their POS

21 HMM for POS Tagging The Georgia branch had taken on loan commitments DT P BD B RP S HMM Model: States Y = {DT, P,,... } are the POS tags Observations X = are words Transition dist n q y % y %&' ) models the tag sequences Emission dist n e x % y % ) models words given their POS

22 HMM Inference and Learning Learning Maximum likelihood: transitions q and emissions e ny p(x 1...x n,y 1...y n )=q(stop y n ) q(y i y i 1 )e(x i y i ) i=1 Inference iterbi y = arg max y 1...y n p(x 1...x n,y 1...y n ) Forward backward p(x 1...x n,y i )= X y 1...y i 1 X y i+1...y n p(x 1...x n,y 1...y n )

23 Learning: Maximum Likelihood p(x 1...x n,y 1...y n )=q(stop y n ) ny i=1 q(y i y i 1 )e(x i y i ) Maximum likelihood methods for estimating transitions q and emissions e q ML (y i y i 1 )= c(y i 1,y i ) c(y i 1 ) e ML (x y) = c(y, x) c(y) Will these estimates be high quality? Which is likely to be more sparse, q or e? Smoothing?

24 Learning: Low Frequency Words p(x 1...x n,y 1...y n )=q(stop y n ) ny i=1 q(y i y i 1 )e(x i y i ) Typically, for transitions: Linear Interpolation q(y i y i 1 )= 1 q ML (y i y i 1 )+ 2 q ML (y i ) However, other approaches used for emissions Step 1: Split the vocabulary Frequent words: appear more than M (often 5) times Low frequency: everything else Step 2: Map each low frequency word to one of a small, finite set of possibilities For example, based on prefixes, suffixes, etc. Step 3: Learn model for this new space of possible word sequences

25 Another Example: Chunking Goal: Segment text into spans with certain properties For example, named entities: PER, ORG, and LOC Germany s representative to the European Union s veterinary committee Werner Zwingman said on Wednesday consumers should [Germany] LOC s representative to the [European Union] ORG s veterinary committee [Werner Zwingman] PER said on Wednesday consumers should How is this a sequence tagging problem?

26 amed Entity Recognition Germany s representative to the European Union s veterinary committee Werner Zwingman said on Wednesday consumers should [Germany] LOC s representative to the [European Union] ORG s veterinary committee [Werner Zwingman] PER said on Wednesday consumers should HMM Model: States Y = {A,BL,CL,BO,CO,BP,CP} represent beginnings (BL,BO,BP) and continuations (CL,CO,CP) of chunks, as well as other words (A) Observations X = are words Transition dist n q(y % y %&' ) models the tag sequences Emission dist n e(x % y % ) models words given their type

27 Low Frequency Words: An Example amed Entity Recognition [Bickel et. al, 1999] Used the following word classes for infrequent words: Word class Example Intuition twodigitum 90 Two digit year fourdigitum 1990 Four digit year containsdigitandalpha A Product code containsdigitanddash Date containsdigitandslash 11/9/89 Date containsdigitandcomma 23, Monetary amount containsdigitandperiod 1.00 Monetary amount,percentage othernum Other number allcaps BB Organization capperiod M. Person name initial firstword first word of sentence no useful capitalization information initcap Sally Capitalized word lowercase can Uncapitalized word other, Punctuation marks, all other words

28 Low Frequency Words: An Example Profits/A soared/a at/a Boeing/SO Co./CO,/A easily/a topping/a forecasts/a on/a Wall/SL Street/CL,/A as/a their/a CEO/A Alan/SP Mulally/CP announced/a first/a quarter/a results/a./a A = o entity SO = Start Organization CO = Continue Organization SL = Start Location CL = Continue Location

29 Low Frequency Words: An Example Profits/A soared/a at/a Boeing/SO Co./CO,/A easily/a topping/a forecasts/a on/a Wall/SL Street/CL,/A as/a their/a CEO/A Alan/SP Mulally/CP announced/a first/a quarter/a results/a./a firstword/a soared/a at/a initcap/sc Co./CC,/A easily/a lowercase/a forecasts/a on/a initcap/sl Street/CL,/A as/a their/a CEO/A Alan/SP initcap/cp announced/a first/a quarter/a results/a./a A = o entity SO = Start Organization CO = Continue Organization SL = Start Location CL = Continue Location

30 HMM Inference and Learning Learning Maximum likelihood: transitions q and emissions e ny p(x 1...x n,y 1...y n )=q(stop y n ) q(y i y i 1 )e(x i y i ) i=1 Inference iterbi y = arg max y 1...y n p(x 1...x n,y 1...y n ) Forward backward p(x 1...x n,y i )= X y 1...y i 1 X y i+1...y n p(x 1...x n,y 1...y n )

31 Inference (Decoding) Problem: find the most likely (iterbi) sequence under the model y = arg max y 1...y n p(x 1...x n,y 1...y n ) Given model parameters, we can score any sequence pair P BZ S CD. Fed raises interest rates 0.5 percent. q(p ) e(fed P) q(bz P) e(raises BZ) q( BZ).. In principle, we re done list all possible tag sequences, score each one, pick the best one (the iterbi state sequence) P BZ S CD. P S S CD. P BZ B S CD. logp(x, y) = 23 log (x, y) = 29 log p(x, y) = 27 Any issue?

32 Finding the Best Trajectory Too many trajectories (state sequences) to list Option 1: Beam Search A beam is a set of partial hypotheses Start with just the single empty trajectory At each derivation step: Consider all continuations of previous hypotheses Discard most, keep top k <> Fed: Fed: Fed:J raises: raises: raises: raises: Beam search often works OK in practice, but but sometimes you want the optimal answer and there s usually a better option than naïve beams

33 The State Lattice / Trellis J D J D J D J D J D J D START Fed raises interest rates STOP J e(fed ) e(raises ) e(interest ) e(rates J) q( ) e(stop )

34 Scoring a Sequence y = arg max y 1...y n p(x 1...x n,y 1...y n ) p(x 1...x n,y 1...y n )=q(stop y n ) ny q(y i y i 1 )e(x i y i ) i=1 Define π(i,y i ) to be the max score of a sequence of length i ending in tag y i (i, y i ) = max p(x 1...x i,y 1...y i ) y 1...y i 1 = max y i 1 e(x i y i )q(y i y i 1 ) max y 1...y i 2 p(x 1...x i 1,y 1...y i 1 ) =max y i 1 e(x i y i )q(y i y i 1 ) (i 1,y i 1 ) We can now design an efficient algorithm. How?

35 The iterbi Algorithm Dynamic program for computing (for all i) (i, y i ) = Iterative computation: What for? (0,y 0 )= For i = 1 n: // Store score max p(x 1...x i,y 1...y i ) y 1...y i 1 1ify0 == ST ART 0 otherwise (i, y i ) = max e(x i y i )q(y i y i 1 ) (i 1,y i 1 ) y i 1 // Store back-pointer bp(i, y i ) = arg max e(x i y i )q(y i y i y i 1 1 ) (i 1,y i 1 )

36 The State Lattice / Trellis Tie breaking: Prefer first START Fed raises interest STOP from \ to emissions START Fed raises interest STOP

37 The State Lattice / Trellis Tie breaking: Prefer first π = 1 bp = null π = 0 bp = π = 0 bp = π = 0 bp = π = 0 bp = π = 0 bp = null π = 0.27 bp = π = bp = π = bp = π = 0 bp = π = 0 bp = null π = 0 bp = π = bp = π = bp = π = 0 bp = π = 0 bp = null π = 0 bp = π = 0 π = 0 bp = bp = π = bp = START Fed raises interest STOP from \ to emissions START Fed raises interest STOP

38 The iterbi Algorithm: Runtime In term of sentence length n? Linear In term of number of states K? Polynomial (i, y i ) = max e(x i y i )q(y i y i y i 1 1 ) (i 1,y i 1 ) Specifically: Total runtime: O(n K ) entriesin (i, y i ) O( K ) time to compute each (i, y i ) O(n K 2 ) Q: Is this a practical algorithm? A: depends on K.

39 Tagsets in Different Languages 2942 = = = 121 [Petrov et al. 2012]

40 HMM Inference and Learning Learning Maximum likelihood: transitions q and emissions e ny p(x 1...x n,y 1...y n )=q(stop y n ) q(y i y i 1 )e(x i y i ) i=1 Inference iterbi y = arg max y 1...y n p(x 1...x n,y 1...y n ) Forward backward p(x 1...x n,y i )= X y 1...y i 1 X y i+1...y n p(x 1...x n,y 1...y n )

41 What about n-gram Taggers? States encode what is relevant about the past Transitions P(s i s i-1 ) encode well-formed tag sequences In a bigram tagger, states = tags < > < y 1 > < y 2 > < y n > s 0 s 1 s 2 s n x 1 x 2 x n In a trigram tagger, states = tag pairs <, > <, y 1 > < y 1, y 2 > < y n-1, y n > s 0 s 1 s 2 s n x 1 x 2 x n

42 The State Lattice / Trellis ot all edges are allowed,,,, e(fed ),,,D D,,D D,,,,,, e(raises D),D D,,,D e(interest ) D, START Fed raises interest D

Tagsets in Different Languages 2942 = 86436 2944 = 7471182096

43 Tagsets in Different Languages 2942 = = = = = = [Petrov et al. 2012]

44 Some umbers Most errors on unknown words Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt A carefully P(y x) smoothed trigram 93.7% tagger / 82.6% MEMM Suffix tagger trees 1: for emissions 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98%

45 Re-visit P(x y) Reality check: What if we drop the sequence? Re-visit P(x y)? Most frequent tag: 90.3% with a so-so unknown word model Can we do better?

46 What about better features? Looking at a word and its environment Add in previous / next word the Previous / next word shapes X X Occurrence pattern features [X: x X occurs] Crude entity detection.. (Inc. Co.) Phrasal verb in sentence? put Conjunctions of these things Uses lots of features: > 200K

47 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% What MEMM does this tagger tell us 2: about sequence 96.8% models? / 86.9% How Perceptron: do we add more features to 97.1% our sequence models? CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98%

48 MEMM Taggers One step up: also condition on previous tags: p(y 1...y n x 1...x n )= ny p(y i y 1...y i 1,x 1...x n ) i=1 Training: Train p(y % y %&', x 1 ) as a discrete log-linear (MaxEnt) model Scoring: This is referred to as an MEMM tagger [Ratnaparkhi 96] = ny p(y i y i 1,x 1...x n ) i=1 p(y i y i 1,x 1...x n )= ew (x 1...x n,i,y i 1,y i ) P y 0 e w (x 1...x n,i,y i 1,y 0 )

49 HMM vs. MEMM HMM models joint distribution: ny p(x 1...x n,y 1...y n )=q(stop y n ) q(y i y i 1 )e(x i y i ) i=1 MEMM models conditioned distribution: p(y 1...y n x 1...x n )= ny p(y i y 1...y i 1,x 1...x n ) i=1

50 Decoding MEMM Taggers Scoring: p(y i y i 1,x 1...x n )= ew (x 1...x n,i,y i 1,y i ) P y 0 e w (x 1...x n,i,y i 1,y 0 ) Beam search is effective why? Guarantees? Optimal? Can we do better?

51 The State Lattice / Trellis J D J D J D J D J D J D START Fed raises interest rates STOP J e(fed ) e(raises ) e(interest ) e(rates J) q( ) e(stop )

52 The MEMM State Lattice / Trellis J D J D J D J D J D J D START Fed raises interest rates STOP J q( )

53 Decoding MEMM Taggers Decoding MaxEnt taggers: Just like decoding HMMs iterbi, beam search iterbi algorithm (HMMs): Define π(i, y i ) to be the max score of a sequence of length i ending in tag y i (i, y i ) = max e(x i y i )q(y i y i 1 ) (i 1,y i 1 ) y i 1 iterbi algorithm (MaxEnt): Can use same algorithm for MEMMs, just need to redefine π(i, y i )! (i, y i ) = max y i 1 p(y i y i 1,x 1...x m ) (i 1,y i 1 )

54 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98% [Ratnaparkhi 1996]

55 Feature Development Common errors: /JJ official knowledge BD RP/I DT made up the story RB BD/B S recently sold shares [Toutanova and Manning 2000]

56 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98% [Toutanova and Manning 2000]

57 Locally ormalized Models So far: Probabilities are product of locally normalized probabilities Is this bad? A B C A B C 1.0 A B C AAA à 0.4 x 0.4 = 0.16 ABB à 0.2 x 1.0 = 0.2 from \ to A B C A B C B à B transitions are likely to take over even if rarely observed!

58 Locally ormalized Models So far: Probabilities are product of locally normalized probabilities Is this bad? Label bias MEMM taggers local scores can be near one without having both good transitions and emissions This means that often evidence doesn t flow properly

59 Global Discriminative Taggers ewer, higher-powered discriminative sequence models CRFs (also Perceptrons) Do not decompose training into independent local regions Can be deathly slow to train require repeated inference on training set * Relatively slow. models are much slower.

instances (x i,y i ) one by one Make a prediction y =argmax If correct (y*==y i ): no change, goto next example!

60 Linear Models: Perceptron The perceptron algorithm Iteratively processes the data, reacting to training errors Can be thought of as trying to drive down training error The (online structured) perceptron algorithm: Start with zero weights isit training instances (x i,y i ) one by one Make a prediction y =argmax If correct (y*==y i ): no change, goto next example! If wrong: adjust weights: y w (x i,y) w = w + (x i,y i ) (x i,y ) Challenge: How to compute argmax efficiently? Sentence: x = x 1 x n Tag Sequence: y=y 1 y m

61 Linear Perceptron Decoding y = arg max w y Features must be local, for x = x 1 x n, and y = y 1 y m (x, y) = nx j=1 (x, j, y j 1,y j ) (x, y)

62 The MEMM State Lattice / Trellis J D J D J D J D J D J D START Fed raises interest rates STOP J q( )

63 The Perceptron State Lattice / Trellis J D J D J D J D J D J D START Fed raises interest rates STOP J wφ(x,3,,)

64 Linear Perceptron Decoding y Features must be local, for x=x 1 x m, and s=s 1 s m (x, y) = nx y = arg max j=1 Define π(i, y i ) to be the max score of a sequence of length i ending in tag y i iterbi algorithm (HMMs): (x, j, y j 1,y j ) w (i, s i ) = max e(x i s i )q(s i s i 1 ) (i 1,s i 1 ) s i 1 (x, y) (i, y i ) = max y i 1 w (x, i, y i i,y i )+ (i 1,y i 1 ) iterbi algorithm (Maxent): (i, s i ) = max s i 1 p(s i s i 1,x 1...x m ) (i 1,s i 1 )

65 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98% [Collins 2002]

66 Conditional Random Fields (CRFs) What did we lose with the Perceptron? o probabilities Let s try again with a probabilistic model

training data {(x (i),y (i) )} m i=1 @ @w j L(w) = mx i=1 j(x (i),y (i) )!

67 CRFs Maximum entropy (logistic regression) Sentence: x = x 1 x n p(y x; w) = exp (w (x, y)) P y 0 exp (w (x, y 0 )) Tag Sequence: y = y 1 y n Learning: maximize the (log) conditional likelihood of training data {(x (i),y (i) )} j L(w) = mx i=1 j(x (i),y (i) )! X p(y x i ; w) j (x (i),y) Computational challenges? Most likely tag sequence, normalization constant, gradient y w j [Lafferty et al. 2001]

68 CRFs Decoding y =argmax y p(y x; w) Features must be local, for x = x 1 x n, and y = y 1 y n (x, y) = nx j=1 (x, j, y j 1,y j ) arg max y exp (w (x, y)) Py exp (w (x, y 0 )) =argmax 0 y exp (w (x, y)) =argmax y w (x, y) Same as linear Perceptron! (i, y i ) = max (x, i, y i i,y i )+ (i 1,y i 1 ) y i 1

69 CRFs: Computing ormalization p(y x; w) = X exp (w (x, y)) P y 0 exp (w (x, y 0 )) y 0 exp w (x, y 0 ) = 0 X X j y 0 j=1 1 w (x, j, y j 1,y j ) A = X Y exp (w (x, j, y j 1,y j )) y 0 j (x, y) = Define norm(i, y i ) to sum of scores for sequences ending in position i norm(i, y i )= X y i 1 exp (w (x, i, y i 1,y i )) norm(i 1,y i 1 ) nx (x, j, y j 1,y j ) Forward algorithm! Remember HMM case: (i, y i ) = max y i 1 e(x i y i )q(y i y i 1 ) (i 1,y i 1 )

70 CRFs: Computing Gradient exp (w (x, y)) p(y x; w) = P y exp (w (x, y 0 )) mx L(w) = j(x (i),y (i) j X p(y x i ; w) y i=1 = j (x i,y) nx X k=1 Can compute with the Forward Backward algorithm See notes for full details! a,b (x, y) = nx j=1 (x, j, y j 1,y j )! X p(y x i ; w) j (x (i),y) y w j = X nx p(y x i ; w) j(x i,k,y k 1,y k ) y k=1 X p(y x i ; w) j (x i,k,y k 1,y k ) y:y k 1 =a,y k =b

71 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98% [Sun 2014]

72 Cyclic etwork Train two MEMMs, combine scores And be very careful Tune regularization Try lots of different features See paper for full details (a) Left-to-Right CMM (b) Right-to-Left CMM (c) Bidirectional Dependency etwork [Toutanova et al. 2003]

73 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98% [Toutanova et al. 2003]

74 Summary For tagging, the change from generative to discriminative model does not by itself result in great improvement One profits from models for specifying dependence on overlapping features of the observation such as spelling, suffix analysis, etc. MEMMs allow integration of rich features of the observations This additional power (of the MEMM,CRF, Perceptron models) has been shown to result in improvements in accuracy The higher accuracy of discriminative models comes at the price of much slower training

75 Domain Effects Accuracies degrade outside of domain Up to triple error rate Usually make the most errors on the things you care about in the domain (e.g. protein names) Open questions How to effectively exploit unlabeled data from a new domain (what could we gain?) How to best incorporate domain lexica in a principled way (e.g. UMLS specialist lexicon, ontologies)

CSE 490 U Natural Language Processing Spring 2016

CSE 490 U Natural Language Processing Spring 2016 CSE 490 U atural Language Processing Spring 2016 Parts of Speech Yejin Choi [Slides adapted from an Klein, Luke Zettlemoyer] Overview POS Tagging Feature Rich Techniques Maximum Entropy Markov Models (MEMMs)