Sequence Prediction and Part-of-speech Tagging

Size: px
Start display at page:

Download "Sequence Prediction and Part-of-speech Tagging"

Transcription

1 CS5740: atural Language Processing Spring 2017 Sequence Prediction and Part-of-speech Tagging Instructor: Yoav Artzi Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, Yejin Choi, and Slav Petrov

2 Overview POS Tagging: the problem Hidden Markov Models (HMM) Supervised Learning Inference The iterbi algorithm Feature-rich models Maximum-entropy Markov Models Perceptron Conditional Random Fields

3 Parts of Speech Open class (lexical) words ouns erbs Adjectives old older oldest Proper Common Main Adverbs slowly IBM cat / cats see Italy snow registered umbers more 122,312 Closed class (functional) Modals one Determiners the some can Prepositions to with Conjunctions and or had Particles off up more Pronouns he its Interjections Ow Eh

4 POS Tagging Words often have more than one POS: back The back door = JJ On my back = Win the voters back = RB Promised to back the bill = B The POS tagging problem is to determine the POS tag for a particular instance of a word.

5 POS Tagging Penn Treebank POS tags Input: Plays well with others Ambiguity: S/BZ UH/JJ//RB I S Output: Plays/BZ well/rb with/i others/s Uses: Text-to-speech (how do we pronounce lead?) Can write regular expressions like (Det) Adj* + over the output for phrases, etc. As input to or to speed up a full parser If you know the tag, you can back off to it in other tasks

6 Penn TreeBank Tagset Possible tags: 45 Tagging guidelines: 36 pages ewswire text

7 CC conjunction, coordinating and both but either or CD numeral, cardinal mid-1890 nine-thirty 0.5 one DT determiner a all an every no that the EX existential there there FW foreign word gemeinschaft hund ich jeux I preposition or conjunction, subordinating among whether out on by if JJ adjective or numeral, ordinal third ill-mannered regrettable JJR adjective, comparative braver cheaper taller JJS adjective, superlative bravest cheapest tallest MD modal auxiliary can may might will would noun, common, singular or mass cabbage thermostat investment subhumanity P noun, proper, singular Motown Cougar Yvette Liverpool PS noun, proper, plural Americans Materials States S noun, common, plural undergraduates bric-a-brac averages POS genitive marker ' 's PRP pronoun, personal hers himself it we them PRP pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously RBR adverb, comparative further gloomier heavier less-perfectly RBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open through TO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heck B verb, base form ask bring fire see take BD verb, past tense pleaded swiped registered saw BG verb, present participle or gerund stirring focusing approaching erasing B verb, past participle dilapidated imitated reunifed unsettled BP verb, present tense, not 3rd person singular twist appear comprise mold postpone BZ verb, present tense, 3rd person singular bases reconstructs marks uses WDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whom WP WH-pronoun, possessive whose WRB Wh-adverb however whenever where why Main Tags

8 Penn TreeBank Tagset How accurate are taggers? (Tag accuracy) About 97% currently But baseline is already 90% Baseline is performance of simplest possible method Tag every word with its most frequent tag Tag unknown words as nouns Partly easy because Many words are unambiguous You get points for them (the, a, etc.) and for punctuation marks! Upperbound: probably 2% annotation errors

9 Hard Cases are Hard Mrs/P Shaefer/P never/rb got/bd around/rp to/to joining/bg All/DT we/prp gotta/b do/b is/bz go/b around/i the/dt corner/ Chateau/P Petrus/P costs/bz around/rb 250/CD

10 How Difficult is POS Tagging? About 11% of the word types in the Brown corpus are ambiguous with regard to part of speech But they tend to be very common words. E.g., that I know that he is honest = I Yes, that play was nice = DT You can t go that far = RB 40% of the word tokens are ambiguous

11 The Tagset Wait, do we really need all these tags? What about other languages? Each language has its own tagset

12 Tagsets in Different Languages [Petrov et al. 2012]

13 The Tagset Wait, do we really need all these tags? What about other languages? Each language has its own tagset But why is this bad? Differences in downstream tasks Harder to do language transfer

14 Alternative: The Universal Tagset 12 tags: OU, ERB, ADJ, AD, PRO, DET, ADP, UM, COJ, PRT,., and X. Deterministic conversion from tagsets in 22 languages. Better unsupervised parsing results Was used to transfer parsers [Petrov et al. 2012]

15 Sources of Information What are the main sources of information for POS tagging? Knowledge of neighboring words Bill saw that man yesterday P B(D) DT B I B Knowledge of word probabilities man is rarely used as a verb. The latter proves the most useful, but the former also helps

16 Word-level Features Can do surprisingly well just looking at a word by itself: Word the: the DT Lowercased words: importantly RB Prefixes unfathomable: un- JJ Suffixes Importantly: -ly RB Capitalization Meridian: CAP P Word shapes 35-year: d-x JJ

17 Sequence-to-Sequence Consider the problem of jointly modeling a pair of strings E.g.: part of speech tagging DT P BD B RP S The Georgia branch had taken on loan commitments DT I BD S BD The average of interbank offered rates plummeted Q: How do we map each word in the input sentence onto the appropriate label? A: We can learn a joint distribution: p(x 1...x n,y 1...y n ) And then compute the most likely assignment: arg max y 1...y n p(x 1...x n,y 1...y n )

18 Classic Solution: HMMs We want a model of sequences y and observations x y 0 y 1 y 2 y n Transition Emission x 1 x 2 x n p(x 1...x n,y 1...y n )=q(stop y n ) ny i=1 q(y i y i 1 )e(x i y i ) where y 0 =START and we call q y i y %&' ) the transition distribution and e x i y i ) the emission (or observation) distribution.

19 Model Assumptions y 0 y 1 y 2 y n Transition Emission x 1 x 2 x n p(x 1...x n,y 1...y n )=q(stop y n ) ny i=1 q(y i y i 1 )e(x i y i ) Tag/state sequence is generated by a Markov model Words are chosen independently, conditioned only on the tag/state These are totally broken assumptions for POS: why?

20 HMM for POS Tagging The Georgia branch had taken on loan commitments DT P BD B RP S HMM Model: States Y = {DT, P,,... } are the POS tags Observations X = are words Transition dist n q y % y %&' ) models the tag sequences Emission dist n e x % y % ) models words given their POS

21 HMM for POS Tagging The Georgia branch had taken on loan commitments DT P BD B RP S HMM Model: States Y = {DT, P,,... } are the POS tags Observations X = are words Transition dist n q y % y %&' ) models the tag sequences Emission dist n e x % y % ) models words given their POS

22 HMM Inference and Learning Learning Maximum likelihood: transitions q and emissions e ny p(x 1...x n,y 1...y n )=q(stop y n ) q(y i y i 1 )e(x i y i ) i=1 Inference iterbi y = arg max y 1...y n p(x 1...x n,y 1...y n ) Forward backward p(x 1...x n,y i )= X y 1...y i 1 X y i+1...y n p(x 1...x n,y 1...y n )

23 Learning: Maximum Likelihood p(x 1...x n,y 1...y n )=q(stop y n ) ny i=1 q(y i y i 1 )e(x i y i ) Maximum likelihood methods for estimating transitions q and emissions e q ML (y i y i 1 )= c(y i 1,y i ) c(y i 1 ) e ML (x y) = c(y, x) c(y) Will these estimates be high quality? Which is likely to be more sparse, q or e? Smoothing?

24 Learning: Low Frequency Words p(x 1...x n,y 1...y n )=q(stop y n ) ny i=1 q(y i y i 1 )e(x i y i ) Typically, for transitions: Linear Interpolation q(y i y i 1 )= 1 q ML (y i y i 1 )+ 2 q ML (y i ) However, other approaches used for emissions Step 1: Split the vocabulary Frequent words: appear more than M (often 5) times Low frequency: everything else Step 2: Map each low frequency word to one of a small, finite set of possibilities For example, based on prefixes, suffixes, etc. Step 3: Learn model for this new space of possible word sequences

25 Another Example: Chunking Goal: Segment text into spans with certain properties For example, named entities: PER, ORG, and LOC Germany s representative to the European Union s veterinary committee Werner Zwingman said on Wednesday consumers should [Germany] LOC s representative to the [European Union] ORG s veterinary committee [Werner Zwingman] PER said on Wednesday consumers should How is this a sequence tagging problem?

26 amed Entity Recognition Germany s representative to the European Union s veterinary committee Werner Zwingman said on Wednesday consumers should [Germany] LOC s representative to the [European Union] ORG s veterinary committee [Werner Zwingman] PER said on Wednesday consumers should HMM Model: States Y = {A,BL,CL,BO,CO,BP,CP} represent beginnings (BL,BO,BP) and continuations (CL,CO,CP) of chunks, as well as other words (A) Observations X = are words Transition dist n q(y % y %&' ) models the tag sequences Emission dist n e(x % y % ) models words given their type

27 Low Frequency Words: An Example amed Entity Recognition [Bickel et. al, 1999] Used the following word classes for infrequent words: Word class Example Intuition twodigitum 90 Two digit year fourdigitum 1990 Four digit year containsdigitandalpha A Product code containsdigitanddash Date containsdigitandslash 11/9/89 Date containsdigitandcomma 23, Monetary amount containsdigitandperiod 1.00 Monetary amount,percentage othernum Other number allcaps BB Organization capperiod M. Person name initial firstword first word of sentence no useful capitalization information initcap Sally Capitalized word lowercase can Uncapitalized word other, Punctuation marks, all other words

28 Low Frequency Words: An Example Profits/A soared/a at/a Boeing/SO Co./CO,/A easily/a topping/a forecasts/a on/a Wall/SL Street/CL,/A as/a their/a CEO/A Alan/SP Mulally/CP announced/a first/a quarter/a results/a./a A = o entity SO = Start Organization CO = Continue Organization SL = Start Location CL = Continue Location

29 Low Frequency Words: An Example Profits/A soared/a at/a Boeing/SO Co./CO,/A easily/a topping/a forecasts/a on/a Wall/SL Street/CL,/A as/a their/a CEO/A Alan/SP Mulally/CP announced/a first/a quarter/a results/a./a firstword/a soared/a at/a initcap/sc Co./CC,/A easily/a lowercase/a forecasts/a on/a initcap/sl Street/CL,/A as/a their/a CEO/A Alan/SP initcap/cp announced/a first/a quarter/a results/a./a A = o entity SO = Start Organization CO = Continue Organization SL = Start Location CL = Continue Location

30 HMM Inference and Learning Learning Maximum likelihood: transitions q and emissions e ny p(x 1...x n,y 1...y n )=q(stop y n ) q(y i y i 1 )e(x i y i ) i=1 Inference iterbi y = arg max y 1...y n p(x 1...x n,y 1...y n ) Forward backward p(x 1...x n,y i )= X y 1...y i 1 X y i+1...y n p(x 1...x n,y 1...y n )

31 Inference (Decoding) Problem: find the most likely (iterbi) sequence under the model y = arg max y 1...y n p(x 1...x n,y 1...y n ) Given model parameters, we can score any sequence pair P BZ S CD. Fed raises interest rates 0.5 percent. q(p ) e(fed P) q(bz P) e(raises BZ) q( BZ).. In principle, we re done list all possible tag sequences, score each one, pick the best one (the iterbi state sequence) P BZ S CD. P S S CD. P BZ B S CD. logp(x, y) = 23 log (x, y) = 29 log p(x, y) = 27 Any issue?

32 Finding the Best Trajectory Too many trajectories (state sequences) to list Option 1: Beam Search A beam is a set of partial hypotheses Start with just the single empty trajectory At each derivation step: Consider all continuations of previous hypotheses Discard most, keep top k <> Fed: Fed: Fed:J raises: raises: raises: raises: Beam search often works OK in practice, but but sometimes you want the optimal answer and there s usually a better option than naïve beams

33 The State Lattice / Trellis J D J D J D J D J D J D START Fed raises interest rates STOP J e(fed ) e(raises ) e(interest ) e(rates J) q( ) e(stop )

34 Scoring a Sequence y = arg max y 1...y n p(x 1...x n,y 1...y n ) p(x 1...x n,y 1...y n )=q(stop y n ) ny q(y i y i 1 )e(x i y i ) i=1 Define π(i,y i ) to be the max score of a sequence of length i ending in tag y i (i, y i ) = max p(x 1...x i,y 1...y i ) y 1...y i 1 = max y i 1 e(x i y i )q(y i y i 1 ) max y 1...y i 2 p(x 1...x i 1,y 1...y i 1 ) =max y i 1 e(x i y i )q(y i y i 1 ) (i 1,y i 1 ) We can now design an efficient algorithm. How?

35 The iterbi Algorithm Dynamic program for computing (for all i) (i, y i ) = Iterative computation: What for? (0,y 0 )= For i = 1 n: // Store score max p(x 1...x i,y 1...y i ) y 1...y i 1 1ify0 == ST ART 0 otherwise (i, y i ) = max e(x i y i )q(y i y i 1 ) (i 1,y i 1 ) y i 1 // Store back-pointer bp(i, y i ) = arg max e(x i y i )q(y i y i y i 1 1 ) (i 1,y i 1 )

36 The State Lattice / Trellis Tie breaking: Prefer first START Fed raises interest STOP from \ to emissions START Fed raises interest STOP

37 The State Lattice / Trellis Tie breaking: Prefer first π = 1 bp = null π = 0 bp = π = 0 bp = π = 0 bp = π = 0 bp = π = 0 bp = null π = 0.27 bp = π = bp = π = bp = π = 0 bp = π = 0 bp = null π = 0 bp = π = bp = π = bp = π = 0 bp = π = 0 bp = null π = 0 bp = π = 0 π = 0 bp = bp = π = bp = START Fed raises interest STOP from \ to emissions START Fed raises interest STOP

38 The iterbi Algorithm: Runtime In term of sentence length n? Linear In term of number of states K? Polynomial (i, y i ) = max e(x i y i )q(y i y i y i 1 1 ) (i 1,y i 1 ) Specifically: Total runtime: O(n K ) entriesin (i, y i ) O( K ) time to compute each (i, y i ) O(n K 2 ) Q: Is this a practical algorithm? A: depends on K.

39 Tagsets in Different Languages 2942 = = = 121 [Petrov et al. 2012]

40 HMM Inference and Learning Learning Maximum likelihood: transitions q and emissions e ny p(x 1...x n,y 1...y n )=q(stop y n ) q(y i y i 1 )e(x i y i ) i=1 Inference iterbi y = arg max y 1...y n p(x 1...x n,y 1...y n ) Forward backward p(x 1...x n,y i )= X y 1...y i 1 X y i+1...y n p(x 1...x n,y 1...y n )

41 What about n-gram Taggers? States encode what is relevant about the past Transitions P(s i s i-1 ) encode well-formed tag sequences In a bigram tagger, states = tags < > < y 1 > < y 2 > < y n > s 0 s 1 s 2 s n x 1 x 2 x n In a trigram tagger, states = tag pairs <, > <, y 1 > < y 1, y 2 > < y n-1, y n > s 0 s 1 s 2 s n x 1 x 2 x n

42 The State Lattice / Trellis ot all edges are allowed,,,, e(fed ),,,D D,,D D,,,,,, e(raises D),D D,,,D e(interest ) D, START Fed raises interest D

43 Tagsets in Different Languages 2942 = = = = = = [Petrov et al. 2012]

44 Some umbers Most errors on unknown words Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt A carefully P(y x) smoothed trigram 93.7% tagger / 82.6% MEMM Suffix tagger trees 1: for emissions 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98%

45 Re-visit P(x y) Reality check: What if we drop the sequence? Re-visit P(x y)? Most frequent tag: 90.3% with a so-so unknown word model Can we do better?

46 What about better features? Looking at a word and its environment Add in previous / next word the Previous / next word shapes X X Occurrence pattern features [X: x X occurs] Crude entity detection.. (Inc. Co.) Phrasal verb in sentence? put Conjunctions of these things Uses lots of features: > 200K

47 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% What MEMM does this tagger tell us 2: about sequence 96.8% models? / 86.9% How Perceptron: do we add more features to 97.1% our sequence models? CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98%

48 MEMM Taggers One step up: also condition on previous tags: p(y 1...y n x 1...x n )= ny p(y i y 1...y i 1,x 1...x n ) i=1 Training: Train p(y % y %&', x 1 ) as a discrete log-linear (MaxEnt) model Scoring: This is referred to as an MEMM tagger [Ratnaparkhi 96] = ny p(y i y i 1,x 1...x n ) i=1 p(y i y i 1,x 1...x n )= ew (x 1...x n,i,y i 1,y i ) P y 0 e w (x 1...x n,i,y i 1,y 0 )

49 HMM vs. MEMM HMM models joint distribution: ny p(x 1...x n,y 1...y n )=q(stop y n ) q(y i y i 1 )e(x i y i ) i=1 MEMM models conditioned distribution: p(y 1...y n x 1...x n )= ny p(y i y 1...y i 1,x 1...x n ) i=1

50 Decoding MEMM Taggers Scoring: p(y i y i 1,x 1...x n )= ew (x 1...x n,i,y i 1,y i ) P y 0 e w (x 1...x n,i,y i 1,y 0 ) Beam search is effective why? Guarantees? Optimal? Can we do better?

51 The State Lattice / Trellis J D J D J D J D J D J D START Fed raises interest rates STOP J e(fed ) e(raises ) e(interest ) e(rates J) q( ) e(stop )

52 The MEMM State Lattice / Trellis J D J D J D J D J D J D START Fed raises interest rates STOP J q( )

53 Decoding MEMM Taggers Decoding MaxEnt taggers: Just like decoding HMMs iterbi, beam search iterbi algorithm (HMMs): Define π(i, y i ) to be the max score of a sequence of length i ending in tag y i (i, y i ) = max e(x i y i )q(y i y i 1 ) (i 1,y i 1 ) y i 1 iterbi algorithm (MaxEnt): Can use same algorithm for MEMMs, just need to redefine π(i, y i )! (i, y i ) = max y i 1 p(y i y i 1,x 1...x m ) (i 1,y i 1 )

54 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98% [Ratnaparkhi 1996]

55 Feature Development Common errors: /JJ official knowledge BD RP/I DT made up the story RB BD/B S recently sold shares [Toutanova and Manning 2000]

56 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98% [Toutanova and Manning 2000]

57 Locally ormalized Models So far: Probabilities are product of locally normalized probabilities Is this bad? A B C A B C 1.0 A B C AAA à 0.4 x 0.4 = 0.16 ABB à 0.2 x 1.0 = 0.2 from \ to A B C A B C B à B transitions are likely to take over even if rarely observed!

58 Locally ormalized Models So far: Probabilities are product of locally normalized probabilities Is this bad? Label bias MEMM taggers local scores can be near one without having both good transitions and emissions This means that often evidence doesn t flow properly

59 Global Discriminative Taggers ewer, higher-powered discriminative sequence models CRFs (also Perceptrons) Do not decompose training into independent local regions Can be deathly slow to train require repeated inference on training set * Relatively slow. models are much slower.

60 Linear Models: Perceptron The perceptron algorithm Iteratively processes the data, reacting to training errors Can be thought of as trying to drive down training error The (online structured) perceptron algorithm: Start with zero weights isit training instances (x i,y i ) one by one Make a prediction y =argmax If correct (y*==y i ): no change, goto next example! If wrong: adjust weights: y w (x i,y) w = w + (x i,y i ) (x i,y ) Challenge: How to compute argmax efficiently? Sentence: x = x 1 x n Tag Sequence: y=y 1 y m

61 Linear Perceptron Decoding y = arg max w y Features must be local, for x = x 1 x n, and y = y 1 y m (x, y) = nx j=1 (x, j, y j 1,y j ) (x, y)

62 The MEMM State Lattice / Trellis J D J D J D J D J D J D START Fed raises interest rates STOP J q( )

63 The Perceptron State Lattice / Trellis J D J D J D J D J D J D START Fed raises interest rates STOP J wφ(x,3,,)

64 Linear Perceptron Decoding y Features must be local, for x=x 1 x m, and s=s 1 s m (x, y) = nx y = arg max j=1 Define π(i, y i ) to be the max score of a sequence of length i ending in tag y i iterbi algorithm (HMMs): (x, j, y j 1,y j ) w (i, s i ) = max e(x i s i )q(s i s i 1 ) (i 1,s i 1 ) s i 1 (x, y) (i, y i ) = max y i 1 w (x, i, y i i,y i )+ (i 1,y i 1 ) iterbi algorithm (Maxent): (i, s i ) = max s i 1 p(s i s i 1,x 1...x m ) (i 1,s i 1 )

65 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98% [Collins 2002]

66 Conditional Random Fields (CRFs) What did we lose with the Perceptron? o probabilities Let s try again with a probabilistic model

67 CRFs Maximum entropy (logistic regression) Sentence: x = x 1 x n p(y x; w) = exp (w (x, y)) P y 0 exp (w (x, y 0 )) Tag Sequence: y = y 1 y n Learning: maximize the (log) conditional likelihood of training data {(x (i),y (i) )} j L(w) = mx i=1 j(x (i),y (i) )! X p(y x i ; w) j (x (i),y) Computational challenges? Most likely tag sequence, normalization constant, gradient y w j [Lafferty et al. 2001]

68 CRFs Decoding y =argmax y p(y x; w) Features must be local, for x = x 1 x n, and y = y 1 y n (x, y) = nx j=1 (x, j, y j 1,y j ) arg max y exp (w (x, y)) Py exp (w (x, y 0 )) =argmax 0 y exp (w (x, y)) =argmax y w (x, y) Same as linear Perceptron! (i, y i ) = max (x, i, y i i,y i )+ (i 1,y i 1 ) y i 1

69 CRFs: Computing ormalization p(y x; w) = X exp (w (x, y)) P y 0 exp (w (x, y 0 )) y 0 exp w (x, y 0 ) = 0 X X j y 0 j=1 1 w (x, j, y j 1,y j ) A = X Y exp (w (x, j, y j 1,y j )) y 0 j (x, y) = Define norm(i, y i ) to sum of scores for sequences ending in position i norm(i, y i )= X y i 1 exp (w (x, i, y i 1,y i )) norm(i 1,y i 1 ) nx (x, j, y j 1,y j ) Forward algorithm! Remember HMM case: (i, y i ) = max y i 1 e(x i y i )q(y i y i 1 ) (i 1,y i 1 )

70 CRFs: Computing Gradient exp (w (x, y)) p(y x; w) = P y exp (w (x, y 0 )) mx L(w) = j(x (i),y (i) j X p(y x i ; w) y i=1 = j (x i,y) nx X k=1 Can compute with the Forward Backward algorithm See notes for full details! a,b (x, y) = nx j=1 (x, j, y j 1,y j )! X p(y x i ; w) j (x (i),y) y w j = X nx p(y x i ; w) j(x i,k,y k 1,y k ) y k=1 X p(y x i ; w) j (x i,k,y k 1,y k ) y:y k 1 =a,y k =b

71 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98% [Sun 2014]

72 Cyclic etwork Train two MEMMs, combine scores And be very careful Tune regularization Try lots of different features See paper for full details (a) Left-to-Right CMM (b) Right-to-Left CMM (c) Bidirectional Dependency etwork [Toutanova et al. 2003]

73 Some umbers Rough accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): 96.7% / 85.5% MaxEnt P(y x) 93.7% / 82.6% MEMM tagger 1: 96.7% / 84.5% MEMM tagger 2: 96.8% / 86.9% Perceptron: 97.1% CRF++: 97.3% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98% [Toutanova et al. 2003]

74 Summary For tagging, the change from generative to discriminative model does not by itself result in great improvement One profits from models for specifying dependence on overlapping features of the observation such as spelling, suffix analysis, etc. MEMMs allow integration of rich features of the observations This additional power (of the MEMM,CRF, Perceptron models) has been shown to result in improvements in accuracy The higher accuracy of discriminative models comes at the price of much slower training

75 Domain Effects Accuracies degrade outside of domain Up to triple error rate Usually make the most errors on the things you care about in the domain (e.g. protein names) Open questions How to effectively exploit unlabeled data from a new domain (what could we gain?) How to best incorporate domain lexica in a principled way (e.g. UMLS specialist lexicon, ontologies)

CSE 490 U Natural Language Processing Spring 2016

CSE 490 U Natural Language Processing Spring 2016 CSE 490 U atural Language Processing Spring 2016 Parts of Speech Yejin Choi [Slides adapted from an Klein, Luke Zettlemoyer] Overview POS Tagging Feature Rich Techniques Maximum Entropy Markov Models (MEMMs)

More information

Natural Language Processing Winter 2013

Natural Language Processing Winter 2013 Natural Language Processing Winter 2013 Hidden Markov Models Luke Zettlemoyer - University of Washington [Many slides from Dan Klein and Michael Collins] Overview Hidden Markov Models Learning Supervised:

More information

CSE 447 / 547 Natural Language Processing Winter 2018

CSE 447 / 547 Natural Language Processing Winter 2018 CSE 447 / 547 Natural Language Processing Winter 2018 Hidden Markov Models Yejin Choi University of Washington [Many slides from Dan Klein, Michael Collins, Luke Zettlemoyer] Overview Hidden Markov Models

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Part of Speech Tagging Dan Klein UC Berkeley 1 Parts of Speech 2 Parts of Speech (English) One basic kind of linguistic structure: syntactic word classes Open class (lexical)

More information

Parts-of-Speech (English) Statistical NLP Spring Part-of-Speech Ambiguity. Why POS Tagging? Classic Solution: HMMs. Lecture 6: POS / Phrase MT

Parts-of-Speech (English) Statistical NLP Spring Part-of-Speech Ambiguity. Why POS Tagging? Classic Solution: HMMs. Lecture 6: POS / Phrase MT Statistical LP Spring 2011 Lecture 6: POS / Phrase MT an Klein UC Berkeley Parts-of-Speech (English) One basic kind of linguistic structure: syntactic word classes Open class (lexical) words ouns Proper

More information

Statistical NLP Spring Parts-of-Speech (English)

Statistical NLP Spring Parts-of-Speech (English) Statistical LP Spring 2009 Lecture 6: Parts-of-Speech an Klein UC Berkeley Parts-of-Speech (English) One basic kind of linguistic structure: syntactic word classes Open class (lexical) words ouns erbs

More information

CSE 517 Natural Language Processing Winter2015

CSE 517 Natural Language Processing Winter2015 CSE 517 Natural Language Processing Winter2015 Feature Rich Models Sameer Singh Guest lecture for Yejin Choi - University of Washington [Slides from Jason Eisner, Dan Klein, Luke ZeLlemoyer] Outline POS

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2018-2019 Spring Open class (lexical) words Nouns Verbs Adjectives yellow Proper Common Main Adverbs slowly IBM Italy cat / cats snow see registered Numbers more 122,312 Closed class

More information

CSE 490 U Natural Language Processing Spring 2016

CSE 490 U Natural Language Processing Spring 2016 CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Structure in the output variable(s)? What is the

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

CSE 447/547 Natural Language Processing Winter 2018

CSE 447/547 Natural Language Processing Winter 2018 CSE 447/547 Natural Language Processing Winter 2018 Feature Rich Models (Log Linear Models) Yejin Choi University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Announcements HW #3 Due Feb

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Web Search and Text Mining. Lecture 23: Hidden Markov Models

Web Search and Text Mining. Lecture 23: Hidden Markov Models Web Search and Text Mining Lecture 23: Hidden Markov Models Markov Models Observable states: 1, 2,..., N (discrete states) Observed sequence: X 1, X 2,..., X T (discrete time) Markov assumption, P (X t

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Hidden Markov Models

Hidden Markov Models CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney Motivating Example: Part Of Speech Tagging Annotate each word

More information

Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs) Hidden Markov Models HMMs Raymond J. Mooney University of Texas at Austin 1 Part Of Speech Tagging Annotate each word in a sentence with a part-of-speech marker. Lowest level of syntactic analysis. John

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Tagging Problems, and Hidden Markov Models. Michael Collins, Columbia University

Tagging Problems, and Hidden Markov Models. Michael Collins, Columbia University Tagging Problems, and Hidden Markov Models Michael Collins, Columbia University Overview The Tagging Problem Generative models, and the noisy-channel model, for supervised learning Hidden Markov Model

More information

19L/w1.-.:vnl\\_lyy1n~-)

19L/w1.-.:vnl\\_lyy1n~-) Se uence-to-se uence \> (4.5 30M q ' Consider the problem of jointly modeling a pair of strings E.g.: part of speech tagging Y: DT NNP NN VBD VBN RP NN NNS

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Part-of-Speech Tagging + Neural Networks CS 287

Part-of-Speech Tagging + Neural Networks CS 287 Part-of-Speech Tagging + Neural Networks CS 287 Quiz Last class we focused on hinge loss. L hinge = max{0, 1 (ŷ c ŷ c )} Consider now the squared hinge loss, (also called l 2 SVM) L hinge 2 = max{0, 1

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Part-of-Speech Tagging

Part-of-Speech Tagging Part-of-Speech Tagging Informatics 2A: Lecture 17 Adam Lopez School of Informatics University of Edinburgh 27 October 2016 1 / 46 Last class We discussed the POS tag lexicon When do words belong to the

More information

Lecture 9: Hidden Markov Model

Lecture 9: Hidden Markov Model Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501 Natural Language Processing 1 This lecture v Hidden Markov

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Tagging Based on slides from Michael Collins, Yoav Goldberg, Yoav Artzi Plan What are part-of-speech tags? What are tagging models? Models for tagging Model zoo Tagging Parsing

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 18, 2017 Recap: Probabilistic Language

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Lecture 7: Sequence Labeling

Lecture 7: Sequence Labeling http://courses.engr.illinois.edu/cs447 Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap: Statistical POS tagging with HMMs (J. Hockenmaier) 2 Recap: Statistical

More information

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 27, 2016 Recap: Probabilistic Language

More information

HMM and Part of Speech Tagging. Adam Meyers New York University

HMM and Part of Speech Tagging. Adam Meyers New York University HMM and Part of Speech Tagging Adam Meyers New York University Outline Parts of Speech Tagsets Rule-based POS Tagging HMM POS Tagging Transformation-based POS Tagging Part of Speech Tags Standards There

More information

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are

More information

CS388: Natural Language Processing Lecture 4: Sequence Models I

CS388: Natural Language Processing Lecture 4: Sequence Models I CS388: Natural Language Processing Lecture 4: Sequence Models I Greg Durrett Mini 1 due today Administrivia Project 1 out today, due September 27 Viterbi algorithm, CRF NER system, extension Extension

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Hidden Markov Models (HMM) Many slides from Michael Collins

Hidden Markov Models (HMM) Many slides from Michael Collins Hidden Markov Models (HMM) Many slides from Michael Collins Overview and HMMs The Tagging Problem Generative models, and the noisy-channel model, for supervised learning Hidden Markov Model (HMM) taggers

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Tagging Based on slides from Michael Collins, Yoav Goldberg, Yoav Artzi Plan What are part-of-speech tags? What are tagging models? Models for tagging 2 Model zoo Tagging Parsing

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material João V. Graça L 2 F INESC-ID Lisboa, Portugal Kuzman Ganchev Ben Taskar University of Pennsylvania Philadelphia, PA, USA

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models 1 University of Oslo : Department of Informatics INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Language Models & Hidden Markov Models Stephan Oepen & Erik Velldal Language

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Text Mining. March 3, March 3, / 49

Text Mining. March 3, March 3, / 49 Text Mining March 3, 2017 March 3, 2017 1 / 49 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, 2017 2 / 49

More information

Probabilistic Models for Sequence Labeling

Probabilistic Models for Sequence Labeling Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Sequence Models. Hidden Markov Models (HMM) MaxEnt Markov Models (MEMM) Many slides from Michael Collins

Sequence Models. Hidden Markov Models (HMM) MaxEnt Markov Models (MEMM) Many slides from Michael Collins Sequence Models Hidden Markov Models (HMM) MaxEnt Markov Models (MEMM) Many slides from Michael Collins Overview and HMMs I The Tagging Problem I Generative models, and the noisy-channel model, for supervised

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Probabilistic Graphical Models

Probabilistic Graphical Models CS 1675: Intro to Machine Learning Probabilistic Graphical Models Prof. Adriana Kovashka University of Pittsburgh November 27, 2018 Plan for This Lecture Motivation for probabilistic graphical models Directed

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars CS 585, Fall 2017 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2017 Brendan O Connor College of Information and Computer Sciences

More information

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari Partially Directed Graphs and Conditional Random Fields Sargur srihari@cedar.buffalo.edu 1 Topics Conditional Random Fields Gibbs distribution and CRF Directed and Undirected Independencies View as combination

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

A Context-Free Grammar

A Context-Free Grammar Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

TnT Part of Speech Tagger

TnT Part of Speech Tagger TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Features of Statistical Parsers

Features of Statistical Parsers Features of tatistical Parsers Preliminary results Mark Johnson Brown University TTI, October 2003 Joint work with Michael Collins (MIT) upported by NF grants LI 9720368 and II0095940 1 Talk outline tatistical

More information

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities

More information

Part-of-Speech Tagging

Part-of-Speech Tagging Part-of-Speech Tagging Informatics 2A: Lecture 17 Shay Cohen School of Informatics University of Edinburgh 26 October 2018 1 / 48 Last class We discussed the POS tag lexicon When do words belong to the

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

POS-Tagging. Fabian M. Suchanek

POS-Tagging. Fabian M. Suchanek POS-Tagging Fabian M. Suchanek 100 Def: POS A Part-of-Speech (also: POS, POS-tag, word class, lexical class, lexical category) is a set of words with the same grammatical role. Alizée wrote a really great

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Structured Output Prediction: Generative Models

Structured Output Prediction: Generative Models Structured Output Prediction: Generative Models CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 17.3, 17.4, 17.5.1 Structured Output Prediction Supervised

More information

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009 CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov

More information

Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much

More information

Statistical methods in NLP, lecture 7 Tagging and parsing

Statistical methods in NLP, lecture 7 Tagging and parsing Statistical methods in NLP, lecture 7 Tagging and parsing Richard Johansson February 25, 2014 overview of today's lecture HMM tagging recap assignment 3 PCFG recap dependency parsing VG assignment 1 overview

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Maximum Entropy Markov Models

Maximum Entropy Markov Models Wi nøt trei a høliday in Sweden this yër? September 19th 26 Background Preliminary CRF work. Published in 2. Authors: McCallum, Freitag and Pereira. Key concepts: Maximum entropy / Overlapping features.

More information

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM Today s Agenda Need to cover lots of background material l Introduction to Statistical Models l Hidden Markov Models l Part of Speech Tagging l Applying HMMs to POS tagging l Expectation-Maximization (EM)

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

Natural Language Processing

Natural Language Processing SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 9, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class

More information

Constituency Parsing

Constituency Parsing CS5740: Natural Language Processing Spring 2017 Constituency Parsing Instructor: Yoav Artzi Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, Yejin Choi, and

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning 1 IN4080 2018 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning 2 Logistic regression Lecture 8, 26 Sept Today 3 Recap: HMM-tagging Generative and discriminative classifiers Linear classifiers Logistic

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Latent Variable Models in NLP

Latent Variable Models in NLP Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable

More information

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars LING 473: Day 10 START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars 1 Issues with Projects 1. *.sh files must have #!/bin/sh at the top (to run on Condor) 2. If run.sh is supposed

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Natural Language Processing SoSe Words and Language Model

Natural Language Processing SoSe Words and Language Model Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence

More information

Conditional Random Fields

Conditional Random Fields Conditional Random Fields Micha Elsner February 14, 2013 2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Degree in Mathematics

Degree in Mathematics Degree in Mathematics Title: Introduction to Natural Language Understanding and Chatbots. Author: Víctor Cristino Marcos Advisor: Jordi Saludes Department: Matemàtiques (749) Academic year: 2017-2018 Introduction

More information

Information Extraction from Text

Information Extraction from Text Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information

More information

Time Zones - KET Grammar

Time Zones - KET Grammar Inventory of grammatical areas Verbs Regular and irregular forms Pages 104-105 (Unit 1 Modals can (ability; requests; permission) could (ability; polite requests) Page 3 (Getting Started) Pages 45-47,

More information

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc) NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition

More information