LECTURER: BURCU CAN Spring

Similar documents
Natural Language Processing

Parts-of-Speech (English) Statistical NLP Spring Part-of-Speech Ambiguity. Why POS Tagging? Classic Solution: HMMs. Lecture 6: POS / Phrase MT

Hidden Markov Models

CSE 490 U Natural Language Processing Spring 2016

Statistical NLP Spring Parts-of-Speech (English)

Hidden Markov Models (HMMs)

CSE 517 Natural Language Processing Winter2015

Probabilistic Graphical Models

10/17/04. Today s Main Points

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

LECTURER: BURCU CAN Spring

CS838-1 Advanced NLP: Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

HMM and Part of Speech Tagging. Adam Meyers New York University

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

CSE 490 U Natural Language Processing Spring 2016

Statistical methods in NLP, lecture 7 Tagging and parsing

Part-of-Speech Tagging + Neural Networks CS 287

Sequence Labeling: HMMs & Structured Perceptron

Part-of-Speech Tagging


Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 13: Structured Prediction

Statistical Methods for NLP

Lecture 9: Hidden Markov Model

CS388: Natural Language Processing Lecture 4: Sequence Models I

CSE 447/547 Natural Language Processing Winter 2018

Natural Language Processing Winter 2013

Lecture 7: Sequence Labeling

Maxent Models and Discriminative Estimation

Parsing with Context-Free Grammars

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Sequence Prediction and Part-of-speech Tagging

Hidden Markov Models

Hidden Markov Models in Language Processing

Degree in Mathematics

Log-Linear Models, MEMMs, and CRFs

Fun with weighted FSTs

Text Mining. March 3, March 3, / 49

Midterm sample questions

Probabilistic Context Free Grammars. Many slides from Michael Collins

Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

CSE 447 / 547 Natural Language Processing Winter 2018

A Context-Free Grammar

Structured Output Prediction: Generative Models

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Hidden Markov Models, Part 1. Steven Bedrick CS/EE 5/655, 10/22/14

COMP (Fall 2017) Natural Language Processing (with deep learning and connections to vision/robotics)

Probabilistic Context-free Grammars

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Noisy Channel and Hidden Markov Models

TnT Part of Speech Tagger

Lecture 6: Part-of-speech tagging

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Neural POS-Tagging with Julia

Graphical models for part of speech tagging

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Natural Language Processing

Natural Language Processing

Lab 12: Structured Prediction

NLP Programming Tutorial 11 - The Structured Perceptron

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

POS-Tagging. Fabian M. Suchanek

A brief introduction to Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Applied Natural Language Processing

Maschinelle Sprachverarbeitung

Statistical Methods for NLP

Language Processing with Perl and Prolog

Maschinelle Sprachverarbeitung

Constituency Parsing

Intelligent Systems (AI-2)

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Time Zones - KET Grammar

Sequential Supervised Learning

Lecture 3: ASR: HMMs, Forward, Viterbi

Conditional Random Fields for Sequential Supervised Learning

Natural Language Processing

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Machine Learning for natural language processing

lecture 6: modeling sequences (final part)

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Dynamic Programming: Hidden Markov Models

Statistical Methods for NLP

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Transcription:

LECTURER: BURCU CAN 2018-2019 Spring

Open class (lexical) words Nouns Verbs Adjectives yellow Proper Common Main Adverbs slowly IBM Italy cat / cats snow see registered Numbers more 122,312 Closed class (functional) Modals one Determiners the some can Prepositions to with Conjunctions and or had Particles off up Pronouns he its more

English PoS Tags CC conjunction, coordinating and both but either or CD numeral, cardinal mid-1890 nine-thirty 0.5 one DT determiner a all an every no that the EX existential there there FW foreign word gemeinschaft hund ich jeux IN preposition or conjunction, subordinating among whether out on by if JJ adjective or numeral, ordinal third ill-mannered regrettable JJR adjective, comparative braver cheaper taller JJS adjective, superlative bravest cheapest tallest MD modal auxiliary can may might will would NN noun, common, singular or mass cabbage thermostat investment subhumanity NNP noun, proper, singular Motown Cougar Yvette Liverpool NNPS noun, proper, plural Americans Materials States NNS noun, common, plural undergraduates bric-a-brac averages POS genitive marker ' 's PRP pronoun, personal hers himself it we them PRP$ pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously RBR adverb, comparative further gloomier heavier less-perfectly RBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open through TO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heck VB verb, base form ask bring fire see take VBD verb, past tense pleaded swiped registered saw VBG verb, present participle or gerund stirring focusing approaching erasing VBN verb, past participle dilapidated imitated reunifed unsettled VBP verb, present tense, not 3rd person singular twist appear comprise mold postpone VBZ verb, present tense, 3rd person singular bases reconstructs marks uses WDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whom WP$ WH-pronoun, possessive whose WRB Wh-adverb however whenever where why

TURKISH PoS Tags Disambiguating Main POS tags for Turkish, Ehsani et al. 2012

Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj Uses: text-to-speech (how do we pronounce lead?) can write regexps like (Det) Adj* N+ over the output if you know the tag, you can back off to it in other tasks 5

Useful: Text-to-speech: record, lead (in English); yüz, bodrum (in Turkish) Lemmatization: saw[v] see, saw[n] saw Useful as a pre-processing step for parsing Less tag ambiguity means fewer parses However, some tag choices are better decided by parsers

Supervised Rule-based systems Statistical PoS Tagging Neural Networks (RNNs) Unsupervised Statistical PoS Tagging Neural Networks (RNNs) Partly-supervised

Classification Decision Trees Naïve Bayes Logistic Regression / Maximum Entropy (MaxEnt) Perceptron or Neural Networks Support Vector Machines Nearest-Neighbour

Sequence Labelling Labels of tokens are dependent on the labels of other tokens Two standard models: Hidden Markov Models Conditional Random Field (CRF)

Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj How many tags are correct? About 97% currently But baseline is already 90% Baseline is the stupidest possible method Tag every word with its most frequent tag Tag unknown words as nouns 10

Example VBD VB VBN VBZ VBP VBZ NNP NNS NN NNS CD NN Fed raises interest rates 0.5 percent

correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more)? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too 12

What Should We Look At? correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more)? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too 600.465 - Intro to NLP - J. Eisner 13

What Should We Look At? correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more)? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too 14

Introduce Hidden Markov Models (HMMs) for part-of-speech tagging/sequence classification Cover the three fundamental questions of HMMs: How do we fit the model parameters of an HMM? Given an HMM, how do we efficiently calculate the likelihood of an observation w? Given an HMM and an observation w, how do we efficiently calculate the most likely state sequence for w?

Green circles are hidden states Dependent only on the previous state (bigram)

Purple nodes are observed states Dependent only on their corresponding hidden state

S S S S S K K K K K {S, K, P, A, B} S : {s 1 s N } are the values for the hidden states K : {k 1 k M } are the values for the observations

S A S A S A S A S B B B K K K K K {S, K, P, A, B} P = {p i } are the initial state probabilities A = {a ij } are the state transition probabilities B = {b ik } are the observation state probabilities

Det 0.95 Noun 0.9 0.5 0.5 start 0.4 PropNoun 0.05 0.8 0.25 0.25 Verb stop

0.5 start 22 Det Noun 0.5 0.95 0.9 0.05 Verb 0.25 PropNoun 0.8 0.4 0.25 P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*=0.0076 stop

the a a the the a the that Det 0.4 0.5 start 23 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

the a a the the a the that Det 0.4 0.5 start 24 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 the a a the the a the that Det 0.4 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop start 25

the a a the the a the that Det 0.4 0.5 start 26 John 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

the a a the the a the that Det 0.4 0.5 start 27 John 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 28 the a a the the a the that Det 0.4 John bit 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 29 the a a the the a the that Det 0.4 John bit 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 30 the a a the the a the that Det 0.4 0.95 Tom John Mary Alice Jerry PropNoun John bit the cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 31 the a a the the a the that Det 0.4 0.95 Tom John Mary Alice Jerry PropNoun John bit the cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 32 the a a the the a the that Det 0.4 0.95 Tom John Mary Alice Jerry PropNoun John bit the apple cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 33 the a a the the a the that Det 0.4 0.95 Tom John Mary Alice Jerry PropNoun John bit the apple cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

We want a model of sequences s and observations w s 0 s 1 s 2 s n w 1 w 2 w n Assumptions: States are tag n-grams Usually a dedicated start and end state / word Tag/state sequence is generated by a Markov model Words are chosen independently, conditioned only on the tag/state These are totally broken assumptions: why?

Transitions P(s s ) encode well-formed tag sequences In a bigram tagger, states = tags < > s 0 < t 1 > < t 2 > < t n > s 1 s 2 s n w 1 w 2 w n

Use standard smoothing methods to estimate transitions: ) ˆ( ) (1 ) ˆ( ), ˆ( ), ( 2 1 1 1 2 1 2 2 1 i i i i i i i i i t P t t P t t t P t t t P l l l l - - + + = - - - - -

Emissions are trickier: Words we ve never seen before Words which occur with tags we ve never seen Issue: words aren t black boxes: 343,127.23 11-year Minteria reintroducibly

Can do surprisingly well just looking at a word by itself: Word the: the DT Lowercased word Importantly: importantly RB Prefixes unfathomable: un- JJ Suffixes Importantly: -ly RB Capitalization Meridian: CAP NNP Word shapes 35-year: d-x JJ Then build a MaxEnt (or whatever) model to predict tag Maxent P(t w): 93.7% / 82.6% (See the next class for Maxent models)

Consider all possible state sequences, Q, of length T that the model could have traversed in generating the given observation sequence. Compute the probability of a given state sequence from A, and multiply it by the probabilities of generating each of given observations in each of the corresponding states in this sequence to get P(O,Q λ) = P(O Q, λ) P(Q λ). Sum this over all possible state sequences to get P(O λ). 41 Computationally complex: O(TN T )

Due to the Markov assumption, the probability of being in any state at any given time t only relies on the probability of being in each of the possible states at time t 1. Forward Algorithm: Uses dynamic programming to exploit this fact to efficiently compute observation likelihood in O(TN 2 ) time. Compute a forward trellis that compactly and implicitly encodes information about all possible state paths.

s 1 s 2 s 0 s F s N t 1 t 2 t 3 t T-1 t T Continue forward in time until reaching final time point and sum probability of ending in final state.

Let a t (j) be the probability of being in state j after seeing the first t observations (by summing over all initial paths leading to j). a ( ) (,,..., l t j = P o1 o2 ot qt = s j )

s 1 s 2 s N a t-1 (i) a 1j a 2j a 2j a Nj s j a t (i) Consider all possible ways of getting to s j at time t by coming from all possible states s i and determine probability of each. Sum these to get the total probability of being in state s j at time t while accounting for the first t 1 observations. Then multiply by the probability of actually observing o t in s j.

Initialization Recursion Termination N j o b a j j j = 1 ) ( ) ( 1 0 a 1 T t N j o b a i j t j N i ij t t < ú û ù ê ë é = å = - 1, 1 ) ( ) ( ) ( 1 1 a a å = + = = N i if T F T a i s O P 1 1 ) ( ) ( ) ( a a l

Requires only O(TN 2 ) time to compute the probability of an observed sequence given a model. Exploits the fact that all state sequences must merge into one of the N possible states at any point in time and the Markov assumption that only the last state effects the next one.

What is the most likely sequence of tags t for the given sequence of words w?

What is the most likely sequence of tags t for the given sequence of words w? Choosing the best tag sequence T=t 1,t 2,,t n for a given word sequence W = w 1,w 2,,w n (sentence): By Bayes Rule: ^ T = arg max P( T ^ T = TÎt arg max TÎt W ) P( W T ) P( T ) P( W ) Since P(W) will be the same for each tag sequence: ^ T = arg max P( W TÎt T ) P( T )

If we assume a tagged corpus and a trigram language model, then P(T) can be approximated as: To evaluate this formula is simple, we get from simple word counting (and smoothing). Õ = - - n i i i i t t t P t t P t P 3 1 2 1 2 1 ) ( ) ( ) (

To evaluate P(W T), we will make the simplifying assumption that the word depends only on its tag: So, we want the tag sequence that maximizes the following quantity. The best tag sequence can be found by Viterbi algorithm. Õ = n i w i t i P 1 ) ( ú û ù ê ë é Õ Õ = = - - n i i i n i i i i t w P t t t P t t P t P 1 3 1 2 1 2 1 ) ( ) ( ) ( ) (

Given these two multinomials, we can score any word / tag sequence pair <, > <,NNP> <NNP, VBZ> <VBZ, NN> <NN, NNS> <NNS, CD> <CD, NN> <STOP> NNP VBZ NN NNS CD NN. Fed raises interest rates 0.5 percent. P(NNP <, >) P(Fed NNP) P(VBZ <NNP, >) P(raises VBZ) P(NN VBZ,NNP).. In principle, we re done list all possible tag sequences, score each one, pick the best one (the Viterbi state sequence) NNP VBZ NN NNS CD NN NNP NNS NN NNS CD NN NNP VBZ VB NNS CD NN logp = -23 logp = -29 logp = -27

Too many trajectories (state sequences) to list Option 1: Beam Search <> Fed:NNP Fed:VBN Fed:VBD Fed:NNP raises:nns Fed:NNP raises:vbz Fed:VBN raises:nns Fed:VBN raises:vbz A beam is a set of partial hypotheses Start with just the single empty trajectory At each derivation step: Consider all continuations of previous hypotheses Discard most, keep top k, or those within a factor of the best, (or some combination)

We want a way of computing exactly: -Most likely state sequence given the model and some w BUT! There are exponentially many possible state sequences T n We can, however efficiently calculate using dynamic programming (DP) The DP algorithm used for efficient state-sequence inference uses a trellis of paths through state space It is an instance of what in NLP is called the Viterbi Algorithm

Dynamic program for computing di( s) = max P( s0... si- 1s, w1... wi -1) s0... si -1s The score of a best path up to position i ending in state s d 0 ( s) δ i (s) = max s' ì 1 if s =<, > = í î0 otherwise P(s s')p(w i 1 s')δ i 1 (s') Also store a back-trace: most likely previous state for each state y i( s) = arg max P( s s' ) P( w s' ) di- 1( s' ) s' Iterate on i, storing partial results as you go

Fish sleep.

0.2 0.8 start 0.8 0.7 noun verb end 0.2

A two-word language: fish and sleep Suppose in our training corpus, fish appears 8 times as a noun and 5 times as a verb sleep appears 2 as a noun and 5 times as a verb Emission probabilities: Noun P(fish noun) : 0.8 P(sleep noun) : 0.2 Verb P(fish verb) : 0.5 P(sleep verb) :0.5

0 1 2 3 start verb noun end

0.2 start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 verb 0 noun 0 end 0

0.2 start 0.8 0.8 noun verb 0.7 end 0.2 Token 1: fish 0 1 2 3 start 1 0 verb 0.2 *.5 noun 0.8 *.8 end 0 0

0.2 start 0.8 0.8 noun verb 0.7 end 0.2 Token 1: fish 0 1 2 3 start 1 0 verb 0.1 noun 0.64 end 0 0

0.2 Token 2: sleep (if fish is verb) start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.1*.1*.5 noun 0.64.1*.2*.2 end 0 0 -

0.2 Token 2: sleep (if fish is verb) start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.005 noun 0.64.004 end 0 0 -

0.2 Token 2: sleep (if fish is a noun) start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.005.64*.8*.5 noun 0.64.004.64*.1*.2 end 0 0 -

0.2 Token 2: sleep (if fish is a noun) start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.005.256 noun 0.64.004.0128 end 0 0 -

0.2 Token 2: sleep take maximum, set back pointers start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.005.256 noun 0.64.004.0128 end 0 0 -

0.2 Token 2: sleep take maximum, set back pointers start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.256 noun 0.64.0128 end 0 0 -

0.2 start 0.8 0.8 noun verb 0.7 end 0.2 Token 3: end 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7.0128*.1

0.2 Token 3: end take maximum, set back pointers start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7.0128*.1

0.2 Decode: fish = noun sleep = verb start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7

Choose the most common tag 90.3% with a bad unknown word model 93.7% with a good one TnT (Brants, 2000): A carefully smoothed trigram tagger Suffix trees for emissions 96.7% on WSJ text A Fully Bayesian Approach to Unsupervised PoS Tagging (Goldwater, 2007) Painless Unsupervised Learning with Features (Berg et al., 2010) Unsupervised PoS Tagging with Anchor HMMs (Stratos et al., 2016) Noise in the data Many errors in the training and test corpora Probably about 2% guaranteed error from noise

Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% Maxent P(t w): 93.7% / 82.6% TnT (HMM++): 96.2% / 86.0% MEMM tagger: 96.9% / 86.9% Bidirectional dependencies: 97.2% / 89.0% Most errors on unknown words Upper bound: ~98% (human agreement)

Better features! RB PRP VBD IN RB IN PRP VBD. They left as soon as he arrived. We could fix this with a feature that looked at the next word JJ NNP NNS VBD VBN. Intrinsic flaws remained undetected. We could fix this by linking capitalized words to their lowercase versions More general solution: Maximum-entropy Markov models Reality check: Taggers are already pretty good on WSJ journal text What the world needs is taggers that work on other text!

Dan Klein, Chris Manning, Jason Eisner slides Heng Ji, POS Tagging and Syntactic Parsing Raymond Mooney, POS Tagging and HMMs