CSE 490 U Natural Language Processing Spring 2016

Similar documents
Parts-of-Speech (English) Statistical NLP Spring Part-of-Speech Ambiguity. Why POS Tagging? Classic Solution: HMMs. Lecture 6: POS / Phrase MT

Statistical NLP Spring Parts-of-Speech (English)

CSE 517 Natural Language Processing Winter2015

Natural Language Processing

Sequence Prediction and Part-of-speech Tagging

CSE 490 U Natural Language Processing Spring 2016

LECTURER: BURCU CAN Spring

CSE 447/547 Natural Language Processing Winter 2018

10/17/04. Today s Main Points

Hidden Markov Models

Hidden Markov Models (HMMs)

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Time Zones - KET Grammar

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Graphical models for part of speech tagging

CS838-1 Advanced NLP: Hidden Markov Models

Log-Linear Models, MEMMs, and CRFs

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Natural Language Processing Winter 2013

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

CSE 447 / 547 Natural Language Processing Winter 2018

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Lecture 13: Structured Prediction

Probabilistic Models for Sequence Labeling

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CS388: Natural Language Processing Lecture 4: Sequence Models I

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Natural Language Processing (CSE 517): Sequence Models (II)

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Sequential Supervised Learning

Probabilistic Graphical Models

Conditional Random Fields for Sequential Supervised Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Natural Language Processing (CSE 517): Sequence Models

Natural Language Processing

Machine Learning for Structured Prediction

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 12: Algorithms for HMMs

Statistical Methods for NLP

COMP (Fall 2017) Natural Language Processing (with deep learning and connections to vision/robotics)

Natural Language Processing

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Part-of-Speech Tagging + Neural Networks CS 287

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Lecture 12: Algorithms for HMMs

A Context-Free Grammar

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Hidden Markov Models in Language Processing

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 7: Sequence Labeling

Part-of-Speech Tagging

HMM and Part of Speech Tagging. Adam Meyers New York University

TnT Part of Speech Tagger

Hidden Markov Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Undirected Graphical Models

Structured Prediction

Midterm sample questions

Maximum Entropy Markov Models

Conditional Random Field

Hidden Markov Models

Intelligent Systems (AI-2)

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Text Mining. March 3, March 3, / 49

Information Extraction from Text

Applied Natural Language Processing

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Structured Output Prediction: Generative Models

Statistical Methods for NLP

A brief introduction to Conditional Random Fields

Machine Learning for natural language processing

The Noisy Channel Model and Markov Models

Intelligent Systems (AI-2)

Conditional Random Fields

Hidden Markov Models

Hidden Markov Models

Features of Statistical Parsers

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Conditional Random Fields: An Introduction

A Support Vector Method for Multivariate Performance Measures

Maschinelle Sprachverarbeitung

Natural Language Processing

Constituency Parsing

Maschinelle Sprachverarbeitung

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

POS-Tagging. Fabian M. Suchanek

Conditional Language Modeling. Chris Dyer

Degree in Mathematics

Ecient Higher-Order CRFs for Morphological Tagging

LECTURER: BURCU CAN Spring

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

CPSC 340: Machine Learning and Data Mining

Transcription:

CSE 490 U atural Language Processing Spring 2016 Parts of Speech Yejin Choi [Slides adapted from an Klein, Luke Zettlemoyer]

Overview POS Tagging Feature Rich Techniques Maximum Entropy Markov Models (MEMMs) Structured Perceptron Conditional Random Fields (CRFs)

Parts-of-Speech (English) One basic kind of linguistic structure: syntactic word classes Open class (lexical) words ouns erbs Adjectives yellow Proper Common Main Adverbs slowly IBM cat / cats see Italy snow registered umbers more 122,312 Closed class (functional) Modals one eterminers the some can Prepositions to with Conjunctions and or had Particles off up Pronouns he its more

Penn Treebank POS: 36 possible tags, 34 pages of tagging guidelines. CC conjunction, coordinating and both but either or C numeral, cardinal mid-1890 nine-thirty 0.5 one T determiner a all an every no that the EX existential there there FW foreign word gemeinschaft hund ich jeux I preposition or conjunction, subordinating among whether out on by if adjective or numeral, ordinal third ill-mannered regrettable R adjective, comparative braver cheaper taller S adjective, superlative bravest cheapest tallest M modal auxiliary can may might will would noun, common, singular or mass cabbage thermostat investment subhumanity P noun, proper, singular Motown Cougar Yvette Liverpool PS noun, proper, plural Americans Materials States S noun, common, plural undergraduates bric-a-brac averages POS genitive marker ' 's PRP pronoun, personal hers himself it we them PRP pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously RBR adverb, comparative further gloomier heavier less-perfectly RBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open through ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz

PRP pronoun, personal hers himself it we them PRP pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously RBR adverb, comparative further gloomier heavier less-perfectly RBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open through TO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heck B verb, base form ask bring fire see take B verb, past tense pleaded swiped registered saw BG verb, present participle or gerund stirring focusing approaching erasing B verb, past participle dilapidated imitated reunifed unsettled BP verb, present tense, not 3rd person singular twist appear comprise mold postpone BZ verb, present tense, 3rd person singular bases reconstructs marks uses WT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whom WP WH-pronoun, possessive whose WRB Wh-adverb however whenever where why ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz

Part-of-Speech Ambiguity Words can have multiple parts of speech B B B BZ BP BZ P S S C Fed raises interest rates 0.5 percent Two basic sources of constraint: Grammatical environment Identity of the current word Many more possible features: Suffixes, capitalization, name databases (gazetteers), etc

Why POS Tagging? Useful in and of itself (more than you d think) Text-to-speech: record, lead Lemmatization: saw[v] see, saw[n] saw Quick-and-dirty P-chunk detection: grep { }* { S} Useful as a pre-processing step for parsing Less tag ambiguity means fewer parses However, some tag choices are better decided by parsers I T P B B RP S The Georgia branch had taken on loan commitments T I B S B The average of interbank offered rates plummeted

Baselines and Upper Bounds Choose the most common tag 90.3% with a bad unknown word model 93.7% with a good one oise in the data Many errors in the training and test corpora Probably about 2% guaranteed error chief executive officer chief executive officer chief executive officer from noise (on this data) chief executive officer

Ambiguity in POS Tagging Particle (RP) vs. preposition (I) He talked over the deal. He talked over the telephone. past tense (B) vs. past participle (B) The horse walked past the barn. The horse walked past the barn fell. noun vs. adjective? The executive decision. noun vs. present participle Fishing can be fun

Ambiguity in POS Tagging Like can be a verb or a preposition I like/bp candy. Time flies like/i an arrow. Around can be a preposition, particle, or adverb I bought it at the shop around/i the corner. I never got around/rp to getting a car. A new Prius costs around/rb 25K. 10

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (Brants, 2000): A carefully smoothed trigram tagger Suffix trees for emissions 96.7% on WS text (SOA is ~97.5%) Most errors on unknown words Upper bound: ~98%

Common Errors Common errors [from Toutanova & Manning 00] / official knowledge B RP/I T made up the story RB B/B S recently sold shares

What about better features? Choose the most common tag 90.3% with a bad unknown word model 93.7% with a good one What about looking at a word and its environment, but no sequence information? Add in previous / next word the Previous / next word shapes X X Occurrence pattern features [X: x X occurs] Crude entity detection.. (Inc. Co.) Phrasal verb in sentence? put Conjunctions of these things x 2 s 3 x 3 x 4 Uses lots of features: > 200K

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% Q: What does this say about sequence models? Q: How do we add more features to our sequence models? Upper bound: ~98%

MEMM Taggers One step up: also condition on previous tags p(s 1...s m x 1...x m )= Train up p(s i s i-1,x 1...x m ) as a discrete log-linear (maxent) model, then use to score sequences This is referred to as an MEMM tagger [Ratnaparkhi 96] Beam search effective! (Why?) my p(s i s 1...s i 1,x 1...x m ) i=1 What s the advantage of beam size 1? = my p(s i s i 1,x 1...x m ) i=1 p(s i s i 1,x 1...x m )= exp (w (x 1...x m,i,s i 1,s i )) P s 0 exp (w (x 1...x m,i,s i 1,s 0 ))

The HMM State Lattice / Trellis (repeat slide) START Fed raises interest rates STOP e(fed ) e(raises ) e(interest ) e(rates ) q( ) e(stop )

The MEMM State Lattice / Trellis x = START Fed raises interest rates STOP p(,x)

ecoding ecoding maxent taggers: ust like decoding HMMs iterbi, beam search, posterior decoding iterbi algorithm (HMMs): efine π(i,s i ) to be the max score of a sequence of length i ending in tag s i (i, s i ) = max e(x i s i )q(s i s i 1 ) (i 1,s i 1 ) s i 1 iterbi algorithm (Maxent): Can use same algorithm for MEMMs, just need to redefine π(i,s i )! (i, s i ) = max s i 1 p(s i s i 1,x 1...x m ) (i 1,s i 1 )

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% Upper bound: ~98%

Global iscriminative Taggers ewer, higher-powered discriminative sequence models CRFs (also perceptrons, M3s) o not decompose training into independent local regions Can be deathly slow to train require repeated inference on training set ifferences can vary in importance, depending on task However: one issue worth knowing about in local models Label bias and other explaining away effects MEMM taggers local scores can be near one without having both good transitions and emissions This means that often evidence doesn t flow properly Why isn t this a big deal for POS tagging? Also: in decoding, condition on predicted, not gold, histories

Linear Models: Perceptron [Collins 02] The perceptron algorithm Iteratively processes the training set, reacting to training errors Can be thought of as trying to drive down training error The (online) perceptron algorithm: Start with zero weights isit training instances (x i,y i ) one by one Make a prediction y =argmax y If correct (y*==y i ): no change, goto next example! If wrong: adjust weights w (x i,y) Sentence: x=x 1 x m w = w + (x i,y i ) (x i,y ) Challenge: How to compute argmax efficiently? Tag Sequence: y=s 1 s m

Linear Perceptron ecoding s = arg max s Features must be local, for x=x 1 x m, and s=s 1 s m (x, s) = mx j=1 (x, j, s j 1,s j ) w (x, s)

The MEMM State Lattice / Trellis (repeat) x = START Fed raises interest rates STOP p(,x) x x x x

The Perceptron State Lattice / Trellis x = START Fed raises interest rates STOP wφ(x,3,,) + + + +

Linear Perceptron ecoding Features must be local, for x=x 1 x m, and s=s 1 s m (x, s) = mx j=1 efine π(i,s i ) to be the max score of a sequence of length i ending in tag s i iterbi algorithm (HMMs): s = arg max s (x, j, s j 1,s j ) w (x, s) (i, s i ) = max s i 1 w (x, i, s i i,s i )+ (i 1,s i 1 ) (i, s i ) = max s i 1 e(x i s i )q(s i s i 1 ) (i 1,s i 1 ) iterbi algorithm (Maxent): (i, s i ) = max s i 1 p(s i s i 1,x 1...x m ) (i 1,s i 1 )

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% Perceptron 96.7% /?? Upper bound: ~98%

Conditional Random Fields (CRFs) Maximum entropy (logistic regression) Sentence: x=x 1 x m Tag Sequence: y=s 1 s m p(y x; w) = [Lafferty, McCallum, Pereira 01] exp (w (x, y)) P y 0 exp (w (x, y 0 )) Learning: maximize the (log) conditional likelihood of training data{(x i,y i )} n i=1 @ @w j L(w) = nx i=1 j(x i,y i )! X p(y x i ; w) j (x i,y) y w j Computational Challenges? Most likely tag sequence, normalization constant, gradient

CRFs ecoding s = arg max s p(s x; w) Features must be local, for x=x 1 x m, and s=s 1 s m p(s x; w) = exp (w (x, s)) P s 0 exp (w (x, s 0 )) (x, s) = mx j=1 (x, j, s j 1,s j ) arg max s exp (w (x, s)) Ps exp (w (x, s 0 )) 0 = arg max s exp (w (x, s)) = arg max s w (x, s) Same as Linear Perceptron!!! (i, s i ) = max s i 1 (x, i, s i i,s i )+ (i 1,s i 1 )

CRFs: Computing ormalization* p(s x; w) = X s 0 exp w (x, s 0 ) exp (w (x, s)) P s exp (w (x, s 0 )) 0 = X exp @ X s 0 j = X Y exp (w (x, j, s j 1,s j )) s 0 j 0 (x, s) = w (x, j, s j 1,s j ) A j=1 1 efine norm(i,s i ) to sum of scores for sequences ending in position i norm(i, y i )= X s i 1 exp (w (x, i, s i 1,s i )) norm(i 1,s i 1 ) mx (x, j, s j 1,s j ) Forward Algorithm! Remember HMM case: (i, y i )= X y i 1 e(x i y i )q(y i y i 1 ) (i 1,y i 1 ) Could also use backward?

p(s x; w) = CRFs: Computing Gradient* @ @w j L(w) = exp (w (x, s)) P s exp (w (x, s 0 )) 0 nx i=1 X p(s x i ; w) s = mx X j=1 a,b eed forward and backward messages See notes for full details! j(x i,s i )! X p(s x i ; w) j (x i,s) s j (x i,s) = X s X s:s j 1 =a,s b =b (x, s) = p(s x i ; w) mx j=1 mx j=1 p(s x i ; w) k (x i,j,s j 1,s j ) (x, j, s j 1,s j ) w j k(x i,j,s j 1,s j )

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% Perceptron 96.7% /?? CRF (untuned) 95.7% / 76.2% Upper bound: ~98%

Cyclic etwork [Toutanova et al 03] Train two MEMMs, multiple together to score And be very careful Tune regularization Try lots of different features See paper for full details (a) Left-to-Right CMM (b) Right-to-Left CMM (c) Bidirectional ependency etwork

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% Perceptron 96.7% /?? CRF (untuned) 95.7% / 76.2% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98%

omain Effects Accuracies degrade outside of domain Up to triple error rate Usually make the most errors on the things you care about in the domain (e.g. protein names) Open questions How to effectively exploit unlabeled data from a new domain (what could we gain?) How to best incorporate domain lexica in a principled way (e.g. UMLS specialist lexicon, ontologies)