Part-of-Speech Tagging + Neural Networks CS 287

Similar documents
Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

HMM and Part of Speech Tagging. Adam Meyers New York University


10/17/04. Today s Main Points

CS838-1 Advanced NLP: Hidden Markov Models

Part-of-Speech Tagging + Neural Networks 2 CS 287

Lecture 9: Hidden Markov Model

Language Processing with Perl and Prolog

Neural POS-Tagging with Julia

Lecture 13: Structured Prediction

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Hidden Markov Models (HMMs)

CS 545 Lecture XVI: Parsing

Parsing with Context-Free Grammars

Probabilistic Context-free Grammars

Part-of-Speech Tagging

Natural Language Processing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

LECTURER: BURCU CAN Spring

Applied Natural Language Processing

LECTURER: BURCU CAN Spring

Hidden Markov Models, Part 1. Steven Bedrick CS/EE 5/655, 10/22/14

Degree in Mathematics

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Midterm sample questions

Constituency Parsing

Part-of-Speech Tagging

Lecture 6: Part-of-speech tagging

A Context-Free Grammar

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

CS 6120/CS4120: Natural Language Processing

Lecture 6: Neural Networks for Representing Word Meaning

Statistical methods in NLP, lecture 7 Tagging and parsing

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Sequence Labeling: HMMs & Structured Perceptron

Time Zones - KET Grammar

Text Mining. March 3, March 3, / 49

NLTK tagging. Basic tagging. Tagged corpora. POS tagging. NLTK tagging L435/L555. Dept. of Linguistics, Indiana University Fall / 18

Probabilistic Graphical Models

Natural Language Processing

Terry Gaasterland Scripps Institution of Oceanography University of California San Diego, La Jolla, CA,

Extraction of Opposite Sentiments in Classified Free Format Text Reviews

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

CSE 517 Natural Language Processing Winter2015

CSE 490 U Natural Language Processing Spring 2016

Spectral Unsupervised Parsing with Additive Tree Metrics

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Probabilistic Context Free Grammars. Many slides from Michael Collins

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Effectiveness of complex index terms in information retrieval

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Neural Networks in Structured Prediction. November 17, 2015

CS460/626 : Natural Language

NLP Programming Tutorial 11 - The Structured Perceptron

Multi-Component Word Sense Disambiguation

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Posterior Sparsity in Unsupervised Dependency Parsing

Natural Language Processing

Automated Biomedical Text Fragmentation. in support of biomedical sentence fragment classification

Deep Learning for NLP Part 2

Recurrent Neural Networks. COMP-550 Oct 5, 2017

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Structured Prediction

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

MODERN VIDYA NIKETAN SCHOOL ENTRANCE TEST SYLLABUS FOR KG ( )

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Natural Language Processing

GloVe: Global Vectors for Word Representation 1

Lecture 7: Sequence Labeling

Parts-of-Speech (English) Statistical NLP Spring Part-of-Speech Ambiguity. Why POS Tagging? Classic Solution: HMMs. Lecture 6: POS / Phrase MT

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

CSC321 Lecture 4: Learning a Classifier

with Local Dependencies

Sequential Supervised Learning

(7) a. [ PP to John], Mary gave the book t [PP]. b. [ VP fix the car], I wonder whether she will t [VP].

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Deep Learning for NLP

MODERN VIDYA NIKETAN SCHOOL ENTRANCE TEST SYLLABUS FOR KG ( )

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Lecture 5 Neural models for NLP

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hidden Markov Models

Features of Statistical Parsers

Natural Language Processing

Natural Language Processing

CSC321 Lecture 4: Learning a Classifier

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning for NLP

Maschinelle Sprachverarbeitung

A Deterministic Word Dependency Analyzer Enhanced With Preference Learning

Transcription:

Part-of-Speech Tagging + Neural Networks CS 287

Quiz Last class we focused on hinge loss. L hinge = max{0, 1 (ŷ c ŷ c )} Consider now the squared hinge loss, (also called l 2 SVM) L hinge 2 = max{0, 1 (ŷ c ŷ c )} 2 What is the effect does this have on the loss How do the parameters gradients change

Contents Part-of-Speech Data Part-of-Speech Models Bilinear Model Windowed Models

Penn Treebank (Marcus et al, 1993) ( (S (CC But) (SBAR-ADV (IN while) (S (-SBJ (DT the) (N New) (N York) (N Stock) (N Exchange) ) (VP (VBD did) (RB n t) (VP (VB fall) (ADVP-CLR (RB apart) ) (-TMP (N Friday) ) (SBAR-TMP (IN as) (S (-SBJ (DT the) (N Dow) (N Jones) (N Industrial) (N Average) ) (VP (VBD plunged) (-EXT ( (CD 190.58) (NNS points) ) (PRN (: -) ( ( (JJS most) ) (PP (IN of) ( (PRP it) )) (PP-TMP (IN in) ( (DT the) (JJ final) (NN hour) ))) (: -) ))))))))) (-SBJ-2 (PRP it) ) (ADVP (RB barely) ) (VP (VBD managed) (S (-SBJ (-NONE- -2) ) (VP (TO to) (VP (VB stay) (-LOC-PRD ( (DT this) (NN side) ) (PP (IN of) ( (NN chaos) ))))))) (..)))

Syntax S. ago VP years 375 ashore blown PP warriors of boatload a, S VP Mexico visit to * PP countrymen their of PP PP ADJP first the of tale the with PP newcomers nervous up S VP buck always here managers ADJP industrial Battle-tested VP years 375 ashore blown PP warriors of boatload a

Syntax S. ago VP years 375 ashore blown PP warriors of boatload a, S VP Mexico visit to * PP countrymen their of PP PP ADJP first the of tale the with PP newcomers nervous up S VP buck always here managers ADJP industrial Battle-tested VP years 375 ashore blown PP warriors of boatload a

Tagging So what if Steinbach had struck just seven home runs in 130 regular-season games, and batted in the seventh position of the A s lineup.

Part-of-Speech Tags So/RB what/wp if/in Steinbach/N had/vbd struck/vbn just/rb seven/cd home/nn runs/nns in/in 130/CD regular-season/jj games/nns,/, and/cc batted/vbd in/in the/dt seventh/jj position/nn of/in the/dt A/N s/n lineup/nn./.

Part-of-Speech Tags So/RB what/wp if/in Steinbach/N had/vbd struck/vbn just/rb seven/cd home/nn runs/nns in/in 130/CD regular-season/jj games/nns,/, and/cc batted/vbd in/in the/dt seventh/jj position/nn of/in the/dt A/N s/n lineup/nn./.

Simplified English Tagset I 1., Punctuation 2. CC Coordinating conjunction 3. CD Cardinal number 4. DT Determiner 5. EX Existential there 6. FW Foreign word 7. IN Preposition or subordinating conjunction 8. JJ Adjective 9. JJR Adjective, comparative 10. JJS Adjective, superlative 11. LS List item marker

Simplified English Tagset II 12. MD Modal 13. NN Noun, singular or mass 14. NNS Noun, plural 15. N Proper noun, singular 16. NS Proper noun, plural 17. PDT Predeterminer 18. POS Possessive ending 19. PRP Personal pronoun 20. PRP$ Possessive pronoun 21. RB Adverb 22. RBR Adverb, comparative

Simplified English Tagset III 23. RBS Adverb, superlative 24. RP Particle 25. SYM Symbol 26. TO to 27. UH Interjection 28. VB Verb, base form 29. VBD Verb, past tense 30. VBG Verb, gerund or present participle 31. VBN Verb, past participle 32. VBP Verb, non-3rd person singular present 33. VBZ Verb, 3rd person singular present

Simplified English Tagset IV 34. WDT Wh-determiner 35. WP Wh-pronoun 36. WP$ Possessive wh-pronoun 37. WRB Wh-adverb

NN or NNS Whether a noun is tagged singular or plural depends not on its semantic properties, but on whether it triggers singular or plural agreement on a verb. We illustrate this below for common nouns, but the same criterion also applies to proper nouns. Any noun that triggers singular agreement on a verb should be tagged as singular, even if it ends in final -s. EXAMPLE: Linguistics NN is/*are a difficult field. If a noun is semantically plural or collective, but triggers singular agreement, it should be tagged as singular. EXAMPLES: The group/nn has/*have disbanded. The jury/nn is/*are deliberating.

Language Specific Which of these tags are English only Are there phenomenon that these don t cover Should our models be language specific

Universal Part-of-Speech Tags (Petrov et al, 2012) 1. VERB - verbs (all tenses and modes) 2. NOUN - nouns (common and proper) 3. PRON - pronouns 4. ADJ - adjectives 5. ADV - adverbs 6. ADP - adpositions (prepositions and postpositions) 7. CONJ - conjunctions 8. DET - determiners 9. NUM - cardinal numbers 10. PRT - particles or other function words 11. X - other: foreign words, typos, abbreviations 12.. - punctuation

Why do tags matter Interesting linguistic question. Used for many downstream NLP tasks. Benchmark linguistic NLP task. However note, Possibly have solved PTB tagging (Manning, 2011) Deep Learning skepticism

Why do tags matter Interesting linguistic question. Used for many downstream NLP tasks. Benchmark linguistic NLP task. However note, Possibly have solved PTB tagging (Manning, 2011) Deep Learning skepticism

Contents Part-of-Speech Data Part-of-Speech Models Bilinear Model Windowed Models

Strawman: Sparse Word-only Tagging Models Let, F; just be the set of word type C; be the set of part-of-speech tags, C 40 Proposal: Use a linear model, ŷ = f (xw + b)

Why is tagging hard 1. Rare Words 3% of tokens in PTB dev are unseen. What can we even do with these 2. Ambiguous Words Around 50% of seen dev tokens are ambiguous in train. How can we decide between different tags for the same type

Better Tag Features: Word Properties Representation can use specific aspects of text. F; Prefixes, suffixes, hyphens, first capital, all-capital, hasdigits, etc. x = i δ(f i ) Example: Rare word tagging in 130 regular-season/* games, x = δ(prefix:3:reg) + δ(prefix:2:re) + δ(prefix:1:r) + δ(has-hyphen) + δ(lower-case) + δ(suffix:3:son)...

Better Tag Features: Tag Sequence Representation can use specific aspects of text. F; Prefixes, suffixes, hyphens, first capital, all-capital, hasdigits, etc. Also include features on previous tags Example: Rare word tagging with context in 130/CD regular-season/* games, x = δ(last:cd) + δ(prefix:3:reg) + δ(prefix:2:re) + δ(prefix:1:r) + δ(has-hyphen) + δ(lower-case) + δ(suffix:3:son)... However, requires search. HMM-style sequence algorithms.

NLP (almost) From Scratch (Collobert et al. 2011) Exercise: What if we just used words and context No word-specific features (mostly) No search over previous decisions Next couple classes, we will work our way up to this paper, 1. Dense word features 2. Contextual windowed representations 3. Neural networks architecture 4. Semi-supervised training

Contents Part-of-Speech Data Part-of-Speech Models Bilinear Model Windowed Models

Motivation: Dense Features Strawman linear model learns one parameter for each word. Features allow us to share information between words. Can this be learned

Bilinear Model Bilinear model, ŷ = f ((x 0 W 0 )W 1 + b) x 0 R 1 d 0 start with one-hot. W 0 R d 0 d in, d 0 = F W 1 R d in d out, b R 1 d out ; model parameters Notes: Bilinear parameter interaction. d 0 >> d in, e.g. d 0 = 10000, d in = 50

Bilinear Model: Intuition (x 0 W 0 )W 1 + b [ 0... 1... 0 ] w 0 1,1... w 0 0,d in... w 0 k,1... w 0 k,d in... w 0 d 0,1... w 0 d 0,d in w 1 1,1...... w 1 0,d out...... w 1 d in,0...... w 1 d in,d out

Embedding Layer x 0 W 0 [ 0... 1... 0 ] w 0 1,1... w 0 0,din... w 0 k,1... w 0 k,din... w 0 d0,1... w 0 d0,din Critical for natural language applications Informal names for this idea, Feature embeddings/ word embeddings Lookup Table Feature/Representation Learning In Torch, nn.lookuptable (x 0 one-hot)

Dense Features When dense features implied we will write, ŷ = f (xw 1 + b) Example 1: single-word classfication with embeddings x = v(f 1 ; θ) = δ(f 1 )W 0 = x 0 W 0 v : F R 1 d in; parameterized embedding function Example 2: Bag-of-words classfication with embeddings x = k k v(f i ; θ) = δ(f i )W 0 i=1 i=1

Dense Features When dense features implied we will write, ŷ = f (xw 1 + b) Example 1: single-word classfication with embeddings x = v(f 1 ; θ) = δ(f 1 )W 0 = x 0 W 0 v : F R 1 d in; parameterized embedding function Example 2: Bag-of-words classfication with embeddings x = k k v(f i ; θ) = δ(f i )W 0 i=1 i=1

Log-Bilinear Model ŷ = softmax(xw 1 + b) Same form as multiclass logistic regression, but with dense features. However, objective is now non-convex (no restrictions on W 0, W 1 )

Log-Bilinear Model 15 log σ(xy) 5 log σ( xy) + λ/2 [x y] 2

Does it matter We are going to use SGD, in theory this is quite bad However, in practice it is not that much of an issue Argument: in large parameter spaces local optima are okay Lots of questions here, beyond scope of class

Embedding Gradients: Cross-Entropy I Chain Rule: m L(f (x)) f (x) j L(f (x)) = x i j=1 x i f (x) j ŷ = softmax(xw 1 + b) Recall, L(y, ŷ) z i = (1 ŷ i ) i = c ŷ i ow. L x f = W 1 L f,i = Wf 1 i z,c(1 ŷ c ) + i Wf 1,iŷi i =c

Embedding Gradients: Cross-Entropy II x = x 0 W 0 Update: x f W 0 k,f = x 0 k 1(f = f ) L W 0 k,f = xk 0 1(f = f ) L = xk 0 ( Wf 1 x,c (1 ŷ c) + f f Wf 1,iŷi) i =c

Contents Part-of-Speech Data Part-of-Speech Models Bilinear Model Windowed Models

Sentence Tagging w 1,..., w n ; sentence words t 1,..., t n ; sentence tags C; output class, set of tags.

Window Model Goal: predict t 5. Windowed word model. w 1 w 2 [w 3 w 4 w 5 w 6 w 7 ] w 8 w 3, w 4 ; left context w 5 ; Word of interest w 6, w 7 ; right context d win ; size of window (d win = 5)

Boundary Cases Goal: predict t 2. [ s w 1 w 2 w 3 w 4 ] w 5 w 6 w 7 w 8 Goal: predict t 8. w 1 w 2 w 3 w 4 w 5 [w 6 w 7 w 8 /s /s ] Here symbols s and /s represent boundary padding.

Dense Windowed BoW Features f 1,..., f dwin are words in window Input representation is the concatenation of embeddings Example: Tagging x = [v(f 1 ) v(f 2 )... v(f dwin )] w 1 w 2 [w 3 w 4 w 5 w 6 w 7 ] w 8 x = [v(w 3 ) v(w 4 ) v(w 5 ) v(w 6 ) v(w 7 )] x d in /5 d in /5 d in /5 d in /5 d in /5 Rows of W 1 encode position specific weights.

Dense Windowed Extended Features f 1,..., f dwin are words, g 1,..., g dwin are capitalization x = [v(f 1 ) v(f 2 )... v(f dwin ) v 2 (g 1 ) v 2 (g 2 )... v 2 (g dwin )] Example: Tagging w 1 w 2 [w 3 w 4 w 5 w 6 w 7 ] w 8 x = [v(w 3 ) v(w 4 ) v(w 5 ) v(w 6 ) v(w 7 ) v 2 (w 3 ) v 2 (w 4 ) v 2 (w 5 ) v 2 (w 6 ) v 2 (w 7 )] Rows of W 1 encode position/feature specific weights. x

Tagging from Scratch (Collobert et al, 2011) Part 1 of the key model,