POS-Tagging. Fabian M. Suchanek

Similar documents
10/17/04. Today s Main Points

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models

Hidden Markov Models

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Midterm sample questions

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

CS460/626 : Natural Language

A gentle introduction to Hidden Markov Models

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 13: Structured Prediction

Dept. of Linguistics, Indiana University Fall 2009

Dynamic Programming: Hidden Markov Models

Hidden Markov Models

Lecture 9: Hidden Markov Model

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Models (HMMs)

Sequential Supervised Learning

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Stochastic Parsing. Roberto Basili

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Hidden Markov Models

Statistical Methods for NLP

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Spatial Role Labeling CS365 Course Project

Lecture 7: Sequence Labeling

Text Mining. March 3, March 3, / 49

COMP90051 Statistical Machine Learning

Statistical Methods for NLP

Intelligent Systems (AI-2)

The Noisy Channel Model and Markov Models

Graphical Models Seminar

Fun with weighted FSTs

Lecture 3: ASR: HMMs, Forward, Viterbi

Structured Output Prediction: Generative Models

Sequence Labeling: HMMs & Structured Perceptron

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

CS532, Winter 2010 Hidden Markov Models

Sequences and Information

Today. Finish word sense disambigua1on. Midterm Review

1. Markov models. 1.1 Markov-chain

Statistical methods in NLP, lecture 7 Tagging and parsing

CS 188 Introduction to AI Fall 2005 Stuart Russell Final

Intelligent Systems (AI-2)

The DIPRE Algorithm. Fabian M. Suchanek

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Statistical Methods for NLP

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Models in Language Processing

Midterm sample questions

Statistical Processing of Natural Language

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

MACHINE LEARNING 2 UGM,HMMS Lecture 7

LECTURER: BURCU CAN Spring

Stochastic models for complex machine learning

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Sequential Data Modeling - The Structured Perceptron

TnT Part of Speech Tagger

Part-of-Speech Tagging

Natural Language Processing (CSE 517): Sequence Models (II)

Latent Variable Models in NLP

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Parsing with Context-Free Grammars

CS 188: Artificial Intelligence Fall 2011

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CS 7180: Behavioral Modeling and Decision- making in AI

CS460/626 : Natural Language

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Ch. 2: Phrase Structure Syntactic Structure (basic concepts) A tree diagram marks constituents hierarchically

Hidden Markov Models

Processing/Speech, NLP and the Web

NLP Programming Tutorial 11 - The Structured Perceptron

Hidden Markov Models. Representing sequence data. Markov Models. A dice-y example 4/5/2017. CISC 5800 Professor Daniel Leeds Π A = 0.3, Π B = 0.

Machine Learning for OR & FE

Machine Learning for natural language processing

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

Conditional Random Fields for Sequential Supervised Learning

Part-of-Speech Tagging

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Sentence-level processing

A* Search. 1 Dijkstra Shortest Path

Transcription:

POS-Tagging Fabian M. Suchanek 100

Def: POS A Part-of-Speech (also: POS, POS-tag, word class, lexical class, lexical category) is a set of words with the same grammatical role. Alizée wrote a really great song by herself Proper Name Determiner Adverb ective Noun Preposition Pronoun 2

Part of Speech A Part-of-Speech (also: POS, POS-tag, word class, lexical class, lexical category) is a set of words with the same grammatical role. Proper nouns (PN): Alizée, Elvis, Obama... Nouns (N): music, sand,... ectives (ADJ): fast, funny,... Adverbs (ADV): fast, well,... s (V): run, jump, dance,... Pronouns (PRON): he, she, it, this,... (what can replace a noun) Determiners (DET): the, a, these, your,... (what goes before a noun) Prepositions (PREP): in, with, on,... (what goes before determiner + noun) Subordinators (SUB): who, whose, that, which, because... (what introduces a sub-sentence) 3

Def: POS-Tagging POS-Tagging is the process of assigning to each word in a sentence its POS. Alizée sings a wonderful song. PN V DET ADJ N Alizée, who has a wonderful voice, sang at the concert in Moscow. c Ronny Martin Junnilainen 4

Def: POS-Tagging POS-Tagging is the process of assigning to each word in a sentence its POS. Alizée sings a wonderful song. PN V DET ADJ N Alizée, who has a wonderful voice, PN SUB V DET ADJ N sang at the concert in Moscow. V PREPDET N PREPPN c Ronny Martin Junnilainen 5

Probabilistic POS-Tagging Probabilistic POS tagging is an algorithm for automatic POS tagging that works by introducing random variables for words and for POS-tags: (visible) (hidden) World W1 W2 T1 T2 Probability ω 1 Alizée sings PN P (ω 1 ) = 0.2 ω 2 Alizée sings P (ω 2 ) = 0.1 ω 3 Alizée runs Prep PN P (ω 3 ) = 0.1...... 6

Probabilistic POS-Tagging Given a sentence w 1,..., w n we want to find argmax t1,...,t n P (w 1,..., w n, t 1,..., t n ). (visible) (hidden) World W1 W2 T1 T2 Probability ω 1 Alizée sings PN P (ω 1 ) = 0.2 ω 2 Alizée sings P (ω 2 ) = 0.1 ω 3 Alizée runs Prep PN P (ω 3 ) = 0.1...... 7

8 Markov Assumption 1 Every word depends just on its tag: P (W i W 1,..., W i 1, T 1,..., T i ) = P (W i T i )

Markov Assumption 1 Every word depends just on its tag: P (W i W 1,..., W i 1, T 1,..., T i ) = P (W i T i ) The probability that the 4th word is song depends just on the tag of that word: P (song Elvis, sings, a, P N, V, D, N) = P (song N) 9

Markov Assumption 1 Every word depends just on its tag: P (W i W 1,..., W i 1, T 1,..., T i ) = P (W i T i ) The probability that the 4th word is song depends just on the tag of that word: P (song Elvis, sings, a, P N, V, D, N) = P (song N) Elvis sings a? PN Det Noun 10

Markov Assumption 1 Every word depends just on its tag: P (W i W 1,..., W i 1, T 1,..., T i ) = P (W i T i ) The probability that the 4th word is song depends just on the tag of that word: P (song Elvis, sings, a, P N, V, D, N) = P (song N) Elvis sings a? PN Det Noun 11

12 Markov Assumption 2 Every tag depends just on its predecessor P (T i T 1,..., T i 1 ) = P (T i T i 1 )

Markov Assumption 2 Every tag depends just on its predecessor P (T i T 1,..., T i 1 ) = P (T i T i 1 ) The probability that PN, V, D is followed by a noun is the same as the probability that D is followed by a noun: P (N P N, V, D) = P (N D) 13

Markov Assumption 2 Every tag depends just on its predecessor P (T i T 1,..., T i 1 ) = P (T i T i 1 ) The probability that PN, V, D is followed by a noun is the same as the probability that D is followed by a noun: P (N P N, V, D) = P (N D) Alizée sings a song PN Det? 14

Markov Assumption 2 Every tag depends just on its predecessor P (T i T 1,..., T i 1 ) = P (T i T i 1 ) The probability that PN, V, D is followed by a noun is the same as the probability that D is followed by a noun: P (N P N, V, D) = P (N D) Alizée sings a song PN Det? 15

16 Homogeneity Assumption 1 The tag probabilities are the same at all positions P (T i T i 1 ) = P (T k T k 1 ) i, k

Homogeneity Assumption 1 The tag probabilities are the same at all positions P (T i T i 1 ) = P (T k T k 1 ) i, k The probability that a Det is followed by a Noun is the same at position 7 and 2: P (T 7 = Noun T 6 = Det) = P (T 2 = Noun T 1 = Det) 17

Homogeneity Assumption 1 The tag probabilities are the same at all positions P (T i T i 1 ) = P (T k T k 1 ) i, k The probability that a Det is followed by a Noun is the same at position 7 and 2: P (T 7 = Noun T 6 = Det) = P (T 2 = Noun T 1 = Det) Let s denote this probability by P (N oun Det) Transition probability P (s t) := P (T i = s T i 1 = t)(foranyi) 18

19 Homogeneity Assumption 2 The word probabilities are the same at all positions P (W i T i ) = P (W k T k ) i, k

Homogeneity Assumption 2 The word probabilities are the same at all positions P (W i T i ) = P (W k T k ) i, k The probability that a PN is Elvis is the same at position 7 and 2: P (W 7 = Elvis T 7 = P N) = P (W 2 = Elvis T 2 = P N) = 8 20

Homogeneity Assumption 2 The word probabilities are the same at all positions P (W i T i ) = P (W k T k ) i, k The probability that a PN is Elvis is the same at position 7 and 2: P (W 7 = Elvis T 7 = P N) = P (W 2 = Elvis T 2 = P N) = 8 Let s denote this probability by P (Elvis P N) Emission probability P (w t) := P (W i = w T i = t)(foranyi) 21

Def: HMM A (homogeneous) Hidden Markov Model (also: HMM) is a sequence of random variables, such that P (w 1,..., w n, t 1,..., t n ) = i P (w i t i ) P (t i t i 1 ) Words of the sentence POS-tags Emission probabilities Transition probabilities... with t 0 = Start 22

23 HMMs as graphs P (w 1,..., w n, t 1,..., t n ) = i P (w i t i ) P (t i t i 1 ) Transition probabilities P (End Noun) = 2 Start Noun 8 10 End 10

HMMs as graphs P (w 1,..., w n, t 1,..., t n ) = i P (w i t i ) P (t i t i 1 ) Emission probabilities P (End Noun) = 2 Start Noun 8 10 End 10 P (sounds N oun) 10 sound nice sound sounds sound sounds. 24

HMMs as graphs P (w 1,..., w n, t 1,..., t n ) = i P (w i t i ) P (t i t i 1 ) P(nice, sounds,.,, Noun, End) = * * 10 * * 2 * 10 = 2.5% P (End Noun) = 2 Start Noun 8 10 End 10 P (sounds N oun) 10 sound nice sound sounds sound sounds. 25

HMM Question What is the most likely tagging for sound sounds.? P (End Noun) = 2 Start Noun 8 10 End 10 P (sounds N oun) 10 sound nice sound sounds sound sounds. 26

POS tagging with HMMs What is the most likely tagging for sound sounds.? + Noun: **10**2*10 = 2.5% Noun + : **8**10 = 1 Finding the most likely sequence of tags that generated a sentence is POS tagging (hooray!). 27

Where do we get the HMM? Estimate probabilities from manually annotated corpus: Elvis/PN sings/ Elvis/PN./End Priscilla/PN laughs/ P (V erb P N) = 2 3 P (End P N) = 1 3... P (Elvis P N) = 2 3 P (sings V erb) = 1 2... >Viterbi 28

The HMM as a Trellis Task: compute the probability of every possible path from left to right, find maximum. Noun 8 Noun Noun Start 10 End End End sound sounds. 29

The HMM as a Trellis Task: compute the probability of every possible path from left to right, find maximum. * 8* 10*10 = 1 Noun 8 Noun Noun Start 10 End End End sound sounds. 30

The HMM as a Trellis * 8* a*b Start Noun 8 Noun a 10 Noun b Complexity: O(T S ) End End End sound sounds. 31

The HMM as a Trellis same as before! * 8* a*b Start Noun 8 Noun a 10 Noun b Complexity: O(T S ) End End End sound sounds. 32

The HMM as a Trellis * 8* a*b Start Noun End 8 Noun V 1 End a 10 Noun b End Idea: Store at each node the probability of the maximal path that leads there. sound sounds. 33

Viterbi Algorithm Start Noun End 8 Noun V 1 End For each word w for each tag t for each preceding tag t compute P (t ) P (t t ) P (w t) store the maximal probability at t, w sound sounds 34

Viterbi Algorithm Start left to right, compute and store probability of arriving here. N 25% Start V E A 25% sound 35

Viterbi Algorithm Find prev that maximizes P (prev) P (t prev) P (w t) N 25% Noun Start V 10 prev = Noun 25%** prev = V erb ** prev = End ** E prev = 25%*10* A 25% sound sounds 36

Viterbi Algorithm Find prev that maximizes P (prev) P (t prev) P (w t) N 25% N 13% prev = Noun 25%** Start V 10 prev = V erb prev = End ** ** E prev = 25%*10* A 25% sound sounds 37

Viterbi Algorithm Start N 25% V E 8 N 13% V 1 E 10 N V E 1 Read best path by following arrows backwards: Start N V E 10 O(S T 2 ) A 25% A A sound sounds. 38

Def: Probabilistic POS Tagging Given a sentence and transition and emission probabilities, Probabilistic POS Tagging computes the sequence of tags that has maximal probability (in an HMM). X = Elvis sings P (Elvis, sings, P N, N) = 0.01 P (Elvis, sings, V, N) = 0.01 winner P (Elvis, sings, P N, V ) = 0.1... >end 39

Probabilistic POS Tagging Probabilistic POS tagging uses Hidden Markov Models General performance very good (>95% acc.) Several POS taggers are available Stanford POS tagger SpaCy.io (open source) NLTK (Pyhton library)... (HMMs and the Viterbi algorithm serve a wide variety of other tasks, in particular NLP at all levels of granularity, e.g., in speech processing) >end 40

Research Questions How can we deal with evil cases? the word blue has 4 letters. pre- and post-secondary look it up The Duchess was entertaining last night. [Wikipedia/POS tagging] unknown words? new languages? >end 41

POS Tagging helps pattern IE We can choose to match the placeholders with only nouns or proper nouns: X was born in Y Atkinson was born in Consett. wasbornin(atkinson,consett) Atkinson was born in the beautiful Consett. wasbornin(atkinson, the) >end 42

POS-tags can generalize patterns X was born in the ADJ X Atkinson was born in the beautiful Consett. match Atkinson was born in the cold Consett. match >end 43

Phrase structure is a problem X was born in the ADJ X Atkinson was born in the very beautiful Consett. no match Atksinon, a famous actor, was born in Consett. no match 44

References Daniel Ramage: HMM Fundamentals, CS229 Section Notes ->dependency-parsing ->semantic-web ->ie-by-reasoning ->security 45