POS-Tagging Fabian M. Suchanek 100
Def: POS A Part-of-Speech (also: POS, POS-tag, word class, lexical class, lexical category) is a set of words with the same grammatical role. Alizée wrote a really great song by herself Proper Name Determiner Adverb ective Noun Preposition Pronoun 2
Part of Speech A Part-of-Speech (also: POS, POS-tag, word class, lexical class, lexical category) is a set of words with the same grammatical role. Proper nouns (PN): Alizée, Elvis, Obama... Nouns (N): music, sand,... ectives (ADJ): fast, funny,... Adverbs (ADV): fast, well,... s (V): run, jump, dance,... Pronouns (PRON): he, she, it, this,... (what can replace a noun) Determiners (DET): the, a, these, your,... (what goes before a noun) Prepositions (PREP): in, with, on,... (what goes before determiner + noun) Subordinators (SUB): who, whose, that, which, because... (what introduces a sub-sentence) 3
Def: POS-Tagging POS-Tagging is the process of assigning to each word in a sentence its POS. Alizée sings a wonderful song. PN V DET ADJ N Alizée, who has a wonderful voice, sang at the concert in Moscow. c Ronny Martin Junnilainen 4
Def: POS-Tagging POS-Tagging is the process of assigning to each word in a sentence its POS. Alizée sings a wonderful song. PN V DET ADJ N Alizée, who has a wonderful voice, PN SUB V DET ADJ N sang at the concert in Moscow. V PREPDET N PREPPN c Ronny Martin Junnilainen 5
Probabilistic POS-Tagging Probabilistic POS tagging is an algorithm for automatic POS tagging that works by introducing random variables for words and for POS-tags: (visible) (hidden) World W1 W2 T1 T2 Probability ω 1 Alizée sings PN P (ω 1 ) = 0.2 ω 2 Alizée sings P (ω 2 ) = 0.1 ω 3 Alizée runs Prep PN P (ω 3 ) = 0.1...... 6
Probabilistic POS-Tagging Given a sentence w 1,..., w n we want to find argmax t1,...,t n P (w 1,..., w n, t 1,..., t n ). (visible) (hidden) World W1 W2 T1 T2 Probability ω 1 Alizée sings PN P (ω 1 ) = 0.2 ω 2 Alizée sings P (ω 2 ) = 0.1 ω 3 Alizée runs Prep PN P (ω 3 ) = 0.1...... 7
8 Markov Assumption 1 Every word depends just on its tag: P (W i W 1,..., W i 1, T 1,..., T i ) = P (W i T i )
Markov Assumption 1 Every word depends just on its tag: P (W i W 1,..., W i 1, T 1,..., T i ) = P (W i T i ) The probability that the 4th word is song depends just on the tag of that word: P (song Elvis, sings, a, P N, V, D, N) = P (song N) 9
Markov Assumption 1 Every word depends just on its tag: P (W i W 1,..., W i 1, T 1,..., T i ) = P (W i T i ) The probability that the 4th word is song depends just on the tag of that word: P (song Elvis, sings, a, P N, V, D, N) = P (song N) Elvis sings a? PN Det Noun 10
Markov Assumption 1 Every word depends just on its tag: P (W i W 1,..., W i 1, T 1,..., T i ) = P (W i T i ) The probability that the 4th word is song depends just on the tag of that word: P (song Elvis, sings, a, P N, V, D, N) = P (song N) Elvis sings a? PN Det Noun 11
12 Markov Assumption 2 Every tag depends just on its predecessor P (T i T 1,..., T i 1 ) = P (T i T i 1 )
Markov Assumption 2 Every tag depends just on its predecessor P (T i T 1,..., T i 1 ) = P (T i T i 1 ) The probability that PN, V, D is followed by a noun is the same as the probability that D is followed by a noun: P (N P N, V, D) = P (N D) 13
Markov Assumption 2 Every tag depends just on its predecessor P (T i T 1,..., T i 1 ) = P (T i T i 1 ) The probability that PN, V, D is followed by a noun is the same as the probability that D is followed by a noun: P (N P N, V, D) = P (N D) Alizée sings a song PN Det? 14
Markov Assumption 2 Every tag depends just on its predecessor P (T i T 1,..., T i 1 ) = P (T i T i 1 ) The probability that PN, V, D is followed by a noun is the same as the probability that D is followed by a noun: P (N P N, V, D) = P (N D) Alizée sings a song PN Det? 15
16 Homogeneity Assumption 1 The tag probabilities are the same at all positions P (T i T i 1 ) = P (T k T k 1 ) i, k
Homogeneity Assumption 1 The tag probabilities are the same at all positions P (T i T i 1 ) = P (T k T k 1 ) i, k The probability that a Det is followed by a Noun is the same at position 7 and 2: P (T 7 = Noun T 6 = Det) = P (T 2 = Noun T 1 = Det) 17
Homogeneity Assumption 1 The tag probabilities are the same at all positions P (T i T i 1 ) = P (T k T k 1 ) i, k The probability that a Det is followed by a Noun is the same at position 7 and 2: P (T 7 = Noun T 6 = Det) = P (T 2 = Noun T 1 = Det) Let s denote this probability by P (N oun Det) Transition probability P (s t) := P (T i = s T i 1 = t)(foranyi) 18
19 Homogeneity Assumption 2 The word probabilities are the same at all positions P (W i T i ) = P (W k T k ) i, k
Homogeneity Assumption 2 The word probabilities are the same at all positions P (W i T i ) = P (W k T k ) i, k The probability that a PN is Elvis is the same at position 7 and 2: P (W 7 = Elvis T 7 = P N) = P (W 2 = Elvis T 2 = P N) = 8 20
Homogeneity Assumption 2 The word probabilities are the same at all positions P (W i T i ) = P (W k T k ) i, k The probability that a PN is Elvis is the same at position 7 and 2: P (W 7 = Elvis T 7 = P N) = P (W 2 = Elvis T 2 = P N) = 8 Let s denote this probability by P (Elvis P N) Emission probability P (w t) := P (W i = w T i = t)(foranyi) 21
Def: HMM A (homogeneous) Hidden Markov Model (also: HMM) is a sequence of random variables, such that P (w 1,..., w n, t 1,..., t n ) = i P (w i t i ) P (t i t i 1 ) Words of the sentence POS-tags Emission probabilities Transition probabilities... with t 0 = Start 22
23 HMMs as graphs P (w 1,..., w n, t 1,..., t n ) = i P (w i t i ) P (t i t i 1 ) Transition probabilities P (End Noun) = 2 Start Noun 8 10 End 10
HMMs as graphs P (w 1,..., w n, t 1,..., t n ) = i P (w i t i ) P (t i t i 1 ) Emission probabilities P (End Noun) = 2 Start Noun 8 10 End 10 P (sounds N oun) 10 sound nice sound sounds sound sounds. 24
HMMs as graphs P (w 1,..., w n, t 1,..., t n ) = i P (w i t i ) P (t i t i 1 ) P(nice, sounds,.,, Noun, End) = * * 10 * * 2 * 10 = 2.5% P (End Noun) = 2 Start Noun 8 10 End 10 P (sounds N oun) 10 sound nice sound sounds sound sounds. 25
HMM Question What is the most likely tagging for sound sounds.? P (End Noun) = 2 Start Noun 8 10 End 10 P (sounds N oun) 10 sound nice sound sounds sound sounds. 26
POS tagging with HMMs What is the most likely tagging for sound sounds.? + Noun: **10**2*10 = 2.5% Noun + : **8**10 = 1 Finding the most likely sequence of tags that generated a sentence is POS tagging (hooray!). 27
Where do we get the HMM? Estimate probabilities from manually annotated corpus: Elvis/PN sings/ Elvis/PN./End Priscilla/PN laughs/ P (V erb P N) = 2 3 P (End P N) = 1 3... P (Elvis P N) = 2 3 P (sings V erb) = 1 2... >Viterbi 28
The HMM as a Trellis Task: compute the probability of every possible path from left to right, find maximum. Noun 8 Noun Noun Start 10 End End End sound sounds. 29
The HMM as a Trellis Task: compute the probability of every possible path from left to right, find maximum. * 8* 10*10 = 1 Noun 8 Noun Noun Start 10 End End End sound sounds. 30
The HMM as a Trellis * 8* a*b Start Noun 8 Noun a 10 Noun b Complexity: O(T S ) End End End sound sounds. 31
The HMM as a Trellis same as before! * 8* a*b Start Noun 8 Noun a 10 Noun b Complexity: O(T S ) End End End sound sounds. 32
The HMM as a Trellis * 8* a*b Start Noun End 8 Noun V 1 End a 10 Noun b End Idea: Store at each node the probability of the maximal path that leads there. sound sounds. 33
Viterbi Algorithm Start Noun End 8 Noun V 1 End For each word w for each tag t for each preceding tag t compute P (t ) P (t t ) P (w t) store the maximal probability at t, w sound sounds 34
Viterbi Algorithm Start left to right, compute and store probability of arriving here. N 25% Start V E A 25% sound 35
Viterbi Algorithm Find prev that maximizes P (prev) P (t prev) P (w t) N 25% Noun Start V 10 prev = Noun 25%** prev = V erb ** prev = End ** E prev = 25%*10* A 25% sound sounds 36
Viterbi Algorithm Find prev that maximizes P (prev) P (t prev) P (w t) N 25% N 13% prev = Noun 25%** Start V 10 prev = V erb prev = End ** ** E prev = 25%*10* A 25% sound sounds 37
Viterbi Algorithm Start N 25% V E 8 N 13% V 1 E 10 N V E 1 Read best path by following arrows backwards: Start N V E 10 O(S T 2 ) A 25% A A sound sounds. 38
Def: Probabilistic POS Tagging Given a sentence and transition and emission probabilities, Probabilistic POS Tagging computes the sequence of tags that has maximal probability (in an HMM). X = Elvis sings P (Elvis, sings, P N, N) = 0.01 P (Elvis, sings, V, N) = 0.01 winner P (Elvis, sings, P N, V ) = 0.1... >end 39
Probabilistic POS Tagging Probabilistic POS tagging uses Hidden Markov Models General performance very good (>95% acc.) Several POS taggers are available Stanford POS tagger SpaCy.io (open source) NLTK (Pyhton library)... (HMMs and the Viterbi algorithm serve a wide variety of other tasks, in particular NLP at all levels of granularity, e.g., in speech processing) >end 40
Research Questions How can we deal with evil cases? the word blue has 4 letters. pre- and post-secondary look it up The Duchess was entertaining last night. [Wikipedia/POS tagging] unknown words? new languages? >end 41
POS Tagging helps pattern IE We can choose to match the placeholders with only nouns or proper nouns: X was born in Y Atkinson was born in Consett. wasbornin(atkinson,consett) Atkinson was born in the beautiful Consett. wasbornin(atkinson, the) >end 42
POS-tags can generalize patterns X was born in the ADJ X Atkinson was born in the beautiful Consett. match Atkinson was born in the cold Consett. match >end 43
Phrase structure is a problem X was born in the ADJ X Atkinson was born in the very beautiful Consett. no match Atksinon, a famous actor, was born in Consett. no match 44
References Daniel Ramage: HMM Fundamentals, CS229 Section Notes ->dependency-parsing ->semantic-web ->ie-by-reasoning ->security 45