CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, PDF Free Download

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

HMM: Three Problems Problem Problem 1: Likelihood of a sequence Forward Procedure Backward Procedure Problem 2: Best state sequence Viterbi Algorithm Problem 3: Re-estimation Baum-Welch ( Forward- Backward Algorithm ) CRF Algorithm Parsing Morph Analysis HMM MEMM Semantics Part of Speech Tagging Hindi Marathi POS tagging English NLP Trinity French Language

Tagged Corpora ^_^ _ The_DT guys_nns that_wdt make_vbp traditional_jj hardware_nn are_vbp really_rb being_vbg obsoleted_vbn by_in microprocessorbased_jj machines_nns,_, _ said_vbd Mr._NNP Benton_NNP._.$_$

For Hindi Rama achhaa gaata hai. (hai is VAUX : Auxiliary verb) ; Ram sings well Rama achha ladakaa hai. (hai is VCOP : Copula verb) ; Ram is a good boy

Example of difficulty in POS tagging

Tags Content Word Function Word Noun Adjective Verb Tags Prepositi Pronoun on Conjunctio n Proper Noun NNP (for NER) Common Noun NN NNS VBP VBD VBG VBN Injection

Difficulty in POS Tagging Consider the following sentences: र म अ छ ग त ह _VAUX (auxiliary verb) Ram good sing is : Ram sings well GNPTAM for ग त only : Male, Singular,??,??,??,- GNPTAM for ग त ह : Male, Singular, 2 nd or 3 rd, Present, Default, Declarative र म अ छ लड़क ह _VCOP (copular verb) Ram good boy is : Ram is a good boy In general, VAUX, VM (main verb) and VCOP cannot be separated easily

Difficulty in POS Tagging To POS Tag based on Rules, one simple rule could be: ह Preceded by verb Preceded by nominal VAUX VCOP Facilitates co-reference स म न धकरण This is a High Precision, Low Recall rule, i.e. when it says Yes is indeed Yes but a No may not actually be No

Exceptions to the previous rule False Negative for VAUX Particle Injection (Particles: भ -Bhi, त -To, ह -Hi, नह -Nahi) Consider the following sentences: र म ग त त अ छ ह, पर... र म अ छ ह _VCOP र म त ग त अ छ ह _VAUX POS TAGs of ह vary here despite the preceding word being an adjective

Evaluation of POS Tag Accuracy Precision, Recall and F-Score Given G (what our system returns) False Positive Agreement Ideal I (Actual Tags) False Negative Precision P= G I / I Recall R= G I / I F-Score = 2PR/(P+R)

POS tag computation (1/2) Best tag sequence = T* = argmax P(T W) = argmax P(T)P(W T) (by Baye s Theorem) P(T) = P(t 0 =^ t 1 t 2 t n+1 =.) = P(t 0 )P(t 1 t 0 )P(t 2 t 1 t 0 )P(t 3 t 2 t 1 t 0 ) P(t n t n-1 t n-2 t 0 )P(t n+1 t n t n-1 t 0 ) = P(t 0 )P(t 1 t 0 )P(t 2 t 1 ) P(t n t n-1 )P(t n+1 t n ) N+1 = P(t i t i-1 ) Bigram Assumption i = 0

POS tag computation (2/2) P(W T) = P(w 0 t 0 -t n+1 )P(w 1 w 0 t 0 -t n+1 )P(w 2 w 1 w 0 t 0 -t n+1 ) P(w n w 0 -w n-1 t 0 -t n+1 )P(w n+1 w 0 -w n t 0 -t n+1 ) Assumption: A word is determined completely by its tag. This is inspired by speech recognition = P(w o t o )P(w 1 t 1 ) P(w n+1 t n+1 ) n+1 = P(w i t i ) i = 0 n+1 i = 1 = P(w i t i ) (Lexical Probability Assumption)

Example People jump high. People : Noun/Verb jump : Noun/Verb high : Noun/Adjective We can start with probabilities.

VM VM JJ ^ $ N N N ^ Peopl e Jump High $ Trellis diagram 8 POS TAG sequences are possible, given these valid tags for each word taken from dictionary

Bigram Assumption Best tag sequence = T* = argmax P(T W) = argmax P(T)P(W T) (by Baye s Theorem) P(T) = P(t 0 =^ t 1 t 2 t n+1 =.) = P(t 0 )P(t 1 t 0 )P(t 2 t 1 t 0 )P(t 3 t 2 t 1 t 0 ) P(t n t n-1 t n-2 t 0 )P(t n+1 t n t n-1 t 0 ) = P(t 0 )P(t 1 t 0 )P(t 2 t 1 ) P(t n t n-1 )P(t n+1 t n ) N+1 = P(t i t i-1 ) Bigram Assumption i = 0

Lexical Probability Assumption P(W T) = P(w 0 t 0 -t n+1 )P(w 1 w 0 t 0 -t n+1 )P(w 2 w 1 w 0 t 0 -t n+1 ) P(w n w 0 -w n-1 t 0 -t n+1 )P(w n+1 w 0 -w n t 0 -t n+1 ) Assumption: A word is determined completely by its tag. This is inspired by speech recognition = P(w o t o )P(w 1 t 1 ) P(w n+1 t n+1 ) n+1 = P(w i t i ) i = 0 n+1 i = 1 = P(w i t i ) (Lexical Probability Assumption)

Calculation from actual data Corpus ^ Ram got many NLP books. He found them all very interesting. Pos Tagged ^ N V A N N. N V N A R A.

Recording numbers ^ N V A R. ^ 0 2 0 0 0 0 N 0 1 2 1 0 1 V 0 1 0 1 0 0 A 0 1 0 0 1 1 R 0 0 0 1 0 0. 1 0 0 0 0 0

Probabilities ^ N V A R. ^ 0 1 0 0 0 0 N 0 1/5 2/5 1/5 0 1/5 V 0 1/2 0 1/2 0 0 A 0 1/3 0 0 1/3 1/3 R 0 0 0 1 0 0. 1 0 0 0 0 0

Penn tagset (1/2)

Penn tagset (2/2)

Indian Language Tagset: Noun

Indian Language Tagset: Pronoun

Indian Language Tagset: Quantifier

Indian Language Tagset: Demonstrative 3 Demonstrative DM DM Vaha, jo, yaha, 3.1 Deictic DMD DM DMD Vaha, yaha 3.2 Relative DMR DM DMR jo, jis 3.3 Wh-word DMQ DM DMQ kis, kaun Indefinite DMI DM DMI KoI, kis

Indian Language Tagset: Verb, Adjective, Adverb

Indian Language Tagset: Postposition, conjunction

Indian Language Tagset: Particle

Indian Language Tagset: Residuals