TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31
Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation by the authors Evaluation by others 4 2 / 31
Why Then? Why Then? Why Now? Published in 2000 [Bra00] One of the first to show that tagger based on Markov models can yield state-of-the-art results 3 / 31
Why Now? Why Then? Why Now? Citation count: 305 Tested across Different languages Different domains and so on... 4 / 31
Trigrams n Tags Underlying Model Other technicalities 5 / 31
Trigrams n Tags Underlying Model Other technicalities A second order Hidden Markov Model 5 / 31
Trigrams n Tags Underlying Model Other technicalities A second order Hidden Markov Model with careful decisions regarding 5 / 31
Trigrams n Tags Underlying Model Other technicalities A second order Hidden Markov Model with careful decisions regarding Handling of start- and end-of-sequence 5 / 31
Trigrams n Tags Underlying Model Other technicalities A second order Hidden Markov Model with careful decisions regarding Handling of start- and end-of-sequence Smoothing 5 / 31
Trigrams n Tags Underlying Model Other technicalities A second order Hidden Markov Model with careful decisions regarding Handling of start- and end-of-sequence Smoothing Capitalization 5 / 31
Trigrams n Tags Underlying Model Other technicalities A second order Hidden Markov Model with careful decisions regarding Handling of start- and end-of-sequence Smoothing Capitalization Handling of unknown words 5 / 31
Trigrams n Tags Underlying Model Other technicalities A second order Hidden Markov Model with careful decisions regarding Handling of start- and end-of-sequence Smoothing Capitalization Handling of unknown words Improving speed of tagging 5 / 31
Underlying Model Other technicalities Second Order Hidden Markov Model 6 / 31
Tri-gram model Underlying Model Other technicalities Given word sequence: w 1, w 2,..., w T 7 / 31
Tri-gram model Underlying Model Other technicalities Given word sequence: w 1, w 2,..., w T Find the tag sequence: t 1, t 2,..., t T where t i Tag Set 7 / 31
Tri-gram model Underlying Model Other technicalities Given word sequence: w 1, w 2,..., w T Find the tag sequence: t 1, t 2,..., t T where t i Tag Set Specifically [ we need: T ] P(t i t i 1, t i 2).P(w i t i ) P(t T +1 tt ) argmax t 1,...,t T i=1 where t 1, t 0, t T +1 denotes beginning-of-sequence and end-of-sequence. 7 / 31
Tri-gram model Underlying Model Other technicalities Given word sequence: w 1, w 2,..., w T Find the tag sequence: t 1, t 2,..., t T where t i Tag Set Specifically [ we need: T ] P(t i t i 1, t i 2).P(w i t i ) P(t T +1 tt ) argmax t 1,...,t T i=1 where t 1, t 0, t T +1 denotes beginning-of-sequence and end-of-sequence. NB: If sentence boundaries are not marked in the input, TnT adds these tags if it encounters one of [.!?; ] as a token. 7 / 31
Tri-gram model continued Underlying Model Other technicalities Define: ˆP = Maximum likelihood probability N = Total number of tokens in the training corpus Unigrams: ˆP(t 3 ) = f (t 3) N Bigrams: ˆP(t 3 t 2 ) = f (t 2,t 3 ) f (t 2 ) Trigrams: ˆP(t 3 t 1, t 2 ) = f (t 1,t 2,t 3 ) f (t 1,t 2 ) Lexical: ˆP(w3 t 3 ) = f (w 3,t 3 ) f (t 3 ) where all t 1, t 2, t 3 are in tagset and w 3 is in the lexicon. Note: ˆP = 0 if numerator, denominator = 0 8 / 31
Underlying Model Other technicalities Other Intricate technicalities 9 / 31
Smoothing Underlying Model Other technicalities P(t 3 t 1, t 2 ) = λ 1 ˆP(t 3 ) + λ 2 ˆP(t 3 t 2 ) + λ 3 ˆP(t 3 t 1, t 2 ) where 0 λ i 1, i {1, 2, 3} such that λ 1 + λ 2 + λ 3 = 1 the values of λ i s are estimated by deleted interpolation. 10 / 31
Underlying Model Other technicalities Procedure to calculate λ i 1: set λ 1 = λ 2 = λ 3 = 0 2: for each trigram t 1, t 2, t 3 with f (t 1, t 2, t 3 ) > 0 do 3: depending on maximum value case f (t 1,t 2,t 3 ) 1 f (t 1,t 2 ) 1 : λ 3 = λ 3 + f (t 1, t 2, t 3 ) case f (t 2,t 3 ) 1 f (t 2 ) 1 : λ 2 = λ 2 + f (t 1, t 2, t 3 ) case f (t 3) 1 N 1 : λ 1 = λ 1 + f (t 1, t 2, t 3 ) 4: end 5: end for 6: normalize λ 1, λ 2, λ 3 11 / 31
Capitalization Underlying Model Other technicalities Capitalization plays a vital role English: Proper nouns German: All nouns 12 / 31
Capitalization Underlying Model Other technicalities Capitalization plays a vital role English: Proper nouns German: All nouns Probability distribution of tags around capitalized words differs from the rest. 12 / 31
Capitalization Underlying Model Other technicalities Capitalization plays a vital role English: Proper nouns German: All nouns Probability distribution of tags around capitalized words differs from the rest. Define: c i = { 1 if w i is capitalized 0 otherwise So use P(t 3, c 3 t 1, c 1, t 2, c 2 ) instead of P(t 3 t 1, t 2 ). The tri-gram model equations need to be changed accordingly. 12 / 31
Handling of Unknown Words Underlying Model Other technicalities Handled best by suffix analysis (proposed by Samuelson in 1993) for inflected languages 13 / 31
Handling of Unknown Words Underlying Model Other technicalities Handled best by suffix analysis (proposed by Samuelson in 1993) for inflected languages What is meant by suffix? final sequence of characters of a word which is not necessarily a linguistically meaningful suffix 13 / 31
Handling of Unknown Words Underlying Model Other technicalities Handled best by suffix analysis (proposed by Samuelson in 1993) for inflected languages What is meant by suffix? final sequence of characters of a word which is not necessarily a linguistically meaningful suffix e.g: smoothing g ng ing hing thing othing oothing moothing smoothing 13 / 31
Underlying Model Other technicalities Handling of Unknown Words (contd...) Given suffix length: i = m to 0 P(l n i+1,...l n t) P(t l n i+1,...l n )P(t) Define: ˆP as the ML estimate obtained from frequencies in the lexicon P(t) = ˆP(t) P(t l n i+1,...l n ) = ˆP(t l n i+1,...l n)+θ i P(t l n i,...l n) 1+θ i where ˆP(t l n i+1,...l n ) = f (t,l n i+1,...l n) θ i = 1 s 1 P = 1 s s j=1 ˆP(t j ) f (l n i+1,...l n) s j=1 (ˆP(t j ) P) 2 Note: Here m = 10 14 / 31
Beam Search Underlying Model Other technicalities A faster and approximated version of Viterbi algorithm. 15 / 31
Beam Search Underlying Model Other technicalities A faster and approximated version of Viterbi algorithm. Explore states above a certain threshold. 15 / 31
Beam Search Underlying Model Other technicalities A faster and approximated version of Viterbi algorithm. Explore states above a certain threshold. Does not guarantee the correct path but performs well. 15 / 31
Evaluation Setting Evaluation by the authors Evaluation by others DataSets: Negra Corpus: German Newspaper corpus Penn TreeBank: The Wall Street Journal portion of Penn-TreeBank corpus DataSet Split: Contiguous Round-Robin Performance Metrics Tagging Accuracy for known, and more importantly, unknown words Effect of amount of training dataset on accuracy Accuracy of Reliable Tag Assigments 16 / 31
Handling of Unknown Words Evaluation by the authors Evaluation by others 17 / 31
Evaluation by the authors Evaluation by others Learning with respect to DataSet Size 18 / 31
Evaluation by the authors Evaluation by others Learning with respect to DataSet Size 19 / 31
Accuracy of Reliable Assignments Evaluation by the authors Evaluation by others 20 / 31
Accuracy of Reliable Assignments Evaluation by the authors Evaluation by others 21 / 31
Evaluation by others Evaluation by the authors Evaluation by others 22 / 31
Evaluation by others Evaluation by the authors Evaluation by others Different people evaluating on different axes 22 / 31
Different Languages Evaluation by the authors Evaluation by others 23 / 31
Different Languages Evaluation by the authors Evaluation by others Does not work well for morphologically complex languages (e.g Icelandic) 23 / 31
Different Languages Evaluation by the authors Evaluation by others Does not work well for morphologically complex languages (e.g Icelandic) Solution: Fill gaps in lexicon using language specific morphological analyzers [Lof07] 23 / 31
Different Languages Evaluation by the authors Evaluation by others Does not work well for morphologically complex languages (e.g Icelandic) Solution: Fill gaps in lexicon using language specific morphological analyzers [Lof07] Worked well for German though 23 / 31
Different Languages Evaluation by the authors Evaluation by others Does not work well for morphologically complex languages (e.g Icelandic) Solution: Fill gaps in lexicon using language specific morphological analyzers [Lof07] Worked well for German though What form of morphological complexities create trouble? 23 / 31
Different Domains Evaluation by the authors Evaluation by others 24 / 31
Different Domains Evaluation by the authors Evaluation by others Works well for domain specific POS task 24 / 31
Different Domains Evaluation by the authors Evaluation by others Works well for domain specific POS task If trained using large domain specific corpora [HW04] 24 / 31
Different Domains Evaluation by the authors Evaluation by others Works well for domain specific POS task If trained using large domain specific corpora [HW04] If trained using large generic corpora with an additional small domain specific corpora [CPA + 05] 24 / 31
The thing about Accuracy Evaluation by the authors Evaluation by others 25 / 31
The thing about Accuracy Evaluation by the authors Evaluation by others Accuracies of over 97%... 25 / 31
The thing about Accuracy Evaluation by the authors Evaluation by others Accuracies of over 97%...... are per-token accuracy 25 / 31
The thing about Accuracy Evaluation by the authors Evaluation by others Accuracies of over 97%...... are per-token accuracy What about sentence-level accuracy? 25 / 31
The thing about Accuracy Evaluation by the authors Evaluation by others Figure: Tagging Accuracies on WSJ Development Set [Man11] 26 / 31
Different POS Tagging Error types Evaluation by the authors Evaluation by others Figure: Frequency of different POS tagging error types [Man11] 27 / 31
28 / 31
A significant milestone in the history of Part-of-Speech Tagging 28 / 31
A significant milestone in the history of Part-of-Speech Tagging A good point of entry into Statistical NLP. 28 / 31
References I Thorsten Brants, Tnt: A statistical part-of-speech tagger, Proceedings of the Sixth Conference on Applied Natural Language Processing (Stroudsburg, PA, USA), ANLC 00, Association for Computational Linguistics, 2000, pp. 224 231. Anni R. Coden, Serguei V. Pakhomov, Rie K. Ando, Patrick H. Duffy, and Christopher G. Chute, Domain-specific language models and lexicons for tagging, J. of Biomedical Informatics 38 (2005), no. 6, 422 430. Udo Hahn and Joachim Wermter, High-performance tagging on medical texts, Proceedings of the 20th International Conference on Computational Linguistics (Stroudsburg, PA, USA), COLING 04, Association for Computational Linguistics, 2004. 29 / 31
References II Hrafn Loftsson, Tagging icelandic text using a linguistic and a statistical tagger, Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers (Stroudsburg, PA, USA), NAACL-Short 07, Association for Computational Linguistics, 2007, pp. 105 108. Christopher D. Manning, Part-of-speech tagging from 97time for some linguistics?, Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing - Volume Part I (Berlin, Heidelberg), CICLing 11, Springer-Verlag, 2011, pp. 171 189. 30 / 31
Thank you! 31 / 31