An Improved Stemming Approach Using HMM for a Highly Inflectional Language

Size: px

Start display at page:

Download "An Improved Stemming Approach Using HMM for a Highly Inflectional Language"

Peregrine Shields
5 years ago
Views:

1 An Improved Stemming Approach Using HMM for a Highly Inflectional Language Navanath Saharia 1, Kishori M. Konwar 2, Utpal Sharma 1, and Jugal K. Kalita 3 1 Department of CSE, Tezpur University, India {nava tu,utpal}@tezu.ernet.in 2 Department of MI, University of British Columbia Canada kishori82@yahoo.com 3 Department of CS, University of Colorado at Colorado Springs, USA jkalita@uccs.edu A. Gelbukh (Ed.): CICLing 2013, Part I, LNCS 7816, pp , c Springer-Verlag Berlin Heidelberg 2013

2 An Improved Stemming Approach Using HMM 165

3 166 N. Saharia et al. 3 rd 3 rd

4 An Improved Stemming Approach Using HMM w 0 w 1 w n 1 w i p i s i p i s i S ɛ w p ɛ w p s s = ɛ w

5 168 N. Saharia et al. (ɛ) (S 1 ) (S m ) (S 1) (S m) (ɛ) (S m) (S 1 ) (S m ) (ɛ) s S s ɛ w = p s w p s S w p s s S s = ɛ S w = p s G s S s ɛ w = p s w M N l w 0,w 1, w l 1 N Mq 0,q 1,,q l 1 q i Q {N,M} G G N M G (a) (b)

6 An Improved Stemming Approach Using HMM 169 w w 0 w 1 w 2 w 3 w 4 w 5 w 6 nabinhatar ghar aamar gharar para man durat p nabin ghar aamar ghar para dur s ɛ ɛ ɛ q M N N M N M M S 1 S m S {ɛ} S 1 S m s i S m w i q i = M q i = N e qi (s) =0 s S m S 1 S 1 S m S = {ɛ, s 1,s m } s 1 s m a kl e k (b) a kl e k (b) G {ɛ} S 1 S m A kl E k (b) a kl e k (b) â kl = A kl l A kl + δ and ê k(b) = E k (b) b E k (b )+δ δ 0

7 170 N. Saharia et al. (M sm ) (M s1 ) (N e ) (N s1 ) (S 1 ) (S m ) (ɛ) t t 1

8 An Improved Stemming Approach Using HMM 171 S 0 M N W T P (W T ) S 0 M N ɛ s 1 s m S 0 M N S 0 M N ɛ s 1 s m S 0 M N N S N SN

9 172 N. Saharia et al. References 1. Porter, M.F.: An algorithm for suffix stripping. Program 14, (1980) 2. Ramanathan, A., Rao, D.: A lightweight stemmer for Hindi. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computatinal Linguistics for South Asian Languages, Budapest, pp (2003) 3. Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: Yass: Yet another suffix stripper. ACM Trans. Inf. Syst. 25(4) (October 2007) 4. Pandey, A.K., Siddiqui, T.J.: An unsupervised Hindi stemmer with heuristic improvements. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, AND 2008, Singapore, pp (2008) 5. Aswani, N., Gaizauskas, R.: Developing morphological analysers for South Asian Languages: Experimenting with the Hindi and Gujarati languages. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC), Malta, pp (2010) 6. Kumar, D., Rana, P.: Design and development of a stemmer for Punjabi. International Journal of Computer Applications 11(12), (2010) 7. Majgaonker, M.M., Siddiqui, T.J.: Discovering suffixes: A case study for Marathi language. International Journal on Computer Science and Engineering 04, (2010) 8. Sharma, U., Kalita, J., Das, R.: Unsupervised learning of morphology for building lexicon for a highly inflectional language. In: Proceedings of the ACL 2002 Workshop on Morphological and Phonological Learning, Philadelphia, pp. 1 6 (2002) 9. Sharma, U., Kalita, J., Das, R.: Root word stemming by multiple evidence from corpus. In: Proceedings of 6th International Conference on Computational Intelligence and Natural Computing (CINC 2003), North Carolina, pp (2003) 10. Sharma, U., Kalita, J.K., Das, R.K.: Acquisition of morphology of an indic language from text corpus. ACM Transactions of Asian Language Information Processing (TALIP) 7(3), 9:1 9:33 (2008)

10 An Improved Stemming Approach Using HMM Saharia, N., Sharma, U., Kalita, J.: Analysis and evaluation of stemming algorithms: a case study with Assamese. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ICACCI 2012, Chennai, India, pp ACM (2012) 12. Saharia, N., Sharma, U., Kalita, J.: A suffix-based noun and verb classifier for an inflectional language. In: Proceedings of the 2010 International Conference on Asian Language Processing, IALP 2010, Harbin, China, pp IEEE Computer Society (2010) 13. Al-Shammari, E.T., Lin, J.: Towards an error-free Arabic stemming. In: Proceedings of the 2nd ACM Workshop on Improving Non English Web Searching, inews 2008, pp ACM, New York (2008) 14. Gaustad, T., Bouma, G.: Accurate stemming of Dutch for text classification. Language and Computers 14, (2002) 15. Suba, K., Jiandani, D., Bhattacharyya, P.: Hybrid inflectional stemmer and rulebased derivational stemmer for Gujrati. In: 2nd Workshop on South and Southeast Asian Natural Languages Processing, Chiang Mai, Thailand (2011) 16. Ram, V.S., Devi, S.L.: Malayalam stemmer. In: Parakh, M. (ed.) Morphological Analysers and Generators, LDC-IL, Mysore, pp (2010) 17. Bora, L.S.: Asamiya Bhasar Ruptattva. M/s Banalata, Guwahati, Assam, India (2006) 18. Creutz, M., Lagus, K.: Induction of a simple morphology for highly-inflecting languages. In: Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology, SIGMorPhon 2004, Barcelona, Spain, pp ACL (2004) 19. Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. SIGIR Forum 37(1), (2003)

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

CS626: NLP, Speech and the Web Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012 Parsing Problem Semantics Part of Speech Tagging NLP Trinity Morph Analysis