Pikchu, Domosur, d other Moolexicl Lguges Srh Alle, Jesse Dodge, Domosur Mrch 2014 Abstrct M complicted techiques hve bee itroduced to id i computer processig of turl lguges. While this is geerll cosidered to be difficult tsk, m pproches hve igored the prevlet clss of moolexicl lguges, or lguges tht cosist of sigle word. Here we preset some desirble properties of such lguges d ppl techiques for commo NLP tsks. 1 Itroductio Curret turl lguge processig techiques ddress m problems i hum-cetric lguges, but the commuit s whole hs igored the clss of moolexicl lguges, of which there re m. Our gol is to highlight some sliet properties of these lguges i hopes of expdig the cpbilities of moder NLP softwre. While trditiol pproches hve ssumed high lguge complexit, we show i Sectio 2 tht this clss of lguges is i fct esil recogizble b computer usig existig techiques. We lso exted curret techiques to iclude moolexicl lguges i Sectio 3. Fill, i Sectio 4, we illustrte the experimetl results of some techiques pplied to these lguges. 1.1 Motivtio Most NLP models re eedlessl complicted, thus cretig hedche for those who implemet them. While these overwrought techiques pper to ield good results for lguges such s Frech, Eglish, d Chiese, the commuit hs lrgel igored the equll importt clss of moolexicl lguges. Moolexicl lguges hve bee recogized i m turl settigs. A few well kow exmples iclude the lguges spoke b Poke mo, N Ct, d Timm Burch of South Prk, Colordo (see Figure 1). I dditio to their prevlece, moolexicl lguges hve m desirble properties which we discuss i subsequet sectios. () Pikchu (b) N Ct (c) TIMMAY!!!!!! Figure 1: Exmples of cretures with moolexicl lguges 1
1.2 Defiitio of Moolexicl Lguge A moolexicl lguge is mde up of seteces ll usig sigle word ol, which we cll the bsis of the lguge. The seteces m coti puctutio chrcters, but we cosider ol termil puctutio, which is used to delimit the seteces. Depedig o the lguge, the bsis m pper either cpitlized or i ll lowercse, but we lso cosider cpitliztio to be irrelevt, so ll processig is doe b chgig ll strigs to lowercse. Therefore, the forml defiitio of the set of ll vlid seteces (without puctutio) is {w( w) i }, where w is the bsis of the lguge. i=0 Figure 2: Domosur i his turl hbitt I m cses, the bsis of the lguge is epomous with the creture tht speks it. Oe such exmple is Domosur, lguge spoke b the creture Domosur. Domosur is getle creture who ws htched from egg, ets predomitl beef d potto stew, d is lws see i diosur costume. He is lso kow to become fltulet whe he is ervous[1]. He curretl resides i the Gtes buildig i Pittsburgh, PA. A photogrph of Domosur is depicted i Figure 2. His website c be foud t http://www.cs.cmu.edu/~srlle/domosur.html. 2
2 Moolexicl Lguges re Regulr M ttempts hve bee mde to chrcterize turl lguges usig formlisms such s cotext-free grmmrs. For most trditiol lguges, these ttempts hve bee lrgel usuccessful due to the complexit of the lguge. I this sectio, we demostrte tht moolexicl lguges re ot ol cotext free, but lso regulr. From Sectio 1.2, it is cler tht ll text i moolexicl lguge with bsis strig b c be expressed usig the regulr expressio (b( b) (.!?) ) (b( b) (.!?)) Now it remis to show tht {b} is i fct regulr. To illustrte this, we costruct determiistic fiite utomto for the bsis of oe exmple lguge, mel. The DFA for {} is show i Figure 3. The proof of its correctess is left s exercise to the reder. This costructio c be trivill exteded for other bsis words. q strt 0 q,, 1 q 2 q 3 q 4,,, q 5 q 6 q 7 q 8,,,,,,, q 10 q 11 q 12 q 13 Figure 3: DFA for the lguge {}. A cceptig pth is show i red. 3 Applig NLP Tools to Moolexicl Lguges I this sectio, we cover the pplictio of some commo turl lguge processig (NLP) techiques to problems risig i moolexicl lguges. 3.1 Mchie Trsltio A gret success of moder NLP hs bee mchie trsltio, the utomtic trsltio from oe lguge to other. While previous work, such s Google trslte, hs bee populr, we hve foud tht trsltig both to d from lguge to be uecessril complicted. With simple relxtio of the problem, we hve developed lgorithim for the hol gril of mchie trsltio: Uiversl trsltio. 3.1.1 Oe-W Mchie Trsltio Our lgorithm trsltes from lguge to give moolexicl lguge. pproch, outlied below. It follows two-stge 1. Word ligmet All words i the source lguge trslte to the bsis word i the trget. This geertes setece i the moolexicl lguge, S. Ufortutel, it is ver difficult to trslte from moolexicl lguges to more trditiol lguges, s m words i moolexicl lguges hve mbiguous meigs. 2. Reorderig I moolexicl lguges, there is observed pheome tht ll seteces re ordered lexicogrphicll. Therefore, fter geertig S, we sort the tokes i S. Our pproch is s follows: 3
() Geerte ll possible permuttios of the set of strigs i S. (b) Score ech geerted permuttio π i with the percetge of words i π i tht pper i sorted order (c) Retur rg mx Score(π i) π i Π Thus it is possible to perform oe-w trsltio with 100% ccurc i O(!) time. Future work could iclude improvemet of the ruig time for this lgorithm. 3.2 Setimet Lexico The tsk of setimet lsis is both vitl d difficult. This well-studied problem hs bee the subject of much reserch. For ech moolexicl lguge, d ll possible setimets, we preset lgorithm to geerte complete setimet lexico. 1. For ech setimet possible, dd etr to the lexico mppig from the bsis word to the setimet. This geertes complete setimet lexico, which mps from ech word i the lguge to its setimet. 4 Experimets 4.1 Wordcloud Wordclouds hve become populr w to represet the reltive frequec of words i set of strigs. The hve lso become growig topic i computer visuliztio reserch. Usig lrge corpus of vilble text, we hve costructed word cloud tht represets the reltive frequec of vrious words i the lguge Domosur. The word cloud is depicted i Figure 4 domosur Figure 4: Word Cloud for Domosur 4
4.2 Topic Model Topic models hve bee used for umber of purposes i NLP, from text clssifictio to semtic lsis. I tpicl setup, topic model lers set of topics, where topic is set of semticll relted words. Documets c be modeled s beig geerted from oe or more topics. For exmple, documet bout Pokémo could be geerted from topic cotiig words such s ctch d pokébll. Lerig topic model c be somewht ivolved, d tkes o-trivil mout of time. Whe delig with moolexicl lguges, however, topic model becomes much simpler d loses much of its eedless complexit. I this work, we lered topic model for exmple moolexicl lguge, d preset the topics i Figure 5. TIMMY Figure 5: A topic for the lguge TIMMY 5 Coclusio I this pper we hve described how rr of well-studied NLP tools c be dpted for this importt clss of lguges. It is our hope tht the techiques d results itroduced i this work will crete foudtio for m subsequet sstems d improvemets to existig NLP softwre. Future work for this field icludes extesios to similr lguges, such s oligolexicl lguges d empt lguges. Refereces [1] Wikipedi. Domo(NHK). https://e.wikipedi.org/wiki/domo_(nhk), 2014. 5