Pikachu, Domosaur, and other Monolexical Languages

Similar documents
Review of CFGs and Parsing I Context-free Languages and Grammars. Winter 2014 Costas Busch - RPI 1

Similar idea to multiplication in N, C. Divide and conquer approach provides unexpected improvements. Naïve matrix multiplication

INFINITE SERIES. ,... having infinite number of terms is called infinite sequence and its indicated sum, i.e., a 1

Some Properties of Brzozowski Derivatives of Regular Expressions

( a n ) converges or diverges.

A GENERAL METHOD FOR SOLVING ORDINARY DIFFERENTIAL EQUATIONS: THE FROBENIUS (OR SERIES) METHOD

PROGRESSIONS AND SERIES

General properties of definite integrals

Applied Databases. Sebastian Maneth. Lecture 16 Suffix Array, Burrows-Wheeler Transform. University of Edinburgh - March 16th, 2017

Chapter 7 Infinite Series

Chapter 5. The Riemann Integral. 5.1 The Riemann integral Partitions and lower and upper integrals. Note: 1.5 lectures

Section IV.6: The Master Method and Applications

Section 6.3: Geometric Sequences

Content. Languages, Alphabets and Strings. Operations on Strings. a ab abba baba. aaabbbaaba b 5. Languages. A language is a set of strings

Module 9: String Matching

Introduction to Computational Molecular Biology. Suffix Trees

We will begin by supplying the proof to (a).

0 otherwise. sin( nx)sin( kx) 0 otherwise. cos( nx) sin( kx) dx 0 for all integers n, k.

MA123, Chapter 9: Computing some integrals (pp )

ENGR 3861 Digital Logic Boolean Algebra. Fall 2007

Prior distributions. July 29, 2002

lecture 16: Introduction to Least Squares Approximation

: : 8.2. Test About a Population Mean. STT 351 Hypotheses Testing Case I: A Normal Population with Known. - null hypothesis states 0

Taylor Polynomials. The Tangent Line. (a, f (a)) and has the same slope as the curve y = f (x) at that point. It is the best

1.3 Continuous Functions and Riemann Sums

A general theory of minimal increments for Hirsch-type indices and applications to the mathematical characterization of Kosmulski-indices

Applied Databases. Sebastian Maneth. Lecture 16 Suffix Array, Burrows-Wheeler Transform. University of Edinburgh - March 10th, 2016

INTEGRATION TECHNIQUES (TRIG, LOG, EXP FUNCTIONS)

EVALUATING DEFINITE INTEGRALS

Review of the Riemann Integral

Review of Sections

Convergence rates of approximate sums of Riemann integrals

Merge Sort. Outline and Reading. Divide-and-Conquer. Divide-and-conquer paradigm ( 4.1.1) Merge-sort ( 4.1.1)

11/16/2010 The Inner Product.doc 1/9. The Inner Product. So we now know that a continuous, analog signal v t can be expressed as:

CONVERGENCE OF THE RATIO OF PERIMETER OF A REGULAR POLYGON TO THE LENGTH OF ITS LONGEST DIAGONAL AS THE NUMBER OF SIDES OF POLYGON APPROACHES TO

Chapter 4 Regular Grammar and Regular Sets. (Solutions / Hints)

Introduction to Matrix Algebra

Vectors. Vectors in Plane ( 2

MTH 146 Class 16 Notes

Notes on Dirichlet L-functions

SOME IDENTITIES BETWEEN BASIC HYPERGEOMETRIC SERIES DERIVING FROM A NEW BAILEY-TYPE TRANSFORMATION

Improving XOR-Dominated Circuits by Exploiting Dependencies between Operands. Ajay K. Verma and Paolo Ienne. csda

Finite Automata. Reading: Chapter 2

SOLUTION OF DIFFERENTIAL EQUATION FOR THE EULER-BERNOULLI BEAM

ICS141: Discrete Mathematics for Computer Science I

POWER SERIES R. E. SHOWALTER

Geometric Sequences. Geometric Sequence. Geometric sequences have a common ratio.

Pre-Calculus - Chapter 3 Sections Notes

SUTCLIFFE S NOTES: CALCULUS 2 SWOKOWSKI S CHAPTER 11

NFAs and Regular Expressions. NFA-ε, continued. Recall. Last class: Today: Fun:

Limit of a function:

Course Material. CS Lecture 1 Deterministic Finite Automata. Grading and Policies. Workload. Website:

Formal Languages The Pumping Lemma for CFLs

CH 39 USING THE GCF TO REDUCE FRACTIONS

SOME SHARP OSTROWSKI-GRÜSS TYPE INEQUALITIES

is an ordered list of numbers. Each number in a sequence is a term of a sequence. n-1 term

The Weierstrass Approximation Theorem

Graphing Review Part 3: Polynomials

Riemann Integration. Chapter 1

Fast Fourier Transform 1) Legendre s Interpolation 2) Vandermonde Matrix 3) Roots of Unity 4) Polynomial Evaluation

In an algebraic expression of the form (1), like terms are terms with the same power of the variables (in this case

Week 13 Notes: 1) Riemann Sum. Aim: Compute Area Under a Graph. Suppose we want to find out the area of a graph, like the one on the right:

Definite Integral. The Left and Right Sums

The total number of permutations of S is n!. We denote the set of all permutations of S by

SM2H. Unit 2 Polynomials, Exponents, Radicals & Complex Numbers Notes. 3.1 Number Theory

Numbers (Part I) -- Solutions

10.5 Power Series. In this section, we are going to start talking about power series. A power series is a series of the form

Applications of Regular Closure

Reversing the Arithmetic mean Geometric mean inequality

Error-free compression

is infinite. The converse is proved similarly, and the last statement of the theorem is clear too.

, we would have a series, designated as + j 1

The limit comparison test

M3P14 EXAMPLE SHEET 1 SOLUTIONS

f(bx) dx = f dx = dx l dx f(0) log b x a + l log b a 2ɛ log b a.

The Elementary Arithmetic Operators of Continued Fraction

Chapter 11 Design of State Variable Feedback Systems

Lecture 3 ( ) (translated and slightly adapted from lecture notes by Martin Klazar)

Approximate Integration

Minimal DFA. minimal DFA for L starting from any other

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 4

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Lecture 09: Myhill-Nerode Theorem

,... are the terms of the sequence. If the domain consists of the first n positive integers only, the sequence is a finite sequence.

CS 331 Design and Analysis of Algorithms. -- Divide and Conquer. Dr. Daisy Tang

The Exponential Function

Convergence rates of approximate sums of Riemann integrals

Test Info. Test may change slightly.

More on automata. Michael George. March 24 April 7, 2014

2.1.1 Definition The Z-transform of a sequence x [n] is simply defined as (2.1) X re x k re x k r

FOURIER SERIES PART I: DEFINITIONS AND EXAMPLES. To a 2π-periodic function f(x) we will associate a trigonometric series. a n cos(nx) + b n sin(nx),

RULES FOR MANIPULATING SURDS b. This is the addition law of surds with the same radicals. (ii)

Discrete Mathematics I Tutorial 12

Harvard University Computer Science 121 Midterm October 23, 2012

(II.G) PRIME POWER MODULI AND POWER RESIDUES

1 Online Learning and Regret Minimization

A GENERALIZATION OF GAUSS THEOREM ON QUADRATIC FORMS

 n. A Very Interesting Example + + = d. + x3. + 5x4. math 131 power series, part ii 7. One of the first power series we examined was. 2!

Sequence and Series of Functions

Transcription:

Pikchu, Domosur, d other Moolexicl Lguges Srh Alle, Jesse Dodge, Domosur Mrch 2014 Abstrct M complicted techiques hve bee itroduced to id i computer processig of turl lguges. While this is geerll cosidered to be difficult tsk, m pproches hve igored the prevlet clss of moolexicl lguges, or lguges tht cosist of sigle word. Here we preset some desirble properties of such lguges d ppl techiques for commo NLP tsks. 1 Itroductio Curret turl lguge processig techiques ddress m problems i hum-cetric lguges, but the commuit s whole hs igored the clss of moolexicl lguges, of which there re m. Our gol is to highlight some sliet properties of these lguges i hopes of expdig the cpbilities of moder NLP softwre. While trditiol pproches hve ssumed high lguge complexit, we show i Sectio 2 tht this clss of lguges is i fct esil recogizble b computer usig existig techiques. We lso exted curret techiques to iclude moolexicl lguges i Sectio 3. Fill, i Sectio 4, we illustrte the experimetl results of some techiques pplied to these lguges. 1.1 Motivtio Most NLP models re eedlessl complicted, thus cretig hedche for those who implemet them. While these overwrought techiques pper to ield good results for lguges such s Frech, Eglish, d Chiese, the commuit hs lrgel igored the equll importt clss of moolexicl lguges. Moolexicl lguges hve bee recogized i m turl settigs. A few well kow exmples iclude the lguges spoke b Poke mo, N Ct, d Timm Burch of South Prk, Colordo (see Figure 1). I dditio to their prevlece, moolexicl lguges hve m desirble properties which we discuss i subsequet sectios. () Pikchu (b) N Ct (c) TIMMAY!!!!!! Figure 1: Exmples of cretures with moolexicl lguges 1

1.2 Defiitio of Moolexicl Lguge A moolexicl lguge is mde up of seteces ll usig sigle word ol, which we cll the bsis of the lguge. The seteces m coti puctutio chrcters, but we cosider ol termil puctutio, which is used to delimit the seteces. Depedig o the lguge, the bsis m pper either cpitlized or i ll lowercse, but we lso cosider cpitliztio to be irrelevt, so ll processig is doe b chgig ll strigs to lowercse. Therefore, the forml defiitio of the set of ll vlid seteces (without puctutio) is {w( w) i }, where w is the bsis of the lguge. i=0 Figure 2: Domosur i his turl hbitt I m cses, the bsis of the lguge is epomous with the creture tht speks it. Oe such exmple is Domosur, lguge spoke b the creture Domosur. Domosur is getle creture who ws htched from egg, ets predomitl beef d potto stew, d is lws see i diosur costume. He is lso kow to become fltulet whe he is ervous[1]. He curretl resides i the Gtes buildig i Pittsburgh, PA. A photogrph of Domosur is depicted i Figure 2. His website c be foud t http://www.cs.cmu.edu/~srlle/domosur.html. 2

2 Moolexicl Lguges re Regulr M ttempts hve bee mde to chrcterize turl lguges usig formlisms such s cotext-free grmmrs. For most trditiol lguges, these ttempts hve bee lrgel usuccessful due to the complexit of the lguge. I this sectio, we demostrte tht moolexicl lguges re ot ol cotext free, but lso regulr. From Sectio 1.2, it is cler tht ll text i moolexicl lguge with bsis strig b c be expressed usig the regulr expressio (b( b) (.!?) ) (b( b) (.!?)) Now it remis to show tht {b} is i fct regulr. To illustrte this, we costruct determiistic fiite utomto for the bsis of oe exmple lguge, mel. The DFA for {} is show i Figure 3. The proof of its correctess is left s exercise to the reder. This costructio c be trivill exteded for other bsis words. q strt 0 q,, 1 q 2 q 3 q 4,,, q 5 q 6 q 7 q 8,,,,,,, q 10 q 11 q 12 q 13 Figure 3: DFA for the lguge {}. A cceptig pth is show i red. 3 Applig NLP Tools to Moolexicl Lguges I this sectio, we cover the pplictio of some commo turl lguge processig (NLP) techiques to problems risig i moolexicl lguges. 3.1 Mchie Trsltio A gret success of moder NLP hs bee mchie trsltio, the utomtic trsltio from oe lguge to other. While previous work, such s Google trslte, hs bee populr, we hve foud tht trsltig both to d from lguge to be uecessril complicted. With simple relxtio of the problem, we hve developed lgorithim for the hol gril of mchie trsltio: Uiversl trsltio. 3.1.1 Oe-W Mchie Trsltio Our lgorithm trsltes from lguge to give moolexicl lguge. pproch, outlied below. It follows two-stge 1. Word ligmet All words i the source lguge trslte to the bsis word i the trget. This geertes setece i the moolexicl lguge, S. Ufortutel, it is ver difficult to trslte from moolexicl lguges to more trditiol lguges, s m words i moolexicl lguges hve mbiguous meigs. 2. Reorderig I moolexicl lguges, there is observed pheome tht ll seteces re ordered lexicogrphicll. Therefore, fter geertig S, we sort the tokes i S. Our pproch is s follows: 3

() Geerte ll possible permuttios of the set of strigs i S. (b) Score ech geerted permuttio π i with the percetge of words i π i tht pper i sorted order (c) Retur rg mx Score(π i) π i Π Thus it is possible to perform oe-w trsltio with 100% ccurc i O(!) time. Future work could iclude improvemet of the ruig time for this lgorithm. 3.2 Setimet Lexico The tsk of setimet lsis is both vitl d difficult. This well-studied problem hs bee the subject of much reserch. For ech moolexicl lguge, d ll possible setimets, we preset lgorithm to geerte complete setimet lexico. 1. For ech setimet possible, dd etr to the lexico mppig from the bsis word to the setimet. This geertes complete setimet lexico, which mps from ech word i the lguge to its setimet. 4 Experimets 4.1 Wordcloud Wordclouds hve become populr w to represet the reltive frequec of words i set of strigs. The hve lso become growig topic i computer visuliztio reserch. Usig lrge corpus of vilble text, we hve costructed word cloud tht represets the reltive frequec of vrious words i the lguge Domosur. The word cloud is depicted i Figure 4 domosur Figure 4: Word Cloud for Domosur 4

4.2 Topic Model Topic models hve bee used for umber of purposes i NLP, from text clssifictio to semtic lsis. I tpicl setup, topic model lers set of topics, where topic is set of semticll relted words. Documets c be modeled s beig geerted from oe or more topics. For exmple, documet bout Pokémo could be geerted from topic cotiig words such s ctch d pokébll. Lerig topic model c be somewht ivolved, d tkes o-trivil mout of time. Whe delig with moolexicl lguges, however, topic model becomes much simpler d loses much of its eedless complexit. I this work, we lered topic model for exmple moolexicl lguge, d preset the topics i Figure 5. TIMMY Figure 5: A topic for the lguge TIMMY 5 Coclusio I this pper we hve described how rr of well-studied NLP tools c be dpted for this importt clss of lguges. It is our hope tht the techiques d results itroduced i this work will crete foudtio for m subsequet sstems d improvemets to existig NLP softwre. Future work for this field icludes extesios to similr lguges, such s oligolexicl lguges d empt lguges. Refereces [1] Wikipedi. Domo(NHK). https://e.wikipedi.org/wiki/domo_(nhk), 2014. 5