Applying Phonetic Matching Algorithm to Tongue Twister Retrieval in Japanese

Similar documents
A L A BA M A L A W R E V IE W

P a g e 5 1 of R e p o r t P B 4 / 0 9

T h e C S E T I P r o j e c t

Use precise language and domain-specific vocabulary to inform about or explain the topic. CCSS.ELA-LITERACY.WHST D

Use precise language and domain-specific vocabulary to inform about or explain the topic. CCSS.ELA-LITERACY.WHST D

Preparative script paper - chinese exam Qian ZHAO (LingYun ZHAO, Pierre MADL)

CATAVASII LA NAȘTEREA DOMNULUI DUMNEZEU ȘI MÂNTUITORULUI NOSTRU, IISUS HRISTOS. CÂNTAREA I-A. Ήχος Πα. to os se e e na aș te e e slă ă ă vi i i i i

Energy Levels of Light Nuclei A = 4

RAHAMA I NTEGRATED FARMS LI MI TED RC

9/20/2017. Elements are Pure Substances that cannot be broken down into simpler substances by chemical change (contain Only One Type of Atom)

Ba Be Bi Bo Bu Filipino Chart

Metallurgical Chemistry. An Audio Course for Students

2 tel

OH BOY! Story. N a r r a t iv e a n d o bj e c t s th ea t e r Fo r a l l a g e s, fr o m th e a ge of 9

176 5 t h Fl oo r. 337 P o ly me r Ma te ri al s

Ch. 9 NOTES ~ Chemical Bonding NOTE: Vocabulary terms are in boldface and underlined. Supporting details are in italics.

Guide to the Extended Step-Pyramid Periodic Table

Lesson Ten. What role does energy play in chemical reactions? Grade 8. Science. 90 minutes ENGLISH LANGUAGE ARTS

Using the Periodic Table

Last 4 Digits of USC ID:

The Periodic Table of the Elements

CHM 101 PRACTICE TEST 1 Page 1 of 4

Advanced Placement. Chemistry. Integrated Rates

Ash Wednesday. First Introit thing. * Dómi- nos. di- di- nos, tú- ré- spi- Ps. ne. Dó- mi- Sál- vum. intra-vé-runt. Gló- ri-

The Periodic Table. Periodic Properties. Can you explain this graph? Valence Electrons. Valence Electrons. Paramagnetism

610B Final Exam Cover Page

P a g e 3 6 of R e p o r t P B 4 / 0 9

Atoms and the Periodic Table

CHEM 10113, Quiz 5 October 26, 2011

Secondary Support Pack. be introduced to some of the different elements within the periodic table;

A Crash Course in Spatial History

PHYSICAL SCIENCES MARCH CONTROLLED TEST GRADE

Chemistry 2 Exam Roane State Academic Festival. Name (print neatly) School

8. Relax and do well.

Earth Materials I Crystal Structures

DO NOW: Retrieve your projects. We will be reviewing them again today. Textbook pg 23, answer questions 1-3. Use the section 1.2 to help you.

INSTRUCTIONS: Exam III. November 10, 1999 Lab Section

Energy Levels of Light Nuclei A =4

COMPILATION OF AUTOMATA FROM MORPHOLOGICAL TWO-LEVEL RULES

Faculty of Natural and Agricultural Sciences Chemistry Department. Semester Test 1 MEMO. Analytical Chemistry CMY 283

(C) Pavel Sedach and Prep101 1

CLASS TEST GRADE 11. PHYSICAL SCIENCES: CHEMISTRY Test 4: Matter and materials 1

Evolution Strategies for Optimizing Rectangular Cartograms

Faculty of Natural and Agricultural Sciences Chemistry Department. Semester Test 1. Analytical Chemistry CMY 283. Time: 120 min Marks: 100 Pages: 6

CHEM 130 Exp. 8: Molecular Models

Element Cube Project (x2)

Solutions and Ions. Pure Substances

Lab Day and Time: Instructions. 1. Do not open the exam until you are told to start.

Radiometric Dating (tap anywhere)

If anything confuses you or is not clear, raise your hand and ask!

ISO/IEC JTC1/SC2/WG2 N2

I M P O R T A N T S A F E T Y I N S T R U C T I O N S W h e n u s i n g t h i s e l e c t r o n i c d e v i c e, b a s i c p r e c a u t i o n s s h o

The exam must be written in ink. No calculators of any sort allowed. You have 2 hours to complete the exam. Periodic table 7 0

Company Case Study: The Market Hall. Prof.dr.ir. Mick Eekhout


PART 1 Introduction to Theory of Solids

Software Process Models there are many process model s in th e li t e ra t u re, s om e a r e prescriptions and some are descriptions you need to mode

I N A C O M P L E X W O R L D

ISOTONIC AND ISOTOPIC DEPENDENCE OF THE RADIATIVE NEUTRON CAPTURE CROSS-SECTION ON THE NEUTRON EXCESS

The distribution of characters, bi- and trigrams in the Uppsala 70 million words Swedish newspaper corpus

PHYSICAL SCIENCES GRADE : 10

R a. Aeolian Church. A O g C. Air, Storm, Wind. P a h. Affinity: Clan Law. r q V a b a. R 5 Z t 6. c g M b. Atroxic Church. d / X.

8. Relax and do well.

NAME: FIRST EXAMINATION

Lab Day and Time: Instructions. 1. Do not open the exam until you are told to start.

H STO RY OF TH E SA NT

Circle the letters only. NO ANSWERS in the Columns!

Nucleus. Electron Cloud

Welcome to the Public Meeting Red Bluff Road from Kirby Boulevard to State Highway 146 Harris County, Texas CSJ No.: December 15, 2016

CHEM 107 (Spring-2004) Exam 2 (100 pts)

Instructions. 1. Do not open the exam until you are told to start.

Boy Scout Troop 41 Bay Village, Ohio A C K N O W L E D G E M E N T S. F i f t i e t h A n n i v e r s a r y C e l e b r a t i o n

NAME (please print) MIDTERM EXAM FIRST LAST JULY 13, 2011

We help hotels and restaurants SAYHELLO.

Made the FIRST periodic table

Speed of light c = m/s. x n e a x d x = 1. 2 n+1 a n π a. He Li Ne Na Ar K Ni 58.

1 Genesis 1:1. Chapter 10 Matter. Lesson. Genesis 1:1 In the beginning God created the heavens and the earth. (NKJV)

Bioscience, Biotechnology, and Biochemistry. Egg white hydrolysate inhibits oxidation in mayonnaise and a model system

8. Relax and do well.

SCIENCE 1206 UNIT 2 CHEMISTRY. September 2017 November 2017

Lab Day and Time: Instructions. 1. Do not open the exam until you are told to start.

02/05/09 Last 4 Digits of USC ID: Dr. Jessica Parr

Nubase table. Data are presented in groups ordered according to mass number A. Nuclidic name: mass number A = N + Z and element symbol (for

MALAYSIAN ORIGINS by E.S.Shankar 1/75 a brief history of the peoples of Malaysia

Essential Chemistry for Biology

PERIODIC TABLE OF THE ELEMENTS

Practice 1 (Topic) A is B p. 21

Atomic Structure & Interatomic Bonding

Topic 3: Periodicity OBJECTIVES FOR TODAY: Fall in love with the Periodic Table, Interpret trends in atomic radii, ionic radii, ionization energies &

Atomic weight: This is a decimal number, but for radioactive elements it is replaced with a number in parenthesis.

Example: Helium has an atomic number of 2. Every helium atom has two protons in its nucleus.

CMSC 313 Lecture 17 Postulates & Theorems of Boolean Algebra Semiconductors CMOS Logic Gates

[ ]:543.4(075.8) 35.20: ,..,..,.., : /... ;. 2-. ISBN , - [ ]:543.4(075.8) 35.20:34.

8. Relax and do well.

(please print) (1) (18) H IIA IIIA IVA VA VIA VIIA He (2) (13) (14) (15) (16) (17)

Trade Patterns, Production networks, and Trade and employment in the Asia-US region

CHEM 167 FINAL EXAM MONDAY, MAY 2 9:45 11:45 A.M GILMAN HALL

Chapter 3: Elements and Compounds. 3.1 Elements

Scripture quotations marked cev are from the Contemporary English Version, Copyright 1991, 1992, 1995 by American Bible Society. Used by permission.

Circle the letters only. NO ANSWERS in the Columns! (3 points each)

Transcription:

1 1 n-gram 2 Applying Phonetic Matching Algorithm to Tongue Twister Retrieval in Japanese Michiko Yasukawa 1 and Hidetoshi Yokoo 1 In this paper, we propose a Japanese phonetic matching algorithm for tongue twister retrieval. We first apply a standard English phonetic algorithm to Japanese texts written in Romaji. Since a transliterated text in Romaji does not retain its original phonetic sound of Japanese, some search results are poor from the viewpoint of approximate phonetic matching in Japanese. To solve the problem, we present a new approach to phonetic matching, which is based on the Japanese syllabary and on the n-gram based inverted index search. We also report user evaluation of searched results with our approach. 1. 1 Gunma University 1) 2) 5) 4) 6) 7) 8),9) 10),11) (wordplay) 12) 13) 14) 15) 2. (phonetic matching algorithm) Soundex 16) Metaphone 17) Soundex Russell 1918 1922 (surname) 16) Soundex SMITH SMYTH S-530 1 1 18) Soundex 1 c 2011 Information Processing Society of Japan

Step-1 Step-2 Step-3 Step-4 a, e, h, i, o, u, w, y b, f, p, v 1 l 4 c, g, j, k, q, s, x, z 2 m, n 5 d, t 3 r 6 2 h w 3 0 3 4 Fig. 1 1 Soundex Basic coding rules and processing of the Soundex Indexing System. 19) Soundex 1 18) Soundex Soundex Calculator 1 Soundex Soundex 4 ( 5 ) Soundex Philips 17) Metaphone Double Metaphone Metaphone Soundex Metaphone Soundex Soundex 6 (123456) Metaphone 16 (0BFHJKLMNPRSTWXY) 0 th X sh ch Metaphone Soundex Metaphone Soundex Metaphone Soundex 1 Double Metaphone (primary) (secondary) 1 2 Metaphone SMITH 1 SM0 Double Metaphone primary SM0 secondary XMT 2 Double Metaphone secondary primary Double Metaphone Metaphone Metaphone AUTO AT OTTO OT Double Metaphone OTTO AT AUTO OTTO 2 ( 6 ) 20) metaphone Fig. 2 Number of unique n-grams of Hiragana/Katakana, Romaji, metaphone in a Japanese-language dictionary 20) 1 http://www.eogn.com/soundex/ 2 c 2011 Information Processing Society of Japan

1 Metaphone Table 1 Approximate string matching in Japanese using metaphone. A () ( ) ( ) () ( ) ( ) ( ) () ( ) () B ( ) () C () () ( ) ( ) D ( ) ( ) () ( ) () () () () () () () ( ) () E ( ) ( ) () () L7 L6 L4 L6 L2 L4 L7 L6 L4 L4 L6 2 Metaphone (1) Table 2 Encoding table to convert katakana readings to Romaji and Metaphone (1). ( ) ( ) ( ) K (ka) (ki) (ku) (ke) (ko) K (ga) (gi) (gu) (ge) (go) S (sa) (su) (se) (so) S (za) (zu) (ze) (zo) X (shi) (chi) J (ji) T (ta) (te) (to) T (da) (di) (du) (de) (do) TS (tsu) ( ) N (na) (ni) (nu) (ne) (no) H (ha) (he) (ho) B (ba) (bi) (bu) (be) (bo) (hi) F (fu) P (pa) (pi) (pu) (pe) (po) M (ma) (mi) (mu) (me) (mo) Y (ya) (yu) (yo) ( ) R (ra) (ri) (ru) (re) (ro) W (wa) (wa) (wo) ( ) 3 Metaphone (2) Table 3 Encoding table to convert katakana readings to Romaji and Metaphone (2). A 1 (a) (a) 2 I 1 (i) (i) 2 U 1 (u) (u) 2 E 1 (e) (e) 2 O 1 (o) (o) 2 TS ( ) 4 2 ( ) 4 1 Y ( ) ( ) H ( ) ( ) 1 2 Table 4 4 Classification of sounds behind the Japanese germinate consonant. ( ) 3 c 2011 Information Processing Society of Japan

3. 23 20) KAKASI 1 PostgreSQL 2 fuzzystrmatch Metaphone metaphone metaphone 2 4 10 metaphone 5) metaphone metaphone 5 metaphone 8 20 7 1 1 8 A E L8 9 0 L8 8 / / / / 1 http://kakasi.namazu.org/ 2 http://www.postgresql.org/ D 5 metaphone KKYKS (kakyakuse) metaphone (kagayakasu) 1 Metaphone kya gaya KY metaphone metaphone 2 ( ) metaphone metaphone ( ) KAKASI ( ) metaphone 2 3 (sha)(shu)(sho)(cha)(chu)(cho) (ja)(ju)(jo) y Metaphone Y (kya) (kyu)(kyo) Metaphone KY KY Metaphone (hya)(hyu)(hyo) y h (g) (k) K ( ) (gogohyakusa) metaphone KKYKS ( ) (kakyakuse) metaphone KKYKS Metaphone Double Metaphone ( ) ( ) Metaphone AK EKW Double Metaphone AK 4 c 2011 Information Processing Society of Japan

2 5 Table 5 Encoding table to convert katakana readings to hiragana phonetic symbols. ( ) ( ) ( ) ( ) 4. 2 7) ) ( ) 5 1 5 ( ) jpma-1 (Japanese Phonetic Matching Algorithm-1) jpma-1 5 n 1 n n-gram n=3 (3-gram) ( ) 1 3 { } n-gram n-gram (inverted index) n n n 5-gram 2 jpma-1 Metaphone jpma-2 (Japanese Phonetic Matching Algorithm-2) Metaphone jpma-2 1 2 ( )( ) ( ) ( ) jpma-2 5 jpma-1 n n-gram n=3 (3-gram) 1 3 { } jpma-1 n-gram 5 c 2011 Information Processing Society of Japan

6 (1) Table 6 Tongue twister retrieval with Japanese phonetic matching algorithm (1). ( ) ( ) ( ) ( ) ( ) ( ) () ( ) ( ) ( ) ( ) ( ) () ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) () () () 7 (2) Table 7 Tongue twister retrieval with Japanese phonetic matching algorithm (2). ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) () ( ) ( ) () ( ) ( ) () ( ) () ( ) () ( ) ( ) () () ( ) ( ) () () () 5. 2 1 1 1 15) DVD-ROM 5 Fig. 3 3 The remarks column of Aozorabunko. 8 Table 8 Difficult parts to pronounce. 6 () 5 ( ) ( ) ( ) 4 ( ) ( ) () () () 3 ( ) ( ) () 2 ( ) ( ) () ( ) ( ) 1 ( ) ( ) () ( ) ( ) ( ) () ( ) () ( ) ( ) ChaSen (chasen-2.4.2 + ipadic- 2.7.0) 1 10 100 2,973 1,711,453 1 http://chasen-legacy.sourceforge.jp/ 6 c 2011 Information Processing Society of Japan

5 1 5-gram (index term) Indri 1 TREC SGML 5 5 jpma-2 3-gram <DOC> <DOCNO>EX-00001</DOCNO> <TEXT> </TEXT> </DOC> 4 TREC SGML ( 3-gram ) Fig. 4 The TREC SGML document format (An example of character based 3-grams of phonetic syllables). 6) 2 3 1 ChaSen 1 5 5-gram Indri Okapi BM25 (k1 = 1.2, b = 0.75, k3 = 7) jpma-1 jpma-2 3 A3 8 20 7 1 1 http://sourceforge.net/apps/trac/lemur/ 2 3 ( ) 6 7 jpma-1 jpma-2 1 1 () 8 ( ) ( ) ( ) () ( ) ( ) ( ) ( ) ( ) ( )jpma-1 jpma-2 jpma-2 jpma-1 jpma-2 jpma-1 jpma-2 6. 2 7 c 2011 Information Processing Society of Japan

( : 21700273) 1) ( ) Vol.82, No.12, pp.50 65 (2005). 2) Fromkin, V.: Slips of the tongue, pp.181 187 (1973). 3) Vol.19, No.3, pp.193 198 DOI:http://dx.doi.org/10.2496/apr.19.193 (1999). 4) (6) pp.131 132, (2004). 5) (19) pp.83 86, (2007). 6) (1987). 7) NL Vol. 2011-NL-201, No. 14, pp. 1 6 http://id.nii.ac.jp/1001/00074041/ (2011). 8) Navarro, G.: A guided tour to approximate string matching, Computing Surveys (CSUR), Vol.33, No.1, pp.31 88 (online), DOI:http://dx.doi.org/10.1145/375360.375365 (2001). 9) Elmagarmid, A. K., Ipeirotis, P. G. and Verykios, V. S.: Duplicate record detection: A survey, IEEE Trans. Knowl. Data Eng., Vol. 19, No. 1, pp. 1 16 (online), available from http://doi.ieeecomputersociety.org/10.1109/tkde.2007.9 (2007). 10) Zobel, J. and Dart, P.W.: Phonetic string matching: lessons from information retrieval, SI- GIR 96 Proceedings, ACM, pp.166 172 (online), DOI:http://dx.doi.org/10.1145/243199.243258 (1996). 11) French, J. C., Powell, L. and Schulman, E.: Applications of approximate word matching in information retrieval, CIKM 97 Proceedings, ACM, pp. 9 15 (online), DOI:http://dx.doi.org/10.1145/266714.266721 (1997). 12) Taylor, J.M. and Mazlack, L.J.: Humorous wordplay recognition, SMC (4), pp.3306 3311 (online), DOI:http://dx.doi.org/10.1109/ICSMC.2004.1400851 (2004). 13) Vol.20, No.5, pp.696 708 DOI:http://dx.doi.org/10.3156/jsoft.20.696 (2008). 14) Cambouropoulos, E., Crochemore, M., Iliopoulos, C.S., Mohamed, M. and Sagot, M.-F.: All maximal-pairs in step-leap representation of melodic sequence, Inf. Sci., Vol.177, No.9, pp.1954 1962 (online), DOI:http://dx.doi.org/10.1016/j.ins.2006.11.012 (2007). 15) http://www.aozora.gr.jp/kizokeikaku/ 2007 2010 DVD-ROM (2007). 16) The U.S. National Archives and Records Administration: The Soundex Indexing System, (online), available from http://www.archives.gov/research/census/soundex.html (2007). 17) Philips, L.: The Double Metaphone Search Algorithm, C/C++ Users Journal, (online), available from http://drdobbs.com/cpp/184401251 (2000). 18) E. The Art of Computer Programming Volume 3 Sorting and Searching Second Edition pp.375 376, (2004). 19) Davidson, L.: Retrieval of misspelled names in an airlines passenger record system, Commun. ACM, Vol.5, pp.169 171 (online), DOI:http://doi.acm.org/10.1145/366862.366913 (1962). 20) (2008). 8 c 2011 Information Processing Society of Japan