1 1 n-gram 2 Applying Phonetic Matching Algorithm to Tongue Twister Retrieval in Japanese Michiko Yasukawa 1 and Hidetoshi Yokoo 1 In this paper, we propose a Japanese phonetic matching algorithm for tongue twister retrieval. We first apply a standard English phonetic algorithm to Japanese texts written in Romaji. Since a transliterated text in Romaji does not retain its original phonetic sound of Japanese, some search results are poor from the viewpoint of approximate phonetic matching in Japanese. To solve the problem, we present a new approach to phonetic matching, which is based on the Japanese syllabary and on the n-gram based inverted index search. We also report user evaluation of searched results with our approach. 1. 1 Gunma University 1) 2) 5) 4) 6) 7) 8),9) 10),11) (wordplay) 12) 13) 14) 15) 2. (phonetic matching algorithm) Soundex 16) Metaphone 17) Soundex Russell 1918 1922 (surname) 16) Soundex SMITH SMYTH S-530 1 1 18) Soundex 1 c 2011 Information Processing Society of Japan
Step-1 Step-2 Step-3 Step-4 a, e, h, i, o, u, w, y b, f, p, v 1 l 4 c, g, j, k, q, s, x, z 2 m, n 5 d, t 3 r 6 2 h w 3 0 3 4 Fig. 1 1 Soundex Basic coding rules and processing of the Soundex Indexing System. 19) Soundex 1 18) Soundex Soundex Calculator 1 Soundex Soundex 4 ( 5 ) Soundex Philips 17) Metaphone Double Metaphone Metaphone Soundex Metaphone Soundex Soundex 6 (123456) Metaphone 16 (0BFHJKLMNPRSTWXY) 0 th X sh ch Metaphone Soundex Metaphone Soundex Metaphone Soundex 1 Double Metaphone (primary) (secondary) 1 2 Metaphone SMITH 1 SM0 Double Metaphone primary SM0 secondary XMT 2 Double Metaphone secondary primary Double Metaphone Metaphone Metaphone AUTO AT OTTO OT Double Metaphone OTTO AT AUTO OTTO 2 ( 6 ) 20) metaphone Fig. 2 Number of unique n-grams of Hiragana/Katakana, Romaji, metaphone in a Japanese-language dictionary 20) 1 http://www.eogn.com/soundex/ 2 c 2011 Information Processing Society of Japan
1 Metaphone Table 1 Approximate string matching in Japanese using metaphone. A () ( ) ( ) () ( ) ( ) ( ) () ( ) () B ( ) () C () () ( ) ( ) D ( ) ( ) () ( ) () () () () () () () ( ) () E ( ) ( ) () () L7 L6 L4 L6 L2 L4 L7 L6 L4 L4 L6 2 Metaphone (1) Table 2 Encoding table to convert katakana readings to Romaji and Metaphone (1). ( ) ( ) ( ) K (ka) (ki) (ku) (ke) (ko) K (ga) (gi) (gu) (ge) (go) S (sa) (su) (se) (so) S (za) (zu) (ze) (zo) X (shi) (chi) J (ji) T (ta) (te) (to) T (da) (di) (du) (de) (do) TS (tsu) ( ) N (na) (ni) (nu) (ne) (no) H (ha) (he) (ho) B (ba) (bi) (bu) (be) (bo) (hi) F (fu) P (pa) (pi) (pu) (pe) (po) M (ma) (mi) (mu) (me) (mo) Y (ya) (yu) (yo) ( ) R (ra) (ri) (ru) (re) (ro) W (wa) (wa) (wo) ( ) 3 Metaphone (2) Table 3 Encoding table to convert katakana readings to Romaji and Metaphone (2). A 1 (a) (a) 2 I 1 (i) (i) 2 U 1 (u) (u) 2 E 1 (e) (e) 2 O 1 (o) (o) 2 TS ( ) 4 2 ( ) 4 1 Y ( ) ( ) H ( ) ( ) 1 2 Table 4 4 Classification of sounds behind the Japanese germinate consonant. ( ) 3 c 2011 Information Processing Society of Japan
3. 23 20) KAKASI 1 PostgreSQL 2 fuzzystrmatch Metaphone metaphone metaphone 2 4 10 metaphone 5) metaphone metaphone 5 metaphone 8 20 7 1 1 8 A E L8 9 0 L8 8 / / / / 1 http://kakasi.namazu.org/ 2 http://www.postgresql.org/ D 5 metaphone KKYKS (kakyakuse) metaphone (kagayakasu) 1 Metaphone kya gaya KY metaphone metaphone 2 ( ) metaphone metaphone ( ) KAKASI ( ) metaphone 2 3 (sha)(shu)(sho)(cha)(chu)(cho) (ja)(ju)(jo) y Metaphone Y (kya) (kyu)(kyo) Metaphone KY KY Metaphone (hya)(hyu)(hyo) y h (g) (k) K ( ) (gogohyakusa) metaphone KKYKS ( ) (kakyakuse) metaphone KKYKS Metaphone Double Metaphone ( ) ( ) Metaphone AK EKW Double Metaphone AK 4 c 2011 Information Processing Society of Japan
2 5 Table 5 Encoding table to convert katakana readings to hiragana phonetic symbols. ( ) ( ) ( ) ( ) 4. 2 7) ) ( ) 5 1 5 ( ) jpma-1 (Japanese Phonetic Matching Algorithm-1) jpma-1 5 n 1 n n-gram n=3 (3-gram) ( ) 1 3 { } n-gram n-gram (inverted index) n n n 5-gram 2 jpma-1 Metaphone jpma-2 (Japanese Phonetic Matching Algorithm-2) Metaphone jpma-2 1 2 ( )( ) ( ) ( ) jpma-2 5 jpma-1 n n-gram n=3 (3-gram) 1 3 { } jpma-1 n-gram 5 c 2011 Information Processing Society of Japan
6 (1) Table 6 Tongue twister retrieval with Japanese phonetic matching algorithm (1). ( ) ( ) ( ) ( ) ( ) ( ) () ( ) ( ) ( ) ( ) ( ) () ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) () () () 7 (2) Table 7 Tongue twister retrieval with Japanese phonetic matching algorithm (2). ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) () ( ) ( ) () ( ) ( ) () ( ) () ( ) () ( ) ( ) () () ( ) ( ) () () () 5. 2 1 1 1 15) DVD-ROM 5 Fig. 3 3 The remarks column of Aozorabunko. 8 Table 8 Difficult parts to pronounce. 6 () 5 ( ) ( ) ( ) 4 ( ) ( ) () () () 3 ( ) ( ) () 2 ( ) ( ) () ( ) ( ) 1 ( ) ( ) () ( ) ( ) ( ) () ( ) () ( ) ( ) ChaSen (chasen-2.4.2 + ipadic- 2.7.0) 1 10 100 2,973 1,711,453 1 http://chasen-legacy.sourceforge.jp/ 6 c 2011 Information Processing Society of Japan
5 1 5-gram (index term) Indri 1 TREC SGML 5 5 jpma-2 3-gram <DOC> <DOCNO>EX-00001</DOCNO> <TEXT> </TEXT> </DOC> 4 TREC SGML ( 3-gram ) Fig. 4 The TREC SGML document format (An example of character based 3-grams of phonetic syllables). 6) 2 3 1 ChaSen 1 5 5-gram Indri Okapi BM25 (k1 = 1.2, b = 0.75, k3 = 7) jpma-1 jpma-2 3 A3 8 20 7 1 1 http://sourceforge.net/apps/trac/lemur/ 2 3 ( ) 6 7 jpma-1 jpma-2 1 1 () 8 ( ) ( ) ( ) () ( ) ( ) ( ) ( ) ( ) ( )jpma-1 jpma-2 jpma-2 jpma-1 jpma-2 jpma-1 jpma-2 6. 2 7 c 2011 Information Processing Society of Japan
( : 21700273) 1) ( ) Vol.82, No.12, pp.50 65 (2005). 2) Fromkin, V.: Slips of the tongue, pp.181 187 (1973). 3) Vol.19, No.3, pp.193 198 DOI:http://dx.doi.org/10.2496/apr.19.193 (1999). 4) (6) pp.131 132, (2004). 5) (19) pp.83 86, (2007). 6) (1987). 7) NL Vol. 2011-NL-201, No. 14, pp. 1 6 http://id.nii.ac.jp/1001/00074041/ (2011). 8) Navarro, G.: A guided tour to approximate string matching, Computing Surveys (CSUR), Vol.33, No.1, pp.31 88 (online), DOI:http://dx.doi.org/10.1145/375360.375365 (2001). 9) Elmagarmid, A. K., Ipeirotis, P. G. and Verykios, V. S.: Duplicate record detection: A survey, IEEE Trans. Knowl. Data Eng., Vol. 19, No. 1, pp. 1 16 (online), available from http://doi.ieeecomputersociety.org/10.1109/tkde.2007.9 (2007). 10) Zobel, J. and Dart, P.W.: Phonetic string matching: lessons from information retrieval, SI- GIR 96 Proceedings, ACM, pp.166 172 (online), DOI:http://dx.doi.org/10.1145/243199.243258 (1996). 11) French, J. C., Powell, L. and Schulman, E.: Applications of approximate word matching in information retrieval, CIKM 97 Proceedings, ACM, pp. 9 15 (online), DOI:http://dx.doi.org/10.1145/266714.266721 (1997). 12) Taylor, J.M. and Mazlack, L.J.: Humorous wordplay recognition, SMC (4), pp.3306 3311 (online), DOI:http://dx.doi.org/10.1109/ICSMC.2004.1400851 (2004). 13) Vol.20, No.5, pp.696 708 DOI:http://dx.doi.org/10.3156/jsoft.20.696 (2008). 14) Cambouropoulos, E., Crochemore, M., Iliopoulos, C.S., Mohamed, M. and Sagot, M.-F.: All maximal-pairs in step-leap representation of melodic sequence, Inf. Sci., Vol.177, No.9, pp.1954 1962 (online), DOI:http://dx.doi.org/10.1016/j.ins.2006.11.012 (2007). 15) http://www.aozora.gr.jp/kizokeikaku/ 2007 2010 DVD-ROM (2007). 16) The U.S. National Archives and Records Administration: The Soundex Indexing System, (online), available from http://www.archives.gov/research/census/soundex.html (2007). 17) Philips, L.: The Double Metaphone Search Algorithm, C/C++ Users Journal, (online), available from http://drdobbs.com/cpp/184401251 (2000). 18) E. The Art of Computer Programming Volume 3 Sorting and Searching Second Edition pp.375 376, (2004). 19) Davidson, L.: Retrieval of misspelled names in an airlines passenger record system, Commun. ACM, Vol.5, pp.169 171 (online), DOI:http://doi.acm.org/10.1145/366862.366913 (1962). 20) (2008). 8 c 2011 Information Processing Society of Japan