N-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram.

Vol. 40 No. 6 June 1999 N N 3 N N 5 36-gram 5 4-gram Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character Hiroyuki Shinnou In this paper, we propose the hiragana character method to detect and correct errors in Japanese hiragana sequences. Further, we investigate about the proper N. It is known that the word method is effective to detect and correct errors in texts. However, it is difficult to construct word, even the case of N = 3. Moreover, in Japanese, this method requires the morphological analysis and high cost for searching an N word sequence from the word table. Thus, at the moment the word method for the text revision is not reasonable. However, if the target of the revision is limited to simple errors in Japanese hiragana sequences, by using the hiragana character we can detect and correct their errors without above problems. In this method, with the high N has the high recall, but the low precision because of the corpus sparseness problem. So, we must consider the corpus size and the weight of the recall to set the proper N. In experiments, we constructed 3,4,5 and 6-gram respectively from newspaper five years articles. By using their tables respectively, we examined the effectiveness of the revision for simple errors in hiragana sequences, which are caused by a single hiragana character insertion, deletion, substitution and reversal. We conclude that the hiragana character is effective to detect and correct errors in hiragana sequences, and N = 4 is proper realistically. 1. N 2) Faculty of Engineering, Ibaraki University Dept. of Systems Engineering 2690 1) N

Vol. 40 No. 6 2691 3),4) N 3 N - 1 i = 0 OCR 2.2 x% x N N 5) N N 2.3 4 5 1 3 6-gram () () 1 5 () () N 4 1 2. () () 2.1 1 () () i i+n 1 N 4 N 6) N - 1

2692 June 1999 2.4 N 1 Table 1 Length of sequences in test data N 6 723 7 489 N 1 8 290 9 202 N 1 10 115 m + 2 ( 0 < m < N 1 ) 11 74 m 12 41 13 25 14 18 15 23 2000 m N N N N N 5 N 3. 3.1 CD-ROM 90 94 5 3 6-gram 6 5 6-gram 7 1277 6 2,000 94 3.2 CD-ROM 6 2000 1.0% 1 1.0% 3 6-gram 2 1 3 N 4-gram 5-gram 6-gram 5-gram 6-gram N 25 4 5 5

Vol. 40 No. 6 2693 2 1% Table 2 Threshold corresponding to threshold ratio 1% 3-gram 75 4-gram 5 5-gram 1 6-gram 0 3 1 Table 3 Result of experience 1 3-gram 313 (2000) 0.844 4-gram 232 (2000) 0.884 5-gram 286 (2000) 0.857 6-gram 328 (2000) 0.836 Trp 2 (P ) rp 2 P = (1 r)(1 p 1) + rp 2 (R) R = p 2 P R 6 F = (β2 + 1.0) P R β 2 P + R β = 1.0 r 0.01 6 Table 6 Evaluation of error detection 3-gram 1 0.838 0.097 4-gram 0.072 0.897 34 4 25 Table 4 Detection results of experience 2-5 5-gram 6-gram 0.061 4 0.925 0.933 15 03 2 3 4 5 ( ) ( ) ( ) ( ) 3-gram 0.607 0.921 0.898 0.926 0.838 4-gram 0.721 0.955 0.940 0.970 0.897 5-gram 0.789 0.974 0.955 0.982 0.925 6-gram 0.810 0.977 0.962 0.982 0.933 5 25 Table 5 Correction results of experience 2-5 2 3 4 5 ( ) ( ) ( ) ( ) 3-gram 0.636 0.770 0.792 0.846 0.761 (775) (1841) (1796) (1852) 4-gram 0.782 0.877 0.879 0.924 0.866 (921) (1910) (1879) (1939) 5-gram 0.731 0.854 0.845 0.882 0.828 (1007) (1947) (1909) (1964) 6-gram 0.684 0.817 0.801 0.835 0.784 (1035) (1954) (1923) (1963) 4-gram 4-gram 4-gram 4-gram 1 (0.072) 9 (0.897) 3.3 r 0.001 0.001 1.0% 1 r 4-gram 3.4 1 p 1 25 x 0.0% % 5.0% p 2 r r 0.0% T 0 r = 0.01 T(1 r) 2 3 4 Tr 4 T(1 r)(1 p 1) Trp 2 0.0% T((1 r)(1 p 1)+rp 2)

2694 June 1999 5 5 5 5 5 5 5 5 0 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 1 Fig. 1 Relation between error ratio and F-measure 0 0 0.5 4 Fig. 4 Relation between threshold ratio and F-measure 0.5 5 5 7 5 5 7 Table 7 F-measure corresponding to minimum threshold ratio 0 0 0.5 2 Fig. 2 Relation between threshold ratio and precision 3-gram 0.01% 49 4-gram 7% 67 5-gram 0.72% 44 6-gram 2.08% 02 1 0.9 0.8 0.7 0.6 0.5 0 0.5 3 Fig. 3 Relation between threshold ratio and recall 2 0.0% 3-gram 3-gram 3-gram 51 47 49 3.5 3-gram 0 β 2.0

Vol. 40 No. 6 2695 3.0 5 6 β = 1.0 4 10 1 5 5 5 5 0 0.5 5 β = 2 Fig. 5 Relation between threshold ratio and precision in β = 2 5 5 5 5 0 0.5 6 β = 3 Fig. 6 Relation between threshold ratio and precision in β = 3 β = 2 4-gram 0 7% 48 0 60 0.818 β = 3 1 4-gram 1/2 1 3-gram 4-gram 5-gram 6-gram 8 4-gram 4. 4.1 N 0.9 9 N 5 0.9 N 1% N = 4 0.072 0.897 5 0.9 N N N N 5 3-gram 4-gram 4-gram 5-gram 4-gram 5 N = 5 3-gram 0.02% 3-gram 4-gram 3% 4-gram 5-gram 1.44% 5-gram

2696 June 1999 8 0.0012 Table 8 F-measure corresponding to 1 of threshold 4-gram 1.0% 3-gram 0.02% 60 4-gram 3% 92 5-gram 1.44% 15 6-gram 4.17% 0.083 6-gram 4.17% 6-gram 34 1.0% 1.0% 3-gram4-gram N 4-gram (8 ) (118 26 N ) N (119 27 ) N (134 21 ) (164 7) 34 ) (3 ) N (118 ) 4.2 (158 ) (134 13 ) 8 (14 ) 8) 4 (200K ) (147 4 7 ) 3 (176 ( 1) (148 28 ) ( 2) (161 5 ) ( 3) (175 6 ) 37 3 100% 3 1 (7 ) (171 12 ) (2 ) 7 ) 4 2,447

Vol. 40 No. 6 2697 9 5) 4 4 2 4-gram 1.0% 5.2% 5-gram ( 1) ( 2) 4.4 ( 3) ( 4) 4 4-gram 5-gram 2.1 5-gram 4-gram 0.081 0.072 1.0% 9) 6 10) 4.3 1 5. () () N () 5 N = 3,4,5, 6

2698 June 1999 5 Spelling Correction Program Based on a Noisy N = 4 Channel Model, COLING-90, Vol.2, pp. 205 210 (1990). 7),, :, (1996). 8) :, 97--2, CD-ROM 90 94 CD- (1997). ROM 94. 9) Golding, A. and Schabes, Y.: Combining Trigram-based and Feature-based Methods for. Context-Sensitive Spelling Correction, 34th Annual Meeting of the Association for Computational Linguistics, pp. 71 78 (1996). 1) :, 10), : N bit, Vol. 30, No. 10, pp. 19 22 (1998).,, 2) Mays, E., Damerau, F. and Mercer, R.: Context based spelling collection, Information Pro- Vol. 36, No. 1, pp. 32 40 (1995). cessing and Management, Vol. 27, No. 5, pp. ( 10 4 3 ) 517 522 (1991). ( 11 3 5 ) 3) : N, 49, pp. 181 182 (1994). 36 60 4),,,, :., 62 SLP-19-15, (1997).. 5),, :,, 5 4 3, pp. 445 448 (1997). 6) Kernighan, M., Church, K. and Gale, W.: A 9 10 ( )