N-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram.

Size: px

Start display at page:

Download "N-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram."

August Potter
5 years ago
Views:

1 Vol. 40 No. 6 June 1999 N N 3 N N 5 36-gram 5 4-gram Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character Hiroyuki Shinnou In this paper, we propose the hiragana character method to detect and correct errors in Japanese hiragana sequences. Further, we investigate about the proper N. It is known that the word method is effective to detect and correct errors in texts. However, it is difficult to construct word, even the case of N = 3. Moreover, in Japanese, this method requires the morphological analysis and high cost for searching an N word sequence from the word table. Thus, at the moment the word method for the text revision is not reasonable. However, if the target of the revision is limited to simple errors in Japanese hiragana sequences, by using the hiragana character we can detect and correct their errors without above problems. In this method, with the high N has the high recall, but the low precision because of the corpus sparseness problem. So, we must consider the corpus size and the weight of the recall to set the proper N. In experiments, we constructed 3,4,5 and 6-gram respectively from newspaper five years articles. By using their tables respectively, we examined the effectiveness of the revision for simple errors in hiragana sequences, which are caused by a single hiragana character insertion, deletion, substitution and reversal. We conclude that the hiragana character is effective to detect and correct errors in hiragana sequences, and N = 4 is proper realistically. 1. N 2) Faculty of Engineering, Ibaraki University Dept. of Systems Engineering ) N

2 Vol. 40 No ),4) N 3 N - 1 i = 0 OCR 2.2 x% x N N 5) N N gram () () 1 5 () () N () () () () i i+n 1 N 4 N 6) N - 1

3 2692 June N 1 Table 1 Length of sequences in test data N N N m + 2 ( 0 < m < N 1 ) m m N N N N N 5 N CD-ROM gram gram , CD-ROM % 1 1.0% 3 6-gram N 4-gram 5-gram 6-gram 5-gram 6-gram N

4 Vol. 40 No % Table 2 Threshold corresponding to threshold ratio 1% 3-gram 75 4-gram 5 5-gram 1 6-gram Table 3 Result of experience 1 3-gram 313 (2000) gram 232 (2000) gram 286 (2000) gram 328 (2000) Trp 2 (P ) rp 2 P = (1 r)(1 p 1) + rp 2 (R) R = p 2 P R 6 F = (β ) P R β 2 P + R β = 1.0 r Table 6 Evaluation of error detection 3-gram gram Table 4 Detection results of experience gram 6-gram ( ) ( ) ( ) ( ) 3-gram gram gram gram Table 5 Correction results of experience ( ) ( ) ( ) ( ) 3-gram (775) (1841) (1796) (1852) 4-gram (921) (1910) (1879) (1939) 5-gram (1007) (1947) (1909) (1964) 6-gram (1035) (1954) (1923) (1963) 4-gram 4-gram 4-gram 4-gram 1 (0.072) 9 (0.897) 3.3 r % 1 r 4-gram p 1 25 x 0.0% % 5.0% p 2 r r 0.0% T 0 r = 0.01 T(1 r) Tr 4 T(1 r)(1 p 1) Trp 2 0.0% T((1 r)(1 p 1)+rp 2)

5 2694 June Fig. 1 Relation between error ratio and F-measure Fig. 4 Relation between threshold ratio and F-measure Table 7 F-measure corresponding to minimum threshold ratio Fig. 2 Relation between threshold ratio and precision 3-gram 0.01% 49 4-gram 7% 67 5-gram 0.72% 44 6-gram 2.08% Fig. 3 Relation between threshold ratio and recall 2 0.0% 3-gram 3-gram 3-gram gram 0 β 2.0

6 Vol. 40 No β = β = 2 Fig. 5 Relation between threshold ratio and precision in β = β = 3 Fig. 6 Relation between threshold ratio and precision in β = 3 β = 2 4-gram 0 7% β = gram 1/2 1 3-gram 4-gram 5-gram 6-gram 8 4-gram N N N 1% N = N N N N 5 3-gram 4-gram 4-gram 5-gram 4-gram 5 N = 5 3-gram 0.02% 3-gram 4-gram 3% 4-gram 5-gram 1.44% 5-gram

7 2696 June Table 8 F-measure corresponding to 1 of threshold 4-gram 1.0% 3-gram 0.02% 60 4-gram 3% 92 5-gram 1.44% 15 6-gram 4.17% gram 4.17% 6-gram % 1.0% 3-gram4-gram N 4-gram (8 ) ( N ) N ( ) N ( ) (164 7) 34 ) (3 ) N (118 ) 4.2 (158 ) ( ) 8 (14 ) 8) 4 (200K ) ( ) 3 (176 ( 1) ( ) ( 2) (161 5 ) ( 3) (175 6 ) % 3 1 (7 ) ( ) (2 ) 7 ) 4 2,447

8 Vol. 40 No ) gram 1.0% 5.2% 5-gram ( 1) ( 2) 4.4 ( 3) ( 4) 4 4-gram 5-gram gram 4-gram % 9) 6 10) () () N () 5 N = 3,4,5, 6

9 2698 June Spelling Correction Program Based on a Noisy N = 4 Channel Model, COLING-90, Vol.2, pp (1990). 7),, :, (1996). 8) :, 97--2, CD-ROM CD- (1997). ROM 94. 9) Golding, A. and Schabes, Y.: Combining Trigram-based and Feature-based Methods for. Context-Sensitive Spelling Correction, 34th Annual Meeting of the Association for Computational Linguistics, pp (1996). 1) :, 10), : N bit, Vol. 30, No. 10, pp (1998).,, 2) Mays, E., Damerau, F. and Mercer, R.: Context based spelling collection, Information Pro- Vol. 36, No. 1, pp (1995). cessing and Management, Vol. 27, No. 5, pp. ( ) (1991). ( ) 3) : N, 49, pp (1994) ),,,, :., 62 SLP-19-15, (1997).. 5),, :,, 5 4 3, pp (1997). 6) Kernighan, M., Church, K. and Gale, W.: A 9 10 ( )

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability