N-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram.

Similar documents
Natural Language Processing. Statistical Inference: n-grams

TnT Part of Speech Tagger

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

The Noisy Channel Model and Markov Models

Exploring Asymmetric Clustering for Statistical Language Modeling

CMPT-825 Natural Language Processing

Informa(on Retrieval

Machine Learning for natural language processing

Natural Language Processing (CSE 490U): Language Models

The distribution of characters, bi- and trigrams in the Uppsala 70 million words Swedish newspaper corpus

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Simple Introduction to Information, Channel Capacity and Entropy

ERROR DETECTION AND CORRECTION IN TOPONYM RECOGNITION IN CARTOGRAPHIC MAPS *

On a New Model for Automatic Text Categorization Based on Vector Space Model

Statistical Methods for NLP

Language Processing with Perl and Prolog

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Graphical models for part of speech tagging

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Fun with weighted FSTs

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Statistical Methods for NLP

Math 4740: Homework 5 Solutions

Tuning as Linear Regression

1 Evaluation of SMT systems: BLEU

arxiv: v1 [cs.cl] 21 May 2017

Cross-Lingual Language Modeling for Automatic Speech Recogntion

SYNTHER A NEW M-GRAM POS TAGGER

Good morning, everyone. On behalf of the Shinsekai Type Study Group. today we would like to talk about the Japanese writing system specifically about

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Binary Convolutional Codes

Error Correction through Language Processing

Chapter 3: Basics of Language Modelling

Probabilistic Language Modeling

Optical Character Recognition of Jutakshars within Devanagari Script

Prenominal Modifier Ordering via MSA. Alignment

Discriminative Training

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Chapter 11 1 LEARNING TO FIND CONTEXT BASED SPELLING ERRORS

ECE 564/645 - Digital Communications, Spring 2018 Homework #2 Due: March 19 (In Lecture)

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance

Naïve Bayes, Maxent and Neural Models

Statistical NLP: Lecture 7. Collocations. (Ch 5) Introduction

Entropy as an Indicator of Context Boundaries An Experiment Using a Web Search Engine

Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Chapter 3: Basics of Language Modeling

Are you talking Bernoulli to me? Comparing methods of assessing word frequencies

Language Models. Philipp Koehn. 11 September 2018

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Wavelet Transform in Speech Segmentation

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Journal of Memory and Language

Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case

Good-Turing Smoothing Without Tears

Primary Factors Contributing to Japan's Extremely Hot Summer of 2010

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Automatically Evaluating Text Coherence using Anaphora and Coreference Resolution

Learning Features from Co-occurrences: A Theoretical Analysis

EECS730: Introduction to Bioinformatics

CS246 Final Exam. March 16, :30AM - 11:30AM

Statistical Substring Reduction in Linear Time

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

CRF Word Alignment & Noisy Channel Translation

Unsupervised Vocabulary Induction

A Surface-Similarity Based Two-Step Classifier for RITE-VAL

Formal Modeling in Cognitive Science Lecture 29: Noisy Channel Model and Applications;

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Section Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence

Word-Transliteration Alignment

Regular expressions and automata

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Random Processes. By: Nick Kingsbury

Significance tests for the evaluation of ranking methods

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Statistical Natural Language Processing

From perceptrons to word embeddings. Simon Šuster University of Groningen

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Improved Decipherment of Homophonic Ciphers

Stochastic Contextual Edit Distance and Probabilistic FSTs

Featurizing Text. Bob Stine Dept of Statistics, Wharton School University of Pennsylvania

Lecture 12: Algorithms for HMMs

And for polynomials with coefficients in F 2 = Z/2 Euclidean algorithm for gcd s Concept of equality mod M(x) Extended Euclid for inverses mod M(x)

CS1800: Mathematical Induction. Professor Kevin Gold

Efficient Cryptanalysis of Homophonic Substitution Ciphers

Unit 8: Part 2: PD, PID, and Feedback Compensation

Transcription:

Vol. 40 No. 6 June 1999 N N 3 N N 5 36-gram 5 4-gram Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character Hiroyuki Shinnou In this paper, we propose the hiragana character method to detect and correct errors in Japanese hiragana sequences. Further, we investigate about the proper N. It is known that the word method is effective to detect and correct errors in texts. However, it is difficult to construct word, even the case of N = 3. Moreover, in Japanese, this method requires the morphological analysis and high cost for searching an N word sequence from the word table. Thus, at the moment the word method for the text revision is not reasonable. However, if the target of the revision is limited to simple errors in Japanese hiragana sequences, by using the hiragana character we can detect and correct their errors without above problems. In this method, with the high N has the high recall, but the low precision because of the corpus sparseness problem. So, we must consider the corpus size and the weight of the recall to set the proper N. In experiments, we constructed 3,4,5 and 6-gram respectively from newspaper five years articles. By using their tables respectively, we examined the effectiveness of the revision for simple errors in hiragana sequences, which are caused by a single hiragana character insertion, deletion, substitution and reversal. We conclude that the hiragana character is effective to detect and correct errors in hiragana sequences, and N = 4 is proper realistically. 1. N 2) Faculty of Engineering, Ibaraki University Dept. of Systems Engineering 2690 1) N

Vol. 40 No. 6 2691 3),4) N 3 N - 1 i = 0 OCR 2.2 x% x N N 5) N N 2.3 4 5 1 3 6-gram () () 1 5 () () N 4 1 2. () () 2.1 1 () () i i+n 1 N 4 N 6) N - 1

2692 June 1999 2.4 N 1 Table 1 Length of sequences in test data N 6 723 7 489 N 1 8 290 9 202 N 1 10 115 m + 2 ( 0 < m < N 1 ) 11 74 m 12 41 13 25 14 18 15 23 2000 m N N N N N 5 N 3. 3.1 CD-ROM 90 94 5 3 6-gram 6 5 6-gram 7 1277 6 2,000 94 3.2 CD-ROM 6 2000 1.0% 1 1.0% 3 6-gram 2 1 3 N 4-gram 5-gram 6-gram 5-gram 6-gram N 25 4 5 5

Vol. 40 No. 6 2693 2 1% Table 2 Threshold corresponding to threshold ratio 1% 3-gram 75 4-gram 5 5-gram 1 6-gram 0 3 1 Table 3 Result of experience 1 3-gram 313 (2000) 0.844 4-gram 232 (2000) 0.884 5-gram 286 (2000) 0.857 6-gram 328 (2000) 0.836 Trp 2 (P ) rp 2 P = (1 r)(1 p 1) + rp 2 (R) R = p 2 P R 6 F = (β2 + 1.0) P R β 2 P + R β = 1.0 r 0.01 6 Table 6 Evaluation of error detection 3-gram 1 0.838 0.097 4-gram 0.072 0.897 34 4 25 Table 4 Detection results of experience 2-5 5-gram 6-gram 0.061 4 0.925 0.933 15 03 2 3 4 5 ( ) ( ) ( ) ( ) 3-gram 0.607 0.921 0.898 0.926 0.838 4-gram 0.721 0.955 0.940 0.970 0.897 5-gram 0.789 0.974 0.955 0.982 0.925 6-gram 0.810 0.977 0.962 0.982 0.933 5 25 Table 5 Correction results of experience 2-5 2 3 4 5 ( ) ( ) ( ) ( ) 3-gram 0.636 0.770 0.792 0.846 0.761 (775) (1841) (1796) (1852) 4-gram 0.782 0.877 0.879 0.924 0.866 (921) (1910) (1879) (1939) 5-gram 0.731 0.854 0.845 0.882 0.828 (1007) (1947) (1909) (1964) 6-gram 0.684 0.817 0.801 0.835 0.784 (1035) (1954) (1923) (1963) 4-gram 4-gram 4-gram 4-gram 1 (0.072) 9 (0.897) 3.3 r 0.001 0.001 1.0% 1 r 4-gram 3.4 1 p 1 25 x 0.0% % 5.0% p 2 r r 0.0% T 0 r = 0.01 T(1 r) 2 3 4 Tr 4 T(1 r)(1 p 1) Trp 2 0.0% T((1 r)(1 p 1)+rp 2)

2694 June 1999 5 5 5 5 5 5 5 5 0 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 1 Fig. 1 Relation between error ratio and F-measure 0 0 0.5 4 Fig. 4 Relation between threshold ratio and F-measure 0.5 5 5 7 5 5 7 Table 7 F-measure corresponding to minimum threshold ratio 0 0 0.5 2 Fig. 2 Relation between threshold ratio and precision 3-gram 0.01% 49 4-gram 7% 67 5-gram 0.72% 44 6-gram 2.08% 02 1 0.9 0.8 0.7 0.6 0.5 0 0.5 3 Fig. 3 Relation between threshold ratio and recall 2 0.0% 3-gram 3-gram 3-gram 51 47 49 3.5 3-gram 0 β 2.0

Vol. 40 No. 6 2695 3.0 5 6 β = 1.0 4 10 1 5 5 5 5 0 0.5 5 β = 2 Fig. 5 Relation between threshold ratio and precision in β = 2 5 5 5 5 0 0.5 6 β = 3 Fig. 6 Relation between threshold ratio and precision in β = 3 β = 2 4-gram 0 7% 48 0 60 0.818 β = 3 1 4-gram 1/2 1 3-gram 4-gram 5-gram 6-gram 8 4-gram 4. 4.1 N 0.9 9 N 5 0.9 N 1% N = 4 0.072 0.897 5 0.9 N N N N 5 3-gram 4-gram 4-gram 5-gram 4-gram 5 N = 5 3-gram 0.02% 3-gram 4-gram 3% 4-gram 5-gram 1.44% 5-gram

2696 June 1999 8 0.0012 Table 8 F-measure corresponding to 1 of threshold 4-gram 1.0% 3-gram 0.02% 60 4-gram 3% 92 5-gram 1.44% 15 6-gram 4.17% 0.083 6-gram 4.17% 6-gram 34 1.0% 1.0% 3-gram4-gram N 4-gram (8 ) (118 26 N ) N (119 27 ) N (134 21 ) (164 7) 34 ) (3 ) N (118 ) 4.2 (158 ) (134 13 ) 8 (14 ) 8) 4 (200K ) (147 4 7 ) 3 (176 ( 1) (148 28 ) ( 2) (161 5 ) ( 3) (175 6 ) 37 3 100% 3 1 (7 ) (171 12 ) (2 ) 7 ) 4 2,447

Vol. 40 No. 6 2697 9 5) 4 4 2 4-gram 1.0% 5.2% 5-gram ( 1) ( 2) 4.4 ( 3) ( 4) 4 4-gram 5-gram 2.1 5-gram 4-gram 0.081 0.072 1.0% 9) 6 10) 4.3 1 5. () () N () 5 N = 3,4,5, 6

2698 June 1999 5 Spelling Correction Program Based on a Noisy N = 4 Channel Model, COLING-90, Vol.2, pp. 205 210 (1990). 7),, :, (1996). 8) :, 97--2, CD-ROM 90 94 CD- (1997). ROM 94. 9) Golding, A. and Schabes, Y.: Combining Trigram-based and Feature-based Methods for. Context-Sensitive Spelling Correction, 34th Annual Meeting of the Association for Computational Linguistics, pp. 71 78 (1996). 1) :, 10), : N bit, Vol. 30, No. 10, pp. 19 22 (1998).,, 2) Mays, E., Damerau, F. and Mercer, R.: Context based spelling collection, Information Pro- Vol. 36, No. 1, pp. 32 40 (1995). cessing and Management, Vol. 27, No. 5, pp. ( 10 4 3 ) 517 522 (1991). ( 11 3 5 ) 3) : N, 49, pp. 181 182 (1994). 36 60 4),,,, :., 62 SLP-19-15, (1997).. 5),, :,, 5 4 3, pp. 445 448 (1997). 6) Kernighan, M., Church, K. and Gale, W.: A 9 10 ( )