N-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram.

Size: px
Start display at page:

Download "N-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram."

Transcription

1 Vol. 40 No. 6 June 1999 N N 3 N N 5 36-gram 5 4-gram Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character Hiroyuki Shinnou In this paper, we propose the hiragana character method to detect and correct errors in Japanese hiragana sequences. Further, we investigate about the proper N. It is known that the word method is effective to detect and correct errors in texts. However, it is difficult to construct word, even the case of N = 3. Moreover, in Japanese, this method requires the morphological analysis and high cost for searching an N word sequence from the word table. Thus, at the moment the word method for the text revision is not reasonable. However, if the target of the revision is limited to simple errors in Japanese hiragana sequences, by using the hiragana character we can detect and correct their errors without above problems. In this method, with the high N has the high recall, but the low precision because of the corpus sparseness problem. So, we must consider the corpus size and the weight of the recall to set the proper N. In experiments, we constructed 3,4,5 and 6-gram respectively from newspaper five years articles. By using their tables respectively, we examined the effectiveness of the revision for simple errors in hiragana sequences, which are caused by a single hiragana character insertion, deletion, substitution and reversal. We conclude that the hiragana character is effective to detect and correct errors in hiragana sequences, and N = 4 is proper realistically. 1. N 2) Faculty of Engineering, Ibaraki University Dept. of Systems Engineering ) N

2 Vol. 40 No ),4) N 3 N - 1 i = 0 OCR 2.2 x% x N N 5) N N gram () () 1 5 () () N () () () () i i+n 1 N 4 N 6) N - 1

3 2692 June N 1 Table 1 Length of sequences in test data N N N m + 2 ( 0 < m < N 1 ) m m N N N N N 5 N CD-ROM gram gram , CD-ROM % 1 1.0% 3 6-gram N 4-gram 5-gram 6-gram 5-gram 6-gram N

4 Vol. 40 No % Table 2 Threshold corresponding to threshold ratio 1% 3-gram 75 4-gram 5 5-gram 1 6-gram Table 3 Result of experience 1 3-gram 313 (2000) gram 232 (2000) gram 286 (2000) gram 328 (2000) Trp 2 (P ) rp 2 P = (1 r)(1 p 1) + rp 2 (R) R = p 2 P R 6 F = (β ) P R β 2 P + R β = 1.0 r Table 6 Evaluation of error detection 3-gram gram Table 4 Detection results of experience gram 6-gram ( ) ( ) ( ) ( ) 3-gram gram gram gram Table 5 Correction results of experience ( ) ( ) ( ) ( ) 3-gram (775) (1841) (1796) (1852) 4-gram (921) (1910) (1879) (1939) 5-gram (1007) (1947) (1909) (1964) 6-gram (1035) (1954) (1923) (1963) 4-gram 4-gram 4-gram 4-gram 1 (0.072) 9 (0.897) 3.3 r % 1 r 4-gram p 1 25 x 0.0% % 5.0% p 2 r r 0.0% T 0 r = 0.01 T(1 r) Tr 4 T(1 r)(1 p 1) Trp 2 0.0% T((1 r)(1 p 1)+rp 2)

5 2694 June Fig. 1 Relation between error ratio and F-measure Fig. 4 Relation between threshold ratio and F-measure Table 7 F-measure corresponding to minimum threshold ratio Fig. 2 Relation between threshold ratio and precision 3-gram 0.01% 49 4-gram 7% 67 5-gram 0.72% 44 6-gram 2.08% Fig. 3 Relation between threshold ratio and recall 2 0.0% 3-gram 3-gram 3-gram gram 0 β 2.0

6 Vol. 40 No β = β = 2 Fig. 5 Relation between threshold ratio and precision in β = β = 3 Fig. 6 Relation between threshold ratio and precision in β = 3 β = 2 4-gram 0 7% β = gram 1/2 1 3-gram 4-gram 5-gram 6-gram 8 4-gram N N N 1% N = N N N N 5 3-gram 4-gram 4-gram 5-gram 4-gram 5 N = 5 3-gram 0.02% 3-gram 4-gram 3% 4-gram 5-gram 1.44% 5-gram

7 2696 June Table 8 F-measure corresponding to 1 of threshold 4-gram 1.0% 3-gram 0.02% 60 4-gram 3% 92 5-gram 1.44% 15 6-gram 4.17% gram 4.17% 6-gram % 1.0% 3-gram4-gram N 4-gram (8 ) ( N ) N ( ) N ( ) (164 7) 34 ) (3 ) N (118 ) 4.2 (158 ) ( ) 8 (14 ) 8) 4 (200K ) ( ) 3 (176 ( 1) ( ) ( 2) (161 5 ) ( 3) (175 6 ) % 3 1 (7 ) ( ) (2 ) 7 ) 4 2,447

8 Vol. 40 No ) gram 1.0% 5.2% 5-gram ( 1) ( 2) 4.4 ( 3) ( 4) 4 4-gram 5-gram gram 4-gram % 9) 6 10) () () N () 5 N = 3,4,5, 6

9 2698 June Spelling Correction Program Based on a Noisy N = 4 Channel Model, COLING-90, Vol.2, pp (1990). 7),, :, (1996). 8) :, 97--2, CD-ROM CD- (1997). ROM 94. 9) Golding, A. and Schabes, Y.: Combining Trigram-based and Feature-based Methods for. Context-Sensitive Spelling Correction, 34th Annual Meeting of the Association for Computational Linguistics, pp (1996). 1) :, 10), : N bit, Vol. 30, No. 10, pp (1998).,, 2) Mays, E., Damerau, F. and Mercer, R.: Context based spelling collection, Information Pro- Vol. 36, No. 1, pp (1995). cessing and Management, Vol. 27, No. 5, pp. ( ) (1991). ( ) 3) : N, 49, pp (1994) ),,,, :., 62 SLP-19-15, (1997).. 5),, :,, 5 4 3, pp (1997). 6) Kernighan, M., Church, K. and Gale, W.: A 9 10 ( )

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

TnT Part of Speech Tagger

TnT Part of Speech Tagger TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation

More information

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Exploring Asymmetric Clustering for Statistical Language Modeling

Exploring Asymmetric Clustering for Statistical Language Modeling Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL, Philadelphia, July 2002, pp. 83-90. Exploring Asymmetric Clustering for Statistical Language Modeling Jianfeng

More information

CMPT-825 Natural Language Processing

CMPT-825 Natural Language Processing CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop February 27, 2008 1 / 30 Cross-Entropy and Perplexity Smoothing n-gram Models Add-one Smoothing Additive Smoothing Good-Turing

More information

Informa(on Retrieval

Informa(on Retrieval Introduc*on to Informa(on Retrieval CS276: Informa*on Retrieval and Web Search Christopher Manning and Pandu Nayak Spelling Correc*on The course thus far Index construc*on Index compression Efficient boolean

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a

More information

Natural Language Processing (CSE 490U): Language Models

Natural Language Processing (CSE 490U): Language Models Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,

More information

The distribution of characters, bi- and trigrams in the Uppsala 70 million words Swedish newspaper corpus

The distribution of characters, bi- and trigrams in the Uppsala 70 million words Swedish newspaper corpus Uppsala University Department of Linguistics The distribution of characters, bi- and trigrams in the Uppsala 70 million words Swedish newspaper corpus Bengt Dahlqvist Abstract The paper describes some

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

A Simple Introduction to Information, Channel Capacity and Entropy

A Simple Introduction to Information, Channel Capacity and Entropy A Simple Introduction to Information, Channel Capacity and Entropy Ulrich Hoensch Rocky Mountain College Billings, MT 59102 hoenschu@rocky.edu Friday, April 21, 2017 Introduction A frequently occurring

More information

ERROR DETECTION AND CORRECTION IN TOPONYM RECOGNITION IN CARTOGRAPHIC MAPS *

ERROR DETECTION AND CORRECTION IN TOPONYM RECOGNITION IN CARTOGRAPHIC MAPS * ERROR DETECTION AND CORRECTION IN TOPONYM RECOGNITION IN CARTOGRAPHIC MAPS * Alexander GELBUKH and Serguei LEVACHKINE Centre for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City,

More information

On a New Model for Automatic Text Categorization Based on Vector Space Model

On a New Model for Automatic Text Categorization Based on Vector Space Model On a New Model for Automatic Text Categorization Based on Vector Space Model Makoto Suzuki, Naohide Yamagishi, Takashi Ishida, Masayuki Goto and Shigeichi Hirasawa Faculty of Information Science, Shonan

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Language Processing with Perl and Prolog

Language Processing with Perl and Prolog Language Processing with Perl and Prolog es Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and Prolog 1 / 12 Training

More information

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional Language Technology CUNY Graduate Center Spring 2013 Unit 1: Sequence Models Lectures 5-6: Language Models and Smoothing required hard optional Professor Liang Huang liang.huang.sh@gmail.com Python Review:

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture

More information

Fun with weighted FSTs

Fun with weighted FSTs Fun with weighted FSTs Informatics 2A: Lecture 18 Shay Cohen School of Informatics University of Edinburgh 29 October 2018 1 / 35 Kedzie et al. (2018) - Content Selection in Deep Learning Models of Summarization

More information

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

{ Jurafsky & Martin Ch. 6:! 6.6 incl. N-grams Now Simple (Unsmoothed) N-grams Smoothing { Add-one Smoothing { Backo { Deleted Interpolation Reading: { Jurafsky & Martin Ch. 6:! 6.6 incl. 1 Word-prediction Applications Augmentative Communication

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,

More information

Math 4740: Homework 5 Solutions

Math 4740: Homework 5 Solutions Math 4740: Homework 5 Solutions. (a) When f x, n i f(x i) N n (x) and y X π(y)f(y) π(x). Therefore the desired statement reduces to N n (x) lim which was shown in class. π(x) with probability, (b) Fix

More information

Tuning as Linear Regression

Tuning as Linear Regression Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical

More information

1 Evaluation of SMT systems: BLEU

1 Evaluation of SMT systems: BLEU 1 Evaluation of SMT systems: BLEU Idea: We want to define a repeatable evaluation method that uses: a gold standard of human generated reference translations a numerical translation closeness metric in

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

SYNTHER A NEW M-GRAM POS TAGGER

SYNTHER A NEW M-GRAM POS TAGGER SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de

More information

Good morning, everyone. On behalf of the Shinsekai Type Study Group. today we would like to talk about the Japanese writing system specifically about

Good morning, everyone. On behalf of the Shinsekai Type Study Group. today we would like to talk about the Japanese writing system specifically about 1 Good morning, everyone. On behalf of the Shinsekai Type Study Group. today we would like to talk about the Japanese writing system specifically about the kana script. Our discussion will focus on the

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Binary Convolutional Codes

Binary Convolutional Codes Binary Convolutional Codes A convolutional code has memory over a short block length. This memory results in encoded output symbols that depend not only on the present input, but also on past inputs. An

More information

Error Correction through Language Processing

Error Correction through Language Processing Error Correction through Language Processing Anxiao (Andrew Jiang Computer Science and Eng. Dept. Texas A&M University College Station, TX 77843 ajiang@cse.tamu.edu Yue Li Electrical Engineering Department

More information

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

Optical Character Recognition of Jutakshars within Devanagari Script

Optical Character Recognition of Jutakshars within Devanagari Script Optical Character Recognition of Jutakshars within Devanagari Script Sheallika Singh Shreesh Ladha Supervised by : Dr. Harish Karnick, Dr. Amit Mitra UGP Presentation, 10 April 2016 OCR of Jutakshars within

More information

Prenominal Modifier Ordering via MSA. Alignment

Prenominal Modifier Ordering via MSA. Alignment Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,

More information

Discriminative Training

Discriminative Training Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer

More information

Chapter 11 1 LEARNING TO FIND CONTEXT BASED SPELLING ERRORS

Chapter 11 1 LEARNING TO FIND CONTEXT BASED SPELLING ERRORS Chapter 11 1 LEARNING TO FIND CONTEXT BASED SPELLING ERRORS Hisham Al-Mubaid, Klaus Truemper University of Houston - Clear Lake Department of Computer Science email: hisham@cl.uh.edu Department of Computer

More information

ECE 564/645 - Digital Communications, Spring 2018 Homework #2 Due: March 19 (In Lecture)

ECE 564/645 - Digital Communications, Spring 2018 Homework #2 Due: March 19 (In Lecture) ECE 564/645 - Digital Communications, Spring 018 Homework # Due: March 19 (In Lecture) 1. Consider a binary communication system over a 1-dimensional vector channel where message m 1 is sent by signaling

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance

Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance Alberto Barrón-Cedeño, Paolo Rosso, and José-Miguel Benedí Department of Information Systems and Computation,

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

Statistical NLP: Lecture 7. Collocations. (Ch 5) Introduction

Statistical NLP: Lecture 7. Collocations. (Ch 5) Introduction Statistical NLP: Lecture 7 Collocations Ch 5 Introduction Collocations are characterized b limited compositionalit. Large overlap between the concepts of collocations and terms, technical term and terminological

More information

Entropy as an Indicator of Context Boundaries An Experiment Using a Web Search Engine

Entropy as an Indicator of Context Boundaries An Experiment Using a Web Search Engine Entropy as an Indicator of Context Boundaries An Experiment Using a Web Search Engine Kumiko Tanaka-Ishii Graduate School of Information Science and Technology, University of Tokyo kumiko@i.u-tokyo.ac.jp

More information

Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1

Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1 Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1 B. J. Oommen and R. K. S. Loke School of Computer Science

More information

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Fast Logistic Regression for Text Categorization with Variable-Length N-grams Fast Logistic Regression for Text Categorization with Variable-Length N-grams Georgiana Ifrim *, Gökhan Bakır +, Gerhard Weikum * * Max-Planck Institute for Informatics Saarbrücken, Germany + Google Switzerland

More information

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation

More information

Chapter 3: Basics of Language Modeling

Chapter 3: Basics of Language Modeling Chapter 3: Basics of Language Modeling Section 3.1. Language Modeling in Automatic Speech Recognition (ASR) All graphs in this section are from the book by Schukat-Talamazzini unless indicated otherwise

More information

Are you talking Bernoulli to me? Comparing methods of assessing word frequencies

Are you talking Bernoulli to me? Comparing methods of assessing word frequencies Comparing methods of assessing word frequencies * Joint work with Panagiotis Papapetrou *, Tanja Säily **, Kai Puolamäki *, Terttu Nevalainen **, and Heikki Mannila * * Department of Information and Computer

More information

Language Models. Philipp Koehn. 11 September 2018

Language Models. Philipp Koehn. 11 September 2018 Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)

More information

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk Language Modelling 2 A language model is a probability

More information

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling Masaharu Kato, Tetsuo Kosaka, Akinori Ito and Shozo Makino Abstract Topic-based stochastic models such as the probabilistic

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam Lecture 4: Smoothing, Part-of-Speech Tagging Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam Language Models from Corpora We want a model of sentence probability P(w

More information

Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)

Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin) Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin) Mitra Shahabi 1 Department of Language and Culture University of Aveiro Aveiro, 3800-356, Portugal mitra.shahabi@ua.pt Abstract An attempt

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

Wavelet Transform in Speech Segmentation

Wavelet Transform in Speech Segmentation Wavelet Transform in Speech Segmentation M. Ziółko, 1 J. Gałka 1 and T. Drwięga 2 1 Department of Electronics, AGH University of Science and Technology, Kraków, Poland, ziolko@agh.edu.pl, jgalka@agh.edu.pl

More information

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition Yu-Seop Kim 1, Jeong-Ho Chang 2, and Byoung-Tak Zhang 2 1 Division of Information and Telecommunication

More information

Journal of Memory and Language

Journal of Memory and Language Journal of Memory and Language 88 (2016) 133 143 Contents lists available at ScienceDirect Journal of Memory and Language journal homepage: www.elsevier.com/locate/jml Corrigendum Corrigendum to Do successor

More information

Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case

Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case Masato Kikuchi, Eiko Yamamoto, Mitsuo Yoshida, Masayuki Okabe, Kyoji Umemura Department of Computer Science

More information

Good-Turing Smoothing Without Tears

Good-Turing Smoothing Without Tears - 1 - Good-Turing Smoothing Without Tears William A. Gale AT&T Bell Laboratories P. O. Box 66 Murray Hill, NJ, 07974-066 gale@research.att.com ABSTRACT The performance of statistically based techniques

More information

Primary Factors Contributing to Japan's Extremely Hot Summer of 2010

Primary Factors Contributing to Japan's Extremely Hot Summer of 2010 temperature anomalies by its standard deviation for JJA 2010 Primary Factors Contributing to Japan's Extremely Hot Summer of 2010 Nobuyuki Kayaba Climate Prediction Division,Japan Meteorological Agancy

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

Automatically Evaluating Text Coherence using Anaphora and Coreference Resolution

Automatically Evaluating Text Coherence using Anaphora and Coreference Resolution 1 1 Barzilay 1) Automatically Evaluating Text Coherence using Anaphora and Coreference Resolution Ryu Iida 1 and Takenobu Tokunaga 1 We propose a metric for automatically evaluating discourse coherence

More information

Learning Features from Co-occurrences: A Theoretical Analysis

Learning Features from Co-occurrences: A Theoretical Analysis Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Statistical Substring Reduction in Linear Time

Statistical Substring Reduction in Linear Time Statistical Substring Reduction in Linear Time Xueqiang Lü Institute of Computational Linguistics Peking University, Beijing lxq@pku.edu.cn Le Zhang Institute of Computer Software & Theory Northeastern

More information

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0. CS 224 HW:#3 ARIA HAGHIGHI SUID :# 05041774 1. Smoothing Probability Models (a). Let r be the number of words with r counts and p r be the probability for a word with r counts in the Absolute discounting

More information

CRF Word Alignment & Noisy Channel Translation

CRF Word Alignment & Noisy Channel Translation CRF Word Alignment & Noisy Channel Translation January 31, 2013 Last Time... X p( Translation)= p(, Translation) Alignment Alignment Last Time... X p( Translation)= p(, Translation) Alignment X Alignment

More information

Unsupervised Vocabulary Induction

Unsupervised Vocabulary Induction Infant Language Acquisition Unsupervised Vocabulary Induction MIT (Saffran et al., 1997) 8 month-old babies exposed to stream of syllables Stream composed of synthetic words (pabikumalikiwabufa) After

More information

A Surface-Similarity Based Two-Step Classifier for RITE-VAL

A Surface-Similarity Based Two-Step Classifier for RITE-VAL A Surface-Similarity Based Two-Step Classifier for RITE-VAL Shohei Hattori Satoshi Sato Graduate School of Engineering, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8603, JAPAN {syohei_h,ssato}@nuee.nagoya-u.ac.jp

More information

Formal Modeling in Cognitive Science Lecture 29: Noisy Channel Model and Applications;

Formal Modeling in Cognitive Science Lecture 29: Noisy Channel Model and Applications; Formal Modeling in Cognitive Science Lecture 9: and ; ; Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Proerties of 3 March, 6 Frank Keller Formal Modeling in Cognitive

More information

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7 Quiz 1, COMS 4705 Name: 10 30 30 20 Good luck! 4705 Quiz 1 page 1 of 7 Part #1 (10 points) Question 1 (10 points) We define a PCFG where non-terminal symbols are {S,, B}, the terminal symbols are {a, b},

More information

Section Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence

Section Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence Section 2.4 Section Summary Sequences. Examples: Geometric Progression, Arithmetic Progression Recurrence Relations Example: Fibonacci Sequence Summations Introduction Sequences are ordered lists of elements.

More information

Word-Transliteration Alignment

Word-Transliteration Alignment Word-Transliteration Alignment Tracy Lin Dep. of Communication Engineering National Chiao Tung University, 1001, Ta Hsueh Road, Hsinchu, 300, Taiwan tracylin@cm.nctu.edu.tw Chien-Cheng Wu Department of

More information

Regular expressions and automata

Regular expressions and automata Regular expressions and automata Introduction Finite State Automaton (FSA) Finite State Transducers (FST) Regular expressions(i) (Res) Standard notation for characterizing text sequences Specifying text

More information

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc) NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition

More information

Random Processes. By: Nick Kingsbury

Random Processes. By: Nick Kingsbury Random Processes By: Nick Kingsbury Random Processes By: Nick Kingsbury Online: < http://cnx.org/content/col10204/1.3/ > C O N N E X I O N S Rice University, Houston, Texas This selection and arrangement

More information

Significance tests for the evaluation of ranking methods

Significance tests for the evaluation of ranking methods Significance tests for the evaluation of ranking methods Stefan Evert Institut für maschinelle Sprachverarbeitung Universität Stuttgart Azenbergstr. 12, 70174 Stuttgart, Germany evert@ims.uni-stuttgart.de

More information

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models (most slides from Sharon Goldwater; some adapted from Alex Lascarides) 29 January 2017 Nathan Schneider ENLP Lecture 5

More information

Statistical Natural Language Processing

Statistical Natural Language Processing Statistical Natural Language Processing N-gram Language Models Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017 N-gram language models A language model answers

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics

More information

Improved Decipherment of Homophonic Ciphers

Improved Decipherment of Homophonic Ciphers Improved Decipherment of Homophonic Ciphers Malte Nuhn and Julian Schamper and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University, Aachen,

More information

Stochastic Contextual Edit Distance and Probabilistic FSTs

Stochastic Contextual Edit Distance and Probabilistic FSTs Stochastic Contextual Edit Distance and Probabilistic FSTs Ryan Cotterell and Nanyun Peng and Jason Eisner Department of Computer Science, Johns Hopkins University {ryan.cotterell,npeng1,jason}@cs.jhu.edu

More information

Featurizing Text. Bob Stine Dept of Statistics, Wharton School University of Pennsylvania

Featurizing Text. Bob Stine Dept of Statistics, Wharton School University of Pennsylvania Featurizing Text Bob Stine Dept of Statistics, School University of Pennsylvania Slides and draft manuscript available at www-stat.wharton.upenn.edu/~stine Thank you, NSF! (#1106743) Thanks also to Dean

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

And for polynomials with coefficients in F 2 = Z/2 Euclidean algorithm for gcd s Concept of equality mod M(x) Extended Euclid for inverses mod M(x)

And for polynomials with coefficients in F 2 = Z/2 Euclidean algorithm for gcd s Concept of equality mod M(x) Extended Euclid for inverses mod M(x) Outline Recall: For integers Euclidean algorithm for finding gcd s Extended Euclid for finding multiplicative inverses Extended Euclid for computing Sun-Ze Test for primitive roots And for polynomials

More information

CS1800: Mathematical Induction. Professor Kevin Gold

CS1800: Mathematical Induction. Professor Kevin Gold CS1800: Mathematical Induction Professor Kevin Gold Induction: Used to Prove Patterns Just Keep Going For an algorithm, we may want to prove that it just keeps working, no matter how big the input size

More information

Efficient Cryptanalysis of Homophonic Substitution Ciphers

Efficient Cryptanalysis of Homophonic Substitution Ciphers Efficient Cryptanalysis of Homophonic Substitution Ciphers Amrapali Dhavare Richard M. Low Mark Stamp Abstract Substitution ciphers are among the earliest methods of encryption. Examples of classic substitution

More information

Unit 8: Part 2: PD, PID, and Feedback Compensation

Unit 8: Part 2: PD, PID, and Feedback Compensation Ideal Derivative Compensation (PD) Lead Compensation PID Controller Design Feedback Compensation Physical Realization of Compensation Unit 8: Part 2: PD, PID, and Feedback Compensation Engineering 5821:

More information