N-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram.
|
|
- August Potter
- 5 years ago
- Views:
Transcription
1 Vol. 40 No. 6 June 1999 N N 3 N N 5 36-gram 5 4-gram Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character Hiroyuki Shinnou In this paper, we propose the hiragana character method to detect and correct errors in Japanese hiragana sequences. Further, we investigate about the proper N. It is known that the word method is effective to detect and correct errors in texts. However, it is difficult to construct word, even the case of N = 3. Moreover, in Japanese, this method requires the morphological analysis and high cost for searching an N word sequence from the word table. Thus, at the moment the word method for the text revision is not reasonable. However, if the target of the revision is limited to simple errors in Japanese hiragana sequences, by using the hiragana character we can detect and correct their errors without above problems. In this method, with the high N has the high recall, but the low precision because of the corpus sparseness problem. So, we must consider the corpus size and the weight of the recall to set the proper N. In experiments, we constructed 3,4,5 and 6-gram respectively from newspaper five years articles. By using their tables respectively, we examined the effectiveness of the revision for simple errors in hiragana sequences, which are caused by a single hiragana character insertion, deletion, substitution and reversal. We conclude that the hiragana character is effective to detect and correct errors in hiragana sequences, and N = 4 is proper realistically. 1. N 2) Faculty of Engineering, Ibaraki University Dept. of Systems Engineering ) N
2 Vol. 40 No ),4) N 3 N - 1 i = 0 OCR 2.2 x% x N N 5) N N gram () () 1 5 () () N () () () () i i+n 1 N 4 N 6) N - 1
3 2692 June N 1 Table 1 Length of sequences in test data N N N m + 2 ( 0 < m < N 1 ) m m N N N N N 5 N CD-ROM gram gram , CD-ROM % 1 1.0% 3 6-gram N 4-gram 5-gram 6-gram 5-gram 6-gram N
4 Vol. 40 No % Table 2 Threshold corresponding to threshold ratio 1% 3-gram 75 4-gram 5 5-gram 1 6-gram Table 3 Result of experience 1 3-gram 313 (2000) gram 232 (2000) gram 286 (2000) gram 328 (2000) Trp 2 (P ) rp 2 P = (1 r)(1 p 1) + rp 2 (R) R = p 2 P R 6 F = (β ) P R β 2 P + R β = 1.0 r Table 6 Evaluation of error detection 3-gram gram Table 4 Detection results of experience gram 6-gram ( ) ( ) ( ) ( ) 3-gram gram gram gram Table 5 Correction results of experience ( ) ( ) ( ) ( ) 3-gram (775) (1841) (1796) (1852) 4-gram (921) (1910) (1879) (1939) 5-gram (1007) (1947) (1909) (1964) 6-gram (1035) (1954) (1923) (1963) 4-gram 4-gram 4-gram 4-gram 1 (0.072) 9 (0.897) 3.3 r % 1 r 4-gram p 1 25 x 0.0% % 5.0% p 2 r r 0.0% T 0 r = 0.01 T(1 r) Tr 4 T(1 r)(1 p 1) Trp 2 0.0% T((1 r)(1 p 1)+rp 2)
5 2694 June Fig. 1 Relation between error ratio and F-measure Fig. 4 Relation between threshold ratio and F-measure Table 7 F-measure corresponding to minimum threshold ratio Fig. 2 Relation between threshold ratio and precision 3-gram 0.01% 49 4-gram 7% 67 5-gram 0.72% 44 6-gram 2.08% Fig. 3 Relation between threshold ratio and recall 2 0.0% 3-gram 3-gram 3-gram gram 0 β 2.0
6 Vol. 40 No β = β = 2 Fig. 5 Relation between threshold ratio and precision in β = β = 3 Fig. 6 Relation between threshold ratio and precision in β = 3 β = 2 4-gram 0 7% β = gram 1/2 1 3-gram 4-gram 5-gram 6-gram 8 4-gram N N N 1% N = N N N N 5 3-gram 4-gram 4-gram 5-gram 4-gram 5 N = 5 3-gram 0.02% 3-gram 4-gram 3% 4-gram 5-gram 1.44% 5-gram
7 2696 June Table 8 F-measure corresponding to 1 of threshold 4-gram 1.0% 3-gram 0.02% 60 4-gram 3% 92 5-gram 1.44% 15 6-gram 4.17% gram 4.17% 6-gram % 1.0% 3-gram4-gram N 4-gram (8 ) ( N ) N ( ) N ( ) (164 7) 34 ) (3 ) N (118 ) 4.2 (158 ) ( ) 8 (14 ) 8) 4 (200K ) ( ) 3 (176 ( 1) ( ) ( 2) (161 5 ) ( 3) (175 6 ) % 3 1 (7 ) ( ) (2 ) 7 ) 4 2,447
8 Vol. 40 No ) gram 1.0% 5.2% 5-gram ( 1) ( 2) 4.4 ( 3) ( 4) 4 4-gram 5-gram gram 4-gram % 9) 6 10) () () N () 5 N = 3,4,5, 6
9 2698 June Spelling Correction Program Based on a Noisy N = 4 Channel Model, COLING-90, Vol.2, pp (1990). 7),, :, (1996). 8) :, 97--2, CD-ROM CD- (1997). ROM 94. 9) Golding, A. and Schabes, Y.: Combining Trigram-based and Feature-based Methods for. Context-Sensitive Spelling Correction, 34th Annual Meeting of the Association for Computational Linguistics, pp (1996). 1) :, 10), : N bit, Vol. 30, No. 10, pp (1998).,, 2) Mays, E., Damerau, F. and Mercer, R.: Context based spelling collection, Information Pro- Vol. 36, No. 1, pp (1995). cessing and Management, Vol. 27, No. 5, pp. ( ) (1991). ( ) 3) : N, 49, pp (1994) ),,,, :., 62 SLP-19-15, (1997).. 5),, :,, 5 4 3, pp (1997). 6) Kernighan, M., Church, K. and Gale, W.: A 9 10 ( )
Natural Language Processing. Statistical Inference: n-grams
Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability
More informationTnT Part of Speech Tagger
TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation
More informationSpeech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri
Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationExploring Asymmetric Clustering for Statistical Language Modeling
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL, Philadelphia, July 2002, pp. 83-90. Exploring Asymmetric Clustering for Statistical Language Modeling Jianfeng
More informationCMPT-825 Natural Language Processing
CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop February 27, 2008 1 / 30 Cross-Entropy and Perplexity Smoothing n-gram Models Add-one Smoothing Additive Smoothing Good-Turing
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval CS276: Informa*on Retrieval and Web Search Christopher Manning and Pandu Nayak Spelling Correc*on The course thus far Index construc*on Index compression Efficient boolean
More informationMachine Learning for natural language processing
Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a
More informationNatural Language Processing (CSE 490U): Language Models
Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,
More informationThe distribution of characters, bi- and trigrams in the Uppsala 70 million words Swedish newspaper corpus
Uppsala University Department of Linguistics The distribution of characters, bi- and trigrams in the Uppsala 70 million words Swedish newspaper corpus Bengt Dahlqvist Abstract The paper describes some
More informationA Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister
A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model
More informationA Simple Introduction to Information, Channel Capacity and Entropy
A Simple Introduction to Information, Channel Capacity and Entropy Ulrich Hoensch Rocky Mountain College Billings, MT 59102 hoenschu@rocky.edu Friday, April 21, 2017 Introduction A frequently occurring
More informationERROR DETECTION AND CORRECTION IN TOPONYM RECOGNITION IN CARTOGRAPHIC MAPS *
ERROR DETECTION AND CORRECTION IN TOPONYM RECOGNITION IN CARTOGRAPHIC MAPS * Alexander GELBUKH and Serguei LEVACHKINE Centre for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City,
More informationOn a New Model for Automatic Text Categorization Based on Vector Space Model
On a New Model for Automatic Text Categorization Based on Vector Space Model Makoto Suzuki, Naohide Yamagishi, Takashi Ishida, Masayuki Goto and Shigeichi Hirasawa Faculty of Information Science, Shonan
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationLanguage Processing with Perl and Prolog
Language Processing with Perl and Prolog es Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and Prolog 1 / 12 Training
More informationProbabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology
Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationLanguage Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional
Language Technology CUNY Graduate Center Spring 2013 Unit 1: Sequence Models Lectures 5-6: Language Models and Smoothing required hard optional Professor Liang Huang liang.huang.sh@gmail.com Python Review:
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationEmpirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider
More informationEmpirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs
Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21
More informationGraphical models for part of speech tagging
Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional
More informationFoundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM
Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture
More informationFun with weighted FSTs
Fun with weighted FSTs Informatics 2A: Lecture 18 Shay Cohen School of Informatics University of Edinburgh 29 October 2018 1 / 35 Kedzie et al. (2018) - Content Selection in Deep Learning Models of Summarization
More information{ Jurafsky & Martin Ch. 6:! 6.6 incl.
N-grams Now Simple (Unsmoothed) N-grams Smoothing { Add-one Smoothing { Backo { Deleted Interpolation Reading: { Jurafsky & Martin Ch. 6:! 6.6 incl. 1 Word-prediction Applications Augmentative Communication
More informationStatistical Methods for NLP
Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,
More informationMath 4740: Homework 5 Solutions
Math 4740: Homework 5 Solutions. (a) When f x, n i f(x i) N n (x) and y X π(y)f(y) π(x). Therefore the desired statement reduces to N n (x) lim which was shown in class. π(x) with probability, (b) Fix
More informationTuning as Linear Regression
Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical
More information1 Evaluation of SMT systems: BLEU
1 Evaluation of SMT systems: BLEU Idea: We want to define a repeatable evaluation method that uses: a gold standard of human generated reference translations a numerical translation closeness metric in
More informationarxiv: v1 [cs.cl] 21 May 2017
Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we
More informationCross-Lingual Language Modeling for Automatic Speech Recogntion
GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The
More informationSYNTHER A NEW M-GRAM POS TAGGER
SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de
More informationGood morning, everyone. On behalf of the Shinsekai Type Study Group. today we would like to talk about the Japanese writing system specifically about
1 Good morning, everyone. On behalf of the Shinsekai Type Study Group. today we would like to talk about the Japanese writing system specifically about the kana script. Our discussion will focus on the
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationBinary Convolutional Codes
Binary Convolutional Codes A convolutional code has memory over a short block length. This memory results in encoded output symbols that depend not only on the present input, but also on past inputs. An
More informationError Correction through Language Processing
Error Correction through Language Processing Anxiao (Andrew Jiang Computer Science and Eng. Dept. Texas A&M University College Station, TX 77843 ajiang@cse.tamu.edu Yue Li Electrical Engineering Department
More informationChapter 3: Basics of Language Modelling
Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple
More informationProbabilistic Language Modeling
Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November
More informationOptical Character Recognition of Jutakshars within Devanagari Script
Optical Character Recognition of Jutakshars within Devanagari Script Sheallika Singh Shreesh Ladha Supervised by : Dr. Harish Karnick, Dr. Amit Mitra UGP Presentation, 10 April 2016 OCR of Jutakshars within
More informationPrenominal Modifier Ordering via MSA. Alignment
Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,
More informationDiscriminative Training
Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationLanguage Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN
Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer
More informationChapter 11 1 LEARNING TO FIND CONTEXT BASED SPELLING ERRORS
Chapter 11 1 LEARNING TO FIND CONTEXT BASED SPELLING ERRORS Hisham Al-Mubaid, Klaus Truemper University of Houston - Clear Lake Department of Computer Science email: hisham@cl.uh.edu Department of Computer
More informationECE 564/645 - Digital Communications, Spring 2018 Homework #2 Due: March 19 (In Lecture)
ECE 564/645 - Digital Communications, Spring 018 Homework # Due: March 19 (In Lecture) 1. Consider a binary communication system over a 1-dimensional vector channel where message m 1 is sent by signaling
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More informationReducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance
Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance Alberto Barrón-Cedeño, Paolo Rosso, and José-Miguel Benedí Department of Information Systems and Computation,
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationStatistical NLP: Lecture 7. Collocations. (Ch 5) Introduction
Statistical NLP: Lecture 7 Collocations Ch 5 Introduction Collocations are characterized b limited compositionalit. Large overlap between the concepts of collocations and terms, technical term and terminological
More informationEntropy as an Indicator of Context Boundaries An Experiment Using a Web Search Engine
Entropy as an Indicator of Context Boundaries An Experiment Using a Web Search Engine Kumiko Tanaka-Ishii Graduate School of Information Science and Technology, University of Tokyo kumiko@i.u-tokyo.ac.jp
More informationNoisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1
Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1 B. J. Oommen and R. K. S. Loke School of Computer Science
More informationFast Logistic Regression for Text Categorization with Variable-Length N-grams
Fast Logistic Regression for Text Categorization with Variable-Length N-grams Georgiana Ifrim *, Gökhan Bakır +, Gerhard Weikum * * Max-Planck Institute for Informatics Saarbrücken, Germany + Google Switzerland
More informationNatural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)
Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation
More informationChapter 3: Basics of Language Modeling
Chapter 3: Basics of Language Modeling Section 3.1. Language Modeling in Automatic Speech Recognition (ASR) All graphs in this section are from the book by Schukat-Talamazzini unless indicated otherwise
More informationAre you talking Bernoulli to me? Comparing methods of assessing word frequencies
Comparing methods of assessing word frequencies * Joint work with Panagiotis Papapetrou *, Tanja Säily **, Kai Puolamäki *, Terttu Nevalainen **, and Heikki Mannila * * Department of Information and Computer
More informationLanguage Models. Philipp Koehn. 11 September 2018
Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)
More informationACS Introduction to NLP Lecture 3: Language Modelling and Smoothing
ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk Language Modelling 2 A language model is a probability
More informationAn Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling
An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling Masaharu Kato, Tetsuo Kosaka, Akinori Ito and Shozo Makino Abstract Topic-based stochastic models such as the probabilistic
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationLecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam
Lecture 4: Smoothing, Part-of-Speech Tagging Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam Language Models from Corpora We want a model of sentence probability P(w
More informationComparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin) Mitra Shahabi 1 Department of Language and Culture University of Aveiro Aveiro, 3800-356, Portugal mitra.shahabi@ua.pt Abstract An attempt
More informationLanguage Models. CS6200: Information Retrieval. Slides by: Jesse Anderton
Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they
More informationWavelet Transform in Speech Segmentation
Wavelet Transform in Speech Segmentation M. Ziółko, 1 J. Gałka 1 and T. Drwięga 2 1 Department of Electronics, AGH University of Science and Technology, Kraków, Poland, ziolko@agh.edu.pl, jgalka@agh.edu.pl
More informationAn Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition
An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition Yu-Seop Kim 1, Jeong-Ho Chang 2, and Byoung-Tak Zhang 2 1 Division of Information and Telecommunication
More informationJournal of Memory and Language
Journal of Memory and Language 88 (2016) 133 143 Contents lists available at ScienceDirect Journal of Memory and Language journal homepage: www.elsevier.com/locate/jml Corrigendum Corrigendum to Do successor
More informationUsing Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case
Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case Masato Kikuchi, Eiko Yamamoto, Mitsuo Yoshida, Masayuki Okabe, Kyoji Umemura Department of Computer Science
More informationGood-Turing Smoothing Without Tears
- 1 - Good-Turing Smoothing Without Tears William A. Gale AT&T Bell Laboratories P. O. Box 66 Murray Hill, NJ, 07974-066 gale@research.att.com ABSTRACT The performance of statistically based techniques
More informationPrimary Factors Contributing to Japan's Extremely Hot Summer of 2010
temperature anomalies by its standard deviation for JJA 2010 Primary Factors Contributing to Japan's Extremely Hot Summer of 2010 Nobuyuki Kayaba Climate Prediction Division,Japan Meteorological Agancy
More informationRecap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP
Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp
More informationAutomatically Evaluating Text Coherence using Anaphora and Coreference Resolution
1 1 Barzilay 1) Automatically Evaluating Text Coherence using Anaphora and Coreference Resolution Ryu Iida 1 and Takenobu Tokunaga 1 We propose a metric for automatically evaluating discourse coherence
More informationLearning Features from Co-occurrences: A Theoretical Analysis
Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationStatistical Substring Reduction in Linear Time
Statistical Substring Reduction in Linear Time Xueqiang Lü Institute of Computational Linguistics Peking University, Beijing lxq@pku.edu.cn Le Zhang Institute of Computer Software & Theory Northeastern
More informationCS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.
CS 224 HW:#3 ARIA HAGHIGHI SUID :# 05041774 1. Smoothing Probability Models (a). Let r be the number of words with r counts and p r be the probability for a word with r counts in the Absolute discounting
More informationCRF Word Alignment & Noisy Channel Translation
CRF Word Alignment & Noisy Channel Translation January 31, 2013 Last Time... X p( Translation)= p(, Translation) Alignment Alignment Last Time... X p( Translation)= p(, Translation) Alignment X Alignment
More informationUnsupervised Vocabulary Induction
Infant Language Acquisition Unsupervised Vocabulary Induction MIT (Saffran et al., 1997) 8 month-old babies exposed to stream of syllables Stream composed of synthetic words (pabikumalikiwabufa) After
More informationA Surface-Similarity Based Two-Step Classifier for RITE-VAL
A Surface-Similarity Based Two-Step Classifier for RITE-VAL Shohei Hattori Satoshi Sato Graduate School of Engineering, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8603, JAPAN {syohei_h,ssato}@nuee.nagoya-u.ac.jp
More informationFormal Modeling in Cognitive Science Lecture 29: Noisy Channel Model and Applications;
Formal Modeling in Cognitive Science Lecture 9: and ; ; Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Proerties of 3 March, 6 Frank Keller Formal Modeling in Cognitive
More informationQuiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7
Quiz 1, COMS 4705 Name: 10 30 30 20 Good luck! 4705 Quiz 1 page 1 of 7 Part #1 (10 points) Question 1 (10 points) We define a PCFG where non-terminal symbols are {S,, B}, the terminal symbols are {a, b},
More informationSection Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence
Section 2.4 Section Summary Sequences. Examples: Geometric Progression, Arithmetic Progression Recurrence Relations Example: Fibonacci Sequence Summations Introduction Sequences are ordered lists of elements.
More informationWord-Transliteration Alignment
Word-Transliteration Alignment Tracy Lin Dep. of Communication Engineering National Chiao Tung University, 1001, Ta Hsueh Road, Hsinchu, 300, Taiwan tracylin@cm.nctu.edu.tw Chien-Cheng Wu Department of
More informationRegular expressions and automata
Regular expressions and automata Introduction Finite State Automaton (FSA) Finite State Transducers (FST) Regular expressions(i) (Res) Standard notation for characterizing text sequences Specifying text
More informationNLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)
NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition
More informationRandom Processes. By: Nick Kingsbury
Random Processes By: Nick Kingsbury Random Processes By: Nick Kingsbury Online: < http://cnx.org/content/col10204/1.3/ > C O N N E X I O N S Rice University, Houston, Texas This selection and arrangement
More informationSignificance tests for the evaluation of ranking methods
Significance tests for the evaluation of ranking methods Stefan Evert Institut für maschinelle Sprachverarbeitung Universität Stuttgart Azenbergstr. 12, 70174 Stuttgart, Germany evert@ims.uni-stuttgart.de
More informationEmpirical Methods in Natural Language Processing Lecture 5 N-gram Language Models
Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models (most slides from Sharon Goldwater; some adapted from Alex Lascarides) 29 January 2017 Nathan Schneider ENLP Lecture 5
More informationStatistical Natural Language Processing
Statistical Natural Language Processing N-gram Language Models Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017 N-gram language models A language model answers
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationChapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze
Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics
More informationImproved Decipherment of Homophonic Ciphers
Improved Decipherment of Homophonic Ciphers Malte Nuhn and Julian Schamper and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University, Aachen,
More informationStochastic Contextual Edit Distance and Probabilistic FSTs
Stochastic Contextual Edit Distance and Probabilistic FSTs Ryan Cotterell and Nanyun Peng and Jason Eisner Department of Computer Science, Johns Hopkins University {ryan.cotterell,npeng1,jason}@cs.jhu.edu
More informationFeaturizing Text. Bob Stine Dept of Statistics, Wharton School University of Pennsylvania
Featurizing Text Bob Stine Dept of Statistics, School University of Pennsylvania Slides and draft manuscript available at www-stat.wharton.upenn.edu/~stine Thank you, NSF! (#1106743) Thanks also to Dean
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.
More informationAnd for polynomials with coefficients in F 2 = Z/2 Euclidean algorithm for gcd s Concept of equality mod M(x) Extended Euclid for inverses mod M(x)
Outline Recall: For integers Euclidean algorithm for finding gcd s Extended Euclid for finding multiplicative inverses Extended Euclid for computing Sun-Ze Test for primitive roots And for polynomials
More informationCS1800: Mathematical Induction. Professor Kevin Gold
CS1800: Mathematical Induction Professor Kevin Gold Induction: Used to Prove Patterns Just Keep Going For an algorithm, we may want to prove that it just keeps working, no matter how big the input size
More informationEfficient Cryptanalysis of Homophonic Substitution Ciphers
Efficient Cryptanalysis of Homophonic Substitution Ciphers Amrapali Dhavare Richard M. Low Mark Stamp Abstract Substitution ciphers are among the earliest methods of encryption. Examples of classic substitution
More informationUnit 8: Part 2: PD, PID, and Feedback Compensation
Ideal Derivative Compensation (PD) Lead Compensation PID Controller Design Feedback Compensation Physical Realization of Compensation Unit 8: Part 2: PD, PID, and Feedback Compensation Engineering 5821:
More information