Kneser-Ney smoothing explained

Similar documents
Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Machine Learning for natural language processing

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Natural Language Processing. Statistical Inference: n-grams

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

CMPT-825 Natural Language Processing

Language Model. Introduction to N-grams

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Statistical Methods for NLP

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

DT2118 Speech and Speaker Recognition

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

CS 6120/CS4120: Natural Language Processing

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Natural Language Processing SoSe Words and Language Model

N-gram Language Modeling Tutorial

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Language Models. Philipp Koehn. 11 September 2018

N-gram Language Modeling

Natural Language Processing (CSE 490U): Language Models

ANLP Lecture 6 N-gram models and smoothing

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Naïve Bayes, Maxent and Neural Models

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Probabilistic Language Modeling

Lecture 3: ASR: HMMs, Forward, Viterbi

Chapter 3: Basics of Language Modelling

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

The Noisy Channel Model and Markov Models

Statistical Natural Language Processing

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

IBM Model 1 for Machine Translation

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Microsoft Corporation.

AN INTRODUCTION TO TOPIC MODELS

Language Model Rest Costs and Space-Efficient Storage

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

CS 6120/CS4120: Natural Language Processing

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1

TnT Part of Speech Tagger

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Midterm sample questions

Language Modeling. Michael Collins, Columbia University

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

SYNTHER A NEW M-GRAM POS TAGGER

Sequences and Information

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

language modeling: n-gram models

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Deep Learning. Language Models and Word Embeddings. Christof Monz

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness

Language Processing with Perl and Prolog

Statistical Machine Translation

arxiv: v1 [cs.cl] 5 Mar Introduction

Learning to translate with neural networks. Michael Auli

Basics of Proofs. 1 The Basics. 2 Proof Strategies. 2.1 Understand What s Going On

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

Variational Decoding for Statistical Machine Translation

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Chapter 3: Basics of Language Modeling

Neural Networks Language Models

PROBABILISTIC PASSWORD MODELING: PREDICTING PASSWORDS WITH MACHINE LEARNING

FINITE STATE LANGUAGE MODELS SMOOTHED USING n-grams

8: Hidden Markov Models

Conditional Language Modeling. Chris Dyer

8: Hidden Markov Models

Tuning as Linear Regression

arxiv:cmp-lg/ v1 9 Jun 1997

Math 5a Reading Assignments for Sections

Hierarchical Bayesian Nonparametrics

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

Topic Modeling: Beyond Bag-of-Words

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Language Models. Hongning Wang

How to write maths (well)

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

CSEP 517 Natural Language Processing Autumn 2013

Methods of Mathematics

Math 38: Graph Theory Spring 2004 Dartmouth College. On Writing Proofs. 1 Introduction. 2 Finding A Solution

What to Expect from Expected Kneser-Ney Smoothing

Maxent Models and Discriminative Estimation

Lecture 22 Exploratory Text Analysis & Topic Models

Transcription:

foldl home blog contact feed Kneser-Ney smoothing explained 18 January 2014 Language models are an essential element of natural language processing, central to tasks ranging from spellchecking to machine translation. Given an arbitrary piece of text, a language model determines whether that text belongs to a given language. We can give a concrete example with a probabilistic language model, a specific construction which uses probabilities to estimate how likely any given string belongs to a language. Consider a probabilistic English language model P E. We would expect the probability P E (I went to the store) to be quite high, since we can confirm this is valid English. On the other hand, we expect the probabilities P E (store went to I the), P E (Ich habe eine Katz) to be very low, since these fragments do not constitute proper English text. I don t aim to cover the entirety of language models at the moment that would be an ambitious task for a single blog post. If you haven t 1 of 7 1/18/19, 5:08 PM

encountered language models or n-grams before, I recommend the following resources: Language model on Wikipedia Chapter 4 of Jurafsky and Martin s Speech and Language Processing Chapter 7 of Statistical Machine Translation (see summary slides online) I d like to jump ahead to a trickier subject within language modeling known as Kneser-Ney smoothing. This smoothing method is most commonly applied in an interpolated form, 1 and this is the form that I ll present today. Kneser-Ney evolved from absolute-discounting interpolation, which makes use of both higher-order (i.e., higher-n) and lower-order language models, reallocating some probability mass from 4-grams or 3-grams to simpler unigram models. The formula for absolute-discounting smoothing as applied to a bigram language model is presented below: max(c( w i 1 w i ) δ, 0) P abs ( w i w i 1 ) = + α p ( ) w c( w i 1 w abs w i ) δ Here refers to a fixed discount value, and is a normalizing constant. The details of this smoothing are covered in Chen and Goodman (1999). α The essence of Kneser-Ney is in the clever observation that we can take advantage of this interpolation as a sort of backoff model. When the first term (in this case, the discounted relative bigram count) is near zero, the second term (the lower-order model) carries more weight. Inversely, when the higher-order model matches strongly, the second lower-order term has little weight. The Kneser-Ney design retains the first term of absolute discounting 2 of 7 1/18/19, 5:08 PM

interpolation, but rewrites the second term to take advantage of this relationship. Whereas absolute discounting interpolation in a bigram model would simply default to a unigram model in the second term, Kneser-Ney depends upon the idea of a continuation probability associated with each unigram. This probability for a given token bigrams which it completes: w i is proportional to the number of P continuation ( w i ) { w i 1 : c( w i 1, w i ) > 0} This quantity is normalized by dividing by the total number of bigram types (note that j is a free variable): { w i 1 : c( w i 1, w i ) > 0} P continuation ( w i ) = { w j 1 : c( w j 1, w j ) > 0} The common example used to demonstrate the efficacy of Kneser-Ney is the phrase San Francisco. Suppose this phrase is abundant in a given training corpus. Then the unigram probability of Francisco will also be high. If we unwisely use something like absolute discounting interpolation in a context where our bigram model is weak, the unigram model portion may take over and lead to some strange results. Dan Jurafsky gives the following example context: I can t see without my reading. 3 of 7 1/18/19, 5:08 PM

A fluent English speaker reading this sentence knows that the word glasses should fill in the blank. But since San Francisco is a common term, absolute-discounting interpolation might declare that Francisco is a better P abs (Francisco) > P abs (glasses) fit:. Kneser-Ney fixes this problem by asking a slightly harder question of our lower-order model. Whereas the unigram model simply provides how likely a word likely a word w i w i is to appear, Kneser-Ney s second term determines how is to appear in an unfamiliar bigram context. Kneser-Ney in whole follows: max(c( w i 1 w i ) δ, 0) { w i 1 : c( w i 1, w i ) > 0} P KN ( w i w i 1 ) = + λ w c( w i 1 w ) { w j 1 : c( w j 1, w j ) > 0} λ is a normalizing constant δ λ( w i 1 ) = { w : c( w i 1, w ) > 0}. c( w i 1 ) Note that the denominator of the first term can be simplified to a unigram count. Here is the final interpolated Kneser-Ney smoothed bigram model, in all its glory: max(c( w i 1 w i ) δ, 0) { w i 1 : c( w i 1, w i ) > 0} P KN ( w i w i 1 ) = + λ c( w i 1 ) { w j 1 : c( w j 1, w j ) > 0} Further reading If you enjoyed this post, here is some further reading on Kneser-Ney and other smoothing methods: 4 of 7 1/18/19, 5:08 PM

Bill MacCartney s smoothing tutorial (very accessible) Chen and Goodman (1999) Section 4.9.1 in Jurafsky and Martin s Speech and Language Processing Footnotes 1. For the canonical definition of interpolated Kneser-Ney smoothing, see S. F. Chen and J. Goodman, An empirical study of smoothing techniques for language modeling, Computer Speech and Language, vol. 13, no. 4, pp. 359 394, 1999. 5 of 7 1/18/19, 5:08 PM

4 Comments foldl 1 Login Recommend 3 t Tweet f Share Sort by Best Join the discussion LOG IN WITH OR SIGN UP WITH DISQUS Name? No name! 3 years ago Could explain what difference does the δ make here? In one of the terms we have: max(c(w_(i 1)w_i) δ,0), but if δ lies 0 < δ < 1 and since c(w_(i 1)w_i) might be as big as hundreds, the difference is pretty low. So I am not sure if I understand it properly. Reply Share z 4 years ago I think you mean Pabs(w(i) w(i-1)) in the first equation on the page. Reply Share Jon Gauthier Mod > z 4 years ago You're right, thanks! Reply Share Simon(Zhuoru) Lin 3 years ago For calculating the normalizing constant. What to do if c(w_i-1) is 0? THanks Reply Share ALSO ON FOLDL Review: Clojure Programming by Chas Emerick et al. 1 comment 6 years ago Avatarassignment help experts This is actually one of the best guide that most people could have that will help them in Imperat aut servit: Managing our knowledge inheritance 2 comments 5 years ago AvatarJon Gauthier Hi, thanks for your insight! I'll respond to your points individually:its more of an impulse Where does ice cream come from? 1 comment 5 years ago 压缩感知用同频信道估计 - 码字 1 comment 5 years ago 6 of 7 1/18/19, 5:08 PM

Jon Gauthier Cambridge, Massachusetts 7 of 7 1/18/19, 5:08 PM