CMPT-825 Natural Language Processing

Similar documents
Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Language Models. Philipp Koehn. 11 September 2018

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Machine Learning for natural language processing

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

DT2118 Speech and Speaker Recognition

Language Modeling. Michael Collins, Columbia University

Statistical Methods for NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Probabilistic Language Modeling

Language Model. Introduction to N-grams

N-gram Language Modeling Tutorial

ANLP Lecture 6 N-gram models and smoothing

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Microsoft Corporation.

N-gram Language Modeling

Natural Language Processing SoSe Words and Language Model

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Statistical Machine Translation

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

CS 6120/CS4120: Natural Language Processing

Language Processing with Perl and Prolog

Kneser-Ney smoothing explained

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Statistical Natural Language Processing

We have a sitting situation

CSEP 517 Natural Language Processing Autumn 2013

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Natural Language Processing (CSE 490U): Language Models

Exploring Asymmetric Clustering for Statistical Language Modeling

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Natural Language Processing

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Chapter 3: Basics of Language Modelling

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Midterm sample questions

Ranked Retrieval (2)

Chapter 3: Basics of Language Modeling

A Simple Introduction to Information, Channel Capacity and Entropy

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Language Models. CS5740: Natural Language Processing Spring Instructor: Yoav Artzi

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Computer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited. DTICQUALBYlSFSPIiCTBDl

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

arxiv:cmp-lg/ v1 9 Jun 1997

CS 6120/CS4120: Natural Language Processing

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

language modeling: n-gram models

Smoothing. This dark art is why NLP is taught in the engineering school.

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Sequences and Information

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Deep Learning. Language Models and Word Embeddings. Christof Monz

CS4442/9542b ArtificialIntelligence II prof. Olga Veksler

TnT Part of Speech Tagger

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Hierarchical Bayesian Nonparametrics

The Noisy Channel Model and Markov Models

SYNTHER A NEW M-GRAM POS TAGGER

On Using Selectional Restriction in Language Models for Speech Recognition

Lecture 12: Algorithms for HMMs

IBM Research Report. Model M Lite: A Fast Class-Based Language Model

Phrasetable Smoothing for Statistical Machine Translation

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

Hidden Markov Models. Sequence to Sequence maps. Markov model. Hidden Markov model. Basic Definitions. Viterbi Decoding

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 12: Algorithms for HMMs

Conditional Language Modeling. Chris Dyer

Log-linear models (part 1)

(today we are assuming sentence segmentation) Wednesday, September 10, 14

What to Expect from Expected Kneser-Ney Smoothing

Administrivia. Lecture 5. Part I. Outline. The Big Picture IBM IBM IBM IBM

Lecture 5. The Big Picture/Language Modeling. Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen

Language Model Rest Costs and Space-Efficient Storage

Hidden Markov Models in Language Processing

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

From Language Modelling to Machine Translation

Transcription:

CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop February 27, 2008 1 / 30

Cross-Entropy and Perplexity Smoothing n-gram Models Add-one Smoothing Additive Smoothing Good-Turing Smoothing Backoff Smoothing Event Space for n-gram Models 2 / 30

How good is a model So far we ve seen the probability of a sentence: P(w 0,..., w n ) What is the probability of a collection of sentences, that is what is the probability of a corpus Let T = s 0,..., s m be a text corpus with sentences s 0 through s m What is P(T )? Let us assume that we trained P( ) on some training data, and T is the test data 3 / 30

How good is a model T = s 0,..., s m is the text corpus with sentences s 0 through s m P(T ) = m i=0 P(s i) P(s i ) = P(w i 0,..., w i n) Let W T be the length of the text T measured in words Then for the unigram model, P(T ) = w T P(w) A problem: we want to compare two different models P 1 and P 2 on T To do this we use the per word perplexity of the model: PP P (T ) = P(T ) 1 1 W T = W T P(T ) 4 / 30

How good is a model The per word perplexity of the model is: PP P (T ) = P(T ) 1 W T Recall that PP P (T ) = 2 H P(T ) where H P (T ) is the cross-entropy of P for text T. Therefore, H P (T ) = log 2 PP P (T ) = 1 W T log 2 P(T ) Above we use a unigram model P(w), but the same derivation holds for bigram, trigram,... 5 / 30

How good is a model Lower cross entropy values and perplexity values are better Lower values mean that the model is better Correlation with performance of the language model in various applications Performance of a language model is its cross-entropy or perplexity on test data (unseen data) corresponds to the number bits required to encode that data On various real life datasets, typical perplexity values yielded by n-gram models on English text range from about 50 to almost 1000 (corresponding to cross entropies from about 6 to 10 bits/word) 6 / 30

Cross-Entropy and Perplexity Smoothing n-gram Models Add-one Smoothing Additive Smoothing Good-Turing Smoothing Backoff Smoothing Event Space for n-gram Models 7 / 30

Bigram Models In practice: P(Mork read a book) = P(Mork < start >) P(read Mork) P(a read) P(book a) P(< stop > book) P(w i w i 1 ) = c(w i 1,w i ) c(w i 1 ) On unseen data, c(w i 1, w i ) or worse c(w i 1 ) could be zero c(w i 1, w i ) c(w w i 1 ) i =? 8 / 30

Smoothing Smoothing deals with events that have been observed zero times Smoothing algorithms also tend to improve the accuracy of the model P(w i w i 1 ) = c(w i 1, w i ) c(w i 1 ) Not just unobserved events: what about events observed once? 9 / 30

Add-one Smoothing Add-one Smoothing: P(w i w i 1 ) = c(w i 1, w i ) c(w i 1 ) P(w i w i 1 ) = 1 + c(w i 1, w i ) V + c(w i 1 ) Let V be the number of words in our vocabulary Assign count of 1 to unseen bigrams 10 / 30

Add-one Smoothing P(Mindy read a book) = P(Mindy < start >) P(read Mindy) P(a read) P(book a) P(< stop > book) Without smoothing: P(read Mindy) = c(mindy, read) c(mindy) = 0 With add-one smoothing (assuming c(mindy) = 1 but c(mindy, read) = 0): P(read Mindy) = 1 V + 1 11 / 30

Additive Smoothing: (Lidstone 1920, Jeffreys 1948) P(w i w i 1 ) = c(w i 1, w i ) c(w i 1 ) Add-one smoothing works horribly in practice. Seems like 1 is too large a count for unobserved events. Additive Smoothing: P(w i w i 1 ) = δ + c(w i 1, w i ) (δ V ) + c(w i 1 ) 0 < δ 1 Still works horribly in practice, but better than add-one smoothing. 12 / 30

Good-Turing Smoothing: (Good, 1953) P(w i w i 1 ) = c(w i 1, w i ) c(w i 1 ) Imagine you re sitting at a sushi bar with a conveyor belt. You see going past you 10 plates of tuna, 3 plates of unagi, 2 plates of salmon, 1 plate of shrimp, 1 plate of octopus, and 1 plate of yellowtail Chance you will observe a new kind of seafood: 3 18 How likely are you to see another plate of salmon: should be < 2 18 13 / 30

Good-Turing Smoothing How many types of seafood (words) were seen once? Use this to predict probabilities for unseen events Let n 1 be the number of events that occurred once: p 0 = n 1 N The Good-Turing estimate states that for any n-gram that occurs r times, we should pretend that it occurs r times r = (r + 1) n r+1 n r 14 / 30

Good-Turing Smoothing 10 tuna, 3 unagi, 2 salmon, 1 shrimp, 1 octopus, 1 yellowtail How likely is new data? Let n 1 be the number of items occurring once, which is 3 in this case. N is the total, which is 18. p 0 = n 1 N = 3 18 = 0.166 15 / 30

Good-Turing Smoothing 10 tuna, 3 unagi, 2 salmon, 1 shrimp, 1 octopus, 1 yellowtail How likely is octopus? Since c(octopus) = 1 The GT estimate is 1. r = (r + 1) n r+1 n r p GT = r To compute 1, we need n 1 = 3 and n 2 = 1 N 1 = 2 1 3 = 2 3 p 1 = 1 18 = 0.037 What happens when n r+1 = 0? (smoothing before smoothing) 16 / 30

Simple Good-Turing: linear interpolation for missing n r+1 f (r) = a + b r a = 2.3 b = 0.17 r n r = f (r) 1 2.14 2 1.97 3 1.80 4 1.63 5 1.46 6 1.29 7 1.12 8 0.95 9 0.78 10 0.61 11 0.44 17 / 30

Comparison between Add-one and Good-Turing freq num with freq r NS Add1 SGT r n r p r p r p r 0 0 0 0.0294 0.12 1 3 0.04 0.0588 0.03079 2 2 0.08 0.0882 0.06719 3 1 0.12 0.1176 0.1045 5 1 0.2 0.1764 0.1797 10 1 0.4 0.3235 0.3691 N = (1 3) + (2 2) + 3 + 5 + 10 = 25 V = 1 + 3 + 2 + 1 + 1 + 1 = 9 Important: we added a new word type for unseen words. Let s call it UNK, the unknown word. Check that: 1.0 == r n r p r 0.12 + (3 0.03079) + (2 0.06719) + 0.1045 + 0.1797 + 0.3691 = 1.0 18 / 30

Comparison between Add-one and Good-Turing freq num with freq r NS Add1 SGT r n r p r p r p r 0 0 0 0.0294 0.12 1 3 0.04 0.0588 0.03079 2 2 0.08 0.0882 0.06719 3 1 0.12 0.1176 0.1045 5 1 0.2 0.1764 0.1797 10 1 0.4 0.3235 0.3691 NS = No smoothing: p r = r N Add1 = Add-one smoothing: p r = 1+r V +N SGT = Simple Good-Turing: p 0 = n 1 N, p r = (r+1) N with linear interpolation for missing values where n r+1 = 0 (Gale and Sampson, 1995) http://www.grsampson.net/agtf1.html n r+1 nr 19 / 30

Simple Backoff Smoothing: incorrect version P(w i w i 1 ) = c(w i 1, w i ) c(w i 1 ) In add-one or Good-Turing: P(the string) = P(Fonz string) If c(w i 1, w i ) = 0, then use P(w i ) (back off) Works for trigrams: back off to bigrams and then unigrams Works better in practice, but probabilities get mixed up (unseen bigrams, for example will get higher probabilities than seen bigrams) 20 / 30

Backoff Smoothing: Jelinek-Mercer Smoothing P ML (w i w i 1 ) = c(w i 1, w i ) c(w i 1 ) P JM (w i w i 1 ) = λp ML (w i w i 1 ) + (1 λ)p ML (w i ) where, 0 λ 1 Notice that P JM (the string) > P JM (Fonz string) as we wanted Jelinek-Mercer (1980) describe an elegant form of this interpolation: P JM (ngram) = λp ML (ngram) + (1 λ)p JM (n 1gram) What about P JM (w i )? For missing unigrams: P JM (w i ) = λp ML (w i ) + (1 λ) δ V 21 / 30

Backoff Smoothing: Many alternatives P JM (ngram) = λp ML (ngram) + (1 λ)p JM (n 1gram) Different methods for finding the values for λ correspond to variety of different smoothing methods Katz Backoff (include Good-Turing with Backoff Smoothing) P katz (y x) = { c (xy) c(x) if c(xy) > 0 α(x)p katz (y) otherwise where α(x) is chosen to make sure that P katz (y x) is a proper probability α(x) = 1 y c (xy) c(x) 22 / 30

Backoff Smoothing: Many alternatives P JM (ngram) = λp ML (ngram) + (1 λ)p JM (n 1gram) Deleted Interpolation (Jelinek, Mercer) compute λ values to minimize cross-entropy on held-out data which is deleted from the initial set of training data Improved JM smoothing, a separate λ for each w i 1 : P JM (w i w i 1 ) = λ(w i 1 )P ML (w i w i 1 )+(1 λ(w i 1 ))P ML (w i ) where i λ(w i ) = 1 because w i P(w i w i 1 ) = 1 23 / 30

Backoff Smoothing: Many alternatives P JM (ngram) = λp ML (ngram) + (1 λ)p JM (n 1gram) Witten-Bell smoothing use the n 1 gram model when the n gram model has too few unique words in the n gram context Absolute discounting (Ney, Essen, Kneser) P abs (y x) = { c(xy) D c(x) if c(xy) > 0 α(x)p abs (y) otherwise compute α(x) as was done in Katz smoothing 24 / 30

Backoff Smoothing: Many alternatives P JM (ngram) = λp ML (ngram) + (1 λ)p JM (n 1gram) Kneser-Ney smoothing P(Francisco eggplant) > P(stew eggplant) Francisco is common, so interpolation gives P(Francisco eggplant) a high value But Francisco occurs in few contexts (only after San) stew is common, and occurs in many contexts Hence weight the interpolation based on number of contexts for the word using discounting 25 / 30

Backoff Smoothing: Many alternatives P JM (ngram) = λp ML (ngram) + (1 λ)p JM (n 1gram) Modified Kneser-Ney smoothing (Chen and Goodman) multiple discounts for one count, two counts and three or more counts Finding λ: use Generalized line search (Powell search) or the Expectation-Maximization algorithm 26 / 30

Trigram Models Revisiting the trigram model: P(w 1, w 2,..., w n ) = P(w 1 ) P(w 2 w 1 ) P(w 3 w 1, w 2 ) P(w 4 w 2, w 3 )... P(w i w i 2, w i 1 )... P(w n w n 2,..., w n 1 ) Notice that the length of the sentence n is variable What is the event space? 27 / 30

The stop symbol Let Σ = {a, b} and the language be Σ so L = {ɛ, a, b, aa, bb, ab, bb...} Consider a unigram model: P(a) = P(b) = 0.5 P(a) = 0.5, P(b) = 0.5, P(aa) = 0.5 2 = 0.25, P(bb) = 0.25 and so on. But P(a) + P(b) + P(aa) + P(bb) = 1.5!! P(w) = 1 w 28 / 30

The stop symbol What went wrong? No probability for P(ɛ) Add a special stop symbol: P(a) = P(b) = 0.25 P(stop) = 0.5 P(stop) = 0.5, P(a stop) = P(b stop) = 0.25 0.5 = 0.125, P(aa stop) = 0.25 2 0.5 = 0.03125 (now the sum is no longer greater than one) 29 / 30

The stop symbol With this new stop symbol we can show that w P(w) = 1 Notice that the probability of any sequence of length n is 0.25 n 0.5 Also there are 2 n sequences of length n P(w) = w 2 n 0.25 n 0.5 n=0 0.5 n 0.5 = n=0 0.5 n = 1 n=1 n=0 0.5 n+1 30 / 30