The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Similar documents
Language Modeling. Michael Collins, Columbia University

Probabilistic Language Modeling

Natural Language Processing

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Natural Language Processing. Statistical Inference: n-grams

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Language Model. Introduction to N-grams

Machine Learning for natural language processing

DT2118 Speech and Speaker Recognition

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

N-gram Language Modeling

Natural Language Processing SoSe Words and Language Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

(today we are assuming sentence segmentation) Wednesday, September 10, 14

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

ANLP Lecture 6 N-gram models and smoothing

Language Models. Philipp Koehn. 11 September 2018

Statistical Methods for NLP

N-gram Language Modeling Tutorial

CMPT-825 Natural Language Processing

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

CSEP 517 Natural Language Processing Autumn 2013

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Natural Language Processing (CSE 490U): Language Models

Sequences and Information

We have a sitting situation

The Noisy Channel Model and Markov Models

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Chapter 3: Basics of Language Modelling

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Statistical Natural Language Processing

TnT Part of Speech Tagger

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Probabilistic Context Free Grammars. Many slides from Michael Collins

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Language Models. CS5740: Natural Language Processing Spring Instructor: Yoav Artzi

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Natural Language Processing

Language Processing with Perl and Prolog

Info 2950, Lecture 24

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

CSC321 Lecture 7 Neural language models

Features of Statistical Parsers

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

IBM Model 1 for Machine Translation

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

A fast and simple algorithm for training neural probabilistic language models

Hidden Markov Models in Language Processing

Chapter 3: Basics of Language Modeling

Conditional Language Modeling. Chris Dyer

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Natural Language Processing

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Kneser-Ney smoothing explained

Log-linear models (part 1)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

language modeling: n-gram models

8: Hidden Markov Models

CS 6120/CS4120: Natural Language Processing

Midterm sample questions

Maxent Models and Discriminative Estimation

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Computer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited. DTICQUALBYlSFSPIiCTBDl

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Feedforward Neural Networks. Michael Collins, Columbia University

Statistical Methods for NLP

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Exploring Asymmetric Clustering for Statistical Language Modeling

Advanced Natural Language Processing Syntactic Parsing

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Graphical models for part of speech tagging

Log-Linear Models, MEMMs, and CRFs

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Deep Learning. Language Models and Word Embeddings. Christof Monz

Microsoft Corporation.

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Hidden Markov Models

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

arxiv:cmp-lg/ v1 9 Jun 1997

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Lecture 12: Algorithms for HMMs

Statistical Machine Learning from Data

Transcription:

The Language Modeling Problem We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham, two, } 6.864 (Fall 2007) Smoothed Estimation, and Language Modeling We have an (infinite) set of strings, V the a the fan the fan saw Beckham the fan saw saw the fan saw Beckham play for Real Madrid 1 3 Overview The language modeling problem Trigram models Evaluating language models: perplexity Linear interpolation Discounting methods The Language Modeling Problem (Continued) We have a training sample of example sentences in English We need to learn a probability distribution ˆP i.e., ˆP is a function that satisfies x V ˆP (x) = 1, ˆP (x) 0 for all x V ˆP (the) = 10 12 ˆP (the fan) = 10 8 ˆP (the fan saw Beckham) = 2 10 8 ˆP (the fan saw saw) = 10 15 ˆP (the fan saw Beckham play for Real Madrid) = 2 10 9 Usual assumption: training sample is drawn from some underlying distribution P, we want ˆP to be as close to P as possible. 2 4

Why on earth would we want to do this?! Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.) The estimation techniques developed for this problem will be VERY useful for other problems in NLP Deriving a Trigram Probability Model Step 2: Make Markov independence assumptions: P (w 1, w 2,, w n ) = P (w 1 START) P (w 2 START, w 1 ) P (w 3 w 1, w 2 ) P (w n w n 2, w n 1 ) P (STOP w n 1, w n ) General assumption: P (w i START, w 1, w 2,, w i 2, w i 1 ) = P (w i w i 2, w i 1 ) For Example P (the, dog, laughs) = P (the START) P (dog START, the) P (laughs the, dog) P (STOP dog, laughs) 5 7 Deriving a Trigram Probability Model Step 1: Expand using the chain rule: P (w 1, w 2,, w n ) = P (w 1 START) P (w 2 START, w 1 ) P (w 3 START, w 1, w 2 ) P (w 4 START, w 1, w 2, w 3 ) P (w n START, w 1, w 2,, w n 1 ) P (STOP START, w 1, w 2,, w n 1, w n ) For Example P (the, dog, laughs) = P (the START) P (dog START, the) P (laughs START, the, dog) P (STOP START, the, dog, laughs) The Trigram Estimation Problem Remaining estimation problem: For example: P (w i w i 2, w i 1 ) P (laughs the, dog) A natural estimate (the maximum likelihood estimate ): P ML (w i w i 2, w i 1 ) = Count(w i 2, w i 1, w i ) Count(w i 2, w i 1 ) P ML (laughs the, dog) = Count(the, dog, laughs) Count(the, dog) 6 8

Evaluating a Language Model We have some test data, n sentences S 1, S 2, S 3,, S n We could look at the probability under our model n i=1 P (S i ). Or more conveniently, the log probability n n log P (S i ) = log P (S i ) i=1 i=1 Some History Shannon conducted experiments on entropy of English i.e., how good are people at the perplexity game? C. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:50 64, 1951. In fact the usual evaluation measure is perplexity Perplexity = 2 x where x = 1 n log P (S i ) W i=1 and W is the total number of words in the test data. 9 11 Some Intuition about Perplexity Say we have a vocabulary V, of size N = V and model that predicts for all w V. P (w) = 1 N Easy to calculate the perplexity in this case: Perplexity = 2 x where x = log 1 N Perplexity = N Perplexity is a measure of effective branching factor Some History Chomsky (in Syntactic Structures (1957)): Second, the notion grammatical cannot be identified with meaningful or significant in any semantic sense. Sentences (1) and (2) are equally nonsensical, but any speaker of English will recognize that only the former is grammatical. (1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless. Third, the notion grammatical in English cannot be identified in any way with the notion high order of statistical approximation to English. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally remote from English. Yet (1), though nonsensical, is grammatical, while (2) is not. (my emphasis) 10 12

Sparse Data Problems A natural estimate (the maximum likelihood estimate ): P ML(w i w i 2, w i 1) = P ML(laughs the, dog) = Say our vocabulary size is N = V, then there are N 3 parameters in the model. Count(wi 2, wi 1, wi) Count(w i 2, w i 1) Count(the, dog, laughs) Count(the, dog) Linear Interpolation Take our estimate ˆP (w i w i 2, w i 1 ) to be ˆP (w i w i 2, w i 1 ) = λ 1 P ML (w i w i 2, w i 1 ) +λ 2 P ML (w i w i 1 ) +λ 3 P ML (w i ) where λ 1 + λ 2 + λ 3 = 1, and λ i 0 for all i. e.g., N = 20, 000 20, 000 3 = 8 10 12 parameters 13 15 The Bias-Variance Trade-Off (Unsmoothed) trigram estimate P ML (w i w i 2, w i 1 ) = Count(w i 2, w i 1, w i ) Count(w i 2, w i 1 ) (Unsmoothed) bigram estimate P ML (w i w i 1 ) = Count(w i 1, w i ) Count(w i 1 ) (Unsmoothed) unigram estimate P ML (w i ) = Count(w i) Count() How close are these different estimates to the true probability P (w i w i 2, w i 1 )? Our estimate correctly defines a distribution: w V ˆP (w w i 2, w i 1) = [λ1 PML(w wi 2, wi 1) + λ2 PML(w wi 1) + λ3 PML(w)] w V = λ 1 PML(w wi 2, wi 1) + w λ2 PML(w wi 1) + w w λ3 PML(w) = λ 1 + λ 2 + λ 3 = 1 (Can show also that ˆP (w w i 2, w i 1 ) 0 for all w V) 14 16

How to estimate the λ values? Hold out part of training set as validation data Define Count 2 (w 1, w 2, w 3 ) to be the number of times the trigram (w 1, w 2, w 3 ) is seen in validation set Choose λ 1, λ 2, λ 3 to maximize: L(λ 1, λ 2, λ 3 ) = Count 2 (w 1, w 2, w 3 ) log ˆP (w 3 w 1, w 2 ) w 1,w 2,w 3 V such that λ 1 + λ 2 + λ 3 = 1, and λ i 0 for all i, and where ˆP (w i w i 2, w i 1 ) = λ 1 P ML (w i w i 2, w i 1 ) +λ 2 P ML (w i w i 1 ) +λ 3 P ML (w i ) Allowing the λ s to vary Take a function Φ that partitions histories e.g., 1 If Count(w i 1, w i 2 ) = 0 2 If 1 Count(w Φ(w i 2, w i 1 ) = i 1, w i 2 ) 2 3 If 3 Count(w i 1, w i 2 ) 5 4 Otherwise Introduce a dependence of the λ s on the partition: ˆP (w i w i 2, w i 1 ) = λ Φ(w i 2,w i 1 ) 1 P ML (w i w i 2, w i 1 ) 2 P ML (w i w i 1 ) 3 P ML (w i ) where λ Φ(w i 2,w i 1 ) 1 + λ Φ(w i 2,w i 1 ) 2 + λ Φ(w i 2,w i 1 ) 3 = 1, and λ Φ(w i 2,w i 1 ) i 0 for all i. 17 19 An Iterative Method (Based on the EM Algorithm) Initialization: Pick arbitrary/random values for λ 1, λ 2, λ 3. Step 1: Calculate the following quantities: c 1 = c 2 = c 3 = w 1,w 2,w 3 V w 1,w 2,w 3 V w 1,w 2,w 3 V Step 2: Re-estimate λ i s as Count 2(w 1, w 2, w 3)λ 1P ML(w 3 w 1, w 2) λ 1P ML(w 3 w 1, w 2) + λ 2P ML(w 3 w 2) + λ 3P ML(w 3) Count 2(w 1, w 2, w 3)λ 2P ML(w 3 w 2) λ 1P ML(w 3 w 1, w 2) + λ 2P ML(w 3 w 2) + λ 3P ML(w 3) Count 2(w 1, w 2, w 3)λ 3P ML(w 3) λ 1P ML(w 3 w 1, w 2) + λ 2P ML(w 3 w 2) + λ 3P ML(w 3) c 1 c 2 c 3 λ 1 =, λ 2 =, λ 3 = c 1 + c 2 + c 3 c 1 + c 2 + c 3 c 1 + c 2 + c 3 Our estimate correctly defines a distribution: w V ˆP (w w i 2, w i 1) = w V [ λφ(w i 2,w i 1 ) 1 P ML(w w i 2, w i 1) 2 P ML(w w i 1) 3 P ML(w)] = λ Φ(w i 2,w i 1 ) 1 2 3 PML(w wi 2, wi 1) w PML(w wi 1) w PML(w) w = λ Φ(w i 2,w i 1 ) 1 + λ Φ(w i 2,w i 1 ) 2 + λ Φ(w i 2,w i 1 ) 3 = 1 Step 3: If λ i s have not converged, go to Step 1. 18 20

An Alternative Definition of the λ s (Alpha Smoothing) A small change: take our estimate ˆP (w i w i 2, w i 1 ) to be ˆP (w i w i 2, w i 1 ) = λ 1 P ML (w i w i 2, w i 1 ) +(1 λ 1 )[λ 2 P ML (w i w i 1 ) +(1 λ 2 ) P ML (w i )] where 0 λ 1 1, and 0 λ 2 1. Next, define λ 1 = Count(w i 2, w i 1 ) α + Count(w i 2, w i 1 ) λ 2 = Count(w i 1) α + Count(w i 1 ) where α is a parameter chosen to optimize probability of a development set. Discounting Methods Say we ve seen the following counts: x Count(x) P ML (w i w i 1 ) the 48 the, dog 15 15/48 the, woman 11 11/48 the, man 10 10/48 the, park 5 5/48 the, job 2 2/48 the, telescope 1 1/48 the, manual 1 1/48 the, afternoon 1 1/48 the, country 1 1/48 the, street 1 1/48 The maximum-likelihood estimates are high (particularly for low count items) 21 23 An Alternative Definition of the λ s (continued) Define Discounting Methods Now define discounted counts, for example (a first, simple definition): U(w i 2, w i 1 ) = {w : Count(w i 2, w i 1, w) > 0} U(w i 1 ) = {w : Count(w i 1, w) > 0} New estimates: Count (x) = Count(x) 0.5 Next, define λ 1 = λ 2 = Count(w i 2, w i 1 ) αu(w i 2, w i 1 ) + Count(w i 2, w i 1 ) Count(w i 1 ) αu(w i 1 ) + Count(w i 1 ) where α is a parameter chosen to optimize probability of a development set. x Count(x) Count (x) the 48 Count (x) Count(x) the, dog 15 14.5 14.5/48 the, woman 11 10.5 10.5/48 the, man 10 9.5 9.5/48 the, park 5 4.5 4.5/48 the, job 2 1.5 1.5/48 the, telescope 1 0.5 0.5/48 the, manual 1 0.5 0.5/48 the, afternoon 1 0.5 0.5/48 the, country 1 0.5 0.5/48 the, street 1 0.5 0.5/48 22 24

We now have some missing probability mass : α(w i 1 ) = 1 w Count (w i 1, w) Count(w i 1 ) e.g., in our example, α(the) = 10 0.5/48 = 5/48 Divide the remaining probability mass between words w for which Count(w i 1, w) = 0. Katz Back-Off Models (Trigrams) For a trigram model, first define two sets A(w i 2, w i 1 ) = {w : Count(w i 2, w i 1, w) > 0} B(w i 2, w i 1 ) = {w : Count(w i 2, w i 1, w) = 0} A trigram model is defined in terms of the bigram model: P KAT Z (w i w i 2, w i 1 ) = where α(w i 2, w i 1 ) = 1 Count (w i 2,w i 1,w i) Count(w i 2,w i 1) If w i A(w i 2, w i 1 ) α(w i 2,w i 1 )P KAT Z (w i w i 1) w B(w i 2,w i 1 ) w A(w i 2,w i 1 ) PKAT Z(w wi 1) If w i B(w i 2, w i 1 ) Count (w i 2, w i 1, w) Count(w i 2, w i 1 ) 25 27 Katz Back-Off Models (Bigrams) For a bigram model, define two sets A bigram model A(w i 1 ) = {w : Count(w i 1, w) > 0} B(w i 1 ) = {w : Count(w i 1, w) = 0} Count (w i 1,w i ) Count(w i 1 ) If w i A(w i 1 ) P KAT Z (w i w i 1 ) = P α(w i 1 ) ML (w i ) w B(wi 1 ) P ML(w) If w i B(w i 1 ) where α(w i 1 ) = 1 w A(w i 1 ) Count (w i 1, w) Count(w i 1 ) Good-Turing Discounting Invented during WWII by Alan Turing (and Good?), later published by Good. Frequency estimates were needed within the Enigma code-breaking effort. Define n r = number of elements x for which Count(x) = r. Modified count for any x with Count(x) = r and r > 0: (r + 1) n r+1 n r Leads to the following estimate of missing mass : n 1 N where N is the size of the sample. This is the estimate of the probability of seeing a new element x on the (N + 1) th draw. 26 28

Summary Three steps in deriving the language model probabilities: 1. Expand P (w 1, w 2 w n ) using Chain rule. 2. Make Markov Independence Assumptions P (w i w 1, w 2 w i 2, w i 1 ) = P (w i w i 2, w i 1 ) 3. Smooth the estimates using low order counts. Other methods used to improve language models: Topic or long-range features. Syntactic models. It s generally hard to improve on trigram models, though!! 29 Further Reading See: An Empirical Study of Smoothing Techniques for Language Modeling. Stanley Chen and Joshua Goodman. 1998. Harvard Computer Science Technical report TR-10-98. (Gives a very thorough evaluation and description of a number of methods.) On the Convergence Rate of Good-Turing Estimators. David McAllester and Robert E. Schapire. In Proceedings of COLT 2000. (A pretty technical paper, giving confidence-intervals on Good- Turing estimators. Theorems 1, 3 and 9 are useful in understanding the motivation for Good-Turing discounting.) 30