Language Models. Philipp Koehn. 11 September 2018

Similar documents
Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

ANLP Lecture 6 N-gram models and smoothing

Natural Language Processing. Statistical Inference: n-grams

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

CMPT-825 Natural Language Processing

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-gram Language Modeling

Probabilistic Language Modeling

N-gram Language Modeling Tutorial

Machine Learning for natural language processing

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Model. Introduction to N-grams

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

DT2118 Speech and Speaker Recognition

Language Modeling. Michael Collins, Columbia University

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

CS 6120/CS4120: Natural Language Processing

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Statistical Methods for NLP

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

From Language Modelling to Machine Translation

Statistical Machine Translation

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Natural Language Processing SoSe Words and Language Model

Natural Language Processing

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Chapter 3: Basics of Language Modelling

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Language Processing with Perl and Prolog

SYNTHER A NEW M-GRAM POS TAGGER

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

arxiv:cmp-lg/ v1 9 Jun 1997

Neural Networks Language Models

Statistical Natural Language Processing

CS 6120/CS4120: Natural Language Processing

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Smoothing. This dark art is why NLP is taught in the engineering school.

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Microsoft Corporation.

Chapter 3: Basics of Language Modeling

The Noisy Channel Model and Markov Models

Kneser-Ney smoothing explained

Natural Language Processing

Midterm sample questions

Exploring Asymmetric Clustering for Statistical Language Modeling

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Language Model Rest Costs and Space-Efficient Storage

language modeling: n-gram models

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

CSEP 517 Natural Language Processing Autumn 2013

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

(today we are assuming sentence segmentation) Wednesday, September 10, 14

Language Models. Hongning Wang

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Conditional Language Modeling. Chris Dyer

Naïve Bayes, Maxent and Neural Models

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

TnT Part of Speech Tagger

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Natural Language Processing (CSE 490U): Language Models

What to Expect from Expected Kneser-Ney Smoothing

We have a sitting situation

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Lecture 12: Algorithms for HMMs

Computer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited. DTICQUALBYlSFSPIiCTBDl

Learning to translate with neural networks. Michael Auli

Deep Learning. Language Models and Word Embeddings. Christof Monz

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Administrivia. Lecture 5. Part I. Outline. The Big Picture IBM IBM IBM IBM

Lecture 5. The Big Picture/Language Modeling. Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Lecture 12: Algorithms for HMMs

Phrasetable Smoothing for Statistical Machine Translation

University of Illinois at Urbana-Champaign. Midterm Examination

On Using Selectional Restriction in Language Models for Speech Recognition

Log-linear models (part 1)

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Improved Decipherment of Homophonic Ciphers

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Gaussian Models

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

Sequences and Information

Neural Network Language Modeling

Transcription:

Language Models Philipp Koehn 11 September 2018

Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small) > p LM (small the is house) Help with word choice p LM (I am going home) > p LM (I am going house)

N-Gram Language Models 2 Given: a string of English words W = w 1, w 2, w 3,..., w n Question: what is p(w )? Sparse data: Many good English sentences will not have been seen before Decomposing p(w ) using the chain rule: p(w 1, w 2, w 3,..., w n ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1, w 2 )...p(w n w 1, w 2,...w n 1 ) (not much gained yet, p(w n w 1, w 2,...w n 1 ) is equally sparse)

Markov Chain 3 Markov assumption: only previous history matters limited memory: only last k words are included in history (older words less relevant) kth order Markov model For instance 2-gram language model: p(w 1, w 2, w 3,..., w n ) p(w 1 ) p(w 2 w 1 ) p(w 3 w 2 )...p(w n w n 1 ) What is conditioned on, here w i 1 is called the history

Estimating N-Gram Probabilities 4 Maximum likelihood estimation Collect counts over a large text corpus p(w 2 w 1 ) = count(w 1, w 2 ) count(w 1 ) Millions to billions of words are easy to get (trillions of English words available on the web)

Example: 3-Gram 5 Counts for trigrams and estimated word probabilities the green (total: 1748) word c. prob. paper 801 0.458 group 640 0.367 light 110 0.063 party 27 0.015 ecu 21 0.012 the red (total: 225) word c. prob. cross 123 0.547 tape 31 0.138 army 9 0.040 card 7 0.031, 5 0.022 the blue (total: 54) word c. prob. box 16 0.296. 6 0.111 flag 6 0.111, 3 0.056 angel 3 0.056 225 trigrams in the Europarl corpus start with the red 123 of them end with cross maximum likelihood probability is 123 225 = 0.547.

How good is the LM? 6 A good model assigns a text of real English W a high probability This can be also measured with cross entropy: H(W ) = 1 n log p(w n 1 ) Or, perplexity perplexity(w ) = 2 H(W )

Example: 3-Gram 7 prediction p LM -log 2 p LM p LM (i </s><s>) 0.109 3.197 p LM (would <s>i) 0.144 2.791 p LM (like i would) 0.489 1.031 p LM (to would like) 0.905 0.144 p LM (commend like to) 0.002 8.794 p LM (the to commend) 0.472 1.084 p LM (rapporteur commend the) 0.147 2.763 p LM (on the rapporteur) 0.056 4.150 p LM (his rapporteur on) 0.194 2.367 p LM (work on his) 0.089 3.498 p LM (. his work) 0.290 1.785 p LM (</s> work.) 0.99999 0.000014 average 2.634

Comparison 1 4-Gram 8 word unigram bigram trigram 4-gram i 6.684 3.197 3.197 3.197 would 8.342 2.884 2.791 2.791 like 9.129 2.026 1.031 1.290 to 5.081 0.402 0.144 0.113 commend 15.487 12.335 8.794 8.633 the 3.885 1.402 1.084 0.880 rapporteur 10.840 7.319 2.763 2.350 on 6.765 4.140 4.150 1.862 his 10.678 7.316 2.367 1.978 work 9.993 4.816 3.498 2.394. 4.896 3.020 1.785 1.510 </s> 4.828 0.005 0.000 0.000 average 8.051 4.072 2.634 2.251 perplexity 265.136 16.817 6.206 4.758

9 count smoothing

Unseen N-Grams 10 We have seen i like to in our corpus We have never seen i like to smooth in our corpus p(smooth i like to) = 0 Any sentence that includes i like to smooth will be assigned probability 0

Add-One Smoothing 11 For all possible n-grams, add the count of one. c = count of n-gram in corpus n = count of history v = vocabulary size p = c + 1 n + v But there are many more unseen n-grams than seen n-grams Example: Europarl 2-bigrams: 86, 700 distinct words 86, 700 2 = 7, 516, 890, 000 possible bigrams but only about 30, 000, 000 words (and bigrams) in corpus

Add-α Smoothing 12 Add α < 1 to each count p = c + α n + αv What is a good value for α? Could be optimized on held-out set

What is the Right Count? 13 Example: the 2-gram red circle occurs in a 30 million word corpus exactly once 1 maximum likelihood estimation tells us that its probability is 30,000,000... but we would expect it to occur less often than that Question: How likely does a 2-gram that occurs once in a 30,000,000 word corpus occur in the wild? Let s find out: get the set of all 2-grams that occur once (red circle, funny elephant,...) record the size of this set: N 1 get another 30,000,000 word corpus for each word in the set: count how often it occurs in the new corpus (many occur never, some once, fewer twice, even fewer 3 times,...) sum up all these counts (0 + 0 + 1 + 0 + 2 + 1 + 0 +...) divide by N 1 that is our test count t c

Example: 2-Grams in Europarl 14 Count Adjusted count Test count c (c + 1) n n (c + α) t n+v 2 n+αv 2 c 0 0.00378 0.00016 0.00016 1 0.00755 0.95725 0.46235 2 0.01133 1.91433 1.39946 3 0.01511 2.87141 2.34307 4 0.01888 3.82850 3.35202 5 0.02266 4.78558 4.35234 6 0.02644 5.74266 5.33762 8 0.03399 7.65683 7.15074 10 0.04155 9.57100 9.11927 20 0.07931 19.14183 18.95948 Add-α smoothing with α = 0.00017

Deleted Estimation 15 Estimate true counts in held-out data split corpus in two halves: training and held-out counts in training C t (w 1,..., w n ) number of ngrams with training count r: N r total times ngrams of training count r seen in held-out data: T r Held-out estimator: p h (w 1,..., w n ) = T r N r N where count(w 1,..., w n ) = r Both halves can be switched and results combined p h (w 1,..., w n ) = T 1 r + T 2 r N(N 1 r + N 2 r ) where count(w 1,..., w n ) = r

Good-Turing Smoothing 16 Adjust actual counts r to expected counts r with formula r = (r + 1) N r+1 N r N r number of n-grams that occur exactly r times in corpus N 0 total number of n-grams Where does this formula come from? Derivation is in the textbook.

Good-Turing for 2-Grams in Europarl 17 Count Count of counts Adjusted count Test count r N r r t 0 7,514,941,065 0.00015 0.00016 1 1,132,844 0.46539 0.46235 2 263,611 1.40679 1.39946 3 123,615 2.38767 2.34307 4 73,788 3.33753 3.35202 5 49,254 4.36967 4.35234 6 35,869 5.32928 5.33762 8 21,693 7.43798 7.15074 10 14,880 9.31304 9.11927 20 4,546 19.54487 18.95948 adjusted count fairly accurate when compared against the test count

18 backoff and interpolation

Back-Off 19 In given corpus, we may never observe Scottish beer drinkers Scottish beer eaters Both have count 0 our smoothing methods will assign them same probability Better: backoff to bigrams: beer drinkers beer eaters

Interpolation 20 Higher and lower order n-gram models have different strengths and weaknesses high-order n-grams are sensitive to more context, but have sparse counts low-order n-grams consider only very limited context, but have robust counts Combine them p I (w 3 w 1, w 2 ) = λ 1 p 1 (w 3 ) + λ 2 p 2 (w 3 w 2 ) + λ 3 p 3 (w 3 w 1, w 2 )

Recursive Interpolation 21 We can trust some histories w i n+1,..., w i 1 more than others Condition interpolation weights on history: λ wi n+1,...,w i 1 Recursive definition of interpolation p I n(w i w i n+1,..., w i 1 ) = λ wi n+1,...,w i 1 p n (w i w i n+1,..., w i 1 ) + + (1 λ wi n+1,...,w i 1 ) p I n 1(w i w i n+2,..., w i 1 )

Back-Off 22 Trust the highest order language model that contains n-gram Requires p BO n (w i w i n+1,..., w i 1 ) = α n (w i w i n+1,..., w i 1 ) if count n (w i n+1,..., w i ) > 0 = d n (w i n+1,..., w i 1 ) p BO n 1(w i w i n+2,..., w i 1 ) else adjusted prediction model α n (w i w i n+1,..., w i 1 ) discounting function d n (w 1,..., w n 1 )

Back-Off with Good-Turing Smoothing 23 Previously, we computed n-gram probabilities based on relative frequency p(w 2 w 1 ) = count(w 1, w 2 ) count(w 1 ) Good Turing smoothing adjusts counts c to expected counts c count (w 1, w 2 ) count(w 1, w 2 ) We use these expected counts for the prediction model (but 0 remains 0) α(w 2 w 1 ) = count (w 1, w 2 ) count(w 1 ) This leaves probability mass for the discounting function d 2 (w 1 ) = 1 w 2 α(w 2 w 1 )

Example 24 Good Turing discounting is used for all positive counts count p GT count α 3 p(big a) 3 7 = 0.43 2.24 2.24 7 = 0.32 3 p(house a) 3 7 = 0.43 2.24 2.24 7 = 0.32 1 p(new a) 1 7 = 0.14 0.446 0.446 7 = 0.06 1 (0.32 + 0.32 + 0.06) = 0.30 is left for back-off d 2 (a) Note: actual values for d 2 is slightly higher, since the predictions of the lowerorder model to seen events at this level are not used.

Diversity of Predicted Words 25 Consider the bigram histories spite and constant both occur 993 times in Europarl corpus only 9 different words follow spite almost always followed by of (979 times), due to expression in spite of 415 different words follow constant most frequent: and (42 times), concern (27 times), pressure (26 times), but huge tail of singletons: 268 different words More likely to see new bigram that starts with constant than spite Witten-Bell smoothing considers diversity of predicted words

Witten-Bell Smoothing 26 Recursive interpolation method Number of possible extensions of a history w 1,..., w n 1 in training data N 1+ (w 1,..., w n 1, ) = {w n : c(w 1,..., w n 1, w n ) > 0} Lambda parameters 1 λ w1,...,w n 1 = N 1+ (w 1,..., w n 1, ) N 1+ (w 1,..., w n 1, ) + w n c(w 1,..., w n 1, w n )

Witten-Bell Smoothing: Examples 27 Let us apply this to our two examples: 1 λ spite = = 1 λ constant = = N 1+ (spite, ) N 1+ (spite, ) + w n c(spite, w n ) 9 9 + 993 = 0.00898 N 1+ (constant, ) N 1+ (constant, ) + w n c(constant, w n ) 415 415 + 993 = 0.29474

Diversity of Histories 28 Consider the word York fairly frequent word in Europarl corpus, occurs 477 times as frequent as foods, indicates and providers in unigram language model: a respectable probability However, it almost always directly follows New (473 times) Recall: unigram model only used, if the bigram model inconclusive York unlikely second word in unseen bigram in back-off unigram model, York should have low probability

Kneser-Ney Smoothing 29 Kneser-Ney smoothing takes diversity of histories into account Count of histories for a word N 1+ ( w) = {w i : c(w i, w) > 0} Recall: maximum likelihood estimation of unigram language model p ML (w) = c(w) i c(w i) In Kneser-Ney smoothing, replace raw counts with count of histories p KN (w) = N 1+( w) w i N 1+ ( w i )

Modified Kneser-Ney Smoothing 30 Based on interpolation Requires p BO n (w i w i n+1,..., w i 1 ) = α n (w i w i n+1,..., w i 1 ) if count n (w i n+1,..., w i ) > 0 = d n (w i n+1,..., w i 1 ) p BO n 1(w i w i n+2,..., w i 1 ) else adjusted prediction model α n (w i w i n+1,..., w i 1 ) discounting function d n (w 1,..., w n 1 )

Formula for α for Highest Order N-Gram Model 31 Absolute discounting: subtract a fixed D from all non-zero counts α(w n w 1,..., w n 1 ) = c(w 1,..., w n ) D w c(w 1,..., w n 1, w) Refinement: three different discount values D 1 if c = 1 D(c) = D 2 if c = 2 D 3+ if c 3

Discount Parameters 32 Optimal discounting parameters D 1, D 2, D 3+ can be computed quite easily Y = N 1 N 1 + 2N 2 D 1 = 1 2Y N 2 N 1 D 2 = 2 3Y N 3 N 2 D 3+ = 3 4Y N 4 N 3 Values N c are the counts of n-grams with exactly count c

Formula for d for Highest Order N-Gram Model 33 Probability mass set aside from seen events d(w 1,..., w n 1 ) = i {1,2,3+} D in i (w 1,..., w n 1 ) w n c(w 1,..., w n ) N i for i {1, 2, 3+} are computed based on the count of extensions of a history w 1,..., w n 1 with count 1, 2, and 3 or more, respectively. Similar to Witten-Bell smoothing

Formula for α for Lower Order N-Gram Models 34 Recall: base on count of histories N 1+ ( w) in which word may appear, not raw counts. α(w n w 1,..., w n 1 ) = N 1+( w 1,..., w n ) D w N 1+( w 1,..., w n 1, w) Again, three different values for D (D 1, D 2, D 3+ ), based on the count of the history w 1,..., w n 1

Formula for d for Lower Order N-Gram Models 35 Probability mass set aside available for the d function d(w 1,..., w n 1 ) = i {1,2,3+} D in i (w 1,..., w n 1 ) w n c(w 1,..., w n )

Interpolated Back-Off 36 Back-off models use only highest order n-gram if sparse, not very reliable. two different n-grams with same history occur once same probability one may be an outlier, the other under-represented in training To remedy this, always consider the lower-order back-off models Adapting the α function into interpolated α I function by adding back-off α I (w n w 1,..., w n 1 ) = α(w n w 1,..., w n 1 ) Note that d function needs to be adapted as well + d(w 1,..., w n 1 ) p I (w n w 2,..., w n 1 )

Evaluation 37 Evaluation of smoothing methods: Perplexity for language models trained on the Europarl corpus Smoothing method bigram trigram 4-gram Good-Turing 96.2 62.9 59.9 Witten-Bell 97.1 63.8 60.4 Modified Kneser-Ney 95.4 61.6 58.6 Interpolated Modified Kneser-Ney 94.5 59.3 54.0

38 efficiency

Managing the Size of the Model 39 Millions to billions of words are easy to get (trillions of English words available on the web) But: huge language models do not fit into RAM

Number of Unique N-Grams 40 Number of unique n-grams in Europarl corpus 29,501,088 tokens (words and punctuation) Order Unique n-grams Singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4-gram 15,303,847 13,081,621 (85.5%) 5-gram 19,882,175 18,324,577 (92.2%) remove singletons of higher order n-grams

Efficient Data Structures 41 4-gram the very 3-gram backoff very 2-gram backoff large boff:-0.470 very large boff:-0.106 important boff:-0.250 best boff:-0.082 serious boff:-0.176 accept p:-3.791 acceptable p:-3.778 accession p:-3.762 accidents p:-3.806 accountancy p:-3.416 accumulated p:-3.885 accumulation p:-3.895 action p:-3.510 additional p:-3.334 administration p:-3.729... large boff:-0.385 important boff:-0.231 best boff:-0.302 serious boff:-0.146 amount p:-2.510 amounts p:-1.633 and p:-1.449 area p:-2.658 companies p:-1.536 cuts p:-2.225 degree p:-2.933 extent p:-2.208 financial p:-2.383 foreign p:-3.428... majority p:-1.147 number p:-0.275 and p:-1.430 areas p:-1.728 challenge p:-2.171 debate p:-1.837 discussion p:-2.145 fact p:-2.128 international p:-1.866 issue p:-1.157... 1-gram backoff aa-afns p:-6.154 aachen p:-5.734 aaiun p:-6.154 aalborg p:-6.154 aarhus p:-5.734 aaron p:-6.154 aartsen p:-6.154 ab p:-5.734 abacha p:-5.156 aback p:-5.876... Need to store probabilities for the very large majority the very language number Both share history the very large no need to store history twice Trie

Reducing Vocabulary Size 42 For instance: each number is treated as a separate token Replace them with a number token NUM but: we want our language model to prefer p LM (I pay 950.00 in May 2007) > p LM (I pay 2007 in May 950.00) not possible with number token p LM (I pay NUM in May NUM) = p LM (I pay NUM in May NUM) Replace each digit (with unique symbol, e.g., @ or 5), retain some distinctions p LM (I pay 555.55 in May 5555) > p LM (I pay 5555 in May 555.55)

Summary 43 Language models: How likely is a string of English words good English? N-gram models (Markov assumption) Perplexity Count smoothing add-one, add-α deleted estimation Good Turing Interpolation and backoff Good Turing Witten-Bell Kneser-Ney Managing the size of the model