Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Similar documents
Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

ANLP Lecture 6 N-gram models and smoothing

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Natural Language Processing. Statistical Inference: n-grams

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Language Models. Philipp Koehn. 11 September 2018

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Probabilistic Language Modeling

N-gram Language Modeling

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Machine Learning for natural language processing

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

N-gram Language Modeling Tutorial

CMPT-825 Natural Language Processing

Statistical Methods for NLP

Language Model. Introduction to N-grams

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

The Noisy Channel Model and Markov Models

More Smoothing, Tuning, and Evaluation

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

CS 6120/CS4120: Natural Language Processing

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Language Modeling. Michael Collins, Columbia University

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Natural Language Processing SoSe Words and Language Model

Chapter 3: Basics of Language Modelling

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Language as a Stochastic Process

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Statistical Natural Language Processing

Conditional Language Modeling. Chris Dyer

Lecture 12: Algorithms for HMMs

DT2118 Speech and Speaker Recognition

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

CS 188: Artificial Intelligence Spring Today

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Lecture 12: Algorithms for HMMs

CSC 411 Lecture 3: Decision Trees

Language Processing with Perl and Prolog

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Midterm sample questions

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Statistical Machine Translation

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

CSC321 Lecture 18: Learning Probabilistic Models

Language Models. Hongning Wang

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

CS 188: Artificial Intelligence Fall 2011

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Natural Language Processing

Forward algorithm vs. particle filtering

Kneser-Ney smoothing explained

Natural Language Processing (CSE 490U): Language Models

Ranked Retrieval (2)

CS 361: Probability & Statistics

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

From perceptrons to word embeddings. Simon Šuster University of Groningen

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

CSC321 Lecture 7 Neural language models

CPSC 340: Machine Learning and Data Mining

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Entropy as a measure of surprise

Statistical methods for NLP Estimation

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

Computability Theory

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Computational Cognitive Science

1 Review of The Learning Setting

N/4 + N/2 + N = 2N 2.

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Problems from Probability and Statistical Inference (9th ed.) by Hogg, Tanis and Zimmerman.

University of Illinois at Urbana-Champaign. Midterm Examination

Chapter 3: Basics of Language Modeling

Theory of Computation

SYNTHER A NEW M-GRAM POS TAGGER

Last few slides from last time

Classification and Regression Trees

Transcription:

Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp Koehn) 25anuary 2019 Language models tell us P ( w) = P (w 1... w n ): How likely to occur is this sequence of words? Roughly: Is this sequence of words a good one in my language? LMs are used as a component in applications such as speech recognition, machine translation, and predictive text completion. To reduce sparse data, -gram LMs assume words depend only on a fixedlength history, even though we know this isn t true. Alex Lascarides FLP lecture 4 25anuary 2019 Alex Lascarides FLP lecture 4 1 Evaluating a language model Two types of evaluation in LP Intuitively, a trigram model captures more context than a bigram model, so should be a better model. That is, it should more accurately predict the probabilities of sentences. But how can we measure this? Extrinsic: measure performance on a downstream application. For LM, plug it into a machine translation/asr/etc system. The most reliable evaluation, but can be time-consuming. And of course, we still need an evaluation measure for the downstream system! Intrinsic: design a measure that is inherent to the current task. Can be much quicker/easier during development cycle. But not always easy to figure out what the right measure is: ideally, one that correlates well with extrinsic measures. Let s consider how to define an intrinsic measure for LMs. Alex Lascarides FLP lecture 4 2 Alex Lascarides FLP lecture 4 3

Entropy Entropy Example Definition of the entropy of a random variable X: H(X) = x P (x) log 2 P (x) Intuitively: a measure of uncertainty/disorder Also: the expected value of log 2 P (X) P (a) = 1 One event (outcome) H(X) = 1 log 2 1 = 0 Alex Lascarides FLP lecture 4 4 Alex Lascarides FLP lecture 4 5 Entropy Example 2 equally likely events: Entropy Example 4 equally likely events: P (a) = 0.5 P (b) = 0.5 H(X) = 0.5 log 2 0.5 0.5 log 2 0.5 = log 2 0.5 = 1 P (a) = 0.25 P (b) = 0.25 P (c) = 0.25 P (d) = 0.25 H(X) = 0.25 log 2 0.25 0.25 log 2 0.25 0.25 log 2 0.25 0.25 log 2 0.25 = log 2 0.25 = 2 Alex Lascarides FLP lecture 4 6 Alex Lascarides FLP lecture 4 7

Entropy Example Entropy Example P (a) = 0.7 P (b) = 0.1 P (c) = 0.1 P (d) = 0.1 3 equally likely events and one more likely than the others: H(X) = 0.7 log 2 0.7 0.1 log 2 0.1 0.1 log 2 0.1 0.1 log 2 0.1 P (a) = 0.97 P (b) = 0.01 P (c) = 0.01 P (d) = 0.01 3 equally likely events and one much more likely than the others: H(X) = 0.97 log 2 0.97 0.01 log 2 0.01 0.01 log 2 0.01 0.01 log 2 0.01 = 0.7 log 2 0.7 0.3 log 2 0.1 = 0.97 log 2 0.97 0.03 log 2 0.01 = (0.7)( 0.5146) (0.3)( 3.3219) = (0.97)( 0.04394) (0.03)( 6.6439) = 0.36020 + 0.99658 = 0.04262 + 0.19932 = 1.35678 = 0.24194 Alex Lascarides FLP lecture 4 8 Alex Lascarides FLP lecture 4 9 Entropy as y/n questions How many yes-no questions (bits) do we need to find out the outcome? H(X) = 0 H(X) = 1 H(X) = 2 Uniform distribution with 2 n outcomes: n yes-no questions. H(X) = 3 H(X) = 1.35678 H(X) = 0.24194 Alex Lascarides FLP lecture 4 10 Alex Lascarides FLP lecture 4 11

Entropy as encoding sequences The Entropy of English Assume that we want to encode a sequence of events X. Each event is encoded by a sequence of bits, we want to use as few bits as possible. For example Coin flip: heads = 0, tails = 1 4 equally likely events: a = 00, b = 01, c = 10, d = 11 3 events, one more likely than others: a = 0, b = 10, c = 11 Morse code: e has shorter code than q Given the start of a text, can we guess the next word? For humans, the measured entropy is only about 1.3. Meaning: on average, given the preceding context, a human would need only 1.3 y/n questions to determine the next word. This is an upper bound on the true entropy, which we can never know (because we don t know the true probability distribution). But what about -gram models? Average number of bits needed to encode X entropy of X Alex Lascarides FLP lecture 4 12 Alex Lascarides FLP lecture 4 13 Cross-entropy Computing cross-entropy Our LM estimates the probability of word sequences. A good model assigns high probability to sequences that actually have high probability (and low probability to others). Put another way, our model should have low uncertainty (entropy) about which word comes next. We can measure this using cross-entropy. For w 1... w n with large n, per-word cross-entropy is well approximated by: H M (w 1... w n ) = 1 n log 2 P M (w 1... w n ) This is just the average negative log prob our model assigns to each word in the sequence. (i.e., normalized for sequence length). Lower cross-entropy model is better at predicting next word. ote that cross-entropy entropy: our model s uncertainty can be no less than the true uncertainty. Alex Lascarides FLP lecture 4 14 Alex Lascarides FLP lecture 4 15

Cross-entropy example Using a bigram model from Moby Dick, compute per-word cross-entropy of I spent three years before the mast (here, without using end-of sentence padding): 1 7 ( lg 2(P (I)) + lg 2 (P (spent I)) + lg 2 (P (three spent)) + lg 2 (P (years three)) + lg 2 (P (before years)) + lg 2 (P (the before)) + lg 2 (P (mast the)) ) = 1 7 ( 6.9381 11.0546 3.1699 4.2362 5.0 2.4426 8.4246 ) = 1 7 ( 41.2660 ) 6 Per-word cross-entropy of the unigram model is about 11. So, unigram model has about 5 bits more uncertainty per word then bigram model. But, what does that mean? Data compression If we designed an optimal code based on our bigram model, we could encode the entire sentence in about 42 bits. 6*7 A code based on our unigram model would require about 77 bits. ASCII uses an average of 24 bits per word (168 bits total)! So better language models can also give us better data compression: elaborated by the field of information theory. as Alex Lascarides FLP lecture 4 16 Alex Lascarides FLP lecture 4 17 Perplexity LM performance is often reported as perplexity rather than cross-entropy. Perplexity is simply 2 cross-entropy Interpreting these measures I measure the cross-entropy of my LM on some corpus as 5.2. Is that good? The average branching factor at each decision point, if our distribution were uniform. So, 6 bits cross-entropy means our model perplexity is 2 6 = 64: equivalent uncertainty to a uniform distribution over 64 outcomes. Alex Lascarides FLP lecture 4 18 Alex Lascarides FLP lecture 4 19

Interpreting these measures I measure the cross-entropy of my LM on some corpus as 5.2. Is that good? o way to tell! Cross-entropy depends on both the model and the corpus. Some language is simply more predictable (e.g. casual speech vs academic writing). So lower cross-entropy could mean the corpus is easy, or the model is good. We can only compare different models on the same corpus. Should we measure on training data or held-out data? Why? Sparse data, again Suppose now we build a trigram model from Moby Dick and evaluate the same sentence. But I spent three never occurs, so P MLE (three I spent) = 0 which means the cross-entropy is infinte. Basically right: our model says I spent three should never occur, so our model is infinitely wrong/surprised when it does! Even with a unigram model, we will run into words we never saw before. So even with short -grams, we need better ways to estimate probabilities from sparse data. Alex Lascarides FLP lecture 4 20 Alex Lascarides FLP lecture 4 21 Smoothing Add-One (Laplace) Smoothing The flaw of MLE: it estimates probabilities that make the training data maximally probable, by making everything else (unseen data) minimally probable. Smoothing methods address the problem by stealing probability mass from seen events and reallocating it to unseen events. Lots of different methods, based on different kinds of assumptions. We will discuss just a few. Just pretend we saw everything one more time than we did. P ML (w i w i 2, w i 1 ) = C(w i 2, w i 1, w i ) C(w i 2, w i 1 ) P +1 (w i w i 2, w i 1 ) = C(w i 2, w i 1, w i ) + 1 C(w i 2, w i 1 )? Alex Lascarides FLP lecture 4 22 Alex Lascarides FLP lecture 4 23

Add-One (Laplace) Smoothing Just pretend we saw everything one more time than we did. P ML (w i w i 2, w i 1 ) = C(w i 2, w i 1, w i ) C(w i 2, w i 1 ) P +1 (w i w i 2, w i 1 ) = C(w i 2, w i 1, w i ) + 1 C(w i 2, w i 1 ) O! Sum over possible w i (in vocabulary V ) must equal 1: w i V P (w i w i 2, w i 1 ) = 1 If increasing the numerator, must change denominator too.? We want: Solve for x: Add-one Smoothing: normalization w i V C(w i 2, w i 1, w i ) + 1 C(w i 2, w i 1 ) + x = 1 (C(w i 2, w i 1, w i ) + 1) = C(w i 2, w i 1 ) + x w i V w i V C(w i 2, w i 1, w i ) + w i V where v = vocabulary size. 1 = C(w i 2, w i 1 ) + x C(w i 2, w i 1 ) + v = C(w i 2, w i 1 ) + x v = x Alex Lascarides FLP lecture 4 24 Alex Lascarides FLP lecture 4 25 Add-one example (1) Add-one example (2) Moby Dick has one trigram that begins with I spent (it s I spent in) and the vocabulary size is 17231. Comparison of MLE vs Add-one probability estimates: MLE +1 Estimate ˆP (three I spent) 0 0.00006 ˆP (in I spent) 1 0.0001 ˆP (in I spent) seems very low, especially since in is a very common word. But can we find better evidence that this method is flawed? Suppose we have a more common bigram w 1, w 2 that occurs 100 times, 10 of which are followed by w 3. MLE 10 ˆP (w 3 w 1, w 2 ) 100 +1 Estimate 11 17331 0.0006 Shows that the very large vocabulary size makes add-one smoothing steal way too much from seen events. In fact, MLE is pretty good for frequent events, so we shouldn t want to change these much. Alex Lascarides FLP lecture 4 26 Alex Lascarides FLP lecture 4 27

Add-α (Lidstone) Smoothing Optimizing α (and other model choices) We can improve things by adding α < 1. P +α (w i w i 1 ) = C(w i 1, w i ) + α C(w i 1 ) + αv Like Laplace, assumes we know the vocabulary size in advance. But if we don t, can just add a single unknown (UK) item, and use this for all unknown words during testing. Then: how to choose α? Use a three-way data split: training set (80-90%), held-out (or development) set (5-10%), and test set (5-10%) Train model (estimate probabilities) on training set with different values of α Choose the α that minimizes cross-entropy on development set Report final results on test set. More generally, use dev set for evaluating different models, debugging, and optimizing choices. Test set simulates deployment, use it only once! Avoids overfitting to the training set and even to the test set. Alex Lascarides FLP lecture 4 28 Alex Lascarides FLP lecture 4 29 Better smoothing: Good-Turing Previous methods changed the denominator, which can have big effects even on frequent events. Good-Turing changes the numerator. Think of it like this: MLE divides count c of -gram by count n of history: P ML = c n Good-Turing uses adjusted counts c instead: P GT = c n Good-Turing in Detail Push every probability total down to the count class below. Each count is reduced slightly (Zipf): we re discounting! c c P c P c [total] c P c P c [total] 0 0 0 0 1 1 1 2 2 2 1 0 1 2 2 2 2 3 3 1 0 2 1 2 1 3 3 2 2 c: count c : number of different items with count c P c : MLE estimate of prob. of that item P c [total]: MLE total probability mass for all items with that count. c : Good-Turing smoothed version of the count P c and P c [total]: Good-Turing versions of P c and P c [total] 1 2 2 3 3 Alex Lascarides FLP lecture 4 30 Alex Lascarides FLP lecture 4 31

Some Observations Basic idea is to arrange the discounts so that the amount we add to the total probability in row 0 is matched by all the discounting in the other rows. ote that we only know 0 if we actually know what s missing. And we can t always estimate what words are missing from a corpus. But for bigrams, we often assume 0 = V 2, where V is the different (observed) words in the corpus. Good-Turing Smoothing: The Formulae Good-Turing discount depends on (real) adjacent count: c = (c + 1) c+1 c P c = c = (c+1) c+1 c Since counts tend to go down as c goes up, the multiplier is < 1. The sum of all discounts is 1 0. We need it to be, given that this is our GT count for row 0! Alex Lascarides FLP lecture 4 32 Alex Lascarides FLP lecture 4 33 Good-Turing for 2-Grams in Europarl Good-Turing justification: 0-count items Count Count of counts Adjusted count Test count c c c t c 0 7,514,941,065 0.00015 0.00016 1 1,132,844 0.46539 0.46235 2 263,611 1.40679 1.39946 3 123,615 2.38767 2.34307 4 73,788 3.33753 3.35202 5 49,254 4.36967 4.35234 6 35,869 5.32928 5.33762 8 21,693 7.43798 7.15074 10 14,880 9.31304 9.11927 20 4,546 19.54487 18.95948 Estimate the probability that the next observation is previously unseen (i.e., will have count 1 once we see it) This part uses MLE! P (unseen) = 1 n Divide that probability equally amongst all unseen events P GT = 1 0 1 n c = 1 0 t c are average counts of bigrams in test set that occurred c times in corpus: fairly close to estimate c. Alex Lascarides FLP lecture 4 34 Alex Lascarides FLP lecture 4 35

Good-Turing justification: 1-count items Summary Estimate the probability that the next observation was seen once before (i.e., will have count 2 once we see it) P (once before) = 2 2 n Divide that probability equally amongst all 1-count events P GT = 1 1 2 2 n c = 2 2 1 We can measure the relative goodness of LMs on the same corpus using cross-entropy: how well does the model predict the next word? We need smoothing to deal with unseen -grams. Add-1 and Add-α are simple, but not very good. Good-Turing is more sophisticated, yields better models, but we ll see even better methods next time. Same thing for higher count items Alex Lascarides FLP lecture 4 36 Alex Lascarides FLP lecture 4 37