Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Similar documents
Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Probabilistic Language Modeling

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CSC321 Lecture 18: Learning Probabilistic Models

Natural Language Processing. Statistical Inference: n-grams

Machine Learning for natural language processing

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Language Modeling. Michael Collins, Columbia University

ANLP Lecture 6 N-gram models and smoothing

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Sequences and Information

Natural Language Processing SoSe Words and Language Model

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Midterm sample questions

Language as a Stochastic Process

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Generative Clustering, Topic Modeling, & Bayesian Inference

Statistical Methods for NLP

N-gram Language Modeling Tutorial

Language Models. Philipp Koehn. 11 September 2018

Classification & Information Theory Lecture #8

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

6.867 Machine Learning

COMP90051 Statistical Machine Learning

CPSC 340: Machine Learning and Data Mining

Exercise 1: Basics of probability calculus

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Language Models. Hongning Wang

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

ECE521 week 3: 23/26 January 2017

University of Illinois at Urbana-Champaign. Midterm Examination

Lecture 1: Probability Fundamentals

Probability Rules. MATH 130, Elements of Statistics I. J. Robert Buchanan. Fall Department of Mathematics

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

CS 361: Probability & Statistics

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Bayesian Analysis for Natural Language Processing Lecture 2

6.867 Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Lecture 7: DecisionTrees

CMPT-825 Natural Language Processing

CS 188: Artificial Intelligence Spring Today

Probability Review and Naïve Bayes

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Inferring information about models from samples

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Modeling Environment

Expectation maximization tutorial

N-gram Language Modeling

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

With Question/Answer Animations. Chapter 7

Computational Cognitive Science

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

MLE/MAP + Naïve Bayes

CS 6120/CS4120: Natural Language Processing

Naive Bayes Classifier. Danushka Bollegala

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Language Processing with Perl and Prolog

Introduction to Bayesian Learning. Machine Learning Fall 2018

Chapter 14. From Randomness to Probability. Copyright 2012, 2008, 2005 Pearson Education, Inc.

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

MAT Mathematics in Today's World

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Computational Cognitive Science

Bayesian Models in Machine Learning

DT2118 Speech and Speaker Recognition

STA111 - Lecture 1 Welcome to STA111! 1 What is the difference between Probability and Statistics?

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Language Model. Introduction to N-grams

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

MLE/MAP + Naïve Bayes

Lecture : Probabilistic Machine Learning

STA Module 4 Probability Concepts. Rev.F08 1

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Basic Probability. Introduction

Standard & Conditional Probability

Midterm Examination Practice

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

CRF for human beings

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Generative Models for Discrete Data

The Naïve Bayes Classifier. Machine Learning Fall 2017

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Transcription:

Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016

Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import statements that you ll need. Includes code to read the files (and deal with annoying Unicode issues) 2

Outline Review of last class Justification of MLE probabilistically Overfitting and unseen data Dealing with unseen data: smoothing and regularization 3

Language Modelling Predict the next word given some context Mary had a little lamb GOOD accident GOOD? very BAD up BAD 4

N-grams Make a conditional independence assumption to make the job of learning the probability distribution easier. Context = the previous N-1 words Common choices: N is between 1 and 3 Unigram model P w N C = P(w N ) Bigram model P w N C = P(w N w N 1 ) Trigram model P w N C = P(w N w N 1, w N 2 ) 5

Deriving Parameters from Counts Simplest method: count N-gram frequencies, then divide by the total count e.g., Unigram: P(cats) = Count(cats) / Count(all words in corpus) Bigram: P(cats the) = Count(the cats) / Count(the) Trigram: P(cats feed the) =? These are the maximum likelihood estimates (MLE). 6

Basic Information Theory Consider some random variable X, distributed according to some probability distribution. We can define information in terms of how much certainty we gain from knowing the value of X. Rank the following in terms of how much information we gain by knowing its value: Fair coin flip An unfair coin flip where we get tails ¾ of the time A very unfair coin that always comes up heads 7

Likely vs Unlikely Outcomes Observing a likely outcome less information gained Intuition: you kinda knew it would happen anyway e.g., observing the word the Observing a rare outcome: more information gained! Intuition: it s a bit surprising to see something unusual! e.g., observing the word armadillo Formal definition of information in bits: I(x) = log 2 ( 1 P(x) ) Minimum number of bits needed to communicate some outcome x 8

Entropy The expected amount of information we get from observing a random variable. Let a discrete random variable be drawn from distribution p take on one of k possible values with probabilities p 1 p k k H p = i=1 k = i=1 k = i=1 p i I x i p i log 2 1 p i p i log 2 p i 9

Entropy Example Plot of entropy vs. coin toss fairness Maximum fairness = maximum entropy Completely biased = minimum entropy Image source: Wikipedia, by Brona and Alessio Damato 10

Cross Entropy Entropy is the minimum number of bits needed to communicate some message, if we know what probability distribution the message is drawn from. Cross entropy is for when we don t know. e.g., language is drawn from some true distribution, the language model we train is an approximation of it k H p, q = i=1 p i log 2 q i p: true distribution q: model distribution 11

Estimating Cross Entropy When evaluating our LM, we assume the test data is a good representative of language drawn from p. Original: k H p, q = i=1 Estimate: p i log 2 q i H(p, q) = 1 N log 2 q(w 1 w N ) True language distribution, which we don t have access to. Language model under evaluation Size of test corpus in number of tokens The words in the test corpus 12

Perplexity Cross entropy gives us a number in bits, which is sometimes hard to read. Perplexity makes this easier. H p,q Perplexity(p, q) = 2 13

Warm-Up Exercise Evaluate the given unigram language models using perplexity: A B C B B Model 1 Model 2 P(A) = 0.3 P(A) = 0.4 P(B) = 0.4 P(B) = 0.5 P(C) = 0.3 P(C) = 0.1 H p,q Perplexity(p, q) = 2 H(p, q) = 1 N log 2 q(w 1 w N ) 14

What is Maximum Likelihood? This way of computing the model parameters corresponds to maximizing the likelihood of (i.e., the probability of generating) the training corpus. Assumption: words (or N-grams) are random variables that are drawn from a categorical probability distribution i.i.d. (independently, and identically distributed) 15

Categorical Random Variables 1-of-K discrete outcomes, each with some probability e.g., coin flip, die roll, draw a word from a language model Probability of a training corpus, C = x 1, x 2, x N : K = 2: P C; θ = n=1 N P x n ; θ = θ N 1 1 θ N 0 Can similarly extend for K > 2 Notes: When K=2, it is called a Bernoulli distribution Sometimes incorrectly called a multinomial distribution, which is something else 16

Maximizing Quantities Calculus to the rescue! Take derivative and set to 0. Trick: maximize the log likelihood instead (math works out better) 17

MLE Derivation for a Bernoulli Maximize the log likelihood: log P C; θ = log(θ N 1 1 θ N 0 ) = N 1 log θ + N 0 log(1 θ) d log P C; θ = N 1 N 0 = 0 dθ θ 1 θ N 1 θ = N 0 1 θ Solve this to get: Or, θ = N 1 N 0 + N 1 θ = N 1 N 18

MLE Derivation for a Categorical The above generalizes to the case where K > 2. Do the derivation! Parameters are now θ 0, θ 1, θ 2,, θ K 1 Counts are now N 0, N 1, N 2,, N K 1 Note: Need to add a constraint that i=0 K 1 θ i = 1 to ensure that the parameters specify a probability distribution. Use the method of Lagrange multipliers 19

Steps 1. Gather a large, representative training corpus 2. Learn the parameters from the corpus to build the model 3. Once the model is fixed, use the model to evaluate on testing data 20

Overfitting MLE often gives us a model that is too good of a fit to the training data. This is called overfitting. Words that we haven t seen The probabilities of the words and N-grams that we have seen are not representative of the true distribution. But when testing, we evaluate the LM on unseen data. Overfitting lowers performance. 21

Out Of Vocabulary (OOV) Items Suppose we train a LM on the WSJ corpus, which is about economic news in 1987 1989. What probability would be assigned to Grexit? In general, we know that there will be many words in the test data that are not in the training data, no matter how large the training corpus is. Neologisms, typos, parts of the text in foreign languages, etc. Remember Zipf s law and the long tail 22

Smoothing Training corpus does not have all the words Add a special UNK symbol for unknown words Estimates for infrequent words are unreliable Modify our probability distributions Smoothe the probability distributions to shift probability mass to cases that we haven t seen before or are unsure about 23

MAP Estimation Smoothing means we are no longer doing MLE. We now have some prior belief about what the parameters should be like: maximum a posteriori inference MLE: Find θ MLE s.t. P(X; θ MLE ) is maximized MAP: Find θ MAP s.t. P X; θ MAP P(θ MAP ) is maximized 24

Add-δ Smoothing Modify our estimates by adding a certain amount to the frequency of each word. (sometimes called pseudocounts) e.g., unigram model Count(w) + δ P(w) = Lexicon δ + Corpus Pros: simple Cons: not the best approach; how to pick δ? Depends on sizes of lexicon and corpus When δ = 1, this is called Laplace discounting 25

Exercise Suppose we have a LM with a vocabulary of 20,000 items. In the training corpus, we see donkey 10 times. Of these, in 5 times it was followed by the word kong. In the other 5 times, it was followed by another word. What is the MLE estimate of P(kong donkey)? What is the Laplace estimate of P(kong donkey)? 26

Interpolation In an N-gram model, as N increases, data sparsity (i.e., unseen or rarely seen events) becomes a bigger problem. In an interpolation, use a lower N to mitigate the problem. 27

Simple Interpolation e.g., combine trigram, bigram, unigram models P w t w t 2, w t 1 = λ 1 P MLE w t w t 2, w t 1 +λ 2 P MLE w t w t 1 +λ 3 P MLE (w t ) Need to set i λ i = 1 so that the overall sum is a probability distribution How to select λ i? We will see shortly 28

Good-Turing Smoothing A more sophisticated method of modelling OOV items Remember Zipf s lessons We shouldn t adjust all words uniformly. The frequency of a word type is related to its rank we should be able to model this! Unseen words should behave a lot like words that only occur once in a corpus (hapax legomenon; pl., hapax legomena) Words that occur a lot should behave like other words that occur a lot. 29

Count of Counts Let s build a histogram to count how many word-types occur a certain number of times in the corpus. Word frequency 1 f 1 = 3993 2 f 2 = 1292 3 f 3 = 664 # word-types with that frequency For some word in bin f c, that word occurred c times in the corpus; c is the numerator in the MLE. Idea: re-estimate c using f c+1 30

Good-Turing Smoothing Defined Let N be total number of observed word-tokens, w c be a word that occurs c times in the training corpus. N = i f i i P(UNK) = f 1 / N Then: c = c+1 f c+1 f c P w c = c / N Example: Let N be 100,000. Word frequency # word-types 1 f 1 = 3,993 2 f 2 = 1,292 3 f 3 = 664 P(UNK) = 3993 / 100000 = 0.03993 (for all unknown words) c 1 = 2 * 1292 / 3993 = 0.647 c 2 = 3 * 664 / 1292 = 1.542 Note that this is for all OOV words Note that this is for one word that occurs c times 31

Good-Turing Refinement In practice, we need to do something a little more: At higher values of c, f c+1 is often 0. Solution: Estimate f c as a function of c We ll assume that a linear relationship exists between log c and log f c Use linear regression to learn this relationship: log f c LR = a log c + b For lower values of c, we continue to use f c ; for higher values of c, we use our new estimate f c LR. 32

Exercises Suppose we have the following counts: Word ship pass camp frock soccer mother tops Freq 8 7 3 2 1 1 1 Give the MLE and Good-Turing estimates for the probabilities of: any unknown word soccer camp 33

Model Selection We now have very many slightly different versions of the model (with different hyperparameters). How to decide between them? Use a development / validation set Procedure: 1. Train one or more models on the training set 2. Test (repeatedly, if necessary) on the dev/val set; choose a final hyperparameter setting/model 3. Test the final model on the final testing set (once only) Steps 1 and 2 can be structured using cross-validation 34

Model Complexity Trade-Offs In general, there is a trade-off between: model expressivity; i.e., what trends you could capture about your data with your model how well it generalizes to data If you use a highly expressive model (e.g., high values of N in N-gram modelling), it is much easier to overfit, and you need to do smoothing. OTOH, if your model is too weak, your performance will suffer as well. 35