ANLP Lecture 6 N-gram models and smoothing

Similar documents
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Language Models. Philipp Koehn. 11 September 2018

Natural Language Processing. Statistical Inference: n-grams

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-gram Language Modeling

Statistical Methods for NLP

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

N-gram Language Modeling Tutorial

Probabilistic Language Modeling

Language Model. Introduction to N-grams

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

CMPT-825 Natural Language Processing

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Machine Learning for natural language processing

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Michael Collins, Columbia University

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Conditional Language Modeling. Chris Dyer

The Noisy Channel Model and Markov Models

CS 6120/CS4120: Natural Language Processing

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Hierarchical Bayesian Nonparametric Models of Language and Text

DT2118 Speech and Speaker Recognition

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Hierarchical Bayesian Nonparametric Models of Language and Text

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Kneser-Ney smoothing explained

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Neural Networks Language Models

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Bayesian (Nonparametric) Approaches to Language Modeling

Natural Language Processing (CSE 490U): Language Models

Statistical Natural Language Processing

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Midterm sample questions

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Naïve Bayes, Maxent and Neural Models

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Natural Language Processing

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Learning to translate with neural networks. Michael Auli

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

The Infinite Markov Model

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

A Hierarchical Bayesian Language Model based on Pitman-Yor Processes

Hidden Markov Models in Language Processing

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Hierarchical Bayesian Nonparametrics

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Hierarchical Bayesian Models of Language and Text

Bayesian Analysis for Natural Language Processing Lecture 2

From perceptrons to word embeddings. Simon Šuster University of Groningen

Natural Language Processing SoSe Words and Language Model

ACCELERATING RECURRENT NEURAL NETWORK TRAINING VIA TWO STAGE CLASSES AND PARALLELIZATION

Gaussian Models

Latent Dirichlet Allocation Based Multi-Document Summarization

IBM Model 1 for Machine Translation

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

CSEP 517 Natural Language Processing Autumn 2013

arxiv:cmp-lg/ v1 9 Jun 1997

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

More Smoothing, Tuning, and Evaluation

CSC321 Lecture 18: Learning Probabilistic Models

Natural Language Processing

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

DISTRIBUTIONAL SEMANTICS

NEURAL LANGUAGE MODELS

Statistical Machine Translation

Exploring Asymmetric Clustering for Statistical Language Modeling

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Modeling Topic and Role Information in Meetings Using the Hierarchical Dirichlet Process

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Language Processing with Perl and Prolog

Language Models. Hongning Wang

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Classification: The rest of the story

Neural Network Language Modeling

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Improved Learning through Augmenting the Loss

Transcription:

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018

Recap: N-gram models We can model sentence probs by conditioning each word on N 1 previous words. For example, a bigram model: P ( w) = n i=1 P (w i w i 1 ) Or trigram model: P ( w) = n i=1 P (w i w i 2, w i 1 ) Sharon Goldwater ANLP Lecture 6 1

MLE estimates for N-grams To estimate each word prob, we could use MLE... P ML (w 2 w 1 ) = C(w 1, w 2 ) C(w 1 ) But what happens when I compute P (consuming commence)? Assume we have seen commence in our corpus But we have never seen commence consuming Sharon Goldwater ANLP Lecture 6 2

MLE estimates for N-grams To estimate each word prob, we could use MLE... P ML (w 2 w 1 ) = C(w 1, w 2 ) C(w 1 ) But what happens when I compute P (consuming commence)? Assume we have seen commence in our corpus But we have never seen commence consuming Any sentence with commence consuming gets probability 0 The guests shall commence consuming supper Green inked commence consuming garden the Sharon Goldwater ANLP Lecture 6 3

The problem with MLE MLE estimates probabilities that make the observed data maximally probable by assuming anything unseen cannot happen (and also assigning too much probability to low-frequency observed events). It over-fits the training data. We tried to avoid zero-probability sentences by modelling with smaller chunks (n-grams), but even these will sometimes have zero prob under MLE. Today: smoothing methods, which reassign probability mass from observed to unobserved events, to avoid overfitting/zero probs. Sharon Goldwater ANLP Lecture 6 4

Today s lecture: How does add-alpha smoothing work, and what are its effects? What are some more sophisticated smoothing methods, and what information do they use that simpler methods don t? What are training, development, and test sets used for? What are the trade-offs between higher order and lower order n-grams? What is a word embedding and how can it help in language modelling? Sharon Goldwater ANLP Lecture 6 5

Add-One Smoothing For all possible bigrams, add one more count. P ML (w i w i 1 ) = C(w i 1, w i ) C(w i 1 ) P +1 (w i w i 1 ) = C(w i 1, w i ) + 1 C(w i 1 )? Sharon Goldwater ANLP Lecture 6 6

Add-One Smoothing For all possible bigrams, add one more count. P ML (w i w i 1 ) = C(w i 1, w i ) C(w i 1 ) P +1 (w i w i 1 ) = C(w i 1, w i ) + 1 C(w i 1 ) NO! Sum over possible w i (in vocabulary V ) must equal 1: P (w i w i 1 ) = 1 w i V True for P ML but we increased the numerator; must change denominator too.? Sharon Goldwater ANLP Lecture 6 7

Add-One Smoothing: normalization We want: w i V C(w i 1, w i ) + 1 C(w i 1 ) + x = 1 Solve for x: (C(w i 1, w i ) + 1) = C(w i 1 ) + x w i V 1 = C(w i 1 ) + x C(w i 1, w i ) + w i V w i V C(w i 1 ) + v = C(w i 1 ) + x So, P +1 (w i w i 1 ) = C(w i 1,w i )+1 C(w i 1 )+v where v = vocabulary size. Sharon Goldwater ANLP Lecture 6 8

Add-One Smoothing: effects Add-one smoothing: P +1 (w i w i 1 ) = C(w i 1, w i ) + 1 C(w i 1 ) + v Large vobulary size means v is often much larger than C(w i 1 ), overpowers actual counts. Example: in Europarl, v = 86, 700 word types (30m tokens, max C(w i 1 ) = 2m). Sharon Goldwater ANLP Lecture 6 9

Add-One Smoothing: effects P +1 (w i w i 1 ) = C(w i 1, w i ) + 1 C(w i 1 ) + v Using v = 86, 700 compute some example probabilities: C(w i 1 ) = 10, 000 C(w i 1, w i ) P ML = P +1 100 1/100 1/970 10 1/1k 1/10k 1 1/10k 1/48k 0 0 1/97k C(w i 1 ) = 100 C(w i 1, w i ) P ML = P +1 100 1 1/870 10 1/10 1/9k 1 1/100 1/43k 0 0 1/87k Sharon Goldwater ANLP Lecture 6 10

The problem with Add-One smoothing All smoothing methods steal from the rich to give to the poor Add-one smoothing steals way too much ML estimates for frequent events are quite accurate, don t want smoothing to change these much. Sharon Goldwater ANLP Lecture 6 11

Add-α Smoothing Add α < 1 to each count P +α (w i w i 1 ) = C(w i 1, w i ) + α C(w i 1 ) + αv Simplifying notation: c is n-gram count, n is history count P +α = c + α n + αv What is a good value for α? Sharon Goldwater ANLP Lecture 6 12

Optimizing α Divide corpus into training set (80-90%), held-out (or development or validation) set (5-10%), and test set (5-10%) Train model (estimate probabilities) on training set with different values of α Choose the value of α that minimizes perplexity on dev set Report final results on test set Sharon Goldwater ANLP Lecture 6 13

A general methodology Training/dev/test split is used across machine learning Development set used for evaluating different models, debugging, optimizing parameters (like α) Test set simulates deployment; only used once final model and parameters are chosen. (Ideally: once per paper) Avoids overfitting to the training set and even to the test set Sharon Goldwater ANLP Lecture 6 14

Is add-α sufficient? Even if we optimize α, add-α smoothing makes pretty bad predictions for word sequences. Some cleverer methods such as Good-Turing improve on this by discounting less from very frequent items. But there s still a problem... Sharon Goldwater ANLP Lecture 6 15

Remaining problem In given corpus, suppose we never observe Scottish beer drinkers Scottish beer eaters If we build a trigram model smoothed with Add-α or G-T, which example has higher probability? Sharon Goldwater ANLP Lecture 6 16

Remaining problem Previous smoothing methods assign equal probability to all unseen events. Better: use information from lower order N-grams (shorter histories). beer drinkers beer eaters Two ways: interpolation and backoff. Sharon Goldwater ANLP Lecture 6 17

Interpolation Higher and lower order N-gram models have different strengths and weaknesses high-order N-grams are sensitive to more context, but have sparse counts low-order N-grams consider only very limited context, but have robust counts So, combine them: P I (w 3 w 1, w 2 ) = λ 1 P 1 (w 3 ) P 1 (drinkers) + λ 2 P 2 (w 3 w 2 ) P 2 (drinkers beer) + λ 3 P 3 (w 3 w 1, w 2 ) P 3 (drinkers Scottish, beer) Sharon Goldwater ANLP Lecture 6 18

Interpolation Note that λ i s must sum to 1: 1 = w 3 P I (w 3 w 1, w 2 ) = w 3 [λ 1 P 1 (w 3 ) + λ 2 P 2 (w 3 w 2 ) + λ 3 P 3 (w 3 w 1, w 2 )] = λ 1 w 3 P 1 (w 3 ) + λ 2 = λ 1 + λ 2 + λ 3 w 3 P 2 (w 3 w 2 ) + λ 3 w 3 P 3 (w 3 w 1, w 2 ) Sharon Goldwater ANLP Lecture 6 19

Fitting the interpolation parameters In general, any weighted combination of distributions is called a mixture model. So λ i s are interpolation parameters or mixture weights. The values of the λ i s are chosen to optimize perplexity on a held-out data set. Sharon Goldwater ANLP Lecture 6 20

Back-Off Trust the highest order language model that contains N-gram, otherwise back off to a lower order model. Basic idea: discount the probabilities slightly in higher order model spread the extra mass between lower order N-grams But maths gets complicated to make probabilities sum to 1. Sharon Goldwater ANLP Lecture 6 21

Back-Off Equation P BO (w i w i N+1,..., w i 1 ) = P (w i w i N+1,..., w i 1 ) if count(w i N+1,..., w i ) > 0 = α(w i N+1,..., w i 1 ) P BO (w i w i N+2,..., w i 1 ) else Requires adjusted prediction model P (w i w i N+1,..., w i 1 ) backoff weights α(w 1,..., w N 1 ) See textbook for details/explanation. Sharon Goldwater ANLP Lecture 6 22

Do our smoothing methods work here? Example from MacKay and Bauman Peto (1994): Imagine, you see, that the language, you see, has, you see, a frequently occurring couplet, you see, you see, in which the second word of the couplet, see, follows the first word, you, with very high probability, you see. Then the marginal statistics, you see, are going to become hugely dominated, you see, by the words you and see, with equal frequency, you see. P (see) and P (you) both high, but see nearly always follows you. So P (see novel) should be much lower than P (you novel). Sharon Goldwater ANLP Lecture 6 23

Diversity of histories matters! A real example: the word York fairly frequent word in Europarl corpus, occurs 477 times as frequent as foods, indicates and providers in unigram language model: a respectable probability However, it almost always directly follows New (473 times) So, in unseen bigram contexts, York should have low probability lower than predicted by unigram model used in interpolation or backoff. Sharon Goldwater ANLP Lecture 6 24

Kneser-Ney Smoothing Kneser-Ney smoothing takes diversity of histories into account Count of distinct histories for a word: N 1+ ( w i ) = {w i 1 : c(w i 1, w i ) > 0} Recall: maximum likelihood est. of unigram language model: P ML (w) = C(w i) w i C(w i ) In KN smoothing, replace raw counts with count of histories: P KN (w i ) = N 1+( w) w i N 1+ ( w i ) Sharon Goldwater ANLP Lecture 6 25

Kneser-Ney in practice Original version used backoff, later modified Kneser-Ney introduced using interpolation (Chen and Goodman, 1998). Fairly complex equations, but until recently the best smoothing method for word n-grams. See Chen and Goodman for extensive comparisons of KN and other smoothing methods. KN (and other methods) implemented in language modelling toolkits like SRILM (classic), KenLM (good for really big models), OpenGrm Ngram library (uses finite state transducers), etc. Sharon Goldwater ANLP Lecture 6 26

Bayesian interpretations of smoothing We contrasted MLE (which has a mathematical justification, but practical problems) with smoothing (heuristic approaches with better practical performance). It turns out that many smoothing methods are mathematically equivalent to forms of Bayesian estimation (uses priors and uncertainty in parameters). So these have a mathematical justification too! Add-α smoothing: Dirichlet prior Kneser-Ney smoothing: Pitman-Yor prior See MacKay and Bauman Peto (1994); (Goldwater, 2006, pp. 13-17); Goldwater et al. (2006); Teh (2006). Sharon Goldwater ANLP Lecture 6 27

Are we done with smoothing yet? We ve considered methods that predict rare/unseen words using Uniform probabilities (add-α, Good-Turing) Probabilities from lower-order n-grams (interpolation, backoff) Probability of appearing in new contexts (Kneser-Ney) What s left? Sharon Goldwater ANLP Lecture 6 28

Word similarity Two words with C(w 1 ) C(w 2 ) salmon swordfish Can P (salmon caught two) tell us something about P (swordfish caught two)? n-gram models: no. Sharon Goldwater ANLP Lecture 6 29

Word similarity in language modeling Early version: class-based language models (J&M 4.9.2) Define classes c of words, by hand or automatically P CL (w i w i 1 ) = P (c i c i 1 )P (w i c i ) (an HMM) Recent version: distributed language models Current models have better perplexity than MKN. Ongoing research to make them more efficient. Examples: Log Bilinear LM (Mnih and Hinton, 2007), Recursive Neural Network LM (Mikolov et al., 2010), LSTM LMs, etc. Sharon Goldwater ANLP Lecture 6 30

Distributed word representations (also called word embeddings) Each word represented as high-dimensional vector (50-500 dims) E.g., salmon is [0.1, 2.3, 0.6, 4.7,...] Similar words represented by similar vectors E.g., swordfish is [0.3, 2.2, 1.2, 3.6,...] Sharon Goldwater ANLP Lecture 6 31

Training the model Goal: learn word representations (embeddings) such that words that behave similarly are close together in high-dimensional space. 2-dimensional example: We ll come back to this later in the course... Sharon Goldwater ANLP Lecture 6 32

Training the model N-gram LM: collect counts, maybe optimize some parameters (Relatively) quick, especially these days (minutes-hours) distributed LM: learn the representation for each word Solved with machine learning methods (e.g., neural networks) Can be extremely time-consuming (hours-days) Learned embeddings seem to encode both semantic and syntactic similarity (using different dimensions) (Mikolov et al., 2013). Sharon Goldwater ANLP Lecture 6 33

Using the model Want to compute P (w 1... w n ) for a new sequence. N-gram LM: again, relatively quick distributed LM: can be slow, but varies; often LMs not used in isolation anyway (instead use end-to-end neural model, which does some of the same work). An active area of research for distributed LMs Sharon Goldwater ANLP Lecture 6 34

Other Topics in Language Modeling Many active research areas in language modeling: Factored/morpheme-based/character language models: back off to word stems, part-of-speech tags, and/or sub-word units Domain adaptation: when only a small domain-specific corpus is available Time efficiency and space efficiency are both key issues (esp on mobile devices!) Sharon Goldwater ANLP Lecture 6 35

Summary We can estimate sentence probabilities by breaking down the problem, e.g. by instead estimating N-gram probabilities. Longer N-grams capture more linguistic information, but are sparser. Different smoothing methods capture different intuitions about how to estimate probabilities for rare/unseen events. Still lots of work on how to improve these models. Sharon Goldwater ANLP Lecture 6 36

Announcements Assignment 1 will go out on Monday: build and experiment with a character-level N-gram model. Intended for students to work in pairs: we strongly recommend you do. You can discuss and learn from your partner. We ll have a signup sheet if you want to choose your own partner. On Tue/Wed, we will assign partners to anyone who hasn t already signed up with a partner (or told us they want to work alone). You may not work with the same partner for both assessed assignments. Sharon Goldwater ANLP Lecture 6 37

Questions and exercises (lects 5-6) 1. What does sparse data refer to, and why is it important in language modelling? 2. Write down the equations for the Noisy Channel framework and explain what each term refers to for an example task (say, speech recognition). 3. Re-derive the equations for a trigram model without looking at the notes. 4. Given a sentence, show how its probability is computed using an unigram, bigram, or trigram model. 5. Using a unigram model, I compute the probabilitytable of 1 the word sequence the cat bit the dog as 0.00057. Give another word sequence that has the same 0.3 0.2 0.7 probability under this model. 0.5 0.25 0.18 0.1 0.15 0.05 6. Given a probability distribution, compute its entropy. 0.05 0.23 0.04 0.05 0.17 0.03 7. Here are three different distributions, each over five outcomes. Which has the (a) (b) (c) highest entropy? The lowest? Sharon Goldwater ANLP Lecture 6 38

8. What is the purpose of the begin/end of sentence markers in an n-gram model? 9. Given a text, how would you compute P (to want) using MLE, and using add-1 smoothing? What about P (want to)? Which conditional probability, P (to want) or P (want to), is needed to compute the probability of I want to go under a bigram model? 10. Consider the following trigrams: (a) private eye maneuvered and (b) private car maneuvered. (Note: a private eye is slang for a detective.) Suppose that neither of these has been observed in a particular corpus, and we are using backoff to estimate their probabilities. What are the bigrams that we will back off to in each case? In which case is the backoff model likely to provide a more accurate estimate of the trigram probability? Why? 11. Suppose I have a smallish corpus to train my language model, and I m not sure I have enough data to train a good 4-gram model. So, I want to know whether a trigram model or 4-gram model is likely to make better predictions on other data similar to my corpus, and by how much. Let s say I only consider two types of smoothing methods: add-alpha smoothing with interpolation, or Kneser-Ney. Describe what I should do to answer my question. What steps should I go through, what experiments do I need to run, and what can I conclude from them? Sharon Goldwater ANLP Lecture 6 39

References Chen, S. F. and Goodman, J. (1998). An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Center for Research in Computing Technology, Harvard University. Goldwater, S. (2006). Nonparametric Bayesian Models of Lexical Acquisition. PhD thesis, Brown University. Goldwater, S., Griffiths, T. L., and Johnson, M. (2006). Interpolating between types and tokens by estimating power-law generators. In Advances in Neural Information Processing Systems 18, pages 459 466, Cambridge, MA. MIT Press. MacKay, D. and Bauman Peto, L. (1994). A hierarchical Dirichlet language model. Natural Language Engineering, 1(1). Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In INTERSPEECH, pages 1045 1048. Sharon Goldwater ANLP Lecture 6 40

Mikolov, T., Yih, W.-t., and Zweig, G. (2013). Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746 751. Mnih, A. and Hinton, G. (2007). Three new graphical models for statistical language modelling. In Proceedings of the 24th international conference on Machine learning, pages 641 648. ACM. Teh, Y. W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 985 992, Syndney, Australia. Sharon Goldwater ANLP Lecture 6 41