Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Similar documents
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

ANLP Lecture 6 N-gram models and smoothing

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Language Models. Philipp Koehn. 11 September 2018

Natural Language Processing. Statistical Inference: n-grams

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

N-gram Language Modeling Tutorial

N-gram Language Modeling

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Language Model. Introduction to N-grams

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Probabilistic Language Modeling

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

CMPT-825 Natural Language Processing

Machine Learning for natural language processing

Statistical Methods for NLP

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

The Noisy Channel Model and Markov Models

Language Modeling. Michael Collins, Columbia University

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

CS 6120/CS4120: Natural Language Processing

DT2118 Speech and Speaker Recognition

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Natural Language Processing (CSE 490U): Language Models

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Naïve Bayes, Maxent and Neural Models

Hidden Markov Models in Language Processing

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Conditional Language Modeling. Chris Dyer

From perceptrons to word embeddings. Simon Šuster University of Groningen

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Midterm sample questions

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Lecture 12: Algorithms for HMMs

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Natural Language Processing

Lecture 12: Algorithms for HMMs

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

NEURAL LANGUAGE MODELS

Natural Language Processing SoSe Words and Language Model

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Kneser-Ney smoothing explained

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Statistical Machine Translation

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Generative Clustering, Topic Modeling, & Bayesian Inference

Graphical models for part of speech tagging

CSEP 517 Natural Language Processing Autumn 2013

language modeling: n-gram models

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Language Models. Hongning Wang

Statistical Natural Language Processing

Neural Networks Language Models

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

Lecture 13: More uses of Language Models

CS 6120/CS4120: Natural Language Processing

Bayesian (Nonparametric) Approaches to Language Modeling

Statistical Methods for NLP

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Forward algorithm vs. particle filtering

IBM Model 1 for Machine Translation

Gaussian Models

STA 4273H: Statistical Machine Learning

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

CS 188: Artificial Intelligence Fall 2011

Lecture : Probabilistic Machine Learning

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Learning to translate with neural networks. Michael Auli

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Ranked Retrieval (2)

Exploring Asymmetric Clustering for Statistical Language Modeling

CSC321 Lecture 7 Neural language models

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

DISTRIBUTIONAL SEMANTICS

From Language Modelling to Machine Translation

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Microsoft Corporation.

arxiv: v1 [cs.cl] 21 May 2017

Hidden Markov Models. x 1 x 2 x 3 x N

Hierarchical Bayesian Models of Language and Text

STA 414/2104: Machine Learning

Lecture 5 Neural models for NLP

Transcription:

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January 2018 Alex Lascarides FNLP Lecture 5 30 January 2018

Recap: Smoothing for language models N-gram LMs reduce sparsity by assuming each word only depends on a fixed-length history. But even this assumption isn t enough: N-grams in a test set or new corpus. we still encounter lots of unseen If we use MLE, we ll assign 0 probability to unseen items: useless as an LM. Smoothing solves this problem: move probability mass from seen items to unseen items. Alex Lascarides FNLP Lecture 5 1

Smoothing methods so far Add-α smoothing: (α = 1 or < 1) very simple, but no good when vocabulary size is large. Good-Turing smoothing: estimate the probability of seeing (any) item with N c counts (e.g., 0 count) as the proportion of items already seen with N c+1 counts (e.g., 1 count). Divide that probability evenly between all possible items with N c counts. Alex Lascarides FNLP Lecture 5 2

Good-Turing smoothing If n is count of history, then P GT = c n where c = (c + 1) N c+1 N c N c number of N-grams that occur exactly c times in corpus N 0 total number of unseen N-grams Ex. for trigram probability P GT (three I spent), then n is count of I spent and c is count of I spent three. Alex Lascarides FNLP Lecture 5 3

Problems with Good-Turing Assumes we know the vocabulary size (no unseen words) [but again, use UNK: see J&M 4.3.2] Doesn t allow holes in the counts (if N i > 0, N i 1 > 0) [can estimate using linear regression: see J&M 4.5.3] Applies discounts even to high-frequency items [but see J&M 4.5.3] But there s a more fundamental problem... Alex Lascarides FNLP Lecture 5 4

Remaining problem In training corpus, suppose we see Scottish beer but neither of Scottish beer drinkers Scottish beer eaters If we build a trigram model smoothed with Add-α or G-T, which example has higher probability? Alex Lascarides FNLP Lecture 5 5

Remaining problem Previous smoothing methods assign equal probability to all unseen events. Better: use information from lower order N-grams (shorter histories). beer drinkers beer eaters Two ways: interpolation and backoff. Alex Lascarides FNLP Lecture 5 6

Interpolation Combine higher and lower order N-gram models, since they have different strengths and weaknesses: high-order N-grams are sensitive to more context, but have sparse counts low-order N-grams have limited context, but robust counts If P N is N-gram estimate (from MLE, GT, etc; N = 1 3), use: Pint(w 3 w 1, w 2 ) = λ 1 P 1 (w 3 ) + λ 2 P 2 (w 3 w 2 ) + λ 3 P 3 (w 3 w 1, w 2 ) Pint(three I, spent) = λ 1 P 1 (three) + λ 2 P 2 (three spent) +λ 3 P 3 (three I, spent) Alex Lascarides FNLP Lecture 5 7

Interpolation Note that λ i s must sum to 1: 1 = w 3 Pint(w 3 w 1, w 2 ) = w 3 [λ 1 P 1 (w 3 ) + λ 2 P 2 (w 3 w 2 ) + λ 3 P 3 (w 3 w 1, w 2 )] = λ 1 w 3 P 1 (w 3 ) + λ 2 = λ 1 + λ 2 + λ 3 w 3 P 2 (w 3 w 2 ) + λ 3 w 3 P 3 (w 3 w 1, w 2 ) Alex Lascarides FNLP Lecture 5 8

Fitting the interpolation parameters In general, any weighted combination of distributions is called a mixture model. So λ i s are interpolation parameters or mixture weights. The values of the λ i s are chosen to optimize perplexity on a held-out data set. Alex Lascarides FNLP Lecture 5 9

Katz Back-Off Solve the problem in a similar way to Good- Turing smoothing. Discount the trigram-based probability estimates. This leaves some probability mass to share among the estimates from the lower-order model(s). Katz backoff: Good-Turing discount the observed counts, but instead of using that saved mass for unseen items, use it for backoff estimates. Alex Lascarides FNLP Lecture 5 10

Back-Off Formulae Trust the highest order language model that contains N-gram P BO (w i w i N+1,..., w i 1 ) = P (w i w i N+1,..., w i 1 ) if count(w i N+1,..., w i ) > 0 = α(w i N+1,..., w i 1 ) P BO (w i w i N+2,..., w i 1 ) else Alex Lascarides FNLP Lecture 5 11

Back-Off Requires adjusted prediction model P (w i w i N+1,..., w i 1 ) backoff weights α(w 1,..., w N 1 ) Exact equations get complicated to make probabilities sum to 1. See textbook for details if interested. Alex Lascarides FNLP Lecture 5 12

Do our smoothing methods work here? Example from MacKay and Peto (1995): Imagine, you see, that the language, you see, has, you see, a frequently occurring couplet, you see, you see, in which the second word of the couplet, see, follows the first word, you, with very high probability, you see. Then the marginal statistics, you see, are going to become hugely dominated, you see, by the words you and see, with equal frequency, you see. P (see) and P (you) both high, but see nearly always follows you. So P (see novel) should be much lower than P (you novel). Alex Lascarides FNLP Lecture 5 13

Diversity of histories matters! A real example: the word York fairly frequent word in Europarl corpus, occurs 477 times as frequent as foods, indicates and providers in unigram language model: a respectable probability However, it almost always directly follows New (473 times) So, in unseen bigram contexts, York should have low probability lower than predicted by unigram model as used in interpolation/backoff. Alex Lascarides FNLP Lecture 5 14

Kneser-Ney Smoothing Kneser-Ney smoothing takes diversity of histories into account Count of distinct histories for a word: N 1+ ( w i ) = {w i 1 : c(w i 1, w i ) > 0} Recall: maximum likelihood est. of unigram language model: P ML (w i ) = C(w i) w C(w) In KN smoothing, replace raw counts with count of histories: P KN (w i ) = N 1+( w i ) w N 1+( w) Alex Lascarides FNLP Lecture 5 15

Kneser-Ney in practice Original version used backoff, later modified Kneser-Ney introduced using interpolation. Fairly complex equations, but until recently the best smoothing method for word n-grams. See Chen and Goodman (1999) for extensive comparisons of KN and other smoothing methods. KN (and other methods) implemented in language modelling toolkits like SRILM (classic), KenLM (good for really big models), OpenGrm Ngram library (uses finite state transducers), etc. Alex Lascarides FNLP Lecture 5 16

Are we done with smoothing yet? We ve considered methods that predict rare/unseen words using Uniform probabilities (add-α, Good-Turing) Probabilities from lower-order n-grams (interpolation, backoff) Probability of appearing in new contexts (Kneser-Ney) What s left? Alex Lascarides FNLP Lecture 5 17

Word similarity Suppose we have two words with C(w 1 ) C(w 2 ) salmon swordfish Can P (salmon caught two) tell us anything about P (swordfish caught two)? N-gram models: no. Alex Lascarides FNLP Lecture 5 18

Word similarity in language modeling Early version: class-based language models (J&M 4.9.2) Define classes c of words, by hand or automatically P CL (w i w i 1 ) = P (c i c i 1 )P (w i c i ) (an HMM) Recent version: distributed language models Current models have better perplexity than MKN. Ongoing research to make them more efficient. Examples: Recursive Neural Network LM (Mikolov et al., 2010), Log Bilinear LM (Mnih and Hinton, 2007), and extensions. Alex Lascarides FNLP Lecture 5 19

Distributed word representations Each word represented as high-dimensional vector (50-500 dims) E.g., salmon is [0.1, 2.3, 0.6, 4.7,...] Similar words represented by similar vectors E.g., swordfish is [0.3, 2.2, 1.2, 3.6,...] More about this later in the course. Alex Lascarides FNLP Lecture 5 20

Training the model Goal: learn word representations (embeddings) such that words that behave similarly are close together in high-dimensional space. 2-dimensional example: Alex Lascarides FNLP Lecture 5 21

Training the model N-gram LM: collect counts, maybe optimize some parameters (Relatively) quick, especially these days (minutes-hours) distributed LM: learn the representation for each word Use ML methods like neural networks that iteratively improve embeddings Can be extremely time-consuming (hours-days) Learned embeddings capture both semantic and syntactic similarity. Alex Lascarides FNLP Lecture 5 22

Using the model Want to compute P (w 1... w n ) for a new sequence. N-gram LM: again, relatively quick distributed LM: often prohibitively slow for real applications An active area of research for distributed LMs Alex Lascarides FNLP Lecture 5 23

Other Topics in Language Modeling Many active research areas! Modeling issues: Morpheme-based language models Syntactic language models Domain adaptation: when only a small corpus is available in the domain of interest Implementation issues: Speed: both to train, and to use in real-time applications like translation and speech recognition. Disk space and memory: espcially important for mobile devices Alex Lascarides FNLP Lecture 5 24

Back to the big picture However we train our LM, we will want to use it in some application. Now, a bit more detail about how that can work. We need another concept from information theory: the Noisy Channel Model. Alex Lascarides FNLP Lecture 5 25

Noisy channel model We imagine that someone tries to communicate a sequence to us, but noise is introduced. We only see the output sequence. symbol sequence P(Y) noisy/ errorful encoding P(X Y) output sequence P(X) Alex Lascarides FNLP Lecture 5 26

Noisy channel model We imagine that someone tries to communicate a sequence to us, but noise is introduced. We only see the output sequence. symbol sequence P(Y) noisy/ errorful encoding P(X Y) output sequence P(X) Application Y X Speech recognition spoken words acoustic signal Machine translation words in L 1 words in L 2 Spelling correction intended words typed words Alex Lascarides FNLP Lecture 5 27

Example: spelling correction P (Y ): Distribution over the words the user intended to type. model! A language P (X Y ): Distribution describing what user is likely to type, given what they meant. Could incorporate information about common spelling errors, key positions, etc. Call it a noise model. P (X): Resulting distribution over what we actually see. Given some particular observation x (say, effert), we want to recover the most probable y that was intended. Alex Lascarides FNLP Lecture 5 28

Noisy channel as probabilistic inference Mathematically, what we want is argmax y P (y x). Read as the y that maximizes P (y x) Rewrite using Bayes Rule: argmax y P (y x) = argmax y P (x y)p (y) P (x) = argmax y P (x y)p (y) Alex Lascarides FNLP Lecture 5 29

Noisy channel as probabilistic inference So to recover the best y, we will need a language model, which will be fairly similar for different applications a noise model, which depends on the application: acoustic model, translation model, misspelling model, etc. Both are normally trained on corpus data. Alex Lascarides FNLP Lecture 5 30

You may be wondering If we can train P (X Y ), why can t we just train P (Y X)? Who needs Bayes Rule? Answer 1: sometimes we do train P (Y X) directly. Stay tuned... Answer 2: training P (X Y ) or P (Y X) requires input/output pairs, which are often limited: Misspelled words with their corrections; transcribed speech; translated text But LMs can be trained on huge unannotated corpora: a better model. Can help improve overall performance. Alex Lascarides FNLP Lecture 5 31

Model versus algorithm We defined a probabilistic model, which says what we should do. E.g., for spelling correction: given trained LM and noise model (we haven t said yet how to train the NM), find the intended word that is most probable given the observed word. We haven t considered how we would do that. A search problem: there are (infinitely) many possible inputs that could have generated what we saw; which one is best? We need to design an algorithm that can solve the problem. Alex Lascarides FNLP Lecture 5 32

Summary Different smoothing methods account for different aspects of sparse data and word behaviour. Interpolation/backoff: leverage advantages of both higher and lower order N-grams. Kneser-Ney smoothing: accounts for diversity of history. Distributed representations: account for word similarity. Noisy channel model combines LM with noise model to define the best solution for many applications. Next time, will consider how to find this solution. Alex Lascarides FNLP Lecture 5 33

References: Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359394. MacKay, D. J. C. and Peto, L. C. B. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1(3):289308. Mikolov, T., Karafi at, M., Burget, L., Cernock y, J. H., and Khudanpur, S. (2010). Recurrent neural network based language model. In INTERSPEECH, pages 10451048, Makuhari, Chiba, Japan. Mnih, A. and Hinton, G. (2007). Three new graphical models for statistical language modelling. In Proc. of ICML, pages 641648, Corvallis, Oregon, USA. Alex Lascarides FNLP Lecture 5 34