Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Similar documents
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

ANLP Lecture 6 N-gram models and smoothing

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Language Models. Philipp Koehn. 11 September 2018

Natural Language Processing. Statistical Inference: n-grams

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-gram Language Modeling Tutorial

N-gram Language Modeling

Language Model. Introduction to N-grams

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Probabilistic Language Modeling

Machine Learning for natural language processing

CMPT-825 Natural Language Processing

Statistical Methods for NLP

The Noisy Channel Model and Markov Models

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Language Modeling. Michael Collins, Columbia University

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

CS 6120/CS4120: Natural Language Processing

DT2118 Speech and Speaker Recognition

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Natural Language Processing (CSE 490U): Language Models

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

From perceptrons to word embeddings. Simon Šuster University of Groningen

Naïve Bayes, Maxent and Neural Models

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Hidden Markov Models in Language Processing

Midterm sample questions

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Conditional Language Modeling. Chris Dyer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

language modeling: n-gram models

Neural Networks Language Models

Natural Language Processing

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

Statistical Machine Translation

NEURAL LANGUAGE MODELS

Language Models. Hongning Wang

Generative Clustering, Topic Modeling, & Bayesian Inference

Graphical models for part of speech tagging

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Natural Language Processing SoSe Words and Language Model

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Lecture 13: More uses of Language Models

Lecture : Probabilistic Machine Learning

CS 6120/CS4120: Natural Language Processing

Bayesian (Nonparametric) Approaches to Language Modeling

Statistical Natural Language Processing

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Forward algorithm vs. particle filtering

CSEP 517 Natural Language Processing Autumn 2013

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Gaussian Models

CSC321 Lecture 7 Neural language models

STA 4273H: Statistical Machine Learning

Kneser-Ney smoothing explained

CS 188: Artificial Intelligence Fall 2011

IBM Model 1 for Machine Translation

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Learning to translate with neural networks. Michael Auli

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Statistical Methods for NLP

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Ranked Retrieval (2)

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

DISTRIBUTIONAL SEMANTICS

From Language Modelling to Machine Translation

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

STA 414/2104: Machine Learning

arxiv: v1 [cs.cl] 21 May 2017

Transcription:

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider ENLP Lecture 10a 5 October 2016

Recap: Smoothing for language models N-gram LMs reduce sparsity by assuming each word only depends on a fixed-length history. But even this assumption isn t enough: N-grams in a test set or new corpus. we still encounter lots of unseen If we use MLE, we ll assign 0 probability to unseen items: useless as an LM. Smoothing solves this problem: move probability mass from seen items to unseen items. Add-α smoothing: (α = 1 or < 1) very simple, but no good when vocabulary size is large. Nathan Schneider ENLP Lecture 10a 1

Good-Turing smoothing: intuition Good-Turing smoothing: estimate the probability of seeing (any) item with N c counts (e.g., 0 count) as the proportion of items already seen with N c+1 counts (e.g., 1 count). Divide that probability evenly between all possible items with N c counts. Nathan Schneider ENLP Lecture 10a 2

Good-Turing smoothing If n is count of history, then P GT = c n where c = (c + 1) N c+1 N c N c number of N-grams that occur exactly c times in corpus N 0 total number of unseen N-grams Ex. for trigram probability P GT (three I spent), then n is count of I spent and c is count of I spent three. Nathan Schneider ENLP Lecture 10a 3

Problems with Good-Turing Assumes we know the vocabulary size (no unseen words) [but again, use UNK: see J&M 4.3.2] Doesn t allow holes in the counts (if N i > 0, N i 1 > 0) [can estimate using linear regression: see J&M 4.5.3] Applies discounts even to high-frequency items [but see J&M 4.5.3] But there s a more fundamental problem... Nathan Schneider ENLP Lecture 10a 4

Remaining problem In training corpus, suppose we see Scottish beer but neither of Scottish beer drinkers Scottish beer eaters If we build a trigram model smoothed with Add-α or G-T, which example has higher probability? Nathan Schneider ENLP Lecture 10a 5

Remaining problem Previous smoothing methods assign equal probability to all unseen events. Better: use information from lower order N-grams (shorter histories). beer drinkers beer eaters Two ways: interpolation and backoff. Nathan Schneider ENLP Lecture 10a 6

Interpolation Combine higher and lower order N-gram models, since they have different strengths and weaknesses: high-order N-grams are sensitive to more context, but have sparse counts low-order N-grams have limited context, but robust counts If P N is N-gram estimate (from MLE, GT, etc.; N {1, 2, 3}), use: Pint(w 3 w 1, w 2 ) = λ 1 P 1 (w 3 ) + λ 2 P 2 (w 3 w 2 ) + λ 3 P 3 (w 3 w 1, w 2 ) Pint(three I, spent) = λ 1 P 1 (three) + λ 2 P 2 (three spent) +λ 3 P 3 (three I, spent) Nathan Schneider ENLP Lecture 10a 7

Interpolation Note that λ i s must sum to 1: 1 = w 3 Pint(w 3 w 1, w 2 ) = w 3 [λ 1 P 1 (w 3 ) + λ 2 P 2 (w 3 w 2 ) + λ 3 P 3 (w 3 w 1, w 2 )] = λ 1 w 3 P 1 (w 3 ) + λ 2 = λ 1 + λ 2 + λ 3 w 3 P 2 (w 3 w 2 ) + λ 3 w 3 P 3 (w 3 w 1, w 2 ) Nathan Schneider ENLP Lecture 10a 8

Fitting the interpolation parameters In general, any weighted combination of distributions is called a mixture model. So λ i s are interpolation parameters or mixture weights. The values of the λ i s are chosen to optimize perplexity on a held-out data set. Nathan Schneider ENLP Lecture 10a 9

Back-Off Trust the highest order language model that contains N-gram P BO (w i w i N+1,..., w i 1 ) = P (w i w i N+1,..., w i 1 ) if count(w i N+1,..., w i ) > 0 = α(w i N+1,..., w i 1 ) P BO (w i w i N+2,..., w i 1 ) else Nathan Schneider ENLP Lecture 10a 10

Back-Off Requires adjusted prediction model P (w i w i N+1,..., w i 1 ) backoff weights α(w 1,..., w N 1 ) Exact equations get complicated to make probabilities sum to 1. See textbook for details if interested. Nathan Schneider ENLP Lecture 10a 11

Do our smoothing methods work here? Example from MacKay and Peto (1995): Imagine, you see, that the language, you see, has, you see, a frequently occurring couplet, you see, you see, in which the second word of the couplet, see, follows the first word, you, with very high probability, you see. Then the marginal statistics, you see, are going to become hugely dominated, you see, by the words you and see, with equal frequency, you see. P (see) and P (you) both high, but see nearly always follows you. So P (see novel) should be much lower than P (you novel). Nathan Schneider ENLP Lecture 10a 12

Diversity of histories matters! A real example: the word York fairly frequent word in Europarl corpus, occurs 477 times as frequent as foods, indicates and providers in unigram language model: a respectable probability However, it almost always directly follows New (473 times) So, in unseen bigram contexts, York should have low probability lower than predicted by unigram model as used in interpolation/backoff. Nathan Schneider ENLP Lecture 10a 13

Kneser-Ney Smoothing Kneser-Ney smoothing takes diversity of histories into account Count of distinct histories for a word: N 1+ ( w i ) = {w i 1 : c(w i 1, w i ) > 0} Recall: maximum likelihood est. of unigram language model: P ML (w i ) = C(w i) w C(w) In KN smoothing, replace raw counts with count of histories: P KN (w i ) = N 1+( w i ) w N 1+( w) Nathan Schneider ENLP Lecture 10a 14

Kneser-Ney in practice Original version used backoff, later modified Kneser-Ney introduced using interpolation. Fairly complex equations, but until recently the best smoothing method for word n-grams. See Chen and Goodman (1999) for extensive comparisons of KN and other smoothing methods. KN (and other methods) implemented in language modelling toolkits like SRILM (classic), KenLM (good for really big models), OpenGrm Ngram library (uses finite state transducers), etc. Nathan Schneider ENLP Lecture 10a 15

Ken neth Heafield LM KenLM is a highly efficient language modeling toolkit. https://kheafield.com/code/kenlm/ Nathan Schneider ENLP Lecture 10a 16

Are we done with smoothing yet? We ve considered methods that predict rare/unseen words using Uniform probabilities (add-α, Good-Turing) Probabilities from lower-order n-grams (interpolation, backoff) Probability of appearing in new contexts (Kneser-Ney) What s left? Nathan Schneider ENLP Lecture 10a 17

Word similarity Suppose we have two words with C(w 1 ) C(w 2 ) salmon swordfish Can P (salmon caught two) tell us anything about P (swordfish caught two)? N-gram models: no. Nathan Schneider ENLP Lecture 10a 18

Word similarity in language modeling Early version: class-based language models (J&M 4.9.2) Define classes c of words, by hand or automatically P CL (w i w i 1 ) = P (c i c i 1 )P (w i c i ) (an HMM) Recent version: distributed language models Current models have better perplexity than MKN. Ongoing research to make them more efficient. Examples: Recursive Neural Network LM (Mikolov et al., 2010), Log Bilinear LM (Mnih and Hinton, 2007) and extensions. Nathan Schneider ENLP Lecture 10a 19

Distributed word representations Each word represented as high-dimensional vector (50-500 dims) E.g., salmon is [0.1, 2.3, 0.6, 4.7,...] Similar words represented by similar vectors E.g., swordfish is [0.3, 2.2, 1.2, 3.6,...] More about this later in the course. Nathan Schneider ENLP Lecture 10a 20

Training the model Goal: learn word representations (embeddings) such that words that behave similarly are close together in high-dimensional space. 2-dimensional example: Nathan Schneider ENLP Lecture 10a 21

Training the model N-gram LM: collect counts, maybe optimize some parameters (Relatively) quick, especially these days (minutes-hours) distributed LM: learn the representation for each word Use ML methods like neural networks that iteratively improve embeddings Can be extremely time-consuming (hours-days) Learned embeddings capture both semantic and syntactic similarity. Nathan Schneider ENLP Lecture 10a 22

Using the model Want to compute P (w 1... w n ) for a new sequence. N-gram LM: again, relatively quick distributed LM: often prohibitively slow for real applications An active area of research for distributed LMs Nathan Schneider ENLP Lecture 10a 23

Other Topics in Language Modeling Many active research areas! Modeling issues: Morpheme-based language models Syntactic language models Domain adaptation: when only a small corpus is available in the domain of interest Implementation issues: Speed: both to train, and to use in real-time applications like translation and speech recognition. Disk space and memory: espcially important for mobile devices Nathan Schneider ENLP Lecture 10a 24

Back to the big picture However we train our LM, we will want to use it in some application. Now, a bit more detail about how that can work. We need another concept from information theory: the Noisy Channel Model. Nathan Schneider ENLP Lecture 10a 25

Noisy channel model We imagine that someone tries to communicate a sequence to us, but noise is introduced. We only see the output sequence. symbol sequence P(Y) noisy/ errorful encoding P(X Y) output sequence P(X) Nathan Schneider ENLP Lecture 10a 26

Noisy channel model We imagine that someone tries to communicate a sequence to us, but noise is introduced. We only see the output sequence. symbol sequence P(Y) noisy/ errorful encoding P(X Y) output sequence P(X) Application Y X Speech recognition spoken words acoustic signal Machine translation words in L 1 words in L 2 Spelling correction intended words typed words Nathan Schneider ENLP Lecture 10a 27

Example: spelling correction P (Y ): Distribution over the words the user intended to type. model! A language P (X Y ): Distribution describing what user is likely to type, given what they meant. Could incorporate information about common spelling errors, key positions, etc. Call it a noise model. P (X): Resulting distribution over what we actually see. Given some particular observation x (say, effert), we want to recover the most probable y that was intended. Nathan Schneider ENLP Lecture 10a 28

Noisy channel as probabilistic inference Mathematically, what we want is argmax y P (y x). Read as the y that maximizes P (y x) Rewrite using Bayes Rule: argmax y P (y x) = argmax y P (x y)p (y) P (x) = argmax y P (x y)p (y) Nathan Schneider ENLP Lecture 10a 29

Noisy channel as probabilistic inference So to recover the best y, we will need a language model, which will be fairly similar for different applications a noise model, which depends on the application: acoustic model, translation model, misspelling model, etc. Both are normally trained on corpus data. Nathan Schneider ENLP Lecture 10a 30

You may be wondering If we can train P (X Y ), why can t we just train P (Y X)? Who needs Bayes Rule? Answer 1: sometimes we do train P (Y X) directly. Stay tuned... Answer 2: training P (X Y ) or P (Y X) requires input/output pairs, which are often limited: Misspelled words with their corrections; transcribed speech; translated text But LMs can be trained on huge unannotated corpora: a better model. Can help improve overall performance. Nathan Schneider ENLP Lecture 10a 31

Model versus algorithm We defined a probabilistic model, which says what we should do. E.g., for spelling correction: given trained LM and noise model (we haven t said yet how to train the NM), find the intended word that is most probable given the observed word. We haven t considered how we would do that. A search problem: there are (infinitely) many possible inputs that could have generated what we saw; which one is best? We need to design an algorithm that can solve the problem. Nathan Schneider ENLP Lecture 10a 32

Summary Different smoothing methods account for different aspects of sparse data and word behaviour. Interpolation/backoff: leverage advantages of both higher and lower order N-grams. Kneser-Ney smoothing: accounts for diversity of history. Distributed representations: account for word similarity. Noisy channel model combines LM with noise model to define the best solution for many applications. Next time, will consider how to find this solution. Nathan Schneider ENLP Lecture 10a 33

References Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359 394. MacKay, D. J. C. and Peto, L. C. B. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1(3):289 308. Mikolov, T., Karafiát, M., Burget, L., Černocký, J. H., and Khudanpur, S. (2010). Recurrent neural network based language model. In INTERSPEECH, pages 1045 1048, Makuhari, Chiba, Japan. Mnih, A. and Hinton, G. (2007). Three new graphical models for statistical language modelling. In Proc. of ICML, pages 641 648, Corvallis, Oregon, USA. Nathan Schneider ENLP Lecture 10a 34