Machine Learning for natural language processing

Similar documents
Natural Language Processing. Statistical Inference: n-grams

CS 6120/CS4120: Natural Language Processing

Language Model. Introduction to N-grams

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Natural Language Processing SoSe Words and Language Model

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Machine Learning for natural language processing

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Probabilistic Language Modeling

Statistical Methods for NLP

CS 6120/CS4120: Natural Language Processing

DT2118 Speech and Speaker Recognition

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

CMPT-825 Natural Language Processing

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Natural Language Processing (CSE 490U): Language Models

Machine Learning for natural language processing

N-gram Language Modeling Tutorial

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Chapter 3: Basics of Language Modelling

N-gram Language Modeling

Language Models. Philipp Koehn. 11 September 2018

Machine Learning for natural language processing

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

Kneser-Ney smoothing explained

Machine Learning for natural language processing

language modeling: n-gram models

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

ANLP Lecture 6 N-gram models and smoothing

We have a sitting situation

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

The Noisy Channel Model and Markov Models

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Chapter 3: Basics of Language Modeling

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Language Modeling. Michael Collins, Columbia University

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Statistical Natural Language Processing

SYNTHER A NEW M-GRAM POS TAGGER

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Microsoft Corporation.

CSEP 517 Natural Language Processing Autumn 2013

Conditional Language Modeling. Chris Dyer

Language Processing with Perl and Prolog

TnT Part of Speech Tagger

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Natural Language Processing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Language Models. Hongning Wang

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Lecture 3: ASR: HMMs, Forward, Viterbi

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06

(today we are assuming sentence segmentation) Wednesday, September 10, 14

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Language Models. CS5740: Natural Language Processing Spring Instructor: Yoav Artzi

Introduction to N-grams

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Sequences and Information

Log-linear models (part 1)

Midterm sample questions

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Text Mining. March 3, March 3, / 49

Exploring Asymmetric Clustering for Statistical Language Modeling

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Natural Language Processing

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Statistical Machine Translation

Naïve Bayes, Maxent and Neural Models

arxiv:cmp-lg/ v1 9 Jun 1997

A fast and simple algorithm for training neural probabilistic language models

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

arxiv: v1 [cs.cl] 5 Mar Introduction

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Computer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited. DTICQUALBYlSFSPIiCTBDl

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Statistical Machine Learning from Data

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Transcription:

Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25

Introduction Goals: Estimate the probability that a given sequence of words occurs in a specific language. Model the most probable next word for a given sequence of words. Jurafsky & Martin (2015), chapter 4, and Chen & Goodman (1999) 2 / 25

Table of contents 1 Motivation 2 N-grams 3 Maximum likelihood estimation 4 Evaluating language models 5 Unknown words 6 Smoothing 3 / 25

Motivation Examples from Jurafsky & Martin (2015) (1) Please turn your homework... What is a probable continuation? Rather in or over and not refrigerator. (2) a. all of a sudden I notice three guys standing on the sidewalk b. on guys all I of notice sidewalk three a sudden standing the Which of the two word orders is better? Language model (LM): Probabilistic model that gives P(w 1... w n ) and P(w n w 1... w n 1 ) 4 / 25

Motivation Applications: Example Tasks in which we have to identify words in noisy, ambiguous input: speech recognition, handwriting recognition,... spelling correction (3) a. their is only one written exam in this class b. there is only one written exam in this class Example machine translation: among a series of different word orders in the target language, one has to choose the best one. (4) a. Das Fahrrad wird er heute reparieren. b. The bike will he today repair c. The bike he will today repair. d. The bike he will repair today. 5 / 25

N-grams Notation: w m 1 = w 1... w m. Question: How can we compute P(w m 1 )? P(w m 1 ) = P(w 1)P(w 2 w 1 )P(w 3 w 2 1)... P(w m w m 1 = m k=1 P(w k w k 1 1 ) But: computing P(w k w k 1 1 ) for a large k is not feasible. 1 ) Approximation of P(w k w1 k 1 ): N-grams, i.e., look at just the n 1 last words, P(w k wk n+1 k 1 ). Special cases: unigrams: P(w k ) bigrams: P(w k w k 1 ) trigrams: P(w k w k 2 w k 1 ) 6 / 25

N-grams With n-grams, we get 1 Probability of a sequence of words: 2 Probability of a next word: P(w1 l ) P(w k wk n+1 k 1 l k=1 P(w l w l 1 1 ) P(w l w l 1 l n+1 ) These are strong independence assumptions called Markov assumptions. E.g. with bigrams Example P(einfach die Klausur war nicht) P(einfach nicht) 7 / 25

Maximum likelihood estimation (MLE) Question: How do we estimate the n-gram probabilities? Maximum likelihood estimation (MLE): Get n-gram counts from a corpus and normalize so that the values lie between 0 and 1. P(w k wk n+1 k 1 ) = C(wk 1 k n+1 w k) In the bigram case, this amounts to C(wk n+1 k 1 ) P(w k w k 1 ) = C(w k 1w k ) C(w k 1 ) We augment sentences with an initial s and a final /s 8 / 25

Maximum likelihood estimation (MLE) Example from Jurafsky & Martin (2015) Training data: < s > I am Sam < /s > < s > Sam I am < /s > < s > I do not like green eggs and ham < /s > Some bigram probabilities: P(I < s >) = 2 3 P(Sam < s >) = 1 3 P(am I) = 2 3 P(< /s > Sam) = 1 2 P(Sam am) = 1 2 P(do I) = 1 3 9 / 25

Maximum likelihood estimation (MLE) Practical issues: In practice, n is mostly between 3 and 5, i.e., we use trigrams, 4-grams or 5-grams. LM probabilities are always represented as log probabilities. Advantage: Adding replaces multiplying and numerical overflow is avoided. p 1 p 2... p l = exp(log p 1 + log p 2 + + log p l ) 10 / 25

Maximum likelihood estimation (MLE) Reminder: log 1 = 0, log 0 = 0 log 10 x 0.5 1 1.5 2 2.5 0 0.2 0.4 0.6 0.8 1 11 / 25

Evaluating language models The data is usually separated into a training set (80% of the data), a test set (10% of the data), and sometimes a development set (10% of the data). The model is estimated from the training set. The higher the probability of the test set, the better the model. Instead of measuring the probability of the test set, LMs are usually evaluated with respect to the perplexity of the test set. The higher the probability, the lower the perplexity. 12 / 25

Evaluating language models The perplexity of a test set W = w 1 w 2... w N is defined as PP(W ) = P(W ) 1 N N = 1 P(W ) N 1 = P(w 1w 2...w N ) 1 = N N k=1 P(w k w k 1 1 ) With our n-gram model, we get then for the perplexity: PP(W ) = N N k=1 1 P(w k w k 1 k n+1 ) 13 / 25

Evaluating language models A different way to think about perplexity: it measures the weighted average branching factor of a language. Example L = {a, b, c, d}. Frequencies are such that P(a) = P(b) = P(c) = P(d) = 1 4 (independent from the context). For any w L, given this model, we obtain PP(w) = w 1 w k=1 1 4 = w 1 1 w = 4 The perplexity of any w L under this model is 4. w 4 w = 4 14 / 25

Evaluating language models Example L = {a, b, c, d}. Words in the language contain three times as many as as they contain bs, cs or ds. P(a) = 1 2 and P(b) = P(c) = P(d) = 1 6. For any w L with these frequencies and with w = 6n: PP(w) = 6n 1 n k=1 1 2 2 2 6 6 6 = 6n 2 6n 3 6n = 2 3 = 3.46 Assume that we use the same model but test it on a W with equal numbers of as, bs, cs and ds, W = 4n. Then we get PP(W ) = 4n 1 n k=1 1 2 6 6 6 = 4n 2 4n 3 3n = 2 4n 3 3 4 4n = 2 4 27 = 4.56 15 / 25

Unknown words Problem: New text can contain unknown words; or unseen n-grams. In these cases, with the algorithm seen so far, we would assign probability 0 to the entire text. (And we would not be able to compute perplexity at all.) Example from (Jurafsky & Martin, 2015) Words following the bigram denied the in WSJ Treebank 3 with counts: denied the allegations 5 denied the speculation 2 denied the rumors 1 denied the report 1 If the test set contains denied the offer or denied the loan, the model would estimate its probability as 0. 16 / 25

Unknown words Unknown or out of vocabulary words: Add a pseudo-word UNK to your vocabulary. Two ways to train the probabilities concerning UNK : 1 Choose a vocabulary V fixed in advance. Any word w V in the training set is converted to UNK. Then estimate probabilities for UNK as for all other words. 2 Replace the first occurrence of every word w in the training set with UNK. Then estimate probabilities for UNK as for all other words. 17 / 25

Smoothing Unseen n-grams: To avoid probabilities 0, we do smoothing: Take off some probability mass from the events seen in training and assign it to unseen events. Laplace Smoothing (or add-one smoothing): Add 1 to the count of all n-grams in the training set before normalizing into probabilities. Not so much used for n-grams but for other tasks, for instance text classification. For unigrams, if N is the size of the training set and V the size of the vocabulary, we replace P(w) = C(w) N with P Laplace (w) = C(w)+1 N+ V. For bigrams, we replace P(w n w n 1 ) = C(wn 1wn) C(w n 1) with P Laplace (w n w n 1 ) = C(wn 1wn)+1 C(w n 1)+ V. 18 / 25

Smoothing Smoothing methods for n-grams that use the (n 1)-grams, (n 2)-grams etc.: Backoff: use the trigram if it has been seen, otherwise fall back to the bigram and, if this has not been seen either, to the unigram. Interpolation: Use always a weighted combination of the trigram, bigram and unigram probabilities. Linear interpolation: ˆP(w n w n 2 w n 1 ) = λ 1 P(w n w n 2 w n 1 ) + λ 2 P(w n w n 1 ) + λ 3 P(w n ) with λ i = 1. i 19 / 25

Smoothing More sophisticated: each λ is computed conditioned on the context. ˆP(w n w n 2 w n 1 ) = λ 1 (w n 2 w n 1 )P(w n w n 2 w n 1 ) +λ 2 (w n 2 w n 1 )P(w n w n 1 ) +λ 3 (w n 2 w n 1 )P(w n ) In both cases, the probabilities are first estimated from the training corpus, and the λ parameters are then estimated from separate heldout data. They are estimated such that they maximize the likelihood of the held-out data. This can be done for example using the EM algorithm (to be introduced later in this course). 20 / 25

Smoothing Most commonly used N-gram smoothing method: Kneser-Ney algorithm Kneser & Ney (1995); Chen & Goodman (1999). Idea: discount the count of an n-gram by the average discount we see in a held-out corpus. Table from Jurafsky & Martin (2015) Average bigram counts in held-out corpus of 22 million words for all bigrams in 22 million words training data of AP newswire (Church & Gale, 1991): count average count count average count training held-out training held-out 0 0.0000270 1 0.448 2 1.25 3 2.24 4 3.23 5 4.21 6 5.23 7 6.21 8 7.21 9 8.26 21 / 25

Smoothing Absolute discounting: P AbsoluteDiscounting (w i w i 1 ) = C(w i 1w i ) d C(w i 1 ) Possible discount d = 0.75. + λ(w i 1 )P(w i ) Further refinement: replace P(w i ) with the probability that we see w i as a continuation. Underlying intuition: If w i has appeared in many contexts, it is a more likely continutation in a new context. P continuation (w i ) = {w C(ww i ) > 0} {w j 1 w j C(w j 1 w j ) > 0} = {w C(ww i ) > 0} wj {w j 1 C(w j 1 w j ) > 0} 22 / 25

Smoothing Interpolated Kneser-Ney smoothing: P KN (w i w i 1 ) = max(c(w i 1w i ) d, 0) C(w i 1 ) where λ(w i 1 ) is a normalizing constant: Here, λ(w i 1 ) = + λ(w i 1 )P continuation (w i ) d C(w i 1 ) {w C(w i 1w) > 0} d C(w i 1) is the normalized discount, and {w C(w i 1 w) > 0} is the number of word types that can follow w i 1, i.e., the number of times we applied the normalized discount. 23 / 25

Smoothing Finally, we obtain as a generalization (Chen & Goodman, 1999): Interpolated Kneser-Ney For n > 1: P KN (w i wi n+1 i 1 ) = max(c(wi n+1 i ) d,0) C(wi n+1 i 1 ) + λ(wi n+1 i 1 )P KN (w i wi n+2 i 1 ) and the recursion terminates with P KN (w i ) = {w C(ww i ) > 0} {w j 1 w j C(w j 1 w j ) > 0} 24 / 25

References Chen, Stanley F. & Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13. 359 394. Church, K. W. & W. A. Gale. 1991. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language 5. 19 54. Jurafsky, Daniel & James H. Martin. 2015. Speech and language processing. an introduction to natural language processing, computational linguistics, and speech recognition. Draft of the 3rd edition. Kneser, R. & H. Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE international conference on acoustics, speech and signal processing, vol. 1, 181 184. Detroit, MI. 25 / 25