ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Similar documents
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

TnT Part of Speech Tagger

Machine Learning for natural language processing

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Natural Language Processing. Statistical Inference: n-grams

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Language Modeling. Michael Collins, Columbia University

N-gram Language Modeling Tutorial

Probabilistic Language Modeling

The Noisy Channel Model and Markov Models

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

8: Hidden Markov Models

N-gram Language Modeling

Statistical methods in NLP, lecture 7 Tagging and parsing

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Natural Language Processing (CSE 490U): Language Models

CS 6120/CS4120: Natural Language Processing

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

SYNTHER A NEW M-GRAM POS TAGGER

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Natural Language Processing SoSe Words and Language Model

Statistical Methods for NLP

Sequences and Information

Kneser-Ney smoothing explained

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

8: Hidden Markov Models

10/17/04. Today s Main Points

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

ANLP Lecture 6 N-gram models and smoothing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

CMPT-825 Natural Language Processing

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Chapter 3: Basics of Language Modelling

More Smoothing, Tuning, and Evaluation

Probability Review and Naïve Bayes

Machine Learning for natural language processing

Language Model. Introduction to N-grams

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

DT2118 Speech and Speaker Recognition

Midterm sample questions

Introduction to Artificial Intelligence (AI)

Fun with weighted FSTs

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Bayesian Analysis for Natural Language Processing Lecture 2

(today we are assuming sentence segmentation) Wednesday, September 10, 14

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Language Models. Philipp Koehn. 11 September 2018

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Statistical Natural Language Processing

CS 6120/CS4120: Natural Language Processing

Lecture 12: Algorithms for HMMs

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Naïve Bayes, Maxent and Neural Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

Natural Language Processing

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Part-of-Speech Tagging

Classification & Information Theory Lecture #8

Language as a Stochastic Process

Natural Language Processing

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

How to Use Probabilities. The Crash Course

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Ranked Retrieval (2)

Lecture 12: Algorithms for HMMs

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Lecture 9: Hidden Markov Model

Conditional Language Modeling. Chris Dyer

6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and the EM Algorithm Part I

Text Mining. March 3, March 3, / 49

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

What s so Hard about Natural Language Understanding?

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Lecture 13: Structured Prediction

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Statistical Machine Translation

Maxent Models and Discriminative Estimation

Transcription:

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk

Language Modelling 2 A language model is a probability distribution over strings A core component of many language processing systems speech recognition, hand writing recognition, machine translation,... Many of the estimation techniques carry over to other NLP problems POS tagging, parsing,...

Classic N-gram Language Models 3 P(w 1,...,w n ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1,w 2 )...P(w n w 1,...,w n 1 ) P(w 1 )P(w 2 w 1 )P(w 3 w 1,w 2 )...P(w n w n 2, w n 1 ) Chain rule followed by independence assumptions here the independence assumptions give a trigram language model Notice chain rule is exact Note also that we don t have to apply the chain rule using this ordering of the words

Philosophical Aside: what do the probabilities mean? 4 I am happy with the idea of the following probabilities: a fair coin landing heads = 1/2 (frequentist) Red Rum winning the Grand National = 1/5 (subjectivist) rain tomorrow = 5/6 (subjectivist) Radium atom decaying in the next 30s = 1/2 (objectivist) But what about P(pig)? perhaps conditional probabilities more intuitive: P( the, fast)? Probabilities of words reflect underlying statistical distribution of the text

Statistical Methods in NLP: Chomsky 5 It must be recognised that the notion of a probability of a sentence is an entirely useless one, under any interpretation of this term (Chomsky, 1969) [taken from Chapter 1 of Young and Bloothooft, eds, Corpus-Based Methods in Language and Speech Processing] [Colorless green ideas sleep furiously example]

It s all about Prediction 6 Statistics is the science of learning from observations and experience (Ney, 1997) By chance we mean something like a guess. Why do we make guesses? We make guesses when we wish to make a judgement but have incomplete information or uncertain knowledge. We want to make a guess as to what things are, or what things are likely to happen. Often we wish to make a guess because we have to make a decision... Sometimes we make guesses because we wish, with our limited knowledge, to say as much as we can about some situation. Really, any generalization is in the nature of a guess. Any physical theory is a kind of guesswork. There are good guesses and there are bad guesses. The theory of probability is a system for making better guesses. The language of probability allows us to speak quantitatively about some situation which may be highly variable, but which does have some consistent average behavior. (Guess who?) [taken from Chapter 1 of Young and Bloothooft, eds, Corpus-Based Methods in Language and Speech Processing]

Overfitting 7 Maximum likelihood estimation (relative frequency estimation in our cases) can suffer from overfitting Because we are choosing parameters to make the data as probable as possible, the resulting model can look too much like the data Classic example of this occurs in language modelling with a word we haven t seen before

The Problem with Zeros 8 ˆP(w 3 w 2, w 1 ) = f(w 1,w 2,w 3 ) f(w 1, w 2 ) If w 1 followed by w 2 is unseen then estimate is undefined If w 1 followed by w 2 followed by w 3 is unseen then estimate is zero Zeros propogate through the product to give a zero probability for the whole string Smoothing techniques are designed to solve this problem (called smoothing because the resulting distributions tend to be more uniform)

Add-one Smoothing 9 ˆP(w i w i 1 ) = 1 + f(w i 1,w i ) wi 1 + f(w i 1, w i ) = 1 + f(w i 1,w i ) V + w i f(w i 1,w i ) Simple technique not widely used anymore

Linear Interpolation 10 Consider burnish the and burnish thou where: f(burnish, the) = f(burnish, thou) = 0 add-1 smoothing would assign the same probability to P(the burnish) and P(thou burnish) But burnish the intuitively more likely (because the more likely than thou)

Linear Interpolation 11 P(w i w i 1 ) = λ ˆP(w i w i 1 ) + (1 λ) ˆP(w i ) P is the interpolated model and ˆP is the maximum likelihood (relative frequency) estimate P(w i w i 1, w i 2 ) = λ 1 ˆP(wi w i 1,w i 2 ) +(1 λ 1 )(λ 2 ˆP(wi w i 1 ) + (1 λ 2 ) ˆP(w i ))

Backoff 12 P(w i w i 1,w i 2 ) = ˆP(w i w i 1,w i 2 ) if f(w i 2, w i 1, w i ) > 0 = α 1 ˆP(w i w i 1 ) if f(w i 2, w i 1,w i ) = 0 and f(w i 1,w i ) > 0 = α 2 ˆP(w i ) otherwise where the αs are required to ensure a proper distribution although see Large Language Models in Machine Translation, Brants et al., based on 2 trillion tokens which uses stupid backoff and ignores the αs [see google n-gram corpus]

Back to Tagging... 13 Tag sequence probabilities can be smoothed (or backed off): P(t i t i 1,t i 2 ) = λ 1 ˆP(ti t i 1,t i 2 ) +(1 λ 1 )(λ 2 ˆP(ti t i 1 ) + (1 λ 2 ) ˆP(t i )) A simple solution for unknown words is to replace them with UNK: P(w i t i ) = P(UNK t i ) where any word in the training data occurring less than, say, 5 times is replaced with UNK

Better Handling of Unknown Words 14 Lots of clues as to what the tag of an unknown word might be: proper nouns (NNP) likely to be unknown if the word ends in ing, likely to be VBG... P(w t) = 1 P(unknown word t)p(capitalized t)p(endings t) Z but now we re starting to see the weaknesses of generative models for taggers Conditional models can deal with these features directly see Ratnaparkhi s tagger as an example

Reading for Today s Lecture 15 Jurafsky and Martin, Speech and Language Processing, Chapter on N-grams Manning and Schutze, Foundations of Statistical Natural Language Processing, Chapter on Statistical Inference: n-gram models over sparse data An Empirical Study of Smoothing Techniques for Language Modeling, Chen and Goodman, TR-10-98, 1998 (lots of technical and empirical details) Chapter 1 of Corpus-Based Methods in Language and Speech Processing, ed Steve Young and Gerrit Bloothooft, Kluwer 1997 (if available)