Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Similar documents
Natural Language Processing. Statistical Inference: n-grams

Kneser-Ney smoothing explained

CMPT-825 Natural Language Processing

Machine Learning for natural language processing

Language Models. Philipp Koehn. 11 September 2018

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

ANLP Lecture 6 N-gram models and smoothing

N-gram Language Modeling Tutorial

Statistical Methods for NLP

Microsoft Corporation.

N-gram Language Modeling

Language Model. Introduction to N-grams

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

CS 6120/CS4120: Natural Language Processing

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

DT2118 Speech and Speaker Recognition

Natural Language Processing SoSe Words and Language Model

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Natural Language Processing (CSE 490U): Language Models

Statistical Natural Language Processing

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Probabilistic Language Modeling

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

CSEP 517 Natural Language Processing Autumn 2013

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

CS 6120/CS4120: Natural Language Processing

Natural Language Processing

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Language Modeling. Michael Collins, Columbia University

Exploring Asymmetric Clustering for Statistical Language Modeling

(today we are assuming sentence segmentation) Wednesday, September 10, 14

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Neural Networks Language Models

Chapter 3: Basics of Language Modelling

Ranked Retrieval (2)

Language Model Rest Costs and Space-Efficient Storage

We have a sitting situation

Language Processing with Perl and Prolog

Naïve Bayes, Maxent and Neural Models

Phrasetable Smoothing for Statistical Machine Translation

language modeling: n-gram models

Language Models. Hongning Wang

Deep Learning. Language Models and Word Embeddings. Christof Monz

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

arxiv:cmp-lg/ v1 9 Jun 1997

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Machine Learning for natural language processing

Neural Network Language Modeling

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Natural Language Processing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Statistical Machine Translation

The Noisy Channel Model and Markov Models

Hierarchical Bayesian Nonparametrics

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Statistical methods in NLP, lecture 7 Tagging and parsing

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Conditional Language Modeling. Chris Dyer

Maxent Models and Discriminative Estimation

Class: Backoff (sections 4.6 and 4.7)

Classification, Linear Models, Naïve Bayes

On Using Selectional Restriction in Language Models for Speech Recognition

Computer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited. DTICQUALBYlSFSPIiCTBDl

Language Models. CS5740: Natural Language Processing Spring Instructor: Yoav Artzi

COMS 4705, Fall Machine Translation Part III

Deep Learning. Ali Ghodsi. University of Waterloo

(COM4513/6513) Week 1. Nikolaos Aletras ( Department of Computer Science University of Sheffield

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Chapter 3: Basics of Language Modeling

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Smoothing. This dark art is why NLP is taught in the engineering school.

Administrivia. Lecture 5. Part I. Outline. The Big Picture IBM IBM IBM IBM

Lecture 5. The Big Picture/Language Modeling. Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen

Introduction to N-grams

Machine Learning for natural language processing

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

IBM Research Report. Model M Lite: A Fast Class-Based Language Model

Transcription:

Week 13: Language Modeling II Smoothing in Language Modeling Irina Sergienya 07.07.2015

Couple of words first... There are much more smoothing techniques, [e.g. Katz back-off, Jelinek-Mercer,...] and techniques to improve LM [e.g. cashing, skipping models, clustering, sentence mixture...] Concepts are the same, formulas are different so check several sources before implementation 2

Recall: Language Models Given a sentence, we would like to estimate how likely it is to see such a sentence in a language P (w 1 length(s = i Problems: P w 1 i 1 P (w k w 1 k 1 = C (w 1 k C (w 1 k 1 sparseness of training data (not enough to estimate probabilities zero probability of unseen events Solution: SMOOTHING! 3

Smoothing Take some probability mass from seen events and assign it to unseen events P(unseen P(seen=1 P(seen=.999.. 4

Recall: Laplace Smoothing P (w n w 1 n 1 = C (w 1 n C (w 1 n 1 P Laplace (w n w n 1 1 = C (w n 1+1 C (w n 1 1 + V 5

Recall: Good-Turing Smoothing Use the count of events we have seen once to help estimate the count of events we have never seen P(unseen P(seen once P(seen once P(seen > once P(seen > once 6

Recall: Good-Turing Smoothing Use the count of events we have seen once to help estimate the count of events we have never seen N c = the count of events we ve seen c times Estimate (C(w = c: P Good Turing (w= 1 N (c+1 N c+1 N c Here N is M from previous lecture 7

Slide from Dan Jurafsky, MOOC Natural Language Processing : Language Modeling. Advanced: Good Turing Smoothing 8

Today Interpolation Absolute discounting Kneser-Ney Smoothing: Back-off Kneser-Ney Interpolated Kneser-Ney Modified Kneser-Ney 9

Interpolation. Concept 1 Problem: No measuring is perfect Solution: combine several measurements (with different trust on them = INTERPOLATION 10

Interpolation. Concept 1 Problem: No measuring is perfect Solution: combine several measurements (with different trust on them = INTERPOLATION value 1 value 2 value 3 11

Interpolation. Concept 2 Problem: No measuring is perfect Solution: combine several measurements (with different trust on them value = α 1 *value 1 + α 2 *value 2 + α 3 *value 3, α i coefficients or weights Usually αi in [0, 1], and α i =1 i=2: α 1 =α, α 2 =1 α i=3 :α 1 =α, α 2 =β, α 3 =1 α β 12

Interpolation. Concept 3 value = α 1 *value 1 + α 2 *value 2 + α 3 *value 3 value 1 = 112 kg 90 kg = 22 kg value 2 = 20 kg value 3 = 25 kg α 1 =.3, α 2 =.2, α 3 =.5 value = 22*.3 + 20*.2 + 25*.5 = 23.1 > 23 kg 13

Examples of interpolation? Take the mean value (split a bill equally (α i =1/n Assume elections result based on an opinion poll Basically, everywhere when you try to assess true value via several sources 14

Linear Interpolation in LM We ve never seen read a book, but we might have seen a book, and we ve certainly seen book, Linear Interpolation: P INT w i 1, w i 2 =α 3 P w i 1, w i 2 +α 2 P w i 1 +α 1 P P(read a book=.5*0 +.2*0.0006 +.3*1.74*10-3 = 6.42*10-4 15

Absolute discounting Discount all non-zero n-gram count by a small constant amount D and interpolate with bigram model: Discounted n-gram P AD w i 1, w i 2 = max(c (w w w i 2 i 1 i D, 0 C 2 w i 1 +(1 λp AD w i 1 Interpolation weight lower-order n-gram 16

Absolute discounting. Interpolation weight P AD w i 1, w i 2 = max(c (w w w i 2 i 1 i D, 0 C 2 w i 1 +(1 λp AD w i 1 If Z seen word types occur after w i-2 w i-1 in the training data, this reserves the probability mass P(U = (Z D/C-2 w i-1 to be computed according to P w i-1. Set: (1 λ=p (U = Z D C 2 w i 1 N.B.: with N 1, N 2 the number of n-grams that occur once or twice, D = N 1 /(N 1 +2N 2 works well in practice 17

Kneser-Ney Smoothing Idea: the higher-order models work better, but when count is small or zero, the lower-order models can help a lot. But lower-order models should be used wisely: San Francisco is common, so absolute discounting will give Francisco high probability in future predictions, while actually Francisco accurs only after San => bigram model is better in this case; Another idea is to take into account context each word occurs in. 18

Kneser-Ney Smoothing. Contexts Number of different words wi-1 that w i follows: e.g. N 1+ (.read = 2 N 1+ (.a = 5 N 1+ (. w i = {w i 1 :C 1 w i >0} N 1+ (..= w i N 1+ (.w i N 1+ (.. = 2+6+2+5+2=17 19

Kneser-Ney Smoothing. Lower-order Replace raw counts with count of contexts: P KN = N 1+ (. w i N 1+ (.. e.g. P KN (read = 2/17 P KN (a = 5/17 P KN (to = 6/17 20

Back-off Kneser-Ney Smoothing KN Smoothing: Similar to absolute discounting, but use KN estimate for lower-order: C 1 w i D P BKN w i 1 C (w ={ i 1 α 1 P KN if C 1 w i >0 otherwise where P KN = N 1+ (. w i N 1+ (.. Back-off to lower-order model in case bigram count is 0 α normalization constant 21

Back-off Kneser-Ney Smoothing. Example C 1 w i D if C (w P BKN w i 1 C (w ={ i 1 i 1 w i >0 α 1 P KN otherwise D = 0.5, α = 0.01 Counts available: P KN (a want = (10-0.5/292 = 0.03253 Have never seen bigram before P KN (to want = 0.01*6/17 = 0.00353 22

Interpolated Kneser-Ney Smoothing KN Smoothing: Similar to absolute discounting, but use KN estimate for lower-order: P IKN w i 1 = C 1 w i D C 1 Interpolation +α 1 P KN where P KN = N 1+ (. w i N 1+ (.. IKN for high-orders, Kneser-Ney for unigram α normalization constant 23

Modified Kneser-Ney Smoothing Chen&Goodman introduced modified Kneser- Ney: Interpolation is used instead of backoff. Uses a separate discount for one- and two-counts instead of a single discount for all counts: 1 if c=1 D(c={D D 2 if c=2 if c 3 D 3+ Estimates discounts on held-out data instead of using a formula based on training counts Modified Kneser-Ney consistently had best performance. 24

Questions We've just seen interpolation with lower-order models. What else could be interpolated? 25

Questions We've just seen interpolation with lower-order models. What else could be interpolated? Why don't just take high-order models and back-off or interpolate with lower-order models? 26

References Dan Jurafsky, Christopher Manning, MOOC Natural Language Processing, lecture Language Modeling. Advanced: Good Turing Smoothing Dan Jurafsky, Christopher Manning, MOOC Natural Language Processing, lecture Advanced: Kneser-Ney Smoothing Bill MacCartney, NLP Lunch Tutorial: Smoothing, 2005 Joshua T. Goodman, A Bit of Progress in Language Modeling, 2001 Philipp Koehn, "Statistical Machine Translation", chapter Language models, 2009 Daniel Jurafsky, James H. Martin, Speech and Language Processing, 1999 27