Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Similar documents
Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Natural Language Processing. Statistical Inference: n-grams

Machine Learning for natural language processing

Natural Language Processing

Probabilistic Language Modeling

CMPT-825 Natural Language Processing

Unit 1: Sequence Models

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Language Models. Philipp Koehn. 11 September 2018

Statistical Methods for NLP

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Language Modeling. Michael Collins, Columbia University

N-gram Language Modeling Tutorial

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Language Processing with Perl and Prolog

Lecture 8: Information Theory and Maximum Entropy

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-gram Language Modeling

Language Models. Hongning Wang

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Chapter 3: Basics of Language Modelling

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Natural Language Processing SoSe Words and Language Model

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

CS 6120/CS4120: Natural Language Processing

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

DT2118 Speech and Speaker Recognition

ANLP Lecture 6 N-gram models and smoothing

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Midterm sample questions

Chapter 3: Basics of Language Modeling

Language Model. Introduction to N-grams

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

The Noisy Channel Model and Markov Models

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

(today we are assuming sentence segmentation) Wednesday, September 10, 14

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Natural Language Processing

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Sequences and Information

SYNTHER A NEW M-GRAM POS TAGGER

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

CS 630 Basic Probability and Information Theory. Tim Campbell

Statistical Natural Language Processing

Natural Language Processing (CSE 490U): Language Models

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

CSEP 517 Natural Language Processing Autumn 2013

Introduction to Bayesian Learning. Machine Learning Fall 2018

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Thai. What is in a word?

University of Illinois at Urbana-Champaign. Midterm Examination

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Exercise 1: Basics of probability calculus

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

TnT Part of Speech Tagger

Log-linear models (part 1)

Quantum Cellular Automata (QCA) Andrew Valesky

Ranked Retrieval (2)

We have a sitting situation

CS344: Introduction to Artificial Intelligence

Naïve Bayes, Maxent and Neural Models

Smoothing. This dark art is why NLP is taught in the engineering school.

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Classification & Information Theory Lecture #8

CS 6120/CS4120: Natural Language Processing

Lecture : Probabilistic Machine Learning

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Bayesian Networks. Motivation

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian Methods: Naïve Bayes

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

Natural Language Processing

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

Entropy. Expected Surprise

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Learning Bayesian Networks (part 1) Goals for the lecture

Graphical models for part of speech tagging

Generative Clustering, Topic Modeling, & Bayesian Inference

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

PATTERN RECOGNITION AND MACHINE LEARNING

Bayesian Learning Features of Bayesian learning methods:

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Transcription:

Language Technology CUNY Graduate Center Spring 2013 Unit 1: Sequence Models Lectures 5-6: Language Models and Smoothing required hard optional Professor Liang Huang liang.huang.sh@gmail.com

Python Review: Styles do not write... when you can write... for key in d.keys(): for key in d: if d.has_key(key): if key in d: i = 0 for x in a:... i += 1 a[0:len(a) - i] for line in \ sys.stdin.readlines(): for x in a: print x, print s = "" for i in range(lev): s += " " print s for i, x in enumerate(a): a[:-i] for line in sys.stdin: print " ".join(map(str, a)) print " " * lev 2

Noisy-Channel Model 3

Noisy-Channel Model p(t...t) 4

Applications of Noisy-Channel spelling correction correct text text with mistakes prob. of language text noisy spelling 5

Noisy Channel Examples Th qck brwn fx jmps vr th lzy dg. Ths sntnc hs ll twnty sx lttrs n th lphbt. I cnduo't bvleiee taht I culod aulaclty uesdtannrd waht I was rdnaieg. Unisg the icndeblire pweor of the hmuan mnid, aocdcrnig to rseecrah at Cmabrigde Uinervtisy, it dseno't mttaer in waht oderr the lterets in a wrod are, the olny irpoamtnt tihng is taht the frsit and lsat ltteer be in the rhgit pclae. Therestcanbeatotalmessandyoucanstillreaditwi thoutaproblem.thisisbecausethehumanminddo esnotreadeveryletterbyitself,butthewordasawh ole. 6

Noisy Channel Examples 7

Language Model for Generation search suggestions 8

Language Models problem: what is P(w) = P(w 1 w2... wn)? ranking: P(an apple) > P(a apple)=0, P(he often swim)=0 prediction: what s the next word? use P(wn w1... wn-1) Obama gave a speech about. P(w1 w2... wn) = P(w1) P(w2 w1)... P(wn w1... wn-1) P(w1) P(w2 w1) P(w3 w1 w2)... P(wn wn-2 wn-1) sequence prob, not just joint prob. trigram P(w1) P(w2 w1) P(w3 w2)... P(wn wn-1) bigram P(w1) P(w2) P(w3)... P(wn) unigram P(w) P(w) P(w)... P(w) 0-gram 9

Estimating n-gram Models maximum likelihood: pml(x) = c(x)/n; pml(xy) = c(xy)/c(x) problem: unknown words/sequences (unobserved events) sparse data problem solution: smoothing 10

Smoothing have to give some probability mass to unseen events (by discounting from seen events) Q1: how to divide this wedge up? Q2: how to squeeze it into the pie? (D. Klein) 11

ML, MAP, and smoothing simple question: what s P(H) if you see H H H H? always maximize posterior: what s the best m given d? with uniform prior, same as likelihood (explains data) argmaxm p(m d) = argmaxm p(m) p(d m) bayes, and p(d)=1 = argmaxm p(d m) when p(m) uniform 12

ML, MAP, and smoothing what if we have arbitrary prior like p(θ) = θ (1-θ) maximum a posteriori estimation (MAP) MAP approaches MLE with infinite MAP = MLE + smoothing this prior is just extra two tosses, unbiased you can inject other priors, like extra 4 tosses, 3 Hs 13

Smoothing: Add One (Laplace) MAP: add a pseudocount of 1 to every word in Vocab Plap(x) = (c(x) + 1) / (N + V) Plap(unk) = 1 / (N+V) V is Vocabulary size same probability for all unks how much prob. mass for unks in the above diagram? e.g., N=10 6 words, V=26 20, Vobs = 10 5, Vunk = 26 20-10 5 14

Smoothing: Add Less than One add one gives too much weight on unseen words! solution: add less than one (Lidstone) to each word in V Plid(x) = (c(x) + λ) / (N + λv) Plid(unk) = λ / (N+ λv) 0<λ<1 is a parameter still same for unks, but smaller Q: how to tune this λ? on held-out data! 15

Smoothing: Witten-Bell key idea: use one-count things to guess for zero-counts recurring idea for unknown events, also for Good-Turing prob. mass for unseen: T / (N + T) T: # of seen types 2 kinds of events: one for each token, one for each type = MLE of seeing a new type (T among N+T are new divide this mass evenly among V-T unknown words pwb(x) = T / (V-T)(N+T) unknown word = c(x) / (N+T) known word bigram case more involved; see J&M Chapter for details 16

Smoothing: Good-Turing again, one-count words in training ~ unseen in test let Nc = # of words with frequency r in training PGT (x) = c (x) / N where c (x) = (c(x)+1) Nc(x)+1 / Nc(x) total adjusted mass is sumc c Nc = sumc (c+1) Nc+1 / N remaining mass: N1 / N: split evenly among unks 17

Smoothing: Good-Turing from Church and Gale (1991). bigram LMs. unigram vocab size = 4x10 5. Tr is the frequencies in the held-out data (see fempirical). 18

Smoothing: Good-Turing Good-Turing is much better than add (less than) one problem 1: Ncmax+1 = 0, so c max = 0 solution: only adjust counts for those less than k (e.g., 5) problem 2: what if Nc = 0 for some middle c? solution: smooth Nc itself 19

Smoothing: Backoff 20

Smoothing: Interpolation 21

Entropy and Perplexity classical entropy (uncertainty): H(X) = -sum p(x) log p(x) how many bits (on average) for encoding sequence entropy (distribution over sequences): H(L) = lim 1/n H(w1... wn) = lim 1/n sum_{w in L} p(w1...wn) log p(w1...wn) Shannon-McMillan-Breiman theorem: H(L) = lim -1/n log p(w1...wn) don t need all w in L! if w is long enough, just take -1/n log p(w) is enough! perplexity is 2^{H(L)} Q: why 1/n? 22

Perplexity of English on 1.5 million WSJ test set: unigram: 962 bigram: 170 trigram: 109 9.9 bits 7.4 bits 6.8 bits higher-order n-grams generally has lower perplexity but higher than trigram is not that significant what about human?? 23

Shannon Game guess the next letter; compute entropy (bits per char) 0-gram: 4.76, 1-gram: 4.03, 2-gram: 3.32, 3-gram: 3.1 native speaker: ~1.1 (0.6~1.3); me: ~2.3 http://math.ucsd.edu/~crypto/java/entropy/ 24