CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Similar documents
N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Language Models. Philipp Koehn. 11 September 2018

CMPT-825 Natural Language Processing

Natural Language Processing. Statistical Inference: n-grams

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Language Model. Introduction to N-grams

Probabilistic Language Modeling

Language Modeling. Michael Collins, Columbia University

Ranked Retrieval (2)

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Chapter 3: Basics of Language Modeling

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

N-gram Language Modeling

Machine Learning for natural language processing

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Chapter 3: Basics of Language Modelling

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Language Processing with Perl and Prolog

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

ANLP Lecture 6 N-gram models and smoothing

Topic Modeling: Beyond Bag-of-Words

N-gram Language Modeling Tutorial

Natural Language Processing SoSe Words and Language Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Smoothing. This dark art is why NLP is taught in the engineering school.

DT2118 Speech and Speaker Recognition

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

AN INTRODUCTION TO TOPIC MODELS

Midterm sample questions

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

More Smoothing, Tuning, and Evaluation

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Statistical methods in NLP, lecture 7 Tagging and parsing

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Language Models. Hongning Wang

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

CS145: INTRODUCTION TO DATA MINING

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Statistical Natural Language Processing

CS 6120/CS4120: Natural Language Processing

Hidden Markov Models in Language Processing

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Microsoft Corporation.

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Natural Language Processing

Language as a Stochastic Process

Statistical Methods for NLP

Generative Clustering, Topic Modeling, & Bayesian Inference

TnT Part of Speech Tagger

Deep Learning. Language Models and Word Embeddings. Christof Monz

The Noisy Channel Model and Markov Models

Natural Language Processing

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Lecture 12: Algorithms for HMMs

Natural Language Processing (CSE 490U): Language Models

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Statistical Methods for NLP

University of Illinois at Urbana-Champaign. Midterm Examination

SYNTHER A NEW M-GRAM POS TAGGER

Statistical Machine Translation

Log-linear models (part 1)

CSEP 517 Natural Language Processing Autumn 2013

Lecture 12: Algorithms for HMMs

What to Expect from Expected Kneser-Ney Smoothing

CS 6120/CS4120: Natural Language Processing

CS 646 (Fall 2016) Homework 3

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Latent Dirichlet Allocation (LDA)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Deep Learning. Ali Ghodsi. University of Waterloo

Kneser-Ney smoothing explained

Natural Language Processing

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

Information Retrieval

Lecture 22 Exploratory Text Analysis & Topic Models

arxiv:cmp-lg/ v1 9 Jun 1997

Statistical methods in NLP Introduction

Transcription:

University of Malta BSc IT (Hons)Year IV CSA4050: Advanced Topics Natural Language Processing Lecture Statistics III Statistical Approaches to NLP Witten-Bell Discounting Unigrams Bigrams Dept Computer Science and AI 2003/04 Lecturer: Michael Rosner

Add-One Smoothing Add-one smoothing is not a very good method because Too much probability mass is assigned to previously unseen bigrams. Too little probability is assigned to those bigrams having non-zero counts. Add-one is worse at predicting zero-count bigrams than other methods. Variances of the counts produced by the add-one method are actually worse than than those for the unsmoothed method. lecture Statistics III:1

Witten-Bell (1991) Discounting Assume that a zero-frequency event is one that has not happened yet When it does happen, it will be the first time it happens. So the probability of seeing a zero-frequency N-gram can be modelled by the probability of seeing an N-gram for the first time. Key Idea: Use the count of things you ve seen just once to help estimate the count of things you ve never seen. What are the things we have seen just once? lecture Statistics III:2

Probability of N-Grams Seen Once The count of N-grams seen just once is the same as the number of N-gram types that have been seen in the corpus. We therefore estimate the the total probability mass of all the zero N-grams as T N + T T is the number of observed types. N is the number of observed tokens. Divide by Z, the number of unseen N-grams, to get the probability per unseen N-gram T Z(N + T ) lecture Statistics III:3

Discounting Multiplying this probability by N yields c i c i = N T Z(N + T ) the adjusted count for zero-count N-grams. Now, the total probability mass assigned to unseen N-grams T N+T, came by discounting the probability of the seen N-grams. Clearly, the remaining probability mass is 1 T N + T = N N + T This multiplier can be used for obtaining discounted probabilities p i = ( N N+T )p i, smoothed counts c i = ( N N+T )c i for the seen N-grams. lecture Statistics III:4

Summary for Unigrams p i = T Z(N+T ) p i N if c i = 0 N+T if c i > 0 We now extend the analysis to bigrams. lecture Statistics III:5

Witten-Bell for Bigrams Generalise seeing a unigram for the first time seeing a bigram with a given first member for the first time. Probability of an unseen w x w k is estimated using the probability of bigrams w x w i actually seen for the first time, i.e the number of actual w x w i types. Not all conditioning events are equally informative, e.g. Coca... vs. the... There are very few Coca x types compared to the x types. Therefore we should give lower estimate for unseen bigrams to contexts where very few distinct word types follow a conditioning event. lecture Statistics III:6

Witten-Bell for Bigrams Unseen Bigrams 1 For the collection of unseen bigrams w x w i, the total probability mass to be assigned is: i:c(w x w i )=0 where = p(w i w x ) = T (w x ) N(w x ) + T (w x ) N(w): number of observed bigram tokens beginning with w T(w): number of observed bigram types beginning with w lecture Statistics III:7

Witten-Bell for Bigrams 2 Unseen Bigrams 2 To obtain adjusted probability per unseen bigram: Divide by Z(W x ) the number of unseen bigrams beginning with w x. p (w i w x ) = T (w x ) Z(w x ) (N(w x ) + T (w x )) Since we know the vocabulary size V (total number of word types), we know the number of theoretically possible bigrams beginning with w x. We also know T (w x ) the number of observed bigrams beginning with x. From these we can infer Z(W x ) = V T (w x ) lecture Statistics III:8

Witten-Bell for Bigrams 3 The probability mass transferred to unseen bigrams must now be discounted from the observed ones, whose adjusted probability is: p (w i w x ) = c(w x w i ) c(w x ) + T (w x ) where T (w x ) is the number of bigrams tokens beginning with w x lecture Statistics III:9

Summary for Bigrams p (w i w x ) = T (w x ) Z(w x ) (N(w x )+T (w x )) if c(w xw i ) = 0 c(w x w i ) c(w x )+T (w x ) if c(w x w i ) > 0 lecture Statistics III:10