N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Similar documents
CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Machine Learning for natural language processing

Language Models. Philipp Koehn. 11 September 2018

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

N-gram Language Modeling Tutorial

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Natural Language Processing. Statistical Inference: n-grams

N-gram Language Modeling

ANLP Lecture 6 N-gram models and smoothing

Probabilistic Language Modeling

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Natural Language Processing SoSe Words and Language Model

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Language Modeling. Michael Collins, Columbia University

CMPT-825 Natural Language Processing

CS 6120/CS4120: Natural Language Processing

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Natural Language Processing (CSE 490U): Language Models

Statistical Methods for NLP

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Midterm sample questions

Language Model. Introduction to N-grams

Kneser-Ney smoothing explained

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

DT2118 Speech and Speaker Recognition

The Noisy Channel Model and Markov Models

TnT Part of Speech Tagger

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Chapter 3: Basics of Language Modelling

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Language Processing with Perl and Prolog

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

SYNTHER A NEW M-GRAM POS TAGGER

Naïve Bayes, Maxent and Neural Models

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

Statistical Natural Language Processing

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Sequences and Information

Natural Language Processing

AN INTRODUCTION TO TOPIC MODELS

language modeling: n-gram models

CS 6120/CS4120: Natural Language Processing

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Part-of-Speech Tagging

Language as a Stochastic Process

Statistical methods in NLP, lecture 7 Tagging and parsing

Maxent Models and Discriminative Estimation

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Generative Clustering, Topic Modeling, & Bayesian Inference

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Microsoft Corporation.

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

Statistical Methods for NLP

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

Exploring Asymmetric Clustering for Statistical Language Modeling

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Classification & Information Theory Lecture #8

Lecture 3: ASR: HMMs, Forward, Viterbi

Statistical Machine Translation

Neural Networks Language Models

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Vectors and their uses

Solving Quadratic & Higher Degree Equations

Chapter 3: Basics of Language Modeling

Log-Linear Models, MEMMs, and CRFs

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

CSC321 Lecture 7 Neural language models

Conditional Language Modeling. Chris Dyer

Solving Quadratic & Higher Degree Equations

Structural Analysis of Word Collocation Networks

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Lecture 9: Hidden Markov Model

Introduction to N-grams

Introduction to Artificial Intelligence (AI)

Transcription:

L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24

Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between, we ll handle words in context Today: n-gram language modeling Next time: POS tagging Both of these topics are covered in more detail in L645 2 / 24

: An n-gram is a stretch of text n words long Approximation of language: n-grams tells us something about language, but doesn t capture structure Efficient: finding and using every, e.g., two-word collocation in a text is quick and easy to do can help in a variety of NLP applications: Word prediction = n-grams can be used to aid in predicting the next word of an utterance, based on the previous n 1 words Context-sensitive spelling correction Machine Translation post-editing... 3 / 24

Corpus-based NLP Corpus (pl. corpora) = a computer-readable collection of text and/or speech, often with annotations Use corpora to gather probabilities & other information about language use A corpus used to gather prior information = training data Testing data = the data one uses to test the accuracy of a method A word may refer to: type = distinct word (e.g., like) token = distinct occurrence of a word (e.g., the type like might have 20,000 tokens in a corpus) 4 / 24

Let s assume we want to predict the next word, based on the previous context of I dreamed I saw the knights in What we want to find is the likelihood of w 8 being the next word, given that we ve seen w 1,..., w 7 This is P(w 8 w 1,..., w 7 ) But, to start with, we examine P(w 1,..., w 8 ) In general, for w n, we are concerned with: (1) P(w 1,..., w n ) = P(w 1 )P(w 2 w 1 )...P(w n w 1,..., w n 1 ) These probabilities are impractical to calculate, as they hardly ever occur in a corpus, if at all Too much data to store, if we could calculate them. 5 / 24

Unigrams Approximate these probabilities to n-grams, for a given n Unigrams (n = 1): (2) P(w n w 1,..., w n 1 ) P(w n ) Easy to calculate, but lack contextual information (3) The quick brown fox jumped We would like to say that over has a higher probability in this context than lazy does 6 / 24

Bigrams bigrams (n = 2) give context & are still easy to calculate: (4) P(w n w 1,..., w n 1 ) P(w n w n 1 ) (5) P(over The, quick, brown, fox, jumped) P(over jumped) The probability of a sentence: (6) P(w 1,..., w n ) = P(w 1 )P(w 2 w 1 )P(w 3 w 2 )...P(w n w n 1 ) 7 / 24

Markov models A bigram model is also called a first-order Markov model first-order because it has one element of memory (one token in the past) Markov models are essentially weighted FSAs i.e., the arcs between states have probabilities The states in the FSA are words More on Markov models when we hit POS tagging... 8 / 24

Bigram example What is the probability of seeing the sentence The quick brown fox jumped over the lazy dog? (7) P(The quick brown fox jumped over the lazy dog) = P(The START)P(quick The)P(brown quick)...p(dog lazy) Probabilities are generally small, so log probabilities are usually used Q: Does this favor shorter sentences? A: Yes, but it also depends upon P(END lastword) 9 / 24

Trigrams Trigrams (n = 3) encode more context Wider context: P(know did, he) vs. P(know he) Generally, trigrams are still short enough that we will have enough data to gather accurate probabilities 10 / 24

Training n-gram models Go through corpus and calculate relative frequencies: (8) P(w n w n 1 ) = C(w n 1,w n ) C(w n 1 ) (9) P(w n w n 2, w n 1 ) = C(w n 2,w n 1,w n ) C(w n 2,w n 1 ) This technique of gathering probabilities from a training corpus is called maximum likelihood estimation (MLE) 11 / 24

Know your corpus We mentioned earlier about splitting training & testing data It s important to remember what your training data is when applying your technology to new data If you train your trigram model on Shakespeare, then you have learned the probabilities in Shakespeare, not the probabilities of English overall What corpus you use depends on your purpose 12 / 24

: Assume: a bigram model has been trained on a good corpus (i.e., learned MLE bigram probabilities) It won t have seen every possible bigram: lickety split is a possible English bigram, but it may not be in the corpus Problem = data sparsity zero probability bigrams that are actual possible bigrams in the language techniques account for this Adjust probabilities to account for unseen data Make zero probabilities non-zero 13 / 24

Add-One One way to smooth is to add a count of one to every bigram: In order to still be a probability, all probabilities need to sum to one Thus: add number of word types to the denominator We added one to every type of bigram, so we need to account for all our numerator additions (10) P (w n w n 1 ) = C(w n 1,w n )+1 C(w n 1 )+V V = total number of word types in the lexicon 14 / 24

example So, if treasure trove never occurred in the data, but treasure occurred twice, we have: (11) P (trove treasure) = 0+1 2+V The probability won t be very high, but it will be better than 0 If the surrounding probabilities are high, treasure trove could be the best pick If the probability were zero, there would be no chance of appearing 15 / 24

Discounting An alternate way of viewing smoothing is as discounting Lowering non-zero counts to get the probability mass we need for the zero count items The discounting factor can be defined as the ratio of the smoothed count to the MLE count Jurafsky and Martin show that add-one smoothing can discount probabilities by a factor of 10! Too much of the probability mass is now in the zeros We will examine one way of handling this; more in L645 16 / 24

Witten-Bell Discounting Idea: Use the counts of words you have seen once to estimate those you have never seen Instead of simply adding one to every n-gram, compute the probability of w i 1, w i by seeing how likely w i 1 is at starting any bigram. Words that begin lots of bigrams lead to higher unseen bigram probabilities Non-zero bigrams are discounted in essentially the same manner as zero count bigrams Jurafsky and Martin show that they are only discounted by about a factor of one 17 / 24

Witten-Bell Discounting formula (12) zero count bigrams: p (w i w i 1 ) = T(w i 1 ) Z(w i 1 )(N(w i 1 )+T(w i 1 )) T(w i 1 ) = number of bigram types starting with w i 1 determines how high the value will be (numerator) N(w i 1 ) = no. of bigram tokens starting with w i 1 N(w i 1 ) + T(w i 1 ) gives total number of events to divide by Z(w i 1 ) = number of bigram tokens starting with w i 1 and having zero count this distributes the probability mass between all zero count bigrams starting with w i 1 18 / 24

Kneser-Ney (asolute discounting) Witten-Bell Discounting is based on using relative discounting factors Kneser-Ney simplifies this by using absolute discounting factors Instead of multiplying by a ratio, simply subtract some discounting factor 19 / 24

Class-based Intuition: we may not have seen a word before, but we may have seen a word like it Never observed Shanghai, but have seen other cities Can use a type of hard clustering, where each word is only assigned to one class (IBM clustering) (13) P(w i w i 1 ) P(c i c i 1 ) P(w i c i ) POS tagging equations will look fairly similar to this... 20 / 24

models: Basic idea Assume a trigram model for predicting language, where we haven t seen a particular trigram before Maybe we ve seen the bigram or the unigram models allow one to try the most informative n-gram first and then back off to lower n-grams 21 / 24

equations Roughly speaking, this is how a backoff model works: If this trigram has a non-zero count, use that: (14) ˆP(wi w i 2 w i 1 ) = P(w i w i 2 w i 1 ) Else, if the bigram count is non-zero, use that: (15) ˆP(wi w i 2 w i 1 ) = α 1 P(w i w i 1 ) In all other cases, use the unigram information: (16) ˆP(wi w i 2 w i 1 ) = α 2 P(w i ) 22 / 24

models: example Assume: never seen the trigram maples want more before If we have seen want more, we use that bigram to calculate a probability estimate (P(more want)) But we re now assigning probability to P(more maples, want) which was zero before We won t have a true probability model anymore This is why α 1 was used in the previous equations, to assign less re-weight to the probability. In general, backoff models are combined with discounting models 23 / 24

A note on information theory Some very useful notions for n-gram work can be found in information theory. Basic ideas: entropy = a measure of how much information is needed to encode something perplexity = a measure of the amount of surprise of an outcome mutual information = the amount of information one item has about another item (e.g., collocations have high mutual information) Take L645 to find out more... 24 / 24