Language Models. Hongning Wang

Similar documents
Ranked Retrieval (2)

Lecture 13: More uses of Language Models

Midterm Examination Practice

University of Illinois at Urbana-Champaign. Midterm Examination


Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

Natural Language Processing. Statistical Inference: n-grams

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language as a Stochastic Process

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Machine Learning for natural language processing

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Modeling Environment

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Introduction to Probabilistic Machine Learning

A Note on the Expectation-Maximization (EM) Algorithm

Statistical Methods for NLP

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Probabilistic Language Modeling

A Risk Minimization Framework for Information Retrieval

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

N-gram Language Modeling

CS145: INTRODUCTION TO DATA MINING

Bayesian Models in Machine Learning

INFO 630 / CS 674 Lecture Notes

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

CS 188: Artificial Intelligence Fall 2011

Gaussian Models

N-gram Language Modeling Tutorial

Naïve Bayes, Maxent and Neural Models

Machine Learning for natural language processing

Note for plsa and LDA-Version 1.1

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Bayesian Methods: Naïve Bayes

Midterm sample questions

Lecture : Probabilistic Machine Learning

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Latent Semantic Analysis. Hongning Wang

Information Retrieval

Information Retrieval

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Chapter 3: Basics of Language Modelling

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

DT2118 Speech and Speaker Recognition

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Language Models, Smoothing, and IDF Weighting

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Language Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Evaluation Methods for Topic Models

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

CPSC 340: Machine Learning and Data Mining

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

PATTERN RECOGNITION AND MACHINE LEARNING

Latent Dirichlet Allocation Introduction/Overview

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Information Retrieval Basic IR models. Luca Bondi

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Gibbs Sampler Derivation for Latent Dirichlet Allocation (Blei et al., 2003) Lecture Notes Arjun Mukherjee (UH)

Why Language Models and Inverse Document Frequency for Information Retrieval?

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Forward algorithm vs. particle filtering

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Language Processing with Perl and Prolog

Learning Bayesian network : Given structure and completely observed data

CS249: ADVANCED DATA MINING

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Computational Cognitive Science

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Classification & Information Theory Lecture #8

Loss Functions, Decision Theory, and Linear Models

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Language Models. Philipp Koehn. 11 September 2018

PV211: Introduction to Information Retrieval

The Noisy Channel Model and Markov Models

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Language Modeling. Michael Collins, Columbia University

CHAPTER 2 Estimating Probabilities

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 8: Graphical models for Text

Document and Topic Models: plsa and LDA

Retrieval models II. IN4325 Information Retrieval

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Transcription:

Language Models Hongning Wang CS@UVa

Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Regression Model (Fox 83) Prob. distr. model (Wong & Yao, 89) Doc generation Classical prob. Model (Robertson & Sparck Jones, 76) Generative Model Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Different inference system Prob. concept space model (Wong & Yao, 95) Inference network model (Turtle & Croft, 91) CS@UVa CS 4501: Information Retrieval 2

Query generation models O( R 1 Q, D) P( Q, D R P( Q, D R 1) 0) P( Q D, R 1) P( D R 1) P( Q D, R 0) P( D R 0) P( D R P( Q D, R 1) P( D R 1) 0) ( Assume P( Q D, R 0) P( Q R 0)) Query likelihood p(q θ d ) Document prior Assuming uniform document prior, we have O( R 1 Q, D) P( Q D, R 1) Now, the question is how to compute P( Q D, R 1)? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Language models, we will cover it now! CS@UVa CS 4501: Information Retrieval 3

What is a statistical LM? A model specifying a probability distribution over word sequences p( Today is Wednesday ) 0.001 p( Today Wednesday is ) 0.0000000000001 p( The eigenvalue is positive ) 0.00001 It can be regarded as a probabilistic mechanism for generating text, thus also called a generative model CS@UVa CS 4501: Information Retrieval 4

Why is a LM useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: Given that we see John and feels, how likely will we see happy as opposed to habit as the next word? (speech recognition) Given that we observe baseball three times and game once in a news article, how likely is it about sports? (text categorization, information retrieval) Given that a user is interested in sports news, how likely would the user use baseball in a query? (information retrieval) CS@UVa CS 4501: Information Retrieval 5

Source-Channel framework [Shannon 48] Source P(X) Transmitter Noisy Receiver X (encoder) Channel Y (decoder) X P(Y X) P(X Y)? Destination Xˆ arg max X p( X Y ) arg max X p( Y X ) p( X ) (Bayes Rule) When X is text, p(x) is a language model Many Examples: Speech recognition: XWord sequence YSpeech signal Machine translation: XEnglish sentence YChinese sentence OCR Error Correction: XCorrect word Y Erroneous word Information Retrieval: XDocument YQuery Summarization: XSummary YDocument CS@UVa CS 4501: Information Retrieval 6

Language model for text Probability distribution over word sequences pp ww 1 ww 2 ww nn pp ww 1 pp ww 2 ww 1 pp ww 3 ww 1, ww 2 Complexity - OO(VV nn ) nn - maximum document length sentence We need independence assumptions! pp ww nn ww 1, ww 2,, ww nn 1 Chain rule: from conditional probability to joint probability 475,000 main headwords in Webster's Third New International Dictionary Average English sentence length is 14.3 words A rough estimate: OO(475000 14 ) How large is this? 475000 14 8bbbbbbbbbb 1024 4 3.38ee66 TTTT CS@UVa CS 4501: Information Retrieval 7

Unigram language model Generate a piece of text by generating each word independently pp(ww 1 ww 2 ww nn ) pp(ww 1 )pp(ww 2 ) pp(ww nn ) ss. tt. pp ww NN ii ii1, ii pp ww ii 1, pp ww ii 0 Essentially a multinomial distribution over the vocabulary A Unigram Language Model 0.12 0.1 The simplest and most popular choice! 0.08 0.06 0.04 0.02 0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16 w17 w18 w19 w20 w21 w22 w23 w24 w25 CS@UVa CS 4501: Information Retrieval 8

More sophisticated LMs N-gram language models In general, pp(ww 1 ww 2 ww nn ) pp(ww 1 )pp(ww 2 ww 1 ) pp(ww nn ww 1 ww nn 1 ) N-gram: conditioned only on the past N-1 words E.g., bigram: pp(ww 1 ww nn ) pp(ww 1 )pp(ww 2 ww 1 ) pp(ww 3 ww 2 ) pp(ww nn ww nn 1 ) Remote-dependence language models (e.g., Maximum Entropy model) Structured language models (e.g., probabilistic context-free grammar) CS@UVa CS 4501: Information Retrieval 9

Why just unigram models? Difficulty in moving toward more complex models They involve more parameters, so need more data to estimate They increase the computational complexity significantly, both in time and space Capturing word order or structure may not add so much value for topical inference But, using more sophisticated models can still be expected to improve performance... CS@UVa CS 4501: Information Retrieval 10

Language models for IR [Ponte & Croft SIGIR 98] Document Language Model Query Text mining paper text? mining? assocation? clustering? food?? data mining algorithms Food nutrition paper food? nutrition? healthy? diet? Which model would most likely have generated this query? CS@UVa CS 4501: Information Retrieval 11

Ranking docs by query likelihood Doc LM Query likelihood d 1 θ d 1 p(q θ d 1 ) q d 2 θ d 2 p(q θ d 2 ) p(q θ d N ) d N θ d N Justification: PRP O( R 1 Q, D) P( Q D, R 1) CS@UVa CS 4501: Information Retrieval 12

Justification from PRP 0)) ( 0), ( ( 0) ( 1) ( 1), ( 0) ( 0), ( 1) ( 1), ( 0), ( 1), ( ), 1 ( R Q P R D Q P Assume R D P R D P R D Q P R D P R D Q P R D P R D Q P R D Q P R D Q P D Q R O Query likelihood p(q θ d ) Document prior Assuming uniform document prior, we have 1), ( ), 1 ( R D Q P D Q R O CS@UVa CS 4501: Information Retrieval 13 Query generation

Retrieval as language model estimation Document ranking based on query likelihood log p( q d) log where, q i w w 1 2... w p( w Retrieval problem Estimation of pp(ww ii dd) Common approach Maximum likelihood estimation (MLE) n i d) Document language model CS@UVa CS 4501: Information Retrieval 14

Recap: maximum likelihood estimation Data: a document d with counts c(w 1 ),, c(w N ) Model: multinomial distribution p(ww θθ) with parameters θθ ii pp(ww ii ) Maximum likelihood estimator: θθ aaaaaaaaaaxx θθ pp(ww θθ) pp WW θθ NN NN cc ww 1,, cc(ww NN ) ii1 NN NN θθ ii cc(ww ii ) ii1 LL WW, θθ cc ww ii log θθ ii + λλ θθ ii 1 ii1 ii1 LL cc ww ii + λλ θθ θθ ii θθ ii cc ww ii ii λλ Since θθ ii ii1 NN θθ ii cc(ww ii ) NN θθ ii 1 we have λλ cc ww ii cc ww ii NN ii1 log pp WW θθ cc ww ii Using Lagrange multiplier approach, we ll tune θθ ii to maximize LL(WW, θθ) Set partial derivatives to zero ML estimate CS@UVa NN CS 4501: Information Retrieval 15 ii1 cc ww ii NN ii1 Requirement from probability log θθ ii

Problem with MLE Unseen events There are 440K tokens on a large collection of Yelp reviews, but: Only 30,000 unique words occurred Word frequency ff kk; ss, NN 1/kk ss Only 0.04% of all possible bigrams occurred This means any word/n-gram that does not occur in the collection has a zero probability with MLE! No future queries/documents can contain those unseen words/n-grams Word rank by frequency A plot of word frequency in Wikipedia (Nov 27, 2006) CS@UVa CS 4501: Information Retrieval 16

Problem with MLE What probability should we give a word that has not been observed in the document? log0? If we want to assign non-zero probabilities to such words, we ll have to discount the probabilities of observed words This is so-called smoothing CS@UVa CS 4501: Information Retrieval 17

Illustration of language model smoothing P(w d) Max. Likelihood Estimate p ( w) count of w ML count of all words Smoothed LM w Discount from the seen words Assigning nonzero probabilities to the unseen words CS@UVa CS 4501: Information Retrieval 18

General idea of smoothing All smoothing methods try to 1. Discount the probability of words seen in a document 2. Re-allocate the extra counts such that unseen words will have a non-zero count CS@UVa CS 4501: Information Retrieval 19

Smoothing methods Method 1: Additive smoothing Add a constant δ to the counts of each word cwd (, ) + 1 pw ( d) d + V Problems? Counts of w in d Length of d (total counts) Hint: all words are equally important? Add one, Laplace smoothing Vocabulary size CS@UVa CS 4501: Information Retrieval 20

Add one smoothing for bigrams CS@UVa CS 4501: Information Retrieval 21

After smoothing Giving too much to the unseen events CS@UVa CS 4501: Information Retrieval 22

Refine the idea of smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate pseen ( w d) if w is seen in d pw ( d) αd p( w REF) otherwise α d 1 p ( w d) w is seen w is unseen seen p( w REF) Reference language model CS@UVa CS 4501: Information Retrieval 23

Smoothing methods Method 2: Absolute discounting Subtract a constant δ from the counts of each word u p( w d) max( c( w; d ) δ,0) + δ d p( w REF ) d Problems? Hint: varied document length? # uniq words CS@UVa CS 4501: Information Retrieval 24

Smoothing methods Method 3: Linear interpolation, Jelinek- Mercer Shrink uniformly toward p(w REF) cwd (, ) p( w d) (1 λ) + λ p( w REF) d Problems? MLE Hint: what is missing? parameter CS@UVa CS 4501: Information Retrieval 25

Smoothing methods Method 4: Dirichlet Prior/Bayesian Assume pseudo counts µp(w REF) c( w; d ) + µ p( w REF ) d cwd (, ) µ p ( w d) d + µ d + µ + d + µ p( w REF) d Problems? parameter CS@UVa CS 4501: Information Retrieval 26

Dirichlet prior smoothing Bayesian estimator Posterior of LM: pp θθ dd pp dd θθ pp(θθ) Conjugate prior Posterior will be in the same form as prior Prior can be interpreted as extra / pseudo data Dirichlet distribution is a conjugate prior for multinomial distribution Dir( θ α,, α 1 N ) likelihood of doc given the model Γ( α + + α N 1 N α i 1 Πθ i Γ( α 1 1) Γ( α ) i N extra / pseudo word counts, we set α i µ p(w i REF) ) prior over models CS@UVa CS 4501: Information Retrieval 27

Some background knowledge Conjugate prior Posterior dist in the same family as prior Dirichlet distribution Continuous Samples from it will be the parameters in a multinomial distribution Gaussian -> Gaussian Beta -> Binomial Dirichlet -> Multinomial CS@UVa CS 4501: Information Retrieval 28

Dirichlet prior smoothing (cont) Posterior distribution of parameters: p θ d ) Dir( θ c( w ) + α,, c( ) + α ( 1 1 w N N ) Property : If θ ~ Dir( θ α ), then E( θ ) { The predictive distribution is the same as the mean: p(w i ˆ) θ p(w c(w d i i + θ ) Dir( θ α ) dθ ) + α N i 1 α i i c(w i ) + µ p( wi d + µ α i α i } REF ) Dirichlet prior smoothing CS@UVa CS 4501: Information Retrieval 29

Estimating µ using leave-one-out [Zhai & Lafferty 02] w 1 w 2 P(w 1 d - w1 ) P(w 2 d - w2 ) Leave-one-out log-likelihood c( w, d ) 1+ µ p( w C) L N i ( µ C) c( w, di )log( 1 i 1 w V di 1+ µ ) w n... Maximum Likelihood Estimator μˆ argmax L 1(μ C) μ P(w n d - wn ) CS@UVa CS 4501: Information Retrieval 30

Understanding smoothing Topical words Query the algorithms for data mining p ML (w d1): 0.04 0.001 0.02 0.002 0.003 p ML (w d2): 0.02 0.001 0.01 0.003 0.004 4.8 10 12 2.4 10 12 p( algorithms d1) p( algorithms d2) p( data d1) < p( data d2) p( mining d1) < p( mining d2) Intuitively, d2 should have a higher score, but p(q d1)>p(q d2) So we should make p( the ) and p( for ) less different for all docs, and smoothing helps to achieve this goal After smoothing with p( w d) 0.1p ( w d) + 0.9 p( w REF), p( q d1) < DML p( q d2)! Query the algorithms for data mining P(w REF) 0.2 0.00001 0.2 0.00001 0.00001 Smoothed p(w d1): 0.184 0.000109 0.182 0.000209 0.000309 Smoothed p(w d2): 0.182 0.000109 0.181 0.000309 0.000409 2.35 10 13 4.53 10 13 CS@UVa CS 4501: Information Retrieval 31

Two-stage smoothing [Zhai & Lafferty 02] Stage-1 -Explain unseen words -Dirichlet prior (Bayesian) µ Stage-2 -Explain noise in query -2-component mixture λ P(w d) (1-λ) c(w,d) d +µp(w C) +µ Collection LM + λp(w U) User background model CS@UVa CS 4501: Information Retrieval 32

Smoothing & TF-IDF weighting log pq ( d) cwq (, ) log pw ( d) w V, c( wq, ) > 0 cwq (, ) log p ( w d) + cwq (, ) log α pw ( C) Seen w V, c( wd, ) > 0, w V, c( wq, ) > 0, c( wd, ) 0 c( wq, ) > 0 Retrieval formula using the general smoothing scheme cwq (, ) log p ( w d) + cwq (, ) log α pw ( C) cwq (, ) log α pw ( C) Seen d d w V, c( wd, ) > 0 w V, c( wq, ) > 0 w V, c( wq, ) > 0, c( wd, ) > 0 cc ww, qq > 0 pseen ( w d) cwq (, ) log q log αd cwqpw (, ) ( C) w V, c( wd, ) > 0 αd pwc ( ) w V, c( wq, ) > 0 c( wq, ) > 0 + + α d pseen ( w d) if w is seen in d pw ( d) αd p( w C) otherwise 1 p ( w d) w is seen w is unseen Seen pwc ( ) d Smoothed ML estimate Reference language model Key rewriting step (where did we see it before?) Similar rewritings are very common when using probabilistic models for IR CS@UVa CS 4501: Information Retrieval 33

Understanding smoothing αd Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain TF weighting 1 p ( w d) w is seen w is unseen seen p( w REF) Doc length normalization (longer doc is expected to have a smaller α d ) log p( q d) w i d q p [log α seen d ( w p( w i i d) ] + C) q logα d + w q i log p( w i C) IDF weighting Ignore for ranking Smoothing with p(w C) TF-IDF + doclength normalization CS@UVa CS 4501: Information Retrieval 34

What you should know How to estimate a language model General idea and different ways of smoothing Effect of smoothing CS@UVa CS 4501: Information Retrieval 35

Today s reading Introduction to information retrieval Chapter 12: Language models for information retrieval CS@UVa CS 4501: Information Retrieval 36