Retrieval Models: Language models

Similar documents
MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

CS47300: Web Information Search and Management

Expectation Maximization Mixture Models HMMs

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

EM and Structure Learning

Expectation Maximization Mixture Models

Evaluation for sets of classes

Maximum Likelihood Estimation

Machine learning: Density estimation

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Course 395: Machine Learning - Lectures

Probabilistic Ranking Principle. Hongning Wang

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology

18.1 Introduction and Recap

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

The Expectation-Maximization Algorithm

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Clustering gene expression data & the EM algorithm

Classification as a Regression Problem

1/10/18. Definitions. Probabilistic models. Why probabilistic models. Example: a fair 6-sided dice. Probability

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Linear Approximation with Regularization and Moving Least Squares

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Bayesian embedding of co-occurrence data for query-based visualization

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

} Often, when learning, we deal with uncertainty:

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

The Basic Idea of EM

Clustering & Unsupervised Learning

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Generalized Linear Methods

3.1 ML and Empirical Distribution

Hidden Markov Models

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Gaussian Mixture Models

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Logistic Classifier CISC 5800 Professor Daniel Leeds

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Clustering & (Ken Kreutz-Delgado) UCSD

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Global Sensitivity. Tuesday 20 th February, 2018

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Chapter Newton s Method

Semi-Supervised Learning

Hidden Markov Model Cheat Sheet

Finding Dense Subgraphs in G(n, 1/2)

Maximum a posteriori estimation for Markov chains based on Gaussian Markov random fields

Probability Density Function Estimation by different Methods

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Limited Dependent Variables

Linear Regression Analysis: Terminology and Notation

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Properties of Least Squares

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Joint Statistical Meetings - Biopharmaceutical Section

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

First Year Examination Department of Statistics, University of Florida

Information Retrieval Language models for IR

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Gaussian Conditional Random Field Network for Semantic Segmentation - Supplementary Material

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

Composite Hypotheses testing

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Machine Learning for Signal Processing Linear Gaussian Models

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Question Classification Using Language Modeling

MAXIMUM A POSTERIORI TRANSDUCTION

Probability Theory (revisited)

Dirichlet Mixtures in Text Modeling

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Maximum Likelihood Estimation (MLE)

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Power law and dimension of the maximum value for belief distribution with the max Deng entropy

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Mean Field / Variational Approximations

9 : Learning Partially Observed GM : EM Algorithm

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Expected Value and Variance

Midterm Review. Hongning Wang

CIE4801 Transportation and spatial modelling Trip distribution

E Tail Inequalities. E.1 Markov s Inequality. Non-Lecture E: Tail Inequalities

Learning from Data 1 Naive Bayes

Computing MLE Bias Empirically

Notes on Frequency Estimation in Data Streams

Ensemble Methods: Boosting

Stat 543 Exam 2 Spring 2016

Why BP Works STAT 232B

Bayesian predictive Configural Frequency Analysis

Lecture 4: November 17, Part 1 Single Buffer Management

Statistical learning

Transcription:

CS-590I Informaton Retreval Retreval Models: Language models Luo S Department of Computer Scence Purdue Unversty

Introducton to language model Ungram language model Document language model estmaton Maxmum Lelhood estmaton Maxmum a posteror estmaton Jelne Mercer Smoothng Model-based feedbac

Vector space model for nformaton retreval Documents and queres are vectors n the term space Relevance s measure by the smlarty between document vectors and query vector Problems for vector space model Ad-hoc term weghtng schemes Ad-hoc smlarty measurement No justfcaton of relatonshp between relevance and smlarty We need more prncpled retreval models

Language model can be created for any language sample A document A collecton of documents Sentence, paragraph, chapter, query The sze of language sample affects the qualty of language model Long documents have more accurate model Short documents have less accurate model Model for sentence, paragraph or query may not be relable

A document language model defnes a probablty dstrbuton over ndexed terms E.g., the probablty of generatng a term Sum of the probabltes s 1 A query can be seen as observed data from unnown models Query also defnes a language model (more on ths later How mght the models be used for IR? Ran documents by Pr( q d Ran documents by language models of q and d based on ullbac-lebler (KL dvergence between the models (come later

Generate retreval results q sport, basetball Estmate the generaton probablty of Pr( q d Language Model for d 1 Language Model for d 2 Language Model for d 3 d 1 sport, basetball, tcet, sport Estmatng language model for each document d 2 basetball, tcet, fnance, tcet, sport d 3 stoc, fnance, fnance, stoc

Three basc problems for language models What type of probablstc dstrbuton can be used to construct language models? How to estmate the parameters of the dstrbuton of the language models? How to compute the lelhood of generatng queres gven the language modes of documents?

Language model bult by multnomal dstrbuton on sngle terms (.e., ungram n the vocabulary Examples: Fve words n vocabulary (sport, basetball, tcet, fnance, stoc For a document d, ts language mode s: {P ( sport, P ( basetball, P ( tcet, P ( fnance, P ( stoc } Formally: The language model s: {P (w for any word w n vocabulary V} P( w = 1 0 P ( w 1

Multnomal Model for 1 d Multnomal Model for 2 d Multnomal Model for 3 d d 1 sport, basetball, tcet, sport Estmatng language model for each document d 2 basetball, tcet, fnance, tcet, sport d 3 stoc, fnance, fnance, stoc

d 1,..,d d 1,..,d d 1,..,d!" #$ Maxmum Lelhood Estmaton: Fnd model parameters that mae generaton lelhood reach maxmum: M*=argmax M Pr(D M There are K words n vocabulary, w 1...w K (e.g., 5 Data: one document d wth counts tf (w 1,, tf (w K, and length d Model: multnomal M wth parameters {p (w } Lelhood: Pr( d M M*=argmax M Pr( d M

!" #$ d p( d M p ( w p ( w l( d M = log p( d M = tf ( w log p ( w K K tf ( w tf ( w = tf ( w1... tf ( wk = 1 = 1 ( = ( log ( + ( ( 1 ' l d M tf w p w λ p w ' l tf ( w tf ( w = + λ = 0 p ( w = p ( w p ( w λ Snce p ( w = 1, λ = tf ( w = d So, p ( w = Use Lagrange multpler approach Set partal dervatves to zero Get maxmum lelhood estmate c ( w d

!" #$ (p sp, p b, p t, p f, p st = (0.5,0.25,0.25,0,0 (p sp, p b, p t, p f, p st = (0.2,0.2,0.4,0.2,0 (p sp, p b, p t, p f, p st = (0,0,0,0.5,0.5 d 1 sport, basetball, tcet, sport Estmatng language model for each document d 2 basetball, tcet, fnance, tcet, sport d 3 stoc, fnance, fnance, stoc

d 1,..,d!" #$ Maxmum Lelhood Estmaton: Assgn zero probabltes to unseen words n small sample A specfc example: Only two words n vocabulary d (w 1 =sport, w 2 =busness le (head, tal for a con; A document generates sequence of two words or draw a con for many tmes d ( 1 ( 2 Pr( d M = p ( 1 tf w (1 ( 1 tf w w p w tf ( w1 tf ( w2 Only observe two words (flp the con twce and MLE estmators are: busness sport P (w 1 =0.5 sport sport P (w 1 =1? busness busness P (w 1 =0?

!" #$ A specfc example: Only observe two words (flp the con twce and MLE estmators are: busness sport P (w 1 *=0.5 sport sport P (w 1 *=1? busness busness P (w 1 *=0? Data sparseness problem

%&' Maxmum a posteror (MAP estmaton Shrnage Bayesan ensemble approach

(&#(&$ Maxmum A Posteror Estmaton: Select a model that maxmzes the probablty of model gven observed data M*=argmax M Pr(M D=argmax M Pr(D MPr(M Pr(M: Pror belef/nowledge Use pror Pr(M to avod zero probabltes A specfc examples: Only two words n vocabulary (sport, busness d For a document : Pror Dstrbuton d ( 1 ( 2 Pr( M d = p ( 1 tf w ( 2 tf w w p w Pr ( M tf ( w1 tf ( w2

(&#(&$ Maxmum A Posteror Estmaton: Introduce pror on the multnomal dstrbuton Use pror Pr(M to avod zero probabltes, most of cons are more or less unbased Use Drchlet pror on p(w Γ ( α1 + + αk α 1 Dr ( p α1,, αk = p ( w, p ( w = 1, 0 p ( w 1 Γ( α Γ( α 1 K Hyper-parameters Constant for p K Γ(x s gamma functon t x 1 Γ( x e t dx 0 Γ ( n + 1 = n! f n

(&#(&$ For the two word example: a Drchlet pror Pr( M p( w (1 p( w 2 2 1 1 P(w 1 2 (1-P(w 1 2

d 1,..,d (&#(&$ Maxmum A Posteror: M*=argmax M Pr(M D=argmax M Pr(D MPr(M Pr( d MPr( M p ( w (1 p ( w p ( w p ( w tf ( w1 tf ( w2 α1 1 α2 1 1 1 1 1 = p ( w (1 p ( w tf ( w1 + α1 1 tf ( w2 + α2 1 1 1 Pseudo Counts * tf ( w1 + α1 1 tf ( w2 + α2 1 M = arg max p ( w1 (1 p ( w1 p ( w 1

(&#(&$ A specfc example: Only observe two words (flp a con twce: sport sport P (w 1 *=1? tmes P(w 1 2 (1-P(w 1 2

(&#(&$ A specfc example: Only observe two words (flp a con twce: sport sport P (w 1 *=1? tf ( w1 + α1 1 p( w * = 1 tf ( w + α 1 + tf ( w + α 1 1 1 2 2 2+ 3 1 4 2 = = = 2+ 3 1+ 0+ 3 1 6 3

(& Maxmum A Posteror Estmaton: Use Drchlet pror for multnomal dstrbuton How to set the parameters for Drchlet pror

(& Maxmum A Posteror Estmaton: Use Drchlet pror for multnomal dstrbuton There are K terms n the vocabulary: Multnomal : p = { p ( w,..., p ( w }, p ( w = 1, 0 p ( w 1 1 K Γ ( α + + α Dr p p w p w p w Γ Γ 1 K α 1 ( α1,, αk = (, ( = 1, 0 ( 1 ( α1 ( αk Hyper-parameters Constant for p K

(& MAP Estmaton for ungram language model: * Γ ( α1 + + αk tf w α p = argmax p ( w p ( w Γ( α Γ( α p 1 K st. p ( w = 1, 0 p ( w 1 = arg max p ( w p * tf ( w + α 1 p ( w = ( tf ( w + α 1 tf ( w + α 1 st. p ( w = 1, 0 p ( w 1 ( 1 Use Lagrange Multpler; Set dervatve to 0 Pseudo counts set by hyper-parameters

(& MAP Estmaton for ungram language model: Use Lagrange Multpler; Set dervatve to 0 * tf ( w + α 1 p ( w = ( tf ( w + α 1 How to determne the approprate value for hyper-parameters? When nothng observed from a document p * α 1 ( w = α ( 1 What s most lely p (w wthout loong at the content of the document?

(& MAP Estmaton for ungram language model: What s most lely p (w wthout loong at the content of the document? The most lely p (w wthout loong nto the content of the document d s the ungram probablty of the collecton: {p(w 1 c, p(w 2 c,, p(w K c} Wthout any nformaton, guess the behavor of one member on the behavor of whole populaton p w p w p w * α 1 ( = = ( α 1 c 1 = c ( α µ ( Constant

(& MAP Estmaton for ungram language model: * Γ ( α1 + + αk µ p = argmax p ( w p ( w Γ( α Γ( α p 1 K st. p ( w = 1, 0 p ( w 1 = arg max p ( w p * tf ( w + µ pc ( w p ( w = tf ( w + µ tf ( w +µ p ( w c st. p ( w = 1, 0 p ( w 1 tf ( w p ( w c Use Lagrange Multpler; Set dervatve to 0 Pseudo counts Pseudo document length

(&#(&$ Drchlet MAP Estmaton for ungram language model: Step 0: compute the probablty on whole collecton based collecton ungram language model p ( w = c tf ( w Step 1: for each document d, compute ts smoothed ungram language model (Drchlet smoothng as tf ( w + µ pc ( w p ( w = d + µ d

(&#(&$ Drchlet MAP Estmaton for ungram language model: Step 2: For a gven query q ={tf q (w 1,, tf q (w } For each document d, compute lelhood K K tfq ( w tf ( w + µ pc ( w p( q d = p( w d = = 1 = 1 d + µ The larger the lelhood, the more relevant the document s to the query tf q ( w

%"" *+,%+ Drchlet Smoothng: p( q d = K = 1 tf ( w + µ pc ( w d + µ tf ( w q? TF-IDF Weghtng: K sm( q, d = tfq ( w tf ( w df ( w norm( d = 1

%"" *+,%+ Drchlet Smoothng: p( q d = K = 1 tf ( w + µ pc ( w d + µ tf ( w tf ( w log p( q d = tfq ( w log1 + log( d + µ + log µ pc ( w µ p ( 1 c w = q TF-IDF Weghtng: K sm( q, d = tfq ( w tf ( w df ( w norm( d = 1

%"" *+,%+ Drchlet Smoothng: tfq ( w K tf ( w + µ pc ( w p( q d = = 1 d + µ log p( q d = tf ( w log tf ( w + p ( w log( d + = 1 { ( µ µ } q c µ pc ( w + tf ( w = tfq ( w log µ pc ( w log( d + µ µ p ( 1 c w = tf ( w = tfq ( w log1 + + log µ pc ( w log( d + µ µ p ( 1 c w =

%"" *+,%+ Drchlet Smoothng: Irrelevant part tf ( w log p( q d = tfq ( w log1 + log( d + µ + log µ pc ( w µ p ( 1 c w = tf ( w log p( q d tfq ( w log1 + log( d + µ µ p ( 1 c w = TF-IDF Weghtng: K sm( q, d = tfq ( w tf ( w df ( w norm( d = 1

%"" *+,%+ Drchlet Smoothng: Loo at the tf.df part tf ( w log1 + µ pc ( w tf ( w tf ( w log1 + µ pc ( w p c tf ( w ( w log1 + µ pc ( w

%"" -.,& Drchlet Smoothng: Hyper-parameter p ( w = tf ( w + µ pc ( w d + µ When When µ s very small, approach MLE estmator µ s very large, approach probablty on whole collecton How to set approprate µ?

%"" -.,& Leave One out Valdaton: p ( w = tf ( w + µ pc ( w d + µ w 1 p( w d / w 1 1 p ( w d / w = 1 1 tf ( w1 1 + µ p c ( w1 d 1 + µ w j p( w d / w j... j... p ( w d / w = j j...... tf ( w j 1 + µ p c ( w j d 1 + µ

%"" -.,& Leave One out Valdaton: w 1 w j p( w d / w 1 1 p( w d / w j... j... l l 1 1 d tf ( w j 1 + µ p c ( w j ( µ, d = lo g j = 1 d 1 + µ µ * C d tf ( w j 1 + µ p c ( w j ( µ, C = lo g = 1 j = 1 d 1 + µ µ = arg max l ( µ, C µ 1

%"" -.,& What type of document/collecton would get large? Most documents use smlar vocabulary and wordng pattern as the whole collecton What type of document/collecton would get small µ? Most documents use dfferent vocabulary and wordng pattern than the whole collecton µ

"! Maxmum Lelhood (MLE bulds model purely on document data and generates query word Model may not be accurate when document s short (many unseen words Shrnage estmator bulds more relable model by consultng more general models (e.g., collecton language model Example: Estmate P(Lung_Cancer Smoe West Lafayette Indana U.S.

"! Jelne Mercer Smoothng Assume for each word, wth probablty λ, t s generated from document language model (MLE, wth probablty 1- λ, t s generated from collecton language model (MLE Lnear nterpolaton between document language model and collecton language model JM Smoothng: tf ( w p ( w = λ + (1 λ pc ( w d

"! Relatonshp between JM Smoothng and Drchlet Smoothng tf ( w + µ pc ( w p ( w = d + µ 1 = ( tf ( w + µ pc ( w d + µ 1 d tf ( w = + µ pc ( w d + µ d d tf ( w µ = + pc ( w d + µ d d + µ JM Smoothng: tf ( w p ( w = λ + (1 λ pc ( w d

/+'! Equvalence of retreval based on query generaton lelhood and Kullbac-Lebler (KL Dvergence between query and document language models Kullbac-Lebler (KL Dvergence between two probablstc dstrbutons p( x KL( p q = p ( x log x q( x It s the dstance between two probablstc dstrbutons It s always larger than zero How to prove t?

/+'! Equvalence of retreval based on query generaton lelhood and Kullbac-Lebler (KL Dvergence between query and document language models Sm( q, d = KL( q d q( w = q( w log w p ( w ( ( ( ( = q( w log p w q( w log q w w w Loglelhood of query generaton probablty Document ndependent constant Generalze query representaton to be a dstrbuton (fractonal term weghtng

/+'! Retreval results Estmate the generaton probablty of Pr( q d Retreval results Calculate KL Dvergence K L ( q d q Language Model for d Language Model for q Estmatng query language model Language Model for d Estmatng language model d q Estmatng document language model d

/+'! Feedbac Documents from ntal results Language Model for q F Retreval results Calculate KL Dvergence K L ( q d α = 0 No feedbac ' q = q New Query Model ' q = (1 αq + αq F α =1 Full feedbac ' q = q F Language Model for q Estmatng query language model q Language Model for Estmatng document language model d d

/+'! q F Assume there s a generatve model to produce each word wthn feedbac document(s For each word n feedbac document(s, gven λ Flp a con q * F 1-λ λ = arg max l( X, λ q Bacground model F = 1 ( ( λqf w λ pc w = arg max log ( + (1 log( ( q F P C (w Topc words q F (w n w w Feedbac Documents

/+'! q F For each word, there s a hdden varable tellng whch language model t comes from Bacground Model p C (w C Unnown query topc p(w θ F =? Basetball the 0.12 to 0.05 t 0.04 a 0.02 sport 0.0001 basetball 0.00005 sport =? basetball =? game =? player =? 1-λ=0.8 λ=0.2 Feedbac Documents MLE Estmator If we now the value of hdden varable of each word...

/+'! q F For each word, the hdden varable Z = {1 (feedbac, 0 (bacground} Step1: estmate hdden varable based current on model parameter (Expectaton p( z = 1 p( w z = 1 p( z = 1 w = p( z = 1 p( w z = 1 + p( z = 0 p( w z = 0 λq ( w = λq ( w (1 p ( w C ( t F ( t F + λ C the (0.1 basetball (0.7 game (0.6 s (0.2. E-step Step2: Update model parameters based on the guess n step1 (Maxmzaton ( t+ 1 c( w, F p( z = 1 w qf ( w θf = c( w, F p( z = 1 w j j j j M-Step

/+'! q F Expectaton-Maxmzaton (EM algorthm Step 0: Intalze values of Step1: (Expectaton Step2: (Maxmzaton Gve λ=0.5 0 q F λq ( w q ( w (1 p ( w C ( t F = w = ( t λ F + λ C p( z 1 ( t+ 1 c( w, F p( z = 1 w qf ( w θf = c( w, F p( z = 1 w j j j j

/+'! q F Propertes of parameter λ If λ s close to 0, most common words can be generated from collecton language model, so more topc words n query language mode If λ s close to 1, query language model has to generate most common words, so fewer topc words n query language mode

Introducton to language model Ungram language model Document language model estmaton Maxmum Lelhood estmaton Maxmum a posteror estmaton Jelne Mercer Smoothng Model-based feedbac